All of lore.kernel.org
 help / color / mirror / Atom feed
* [Patch] mm tracepoints update
@ 2009-04-21 22:45 ` Larry Woodman
  0 siblings, 0 replies; 37+ messages in thread
From: Larry Woodman @ 2009-04-21 22:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm, riel, mingo, rostedt

[-- Attachment #1: Type: text/plain, Size: 1348 bytes --]


I've cleaned up the mm tracepoints to track page allocation and
freeing, various types of pagefaults and unmaps, and critical page
reclamation routines.  This is useful for debugging memory allocation
issues and system performance problems under heavy memory loads.


----------------------------------------------------------------------


# tracer: mm
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
         pdflush-624   [004]   184.293169: wb_kupdate:
mm_pdflush_kupdate count=3e48
         pdflush-624   [004]   184.293439: get_page_from_freelist:
mm_page_allocation pfn=447c27 zone_free=1940910
        events/6-33    [006]   184.962879: free_hot_cold_page:
mm_page_free pfn=44bba9
      irqbalance-8313  [001]   188.042951: unmap_vmas:
mm_anon_userfree mm=ffff88044a7300c0 address=7f9a2eb70000 pfn=24c29a
             cat-9122  [005]   191.141173: filemap_fault:
mm_filemap_fault primary fault: mm=ffff88024c9d8f40 address=3cea2dd000
pfn=44d68e
             cat-9122  [001]   191.143036: handle_mm_fault:
mm_anon_fault mm=ffff88024c8beb40 address=7fffbde99f94 pfn=24ce22
-------------------------------------------------------------------------

Signed-off-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>


The patch applies to ingo's latest tip tree:

[-- Attachment #2: 0001-Merge-mm-tracepoints-into-upstream-tip-tree.patch --]
[-- Type: application/mbox, Size: 20030 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [Patch] mm tracepoints update
@ 2009-04-21 22:45 ` Larry Woodman
  0 siblings, 0 replies; 37+ messages in thread
From: Larry Woodman @ 2009-04-21 22:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm, riel, mingo, rostedt

[-- Attachment #1: Type: text/plain, Size: 1348 bytes --]


I've cleaned up the mm tracepoints to track page allocation and
freeing, various types of pagefaults and unmaps, and critical page
reclamation routines.  This is useful for debugging memory allocation
issues and system performance problems under heavy memory loads.


----------------------------------------------------------------------


# tracer: mm
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
         pdflush-624   [004]   184.293169: wb_kupdate:
mm_pdflush_kupdate count=3e48
         pdflush-624   [004]   184.293439: get_page_from_freelist:
mm_page_allocation pfn=447c27 zone_free=1940910
        events/6-33    [006]   184.962879: free_hot_cold_page:
mm_page_free pfn=44bba9
      irqbalance-8313  [001]   188.042951: unmap_vmas:
mm_anon_userfree mm=ffff88044a7300c0 address=7f9a2eb70000 pfn=24c29a
             cat-9122  [005]   191.141173: filemap_fault:
mm_filemap_fault primary fault: mm=ffff88024c9d8f40 address=3cea2dd000
pfn=44d68e
             cat-9122  [001]   191.143036: handle_mm_fault:
mm_anon_fault mm=ffff88024c8beb40 address=7fffbde99f94 pfn=24ce22
-------------------------------------------------------------------------

Signed-off-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>


The patch applies to ingo's latest tip tree:

[-- Attachment #2: 0001-Merge-mm-tracepoints-into-upstream-tip-tree.patch --]
[-- Type: application/mbox, Size: 0 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update
  2009-04-21 22:45 ` Larry Woodman
@ 2009-04-22  1:00   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2009-04-22  1:00 UTC (permalink / raw)
  To: Larry Woodman
  Cc: kosaki.motohiro, linux-kernel, linux-mm, riel, mingo, rostedt

> 
> I've cleaned up the mm tracepoints to track page allocation and
> freeing, various types of pagefaults and unmaps, and critical page
> reclamation routines.  This is useful for debugging memory allocation
> issues and system performance problems under heavy memory loads.

In past thread, Andrew pointed out bare page tracer isn't useful.
Can you make good consumer?


> 
> 
> ----------------------------------------------------------------------
> 
> 
> # tracer: mm
> #
> #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
> #              | |       |          |         |
>          pdflush-624   [004]   184.293169: wb_kupdate:
> mm_pdflush_kupdate count=3e48
>          pdflush-624   [004]   184.293439: get_page_from_freelist:
> mm_page_allocation pfn=447c27 zone_free=1940910
>         events/6-33    [006]   184.962879: free_hot_cold_page:
> mm_page_free pfn=44bba9
>       irqbalance-8313  [001]   188.042951: unmap_vmas:
> mm_anon_userfree mm=ffff88044a7300c0 address=7f9a2eb70000 pfn=24c29a
>              cat-9122  [005]   191.141173: filemap_fault:
> mm_filemap_fault primary fault: mm=ffff88024c9d8f40 address=3cea2dd000
> pfn=44d68e
>              cat-9122  [001]   191.143036: handle_mm_fault:
> mm_anon_fault mm=ffff88024c8beb40 address=7fffbde99f94 pfn=24ce22
> -------------------------------------------------------------------------
> 
> Signed-off-by: Larry Woodman <lwoodman@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> 
> 
> The patch applies to ingo's latest tip tree:




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update
@ 2009-04-22  1:00   ` KOSAKI Motohiro
  0 siblings, 0 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2009-04-22  1:00 UTC (permalink / raw)
  To: Larry Woodman
  Cc: kosaki.motohiro, linux-kernel, linux-mm, riel, mingo, rostedt

> 
> I've cleaned up the mm tracepoints to track page allocation and
> freeing, various types of pagefaults and unmaps, and critical page
> reclamation routines.  This is useful for debugging memory allocation
> issues and system performance problems under heavy memory loads.

In past thread, Andrew pointed out bare page tracer isn't useful.
Can you make good consumer?


> 
> 
> ----------------------------------------------------------------------
> 
> 
> # tracer: mm
> #
> #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
> #              | |       |          |         |
>          pdflush-624   [004]   184.293169: wb_kupdate:
> mm_pdflush_kupdate count=3e48
>          pdflush-624   [004]   184.293439: get_page_from_freelist:
> mm_page_allocation pfn=447c27 zone_free=1940910
>         events/6-33    [006]   184.962879: free_hot_cold_page:
> mm_page_free pfn=44bba9
>       irqbalance-8313  [001]   188.042951: unmap_vmas:
> mm_anon_userfree mm=ffff88044a7300c0 address=7f9a2eb70000 pfn=24c29a
>              cat-9122  [005]   191.141173: filemap_fault:
> mm_filemap_fault primary fault: mm=ffff88024c9d8f40 address=3cea2dd000
> pfn=44d68e
>              cat-9122  [001]   191.143036: handle_mm_fault:
> mm_anon_fault mm=ffff88024c8beb40 address=7fffbde99f94 pfn=24ce22
> -------------------------------------------------------------------------
> 
> Signed-off-by: Larry Woodman <lwoodman@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> 
> 
> The patch applies to ingo's latest tip tree:



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update
  2009-04-22  1:00   ` KOSAKI Motohiro
@ 2009-04-22  9:57     ` Ingo Molnar
  -1 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2009-04-22  9:57 UTC (permalink / raw)
  To: KOSAKI Motohiro, Frédéric Weisbecker, Li Zefan,
	Pekka Enberg, eduard.munteanu
  Cc: Larry Woodman, linux-kernel, linux-mm, riel, rostedt


* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > I've cleaned up the mm tracepoints to track page allocation and 
> > freeing, various types of pagefaults and unmaps, and critical 
> > page reclamation routines.  This is useful for debugging memory 
> > allocation issues and system performance problems under heavy 
> > memory loads.
> 
> In past thread, Andrew pointed out bare page tracer isn't useful. 

(do you have a link to that mail?)

> Can you make good consumer?

These MM tracepoints would be automatically seen by the 
ftrace-analyzer GUI tool for example:

  git://git.kernel.org/pub/scm/utils/kernel/ftrace/ftrace.git

And could also be seen by other tools such as kmemtrace. Beyond, of 
course, embedding in function tracer output.

Here's the list of advantages of the types of tracepoints Larry is 
proposing:

  - zero-copy and per-cpu splice() based tracing
  - binary tracing without printf overhead
  - structured logging records exposed under /debug/tracing/events
  - trace events embedded in function tracer output and other plugins
  - user-defined, per tracepoint filter expressions

I think the main review question is: are they properly structured 
and do they expose essential information to analyze behavioral 
details of the kernel in this area?

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update
@ 2009-04-22  9:57     ` Ingo Molnar
  0 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2009-04-22  9:57 UTC (permalink / raw)
  To: KOSAKI Motohiro, Frédéric Weisbecker, Li Zefan,
	Pekka Enberg, eduard.munteanu
  Cc: Larry Woodman, linux-kernel, linux-mm, riel, rostedt


* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > I've cleaned up the mm tracepoints to track page allocation and 
> > freeing, various types of pagefaults and unmaps, and critical 
> > page reclamation routines.  This is useful for debugging memory 
> > allocation issues and system performance problems under heavy 
> > memory loads.
> 
> In past thread, Andrew pointed out bare page tracer isn't useful. 

(do you have a link to that mail?)

> Can you make good consumer?

These MM tracepoints would be automatically seen by the 
ftrace-analyzer GUI tool for example:

  git://git.kernel.org/pub/scm/utils/kernel/ftrace/ftrace.git

And could also be seen by other tools such as kmemtrace. Beyond, of 
course, embedding in function tracer output.

Here's the list of advantages of the types of tracepoints Larry is 
proposing:

  - zero-copy and per-cpu splice() based tracing
  - binary tracing without printf overhead
  - structured logging records exposed under /debug/tracing/events
  - trace events embedded in function tracer output and other plugins
  - user-defined, per tracepoint filter expressions

I think the main review question is: are they properly structured 
and do they expose essential information to analyze behavioral 
details of the kernel in this area?

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update
  2009-04-22  9:57     ` Ingo Molnar
@ 2009-04-22 12:07       ` Larry Woodman
  -1 siblings, 0 replies; 37+ messages in thread
From: Larry Woodman @ 2009-04-22 12:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: KOSAKI Motohiro, Frédéric Weisbecker, Li Zefan,
	Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm, riel,
	rostedt

On Wed, 2009-04-22 at 11:57 +0200, Ingo Molnar wrote:
> * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > I've cleaned up the mm tracepoints to track page allocation and 
> > > freeing, various types of pagefaults and unmaps, and critical 
> > > page reclamation routines.  This is useful for debugging memory 
> > > allocation issues and system performance problems under heavy 
> > > memory loads.
> > 
> > In past thread, Andrew pointed out bare page tracer isn't useful. 
> 
> (do you have a link to that mail?)
> 
> > Can you make good consumer?

I will work up some good examples of what these are useful for.  I use
the mm tracepoint data in the debugfs trace buffer to locate customer
performance problems associated with memory allocation, deallocation,
paging and swapping frequently, especially on large systems.

Larry

> 
> These MM tracepoints would be automatically seen by the 
> ftrace-analyzer GUI tool for example:
> 
>   git://git.kernel.org/pub/scm/utils/kernel/ftrace/ftrace.git
> 
> And could also be seen by other tools such as kmemtrace. Beyond, of 
> course, embedding in function tracer output.
> 
> Here's the list of advantages of the types of tracepoints Larry is 
> proposing:
> 
>   - zero-copy and per-cpu splice() based tracing
>   - binary tracing without printf overhead
>   - structured logging records exposed under /debug/tracing/events
>   - trace events embedded in function tracer output and other plugins
>   - user-defined, per tracepoint filter expressions
> 
> I think the main review question is: are they properly structured 
> and do they expose essential information to analyze behavioral 
> details of the kernel in this area?
> 
> 	Ingo
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update
@ 2009-04-22 12:07       ` Larry Woodman
  0 siblings, 0 replies; 37+ messages in thread
From: Larry Woodman @ 2009-04-22 12:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: KOSAKI Motohiro, Frédéric Weisbecker, Li Zefan,
	Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm, riel,
	rostedt

On Wed, 2009-04-22 at 11:57 +0200, Ingo Molnar wrote:
> * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > I've cleaned up the mm tracepoints to track page allocation and 
> > > freeing, various types of pagefaults and unmaps, and critical 
> > > page reclamation routines.  This is useful for debugging memory 
> > > allocation issues and system performance problems under heavy 
> > > memory loads.
> > 
> > In past thread, Andrew pointed out bare page tracer isn't useful. 
> 
> (do you have a link to that mail?)
> 
> > Can you make good consumer?

I will work up some good examples of what these are useful for.  I use
the mm tracepoint data in the debugfs trace buffer to locate customer
performance problems associated with memory allocation, deallocation,
paging and swapping frequently, especially on large systems.

Larry

> 
> These MM tracepoints would be automatically seen by the 
> ftrace-analyzer GUI tool for example:
> 
>   git://git.kernel.org/pub/scm/utils/kernel/ftrace/ftrace.git
> 
> And could also be seen by other tools such as kmemtrace. Beyond, of 
> course, embedding in function tracer output.
> 
> Here's the list of advantages of the types of tracepoints Larry is 
> proposing:
> 
>   - zero-copy and per-cpu splice() based tracing
>   - binary tracing without printf overhead
>   - structured logging records exposed under /debug/tracing/events
>   - trace events embedded in function tracer output and other plugins
>   - user-defined, per tracepoint filter expressions
> 
> I think the main review question is: are they properly structured 
> and do they expose essential information to analyze behavioral 
> details of the kernel in this area?
> 
> 	Ingo
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
  2009-04-22 12:07       ` Larry Woodman
  (?)
@ 2009-04-22 19:22       ` Larry Woodman
  2009-04-23  0:48           ` KOSAKI Motohiro
  -1 siblings, 1 reply; 37+ messages in thread
From: Larry Woodman @ 2009-04-22 19:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: KOSAKI Motohiro, Frédéric Weisbecker, Li Zefan,
	Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm, riel,
	rostedt

[-- Attachment #1: Type: text/plain, Size: 695 bytes --]

On Wed, 2009-04-22 at 08:07 -0400, Larry Woodman wrote:
> On Wed, 2009-04-22 at 11:57 +0200, Ingo Molnar wrote:
> > * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > > In past thread, Andrew pointed out bare page tracer isn't useful. 
> > 
> > (do you have a link to that mail?)
> > 
> > > Can you make good consumer?
> 
> I will work up some good examples of what these are useful for.  I use
> the mm tracepoint data in the debugfs trace buffer to locate customer
> performance problems associated with memory allocation, deallocation,
> paging and swapping frequently, especially on large systems.
> 
> Larry

Attached is an example of what the mm tracepoints can be used for:



[-- Attachment #2: usecase --]
[-- Type: text/plain, Size: 7275 bytes --]


At Red Hat I use these mm tracepoints in an older kernel version(2.6.18).
The following steps illustrate how the mm tracepoints were used to debug 
and ultimately fix a problem. 

1.) We had customer complaints about large NUMA systems burning up 100% of
a CPU in system mode when running memory applications that require at least
half but not all of the of the memory.

---------- top output -------------------------------------------------------
Tasks: 212 total,   2 running, 210 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us,100.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16334996k total,  8979320k used,  7355676k free,     3280k buffers
Swap:  2031608k total,   129572k used,  1902036k free,   353220k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
10723 root      20   0 16.0g 8.0g  376 R  100 51.4   0:17.78 mem
10724 root      20   0 12880 1224  872 R    1  0.0   0:00.06 top
 7822 root      20   0 10868  348  272 S    0  0.0   0:06.00 irqbalance
-----------------------------------------------------------------------------

2.) Using the mm tracepoints I could immediately see that __zone_reclaim() is 
being called directly out of the memory allocator indicating that 
zone_reclaim_mode is non-zero(1).  In addition I could see that the priority
was decremented to zero and that 12342 pages had been reclaimed rather than
just enough to satisfy the page allocation request.

-----------------------------------------------------------------------------
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
<mem>-10723 [005]  6976.285610: mm_directreclaim_reclaimzone: reclaimed=12342, priority=0
-----------------------------------------------------------------------------

3.) zone_reclaim_mode is set to 1 in build_zonelists() on NUMA systems with 
sufficient distance between the nodes:

                /*
                 * If another node is sufficiently far away then it is better
                 * to reclaim pages in a zone before going off node.
                 */
                if (distance > RECLAIM_DISTANCE)
                        zone_reclaim_mode = 1;


4.) To verify zone_reclaim_mode was involved I disabled it by:
"echo 0 > /proc/sys/vm/zone_reclaim_mode" and sure enough the problem went
away.

5.) Next, after a reboot using the mm tracepoints I could see several calls 
were made to shrink_zone() and it had reclaimed many more pages than it 
should have:

-----------------------------------------------------------------------------
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
           <mem>-10723 [005]   282.776271: mm_pagereclaim_shrinkzone: reclaimed=12342
           <mem>-10723 [005]   282.781209: mm_pagereclaim_shrinkzone: reclaimed=3540
           <mem>-10723 [005]   282.801194: mm_pagereclaim_shrinkzone: reclaimed=7528
-----------------------------------------------------------------------------

6.) In between the shrink_zone() runs, shrink_active_list() and 
shrink_inactive_list() had run several times, each time fulfilling the memory
request from the pagecache.

-----------------------------------------------------------------------------
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
           <mem>-10723 [005]   282.755691: mm_pagereclaim_shrinkinactive: scanned=32, pagecache, priority=4
           <mem>-10723 [005]   282.755766: mm_pagereclaim_shrinkinactive: scanned=32, pagecache, priority=4
           <mem>-10723 [005]   282.755795: mm_pagereclaim_shrinkinactive: scanned=32, pagecache, priority=4
 ...
           <mem>-10723 [005]   282.755845: mm_pagereclaim_shrinkactive: scanned=32, pagecache, priority=4
           <mem>-10723 [005]   282.755882: mm_pagereclaim_shrinkactive: scanned=32, pagecache, priority=4
           <mem>-10723 [005]   282.755938: mm_pagereclaim_shrinkactive: scanned=32, pagecache, priority=4
-----------------------------------------------------------------------------

7.) This indicates that the direct memory reclaim code path called directly 
from the memory allocator when zone_reclaim_mode is non-zero could reclaim 
far more than SWAP_CLUSTER_MAX pages and consume significant CPU time doing 
it:

-----------------------------------------------------------------------------
get_page_from_freelist(..)

                if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
                        unsigned long mark;
                        if (alloc_flags & ALLOC_WMARK_MIN)
                                mark = (*z)->pages_min;
                        else if (alloc_flags & ALLOC_WMARK_LOW)
                                mark = (*z)->pages_low;
                        else
                                mark = (*z)->pages_high;
                        if (!zone_watermark_ok(*z, order, mark,
                                    classzone_idx, alloc_flags))
                                if (!zone_reclaim_mode ||
                                    !zone_reclaim(*z, gfp_mask, order))
                                        continue;
                }

-----------------------------------------------------------------------------

8.) On further investigation I found that the 2.6.18 shrink_zone() was missing
an upstream patch that bails out as soon as SWAP_CLUSTER_MAX pages have been 
reclaimed.

-----------------------------------------------------------------------------
shrink_zone(...)

+               /*
+                * On large memory systems, scan >> priority can become
+                * really large. This is fine for the starting priority;
+                * we want to put equal scanning pressure on each zone.
+                * However, if the VM has a harder time of freeing pages,
+                * with multiple processes reclaiming pages, the total
+                * freeing target can get unreasonably large.
+                */
+               if (nr_reclaimed > swap_cluster_max &&
+                       priority < DEF_PRIORITY && !current_is_kswapd())
+                       break; 
-----------------------------------------------------------------------------

9.) Including this patch in shrink_zone() fixed the problem by terminating
one enough memory is reclaimed to satisfy the __alloc_pages() request on the
local node.


This example is realitively simple and does not illustrate the use of
every one of the proposed mm tracepoints,   It show how they can be used to
quickly drill down into performance and other problems without several
itterations of rebuilding the kernel adding debug code.  


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
  2009-04-22 19:22       ` [Patch] mm tracepoints update - use case Larry Woodman
@ 2009-04-23  0:48           ` KOSAKI Motohiro
  0 siblings, 0 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2009-04-23  0:48 UTC (permalink / raw)
  To: Larry Woodman
  Cc: kosaki.motohiro, Ingo Molnar, Fr馘駻ic Weisbecker,
	Li Zefan, Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm,
	riel, rostedt

> On Wed, 2009-04-22 at 08:07 -0400, Larry Woodman wrote:
> > On Wed, 2009-04-22 at 11:57 +0200, Ingo Molnar wrote:
> > > * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > > In past thread, Andrew pointed out bare page tracer isn't useful. 
> > > 
> > > (do you have a link to that mail?)
> > > 
> > > > Can you make good consumer?
> > 
> > I will work up some good examples of what these are useful for.  I use
> > the mm tracepoint data in the debugfs trace buffer to locate customer
> > performance problems associated with memory allocation, deallocation,
> > paging and swapping frequently, especially on large systems.
> > 
> > Larry
> 
> Attached is an example of what the mm tracepoints can be used for:

I have some comment.

1. Yes, current zone_reclaim have strange behavior. I plan to fix
   some bug-like bahavior.
2. your scenario only use the information of "zone_reclaim called".
   function tracer already provide it.
3. but yes, you are going to proper direction. we definitely need
   some fine grained tracepoint in this area. we are welcome to you.
   but in my personal feeling, your tracepoint have worthless argument
   a lot. we need more good information.
   I think I can help you in this area. I hope to work together.








^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
@ 2009-04-23  0:48           ` KOSAKI Motohiro
  0 siblings, 0 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2009-04-23  0:48 UTC (permalink / raw)
  To: Larry Woodman
  Cc: kosaki.motohiro, Ingo Molnar, Fr馘駻ic Weisbecker,
	Li Zefan, Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm,
	riel, rostedt

> On Wed, 2009-04-22 at 08:07 -0400, Larry Woodman wrote:
> > On Wed, 2009-04-22 at 11:57 +0200, Ingo Molnar wrote:
> > > * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > > In past thread, Andrew pointed out bare page tracer isn't useful. 
> > > 
> > > (do you have a link to that mail?)
> > > 
> > > > Can you make good consumer?
> > 
> > I will work up some good examples of what these are useful for.  I use
> > the mm tracepoint data in the debugfs trace buffer to locate customer
> > performance problems associated with memory allocation, deallocation,
> > paging and swapping frequently, especially on large systems.
> > 
> > Larry
> 
> Attached is an example of what the mm tracepoints can be used for:

I have some comment.

1. Yes, current zone_reclaim have strange behavior. I plan to fix
   some bug-like bahavior.
2. your scenario only use the information of "zone_reclaim called".
   function tracer already provide it.
3. but yes, you are going to proper direction. we definitely need
   some fine grained tracepoint in this area. we are welcome to you.
   but in my personal feeling, your tracepoint have worthless argument
   a lot. we need more good information.
   I think I can help you in this area. I hope to work together.







--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
  2009-04-23  0:48           ` KOSAKI Motohiro
@ 2009-04-23  4:50             ` Andrew Morton
  -1 siblings, 0 replies; 37+ messages in thread
From: Andrew Morton @ 2009-04-23  4:50 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Larry Woodman, Ingo Molnar, Fr馘駻ic Weisbecker,
	Li Zefan, Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm,
	riel, rostedt

On Thu, 23 Apr 2009 09:48:04 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > On Wed, 2009-04-22 at 08:07 -0400, Larry Woodman wrote:
> > > On Wed, 2009-04-22 at 11:57 +0200, Ingo Molnar wrote:
> > > > * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > > > In past thread, Andrew pointed out bare page tracer isn't useful. 
> > > > 
> > > > (do you have a link to that mail?)

http://lkml.indiana.edu/hypermail/linux/kernel/0903.0/02674.html

And Larry's example use case here tends to reinforce what I said then.  Look:

: In addition I could see that the priority was decremented to zero and
: that 12342 pages had been reclaimed rather than just enough to satisfy
: the page allocation request.
: 
: -----------------------------------------------------------------------------
: # tracer: nop
: #
: #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
: #              | |       |          |         |
: <mem>-10723 [005]  6976.285610: mm_directreclaim_reclaimzone: reclaimed=12342, priority=0

and

: -----------------------------------------------------------------------------
: # tracer: nop
: #
: #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
: #              | |       |          |         |
:            <mem>-10723 [005]   282.776271: mm_pagereclaim_shrinkzone: reclaimed=12342
:            <mem>-10723 [005]   282.781209: mm_pagereclaim_shrinkzone: reclaimed=3540
:            <mem>-10723 [005]   282.801194: mm_pagereclaim_shrinkzone: reclaimed=7528
: -----------------------------------------------------------------------------

This diagnosis was successful because the "reclaimed" number was weird.
By sheer happy coincidence, page-reclaim is already generating the
aggregated numbers for us, and the tracer just prints it out.

If some other problem is being worked on and if there _isn't_ some
convenient already-present aggregated result for the tracer to print,
the problem won't be solved.  Unless a vast number of trace events are
emitted and problem-specific userspace code is written to aggregate
them into something which the developer can use.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
@ 2009-04-23  4:50             ` Andrew Morton
  0 siblings, 0 replies; 37+ messages in thread
From: Andrew Morton @ 2009-04-23  4:50 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Larry Woodman, Ingo Molnar, Fr馘駻ic Weisbecker,
	Li Zefan, Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm,
	riel, rostedt

On Thu, 23 Apr 2009 09:48:04 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > On Wed, 2009-04-22 at 08:07 -0400, Larry Woodman wrote:
> > > On Wed, 2009-04-22 at 11:57 +0200, Ingo Molnar wrote:
> > > > * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > > > In past thread, Andrew pointed out bare page tracer isn't useful. 
> > > > 
> > > > (do you have a link to that mail?)

http://lkml.indiana.edu/hypermail/linux/kernel/0903.0/02674.html

And Larry's example use case here tends to reinforce what I said then.  Look:

: In addition I could see that the priority was decremented to zero and
: that 12342 pages had been reclaimed rather than just enough to satisfy
: the page allocation request.
: 
: -----------------------------------------------------------------------------
: # tracer: nop
: #
: #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
: #              | |       |          |         |
: <mem>-10723 [005]  6976.285610: mm_directreclaim_reclaimzone: reclaimed=12342, priority=0

and

: -----------------------------------------------------------------------------
: # tracer: nop
: #
: #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
: #              | |       |          |         |
:            <mem>-10723 [005]   282.776271: mm_pagereclaim_shrinkzone: reclaimed=12342
:            <mem>-10723 [005]   282.781209: mm_pagereclaim_shrinkzone: reclaimed=3540
:            <mem>-10723 [005]   282.801194: mm_pagereclaim_shrinkzone: reclaimed=7528
: -----------------------------------------------------------------------------

This diagnosis was successful because the "reclaimed" number was weird.
By sheer happy coincidence, page-reclaim is already generating the
aggregated numbers for us, and the tracer just prints it out.

If some other problem is being worked on and if there _isn't_ some
convenient already-present aggregated result for the tracer to print,
the problem won't be solved.  Unless a vast number of trace events are
emitted and problem-specific userspace code is written to aggregate
them into something which the developer can use.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
  2009-04-23  4:50             ` Andrew Morton
@ 2009-04-23  8:42               ` Ingo Molnar
  -1 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2009-04-23  8:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, Larry Woodman, Fr馘駻ic Weisbecker,
	Li Zefan, Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm,
	riel, rostedt


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 23 Apr 2009 09:48:04 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > On Wed, 2009-04-22 at 08:07 -0400, Larry Woodman wrote:
> > > > On Wed, 2009-04-22 at 11:57 +0200, Ingo Molnar wrote:
> > > > > * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > 
> > > > > > In past thread, Andrew pointed out bare page tracer isn't useful. 
> > > > > 
> > > > > (do you have a link to that mail?)
> 
> http://lkml.indiana.edu/hypermail/linux/kernel/0903.0/02674.html
> 
> And Larry's example use case here tends to reinforce what I said then.  Look:
> 
> : In addition I could see that the priority was decremented to zero and
> : that 12342 pages had been reclaimed rather than just enough to satisfy
> : the page allocation request.
> : 
> : -----------------------------------------------------------------------------
> : # tracer: nop
> : #
> : #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
> : #              | |       |          |         |
> : <mem>-10723 [005]  6976.285610: mm_directreclaim_reclaimzone: reclaimed=12342, priority=0
> 
> and
> 
> : -----------------------------------------------------------------------------
> : # tracer: nop
> : #
> : #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
> : #              | |       |          |         |
> :            <mem>-10723 [005]   282.776271: mm_pagereclaim_shrinkzone: reclaimed=12342
> :            <mem>-10723 [005]   282.781209: mm_pagereclaim_shrinkzone: reclaimed=3540
> :            <mem>-10723 [005]   282.801194: mm_pagereclaim_shrinkzone: reclaimed=7528
> : -----------------------------------------------------------------------------
> 
> This diagnosis was successful because the "reclaimed" number was 
> weird. By sheer happy coincidence, page-reclaim is already 
> generating the aggregated numbers for us, and the tracer just 
> prints it out.
> 
> If some other problem is being worked on and if there _isn't_ some 
> convenient already-present aggregated result for the tracer to 
> print, the problem won't be solved.  Unless a vast number of trace 
> events are emitted and problem-specific userspace code is written 
> to aggregate them into something which the developer can use.

Not so in the usescases i made use of tracers. The key is not to 
trace everything, but to have a few key _concepts_ traced 
pervasively. Having a dynamic notion of a per event changes is also 
obviously good. In a fast changing workload you cannot just tell 
based on summary statistics whether rapid changes are the product of 
the inherent entropy of the workload, or the result of the MM being 
confused.

/proc/ statisitics versus good tracing is like the difference 
between a magnifying glass and an electron microscope. Both have 
their strengths, and they are best if used together.

One such conceptual thing in the scheduler is the lifetime of a 
task, its schedule, deschedule and wakeup events. It can already 
show a massive amount of badness in practice, and it only takes a 
few tracepoints to do.

Same goes for the MM IMHO. Number of pages reclaimed is obviously a 
key metric to follow. Larry is an expert who fixed a _lot_ of MM 
crap in the last 5-10 years at Red Hat, so if he says that these 
tracepoints are useful to him, we shouldnt just dismiss that 
experience like that. I wish Larry spent some of his energies on 
fixing the upstream MM too ;-)

A balanced number of MM tracepoints, showing the concepts and the 
inner dynamics of the MM would be useful. We dont need every little 
detail traced (we have the function tracer for that), but a few key 
aspects would be nice to capture ...

pagefaults, allocations, cache-misses, cache flushes and how pages 
shift between various queues in the MM would be a good start IMHO.

Anyway, i suspect your answer means a NAK :-( Would be nice if you 
would suggest a path out of that NAK.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
@ 2009-04-23  8:42               ` Ingo Molnar
  0 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2009-04-23  8:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, Larry Woodman, Fr馘駻ic Weisbecker,
	Li Zefan, Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm,
	riel, rostedt


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 23 Apr 2009 09:48:04 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > On Wed, 2009-04-22 at 08:07 -0400, Larry Woodman wrote:
> > > > On Wed, 2009-04-22 at 11:57 +0200, Ingo Molnar wrote:
> > > > > * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > 
> > > > > > In past thread, Andrew pointed out bare page tracer isn't useful. 
> > > > > 
> > > > > (do you have a link to that mail?)
> 
> http://lkml.indiana.edu/hypermail/linux/kernel/0903.0/02674.html
> 
> And Larry's example use case here tends to reinforce what I said then.  Look:
> 
> : In addition I could see that the priority was decremented to zero and
> : that 12342 pages had been reclaimed rather than just enough to satisfy
> : the page allocation request.
> : 
> : -----------------------------------------------------------------------------
> : # tracer: nop
> : #
> : #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
> : #              | |       |          |         |
> : <mem>-10723 [005]  6976.285610: mm_directreclaim_reclaimzone: reclaimed=12342, priority=0
> 
> and
> 
> : -----------------------------------------------------------------------------
> : # tracer: nop
> : #
> : #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
> : #              | |       |          |         |
> :            <mem>-10723 [005]   282.776271: mm_pagereclaim_shrinkzone: reclaimed=12342
> :            <mem>-10723 [005]   282.781209: mm_pagereclaim_shrinkzone: reclaimed=3540
> :            <mem>-10723 [005]   282.801194: mm_pagereclaim_shrinkzone: reclaimed=7528
> : -----------------------------------------------------------------------------
> 
> This diagnosis was successful because the "reclaimed" number was 
> weird. By sheer happy coincidence, page-reclaim is already 
> generating the aggregated numbers for us, and the tracer just 
> prints it out.
> 
> If some other problem is being worked on and if there _isn't_ some 
> convenient already-present aggregated result for the tracer to 
> print, the problem won't be solved.  Unless a vast number of trace 
> events are emitted and problem-specific userspace code is written 
> to aggregate them into something which the developer can use.

Not so in the usescases i made use of tracers. The key is not to 
trace everything, but to have a few key _concepts_ traced 
pervasively. Having a dynamic notion of a per event changes is also 
obviously good. In a fast changing workload you cannot just tell 
based on summary statistics whether rapid changes are the product of 
the inherent entropy of the workload, or the result of the MM being 
confused.

/proc/ statisitics versus good tracing is like the difference 
between a magnifying glass and an electron microscope. Both have 
their strengths, and they are best if used together.

One such conceptual thing in the scheduler is the lifetime of a 
task, its schedule, deschedule and wakeup events. It can already 
show a massive amount of badness in practice, and it only takes a 
few tracepoints to do.

Same goes for the MM IMHO. Number of pages reclaimed is obviously a 
key metric to follow. Larry is an expert who fixed a _lot_ of MM 
crap in the last 5-10 years at Red Hat, so if he says that these 
tracepoints are useful to him, we shouldnt just dismiss that 
experience like that. I wish Larry spent some of his energies on 
fixing the upstream MM too ;-)

A balanced number of MM tracepoints, showing the concepts and the 
inner dynamics of the MM would be useful. We dont need every little 
detail traced (we have the function tracer for that), but a few key 
aspects would be nice to capture ...

pagefaults, allocations, cache-misses, cache flushes and how pages 
shift between various queues in the MM would be a good start IMHO.

Anyway, i suspect your answer means a NAK :-( Would be nice if you 
would suggest a path out of that NAK.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
  2009-04-23  8:42               ` Ingo Molnar
@ 2009-04-23 11:47                 ` Larry Woodman
  -1 siblings, 0 replies; 37+ messages in thread
From: Larry Woodman @ 2009-04-23 11:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, KOSAKI Motohiro, Fr馘駻ic Weisbecker,
	Li Zefan, Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm,
	riel, rostedt

On Thu, 2009-04-23 at 10:42 +0200, Ingo Molnar wrote:

> 
> Not so in the usescases i made use of tracers. The key is not to 
> trace everything, but to have a few key _concepts_ traced 
> pervasively. Having a dynamic notion of a per event changes is also 
> obviously good. In a fast changing workload you cannot just tell 
> based on summary statistics whether rapid changes are the product of 
> the inherent entropy of the workload, or the result of the MM being 
> confused.
> 
> /proc/ statisitics versus good tracing is like the difference 
> between a magnifying glass and an electron microscope. Both have 
> their strengths, and they are best if used together.
> 
> One such conceptual thing in the scheduler is the lifetime of a 
> task, its schedule, deschedule and wakeup events. It can already 
> show a massive amount of badness in practice, and it only takes a 
> few tracepoints to do.
> 
> Same goes for the MM IMHO. Number of pages reclaimed is obviously a 
> key metric to follow. Larry is an expert who fixed a _lot_ of MM 
> crap in the last 5-10 years at Red Hat, so if he says that these 
> tracepoints are useful to him, we shouldnt just dismiss that 
> experience like that. I wish Larry spent some of his energies on 
> fixing the upstream MM too ;-)
> 
> A balanced number of MM tracepoints, showing the concepts and the 
> inner dynamics of the MM would be useful. We dont need every little 
> detail traced (we have the function tracer for that), but a few key 
> aspects would be nice to capture ...

I hear you, there is  lot of data coming out of these mm tracepoints as
well as must of the other tracepoints I've played around with, we have
to filter them.  I added them in locations that would allow us to debug
a variety of real running systems such as a Wall St. trading server
during the heaviest period of the day without rebooting a debug kernel.
We can collect whatever is needed to figure out whats happening then
turning it all off when we've collected enough.  We've seen systems
experiencing performance problems caused by the "inner'ds" of the page
reclaim code, memory leak problems cause by applications, excessive COW
faults caused by applications that mmap() gigs of files then fork and
applications that rely the kernel to flush out every modified page of
those gigs of mmap()'d file data every 30 seconds via kupdate because
other kernel do. The list goes on and on...  These tracepoints are in
the same locations that we've placed debug code in debug kernels in the
past.

Larry


 
> 
> pagefaults, allocations, cache-misses, cache flushes and how pages 
> shift between various queues in the MM would be a good start IMHO.
> 
> Anyway, i suspect your answer means a NAK :-( Would be nice if you 
> would suggest a path out of that NAK.
> 
> 	Ingo


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
@ 2009-04-23 11:47                 ` Larry Woodman
  0 siblings, 0 replies; 37+ messages in thread
From: Larry Woodman @ 2009-04-23 11:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, KOSAKI Motohiro, Fr馘駻ic Weisbecker,
	Li Zefan, Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm,
	riel, rostedt

On Thu, 2009-04-23 at 10:42 +0200, Ingo Molnar wrote:

> 
> Not so in the usescases i made use of tracers. The key is not to 
> trace everything, but to have a few key _concepts_ traced 
> pervasively. Having a dynamic notion of a per event changes is also 
> obviously good. In a fast changing workload you cannot just tell 
> based on summary statistics whether rapid changes are the product of 
> the inherent entropy of the workload, or the result of the MM being 
> confused.
> 
> /proc/ statisitics versus good tracing is like the difference 
> between a magnifying glass and an electron microscope. Both have 
> their strengths, and they are best if used together.
> 
> One such conceptual thing in the scheduler is the lifetime of a 
> task, its schedule, deschedule and wakeup events. It can already 
> show a massive amount of badness in practice, and it only takes a 
> few tracepoints to do.
> 
> Same goes for the MM IMHO. Number of pages reclaimed is obviously a 
> key metric to follow. Larry is an expert who fixed a _lot_ of MM 
> crap in the last 5-10 years at Red Hat, so if he says that these 
> tracepoints are useful to him, we shouldnt just dismiss that 
> experience like that. I wish Larry spent some of his energies on 
> fixing the upstream MM too ;-)
> 
> A balanced number of MM tracepoints, showing the concepts and the 
> inner dynamics of the MM would be useful. We dont need every little 
> detail traced (we have the function tracer for that), but a few key 
> aspects would be nice to capture ...

I hear you, there is  lot of data coming out of these mm tracepoints as
well as must of the other tracepoints I've played around with, we have
to filter them.  I added them in locations that would allow us to debug
a variety of real running systems such as a Wall St. trading server
during the heaviest period of the day without rebooting a debug kernel.
We can collect whatever is needed to figure out whats happening then
turning it all off when we've collected enough.  We've seen systems
experiencing performance problems caused by the "inner'ds" of the page
reclaim code, memory leak problems cause by applications, excessive COW
faults caused by applications that mmap() gigs of files then fork and
applications that rely the kernel to flush out every modified page of
those gigs of mmap()'d file data every 30 seconds via kupdate because
other kernel do. The list goes on and on...  These tracepoints are in
the same locations that we've placed debug code in debug kernels in the
past.

Larry


 
> 
> pagefaults, allocations, cache-misses, cache flushes and how pages 
> shift between various queues in the MM would be a good start IMHO.
> 
> Anyway, i suspect your answer means a NAK :-( Would be nice if you 
> would suggest a path out of that NAK.
> 
> 	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
  2009-04-23 11:47                 ` Larry Woodman
  (?)
@ 2009-04-24 20:48                 ` Larry Woodman
  -1 siblings, 0 replies; 37+ messages in thread
From: Larry Woodman @ 2009-04-24 20:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, KOSAKI Motohiro, Fr馘駻ic Weisbecker,
	Li Zefan, Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm,
	riel, rostedt

[-- Attachment #1: Type: text/plain, Size: 2479 bytes --]

On Thu, 2009-04-23 at 07:47 -0400, Larry Woodman wrote:
> On Thu, 2009-04-23 at 10:42 +0200, Ingo Molnar wrote:

> > 
> > A balanced number of MM tracepoints, showing the concepts and the 
> > inner dynamics of the MM would be useful. We dont need every little 
> > detail traced (we have the function tracer for that), but a few key 
> > aspects would be nice to capture ...
> 
> I hear you, there is  lot of data coming out of these mm tracepoints as
> well as must of the other tracepoints I've played around with, we have
> to filter them.  I added them in locations that would allow us to debug
> a variety of real running systems such as a Wall St. trading server
> during the heaviest period of the day without rebooting a debug kernel.
> We can collect whatever is needed to figure out whats happening then
> turning it all off when we've collected enough.  We've seen systems
> experiencing performance problems caused by the "inner'ds" of the page
> reclaim code, memory leak problems cause by applications, excessive COW
> faults caused by applications that mmap() gigs of files then fork and
> applications that rely the kernel to flush out every modified page of
> those gigs of mmap()'d file data every 30 seconds via kupdate because
> other kernel do. The list goes on and on...  These tracepoints are in
> the same locations that we've placed debug code in debug kernels in the
> past.
> 
> Larry
> 
 
> > 
> > pagefaults, allocations, cache-misses, cache flushes and how pages 
> > shift between various queues in the MM would be a good start IMHO.
> > 
> > Anyway, i suspect your answer means a NAK :-( Would be nice if you 
> > would suggest a path out of that NAK.
> > 
> > 	Ingo
> 


I've overhauled the patch so that all page level tracing has been
removed unless it directly causes page reclamation.  At this point trace
individual pagefaults, unmaps and pageouts.  However, for all page
reclaim paths and writeback paths we now traces quantities of pages
activated, deactivated, written, reclaimed, etc,.  Also we now only
trace individual page allocations that cause further page reclamation to
occur.  This still provides the necessary microscopic level of detail
without tracing the movement of all the pageframes:


> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

[-- Attachment #2: mm-tracepoints.patch --]
[-- Type: text/x-patch, Size: 17759 bytes --]

diff --git a/include/trace/events/mm.h b/include/trace/events/mm.h
new file mode 100644
index 0000000..6b1c114
--- /dev/null
+++ b/include/trace/events/mm.h
@@ -0,0 +1,436 @@
+#if !defined(_TRACE_MM_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MM_H
+
+#include <linux/mm.h>
+#include <linux/tracepoint.h>
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mm
+
+TRACE_EVENT(mm_anon_fault,
+
+	TP_PROTO(struct mm_struct *mm, unsigned long address),
+
+	TP_ARGS(mm, address),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+	),
+
+	TP_printk("mm=%lx address=%lx",
+		(unsigned long)__entry->mm, __entry->address)
+);
+
+TRACE_EVENT(mm_anon_pgin,
+
+	TP_PROTO(struct mm_struct *mm, unsigned long address),
+
+	TP_ARGS(mm, address),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+	),
+
+	TP_printk("mm=%lx address=%lx",
+		(unsigned long)__entry->mm, __entry->address)
+	);
+
+TRACE_EVENT(mm_anon_cow,
+
+	TP_PROTO(struct mm_struct *mm,
+			unsigned long address),
+
+	TP_ARGS(mm, address),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+	),
+
+	TP_printk("mm=%lx address=%lx",
+		(unsigned long)__entry->mm, __entry->address)
+	);
+
+TRACE_EVENT(mm_anon_userfree,
+
+	TP_PROTO(struct mm_struct *mm,
+			unsigned long address),
+
+	TP_ARGS(mm, address),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+	),
+
+	TP_printk("mm=%lx address=%lx",
+		(unsigned long)__entry->mm, __entry->address)
+	);
+
+TRACE_EVENT(mm_anon_unmap,
+
+	TP_PROTO(struct mm_struct *mm, unsigned long address),
+
+	TP_ARGS(mm, address),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+	),
+
+	TP_printk("mm=%lx address=%lx",
+		(unsigned long)__entry->mm, __entry->address)
+	);
+
+TRACE_EVENT(mm_filemap_fault,
+
+	TP_PROTO(struct mm_struct *mm, unsigned long address, int flag),
+	TP_ARGS(mm, address, flag),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+		__field(int, flag)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+		__entry->flag = flag;
+	),
+
+	TP_printk("%s: mm=%lx address=%lx",
+		__entry->flag ? "pagein" : "primary fault",
+		(unsigned long)__entry->mm, __entry->address)
+	);
+
+TRACE_EVENT(mm_filemap_cow,
+
+	TP_PROTO(struct mm_struct *mm, unsigned long address),
+
+	TP_ARGS(mm, address),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+	),
+
+	TP_printk("mm=%lx address=%lx",
+		(unsigned long)__entry->mm, __entry->address)
+	);
+
+TRACE_EVENT(mm_filemap_unmap,
+
+	TP_PROTO(struct mm_struct *mm, unsigned long address),
+
+	TP_ARGS(mm, address),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+	),
+
+	TP_printk("mm=%lx address=%lx",
+		(unsigned long)__entry->mm, __entry->address)
+	);
+
+TRACE_EVENT(mm_filemap_userunmap,
+
+	TP_PROTO(struct mm_struct *mm, unsigned long address),
+
+	TP_ARGS(mm, address),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+	),
+
+	TP_printk("mm=%lx address=%lx",
+		(unsigned long)__entry->mm, __entry->address)
+	);
+
+TRACE_EVENT(mm_pagereclaim_pgout,
+
+	TP_PROTO(struct address_space *mapping, unsigned long offset, int anon),
+
+	TP_ARGS(mapping, offset, anon),
+
+	TP_STRUCT__entry(
+		__field(struct address_space *, mapping)
+		__field(unsigned long, offset)
+		__field(int, anon)
+	),
+
+	TP_fast_assign(
+		__entry->mapping = mapping;
+		__entry->offset = offset;
+		__entry->anon = anon;
+	),
+
+	TP_printk("mapping=%lx, offset=%lx %s",
+		(unsigned long)__entry->mapping, __entry->offset, 
+			__entry->anon ? "anonymous" : "pagecache")
+	);
+
+TRACE_EVENT(mm_pagereclaim_free,
+
+	TP_PROTO(unsigned long nr_reclaimed),
+
+	TP_ARGS(nr_reclaimed),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, nr_reclaimed)
+	),
+
+	TP_fast_assign(
+		__entry->nr_reclaimed = nr_reclaimed;
+	),
+
+	TP_printk("freed=%ld", __entry->nr_reclaimed)
+	);
+
+TRACE_EVENT(mm_pdflush_bgwriteout,
+
+	TP_PROTO(unsigned long written),
+
+	TP_ARGS(written),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, written)
+	),
+
+	TP_fast_assign(
+		__entry->written = written;
+	),
+
+	TP_printk("written=%ld", __entry->written)
+	);
+
+TRACE_EVENT(mm_pdflush_kupdate,
+
+	TP_PROTO(unsigned long writes),
+
+	TP_ARGS(writes),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, writes)
+	),
+
+	TP_fast_assign(
+		__entry->writes = writes;
+	),
+
+	TP_printk("writes=%ld", __entry->writes)
+	);
+
+TRACE_EVENT(mm_balance_dirty,
+
+	TP_PROTO(unsigned long written),
+
+	TP_ARGS(written),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, written)
+	),
+
+	TP_fast_assign(
+		__entry->written = written;
+	),
+
+	TP_printk("written=%ld", __entry->written)
+	);
+
+TRACE_EVENT(mm_page_allocation,
+
+	TP_PROTO(unsigned long free),
+
+	TP_ARGS(free),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, free)
+	),
+
+	TP_fast_assign(
+		__entry->free = free;
+	),
+
+	TP_printk("zone_free=%ld", __entry->free)
+	);
+
+TRACE_EVENT(mm_kswapd_ran,
+
+	TP_PROTO(struct pglist_data *pgdat, unsigned long reclaimed),
+
+	TP_ARGS(pgdat, reclaimed),
+
+	TP_STRUCT__entry(
+		__field(struct pglist_data *, pgdat)
+		__field(int, node_id)
+		__field(unsigned long, reclaimed)
+	),
+
+	TP_fast_assign(
+		__entry->pgdat = pgdat;
+		__entry->node_id = pgdat->node_id;
+		__entry->reclaimed = reclaimed;
+	),
+
+	TP_printk("node=%d reclaimed=%ld", __entry->node_id, __entry->reclaimed)
+	);
+
+TRACE_EVENT(mm_directreclaim_reclaimall,
+
+	TP_PROTO(int node, unsigned long reclaimed, unsigned long priority),
+
+	TP_ARGS(node, reclaimed, priority),
+
+	TP_STRUCT__entry(
+		__field(int, node)
+		__field(unsigned long, reclaimed)
+		__field(unsigned long, priority)
+	),
+
+	TP_fast_assign(
+		__entry->node = node;
+		__entry->reclaimed = reclaimed;
+		__entry->priority = priority;
+	),
+
+	TP_printk("node=%d reclaimed=%ld priority=%ld", __entry->node, __entry->reclaimed, 
+					__entry->priority)
+	);
+
+TRACE_EVENT(mm_directreclaim_reclaimzone,
+
+	TP_PROTO(int node, unsigned long reclaimed, unsigned long priority),
+
+	TP_ARGS(node, reclaimed, priority),
+
+	TP_STRUCT__entry(
+		__field(int, node)
+		__field(unsigned long, reclaimed)
+		__field(unsigned long, priority)
+	),
+
+	TP_fast_assign(
+		__entry->node = node;
+		__entry->reclaimed = reclaimed;
+		__entry->priority = priority;
+	),
+
+	TP_printk("node = %d reclaimed=%ld, priority=%ld",
+			__entry->node, __entry->reclaimed, __entry->priority)
+	);
+TRACE_EVENT(mm_pagereclaim_shrinkzone,
+
+	TP_PROTO(unsigned long reclaimed),
+
+	TP_ARGS(reclaimed),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, reclaimed)
+	),
+
+	TP_fast_assign(
+		__entry->reclaimed = reclaimed;
+	),
+
+	TP_printk("reclaimed=%ld", __entry->reclaimed)
+	);
+
+TRACE_EVENT(mm_pagereclaim_shrinkactive,
+
+	TP_PROTO(unsigned long scanned, int file, int priority),
+
+	TP_ARGS(scanned, file, priority),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, scanned)
+		__field(int, file)
+		__field(int, priority)
+	),
+
+	TP_fast_assign(
+		__entry->scanned = scanned;
+		__entry->file = file;
+		__entry->priority = priority;
+	),
+
+	TP_printk("scanned=%ld, %s, priority=%d",
+		__entry->scanned, __entry->file ? "pagecache" : "anonymous",
+		__entry->priority)
+	);
+
+TRACE_EVENT(mm_pagereclaim_shrinkinactive,
+
+	TP_PROTO(unsigned long scanned, unsigned long reclaimed,
+			int file, int priority),
+
+	TP_ARGS(scanned, reclaimed, file, priority),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, scanned)
+		__field(unsigned long, reclaimed)
+		__field(int, file)
+		__field(int, priority)
+	),
+
+	TP_fast_assign(
+		__entry->scanned = scanned;
+		__entry->reclaimed = reclaimed;
+		__entry->file = file;
+		__entry->priority = priority;
+	),
+
+	TP_printk("scanned=%ld, reclaimed=%ld %s, priority=%d",
+		__entry->scanned, __entry->reclaimed, 
+		__entry->file ? "pagecache" : "anonymous",
+		__entry->priority)
+	);
+
+#endif /* _TRACE_MM_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/filemap.c b/mm/filemap.c
index 379ff0b..c4424ed 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -34,6 +34,8 @@
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
+#include <linux/ftrace.h>
+#include <trace/events/mm.h>
 #include "internal.h"
 
 /*
@@ -1568,6 +1570,8 @@ retry_find:
 	 */
 	ra->prev_pos = (loff_t)page->index << PAGE_CACHE_SHIFT;
 	vmf->page = page;
+	trace_mm_filemap_fault(vma->vm_mm, (unsigned long)vmf->virtual_address,
+			vmf->flags&FAULT_FLAG_NONLINEAR);
 	return ret | VM_FAULT_LOCKED;
 
 no_cached_page:
diff --git a/mm/memory.c b/mm/memory.c
index cf6873e..27f5e0b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -55,6 +55,7 @@
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
 #include <linux/elf.h>
+#include <linux/ftrace.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -64,6 +65,8 @@
 
 #include "internal.h"
 
+#include <trace/events/mm.h>
+
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 /* use the per-pgdat data instead for discontigmem - mbligh */
 unsigned long max_mapnr;
@@ -812,15 +815,17 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 						addr) != page->index)
 				set_pte_at(mm, addr, pte,
 					   pgoff_to_pte(page->index));
-			if (PageAnon(page))
+			if (PageAnon(page)) {
 				anon_rss--;
-			else {
+				trace_mm_anon_userfree(mm, addr);
+			} else {
 				if (pte_dirty(ptent))
 					set_page_dirty(page);
 				if (pte_young(ptent) &&
 				    likely(!VM_SequentialReadHint(vma)))
 					mark_page_accessed(page);
 				file_rss--;
+				trace_mm_filemap_userunmap(mm, addr);
 			}
 			page_remove_rmap(page);
 			if (unlikely(page_mapcount(page) < 0))
@@ -1896,7 +1901,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pte_t *page_table, pmd_t *pmd,
 		spinlock_t *ptl, pte_t orig_pte)
 {
-	struct page *old_page, *new_page;
+	struct page *old_page, *new_page = NULL;
 	pte_t entry;
 	int reuse = 0, ret = 0;
 	int page_mkwrite = 0;
@@ -2039,9 +2044,12 @@ gotten:
 			if (!PageAnon(old_page)) {
 				dec_mm_counter(mm, file_rss);
 				inc_mm_counter(mm, anon_rss);
+				trace_mm_filemap_cow(mm, address);
 			}
-		} else
+		} else {
 			inc_mm_counter(mm, anon_rss);
+			trace_mm_anon_cow(mm, address);
+		}
 		flush_cache_page(vma, address, pte_pfn(orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
@@ -2416,7 +2424,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		int write_access, pte_t orig_pte)
 {
 	spinlock_t *ptl;
-	struct page *page;
+	struct page *page = NULL;
 	swp_entry_t entry;
 	pte_t pte;
 	struct mem_cgroup *ptr = NULL;
@@ -2517,6 +2525,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 out:
+	trace_mm_anon_pgin(mm, address);
 	return ret;
 out_nomap:
 	mem_cgroup_cancel_charge_swapin(ptr);
@@ -2549,6 +2558,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto oom;
 	__SetPageUptodate(page);
 
+	trace_mm_anon_fault(mm, address);
 	if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
 		goto oom_free_page;
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 30351f0..a3d469c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -34,6 +34,8 @@
 #include <linux/syscalls.h>
 #include <linux/buffer_head.h>
 #include <linux/pagevec.h>
+#include <linux/ftrace.h>
+#include <trace/events/mm.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -574,6 +576,7 @@ static void balance_dirty_pages(struct address_space *mapping)
 		congestion_wait(WRITE, HZ/10);
 	}
 
+	trace_mm_balance_dirty(pages_written);
 	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
 			bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
@@ -716,6 +719,7 @@ static void background_writeout(unsigned long _min_pages)
 				break;
 		}
 	}
+	trace_mm_pdflush_bgwriteout(_min_pages);
 }
 
 /*
@@ -776,6 +780,7 @@ static void wb_kupdate(unsigned long arg)
 	nr_to_write = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
 			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+	trace_mm_pdflush_kupdate(nr_to_write);
 	while (nr_to_write > 0) {
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a3df888..73576cf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -47,6 +47,8 @@
 #include <linux/page-isolation.h>
 #include <linux/page_cgroup.h>
 #include <linux/debugobjects.h>
+#include <linux/ftrace.h>
+#include <trace/events/mm.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -1443,6 +1445,7 @@ zonelist_scan:
 				mark = zone->pages_high;
 			if (!zone_watermark_ok(zone, order, mark,
 				    classzone_idx, alloc_flags)) {
+				trace_mm_page_allocation(zone_page_state(zone, NR_FREE_PAGES));
 				if (!zone_reclaim_mode ||
 				    !zone_reclaim(zone, gfp_mask, order))
 					goto this_zone_full;
diff --git a/mm/rmap.c b/mm/rmap.c
index 1652166..8f2b43f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -50,6 +50,8 @@
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
+#include <linux/ftrace.h>
+#include <trace/events/mm.h>
 
 #include <asm/tlbflush.h>
 
@@ -1025,6 +1027,7 @@ static int try_to_unmap_anon(struct page *page, int unlock, int migration)
 			if (mlocked)
 				break;	/* stop if actually mlocked page */
 		}
+		trace_mm_anon_unmap(vma->vm_mm, vma->vm_start+page->index);
 	}
 
 	page_unlock_anon_vma(anon_vma);
@@ -1152,6 +1155,7 @@ static int try_to_unmap_file(struct page *page, int unlock, int migration)
 					goto out;
 			}
 			vma->vm_private_data = (void *) max_nl_cursor;
+			trace_mm_filemap_unmap(vma->vm_mm, vma->vm_start+page->index);
 		}
 		cond_resched_lock(&mapping->i_mmap_lock);
 		max_nl_cursor += CLUSTER_SIZE;
@@ -1170,6 +1174,7 @@ out:
 		ret = SWAP_MLOCK;	/* actually mlocked the page */
 	else if (ret == SWAP_MLOCK)
 		ret = SWAP_AGAIN;	/* saw VM_LOCKED vma */
+
 	return ret;
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eac9577..6f3a543 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -40,6 +40,9 @@
 #include <linux/memcontrol.h>
 #include <linux/delayacct.h>
 #include <linux/sysctl.h>
+#include <linux/ftrace.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/mm.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -417,6 +420,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			ClearPageReclaim(page);
 		}
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
+		trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT,
+						PageAnon(page));
 		return PAGE_SUCCESS;
 	}
 
@@ -794,6 +799,7 @@ keep:
 	if (pagevec_count(&freed_pvec))
 		__pagevec_free(&freed_pvec);
 	count_vm_events(PGACTIVATE, pgactivate);
+	trace_mm_pagereclaim_free(nr_reclaimed);
 	return nr_reclaimed;
 }
 
@@ -1180,6 +1186,8 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 done:
 	local_irq_enable();
 	pagevec_release(&pvec);
+	trace_mm_pagereclaim_shrinkinactive(nr_scanned, nr_reclaimed,
+				file, priority);
 	return nr_reclaimed;
 }
 
@@ -1314,6 +1322,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 	if (buffer_heads_over_limit)
 		pagevec_strip(&pvec);
 	pagevec_release(&pvec);
+	trace_mm_pagereclaim_shrinkactive(pgscanned, file, priority);
 }
 
 static int inactive_anon_is_low_global(struct zone *zone)
@@ -1514,6 +1523,7 @@ static void shrink_zone(int priority, struct zone *zone,
 	}
 
 	sc->nr_reclaimed = nr_reclaimed;
+	trace_mm_pagereclaim_shrinkzone(nr_reclaimed);
 
 	/*
 	 * Even if we did not try to evict anon pages at all, we want to
@@ -1676,6 +1686,8 @@ out:
 	if (priority < 0)
 		priority = 0;
 
+	trace_mm_directreclaim_reclaimall(zonelist[0]._zonerefs->zone->node,
+						sc->nr_reclaimed, priority);
 	if (scanning_global_lru(sc)) {
 		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 
@@ -1945,6 +1957,7 @@ out:
 		goto loop_again;
 	}
 
+	trace_mm_kswapd_ran(pgdat, sc.nr_reclaimed);
 	return sc.nr_reclaimed;
 }
 
@@ -2297,7 +2310,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	const unsigned long nr_pages = 1 << order;
 	struct task_struct *p = current;
 	struct reclaim_state reclaim_state;
-	int priority;
+	int priority = ZONE_RECLAIM_PRIORITY;
 	struct scan_control sc = {
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
@@ -2364,6 +2377,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 
 	p->reclaim_state = NULL;
 	current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
+	trace_mm_directreclaim_reclaimzone(zone->node,
+				sc.nr_reclaimed, priority);
 	return sc.nr_reclaimed >= nr_pages;
 }
 

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
  2009-04-23  0:48           ` KOSAKI Motohiro
@ 2009-06-15 18:26             ` Rik van Riel
  -1 siblings, 0 replies; 37+ messages in thread
From: Rik van Riel @ 2009-06-15 18:26 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Larry Woodman, Ingo Molnar, Fr馘駻ic Weisbecker,
	Li Zefan, Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm,
	rostedt

KOSAKI Motohiro wrote:
>> On Wed, 2009-04-22 at 08:07 -0400, Larry Woodman wrote:

>> Attached is an example of what the mm tracepoints can be used for:
> 
> I have some comment.
> 
> 1. Yes, current zone_reclaim have strange behavior. I plan to fix
>    some bug-like bahavior.
> 2. your scenario only use the information of "zone_reclaim called".
>    function tracer already provide it.
> 3. but yes, you are going to proper direction. we definitely need
>    some fine grained tracepoint in this area. we are welcome to you.
>    but in my personal feeling, your tracepoint have worthless argument
>    a lot. we need more good information.
>    I think I can help you in this area. I hope to work together.

Sorry I am replying to a really old email, but exactly
what information do you believe would be more useful to
extract from vmscan.c with tracepoints?

What are the kinds of problems that customer systems
(which cannot be rebooted into experimental kernels)
run into, that can be tracked down with tracepoints?

I can think of a few:
- excessive CPU use in page reclaim code
- excessive reclaim latency in page reclaim code
- unbalanced memory allocation between zones/nodes
- strange balance problems between reclaiming of page
   cache and swapping out process pages

I suspect we would need fairly fine grained tracepoints
to track down these kinds of problems, with filtering
and/or interpretation in userspace, but I am always
interested in easier ways of tracking down these kinds
of problems :)

What kinds of tracepoints do you believe we would need?

Or, using Larry's patch as a starting point, what do you
believe should be changed?

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
@ 2009-06-15 18:26             ` Rik van Riel
  0 siblings, 0 replies; 37+ messages in thread
From: Rik van Riel @ 2009-06-15 18:26 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Larry Woodman, Ingo Molnar, Fr馘駻ic Weisbecker,
	Li Zefan, Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm,
	rostedt

KOSAKI Motohiro wrote:
>> On Wed, 2009-04-22 at 08:07 -0400, Larry Woodman wrote:

>> Attached is an example of what the mm tracepoints can be used for:
> 
> I have some comment.
> 
> 1. Yes, current zone_reclaim have strange behavior. I plan to fix
>    some bug-like bahavior.
> 2. your scenario only use the information of "zone_reclaim called".
>    function tracer already provide it.
> 3. but yes, you are going to proper direction. we definitely need
>    some fine grained tracepoint in this area. we are welcome to you.
>    but in my personal feeling, your tracepoint have worthless argument
>    a lot. we need more good information.
>    I think I can help you in this area. I hope to work together.

Sorry I am replying to a really old email, but exactly
what information do you believe would be more useful to
extract from vmscan.c with tracepoints?

What are the kinds of problems that customer systems
(which cannot be rebooted into experimental kernels)
run into, that can be tracked down with tracepoints?

I can think of a few:
- excessive CPU use in page reclaim code
- excessive reclaim latency in page reclaim code
- unbalanced memory allocation between zones/nodes
- strange balance problems between reclaiming of page
   cache and swapping out process pages

I suspect we would need fairly fine grained tracepoints
to track down these kinds of problems, with filtering
and/or interpretation in userspace, but I am always
interested in easier ways of tracking down these kinds
of problems :)

What kinds of tracepoints do you believe we would need?

Or, using Larry's patch as a starting point, what do you
believe should be changed?

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
  2009-06-15 18:26             ` Rik van Riel
  (?)
@ 2009-06-17 14:07             ` Larry Woodman
  -1 siblings, 0 replies; 37+ messages in thread
From: Larry Woodman @ 2009-06-17 14:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, Ingo Molnar, Fr馘駻ic Weisbecker,
	Li Zefan, Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm,
	rostedt, lwoodman, Linda Wang

[-- Attachment #1: Type: text/plain, Size: 3462 bytes --]

Rik van Riel wrote:
>
> Sorry I am replying to a really old email, but exactly
> what information do you believe would be more useful to
> extract from vmscan.c with tracepoints?
>
> What are the kinds of problems that customer systems
> (which cannot be rebooted into experimental kernels)
> run into, that can be tracked down with tracepoints?
>
> I can think of a few:
> - excessive CPU use in page reclaim code
> - excessive reclaim latency in page reclaim code
> - unbalanced memory allocation between zones/nodes
> - strange balance problems between reclaiming of page
>   cache and swapping out process pages
>
> I suspect we would need fairly fine grained tracepoints
> to track down these kinds of problems, with filtering
> and/or interpretation in userspace, but I am always
> interested in easier ways of tracking down these kinds
> of problems :)
>
> What kinds of tracepoints do you believe we would need?
>
> Or, using Larry's patch as a starting point, what do you
> believe should be changed?
>

Rik, I know these mm tracepoint patches produce a low of output in the 
trace buffer.
In a nutshell what I have done is to add them in critical locations in 
places that allocate
memory, map that memory in user space, unmap it from user space, and 
free it.  In addition,
I have added tracepoints to important places in the memory allocation 
and reclaim paths so
we can see failures, stalls, high latencies as well as normal behavior.  
Finally I added them
to the pdflush operations so we can determine amounts of memory written 
back to disk there
versus the swapout paths.  Perhaps if this is too many tracepoints all 
at once we could focus
mainly on those specific to the page reclaim code path since that is 
where most contention
occurs?

Anonymous memory tracepoints:
1.) mm_anon_fault - initial anonymous pagefault.
2.) mm_anon_unmap - anonymous unmap triggered by page reclaim.
3.) mm_anon_userfree - anonymous memory unmap by user.
4.) mm_anon_cow - anonymous COW fault
5.) mm_anon_pgin - anonymous pagein from swap.

Filemap memory tracepoints:
1.) mm_filemap_fault - initial filemap fault.
2.) mm_filemap_cow - filemap COW fault.
3.) mm_filemap_userunmap - filemap unmap by user.
4.) mm_filemap_unmap - filemap unmap triggered by page reclaim.

Page allocation failure tracepoints:
1.) mm_page_allocation - page allocation that fails and causes page reclaim.

Page kswapd and direct reclaim tracepoints:
1.) mm_kswapd_ran - kswapd ran and tells us how many pages it reclaimed.
2.) mm_directreclaim_reclaimall - direct reclaim because free lists were 
below min.
3.) mm_directreclaim_reclaimzone - direct reclaim of a specific numa node.

Inner workings of the page reclaim tracepoints:
1.) mm_pagereclaim_shrinkzone - shrink zone, tells us how many pages 
were scanned.
2.) mm_pagereclaim_shrinkinactive - shrink inactive list, tells us how 
many pages were deactivated.
3.) mm_pagereclaim_shrinkactive - shrink inactive list, tells us how 
many pages were processed
4.) mm_pagereclaim_pgout - pageout, tells us which pages were paged out.
5.) mm_pagereclaim_free - tells us how many pages were freed in each 
page reclaim invocation.

Pagecache flushing tracepoints:
1.) mm_balance_dirty - tells us how many pages were written when dirty 
was above dirty_ratio.
2.) mm_pdflush_bgwriteout - tells us how many pages written when dirty 
was above dirty_background_ratio.
3.) mm_pdflush_kupdate - tells us how many pages kupdate wrote.



[-- Attachment #2: mmtracepoints-617.diff --]
[-- Type: text/plain, Size: 17240 bytes --]

diff --git a/include/trace/events/mm.h b/include/trace/events/mm.h
new file mode 100644
index 0000000..1d888a4
--- /dev/null
+++ b/include/trace/events/mm.h
@@ -0,0 +1,436 @@
+#if !defined(_TRACE_MM_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MM_H
+
+#include <linux/mm.h>
+#include <linux/tracepoint.h>
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mm
+
+TRACE_EVENT(mm_anon_fault,
+
+	TP_PROTO(struct mm_struct *mm, unsigned long address),
+
+	TP_ARGS(mm, address),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+	),
+
+	TP_printk("mm=%lx address=%lx",
+		(unsigned long)__entry->mm, __entry->address)
+);
+
+TRACE_EVENT(mm_anon_pgin,
+
+	TP_PROTO(struct mm_struct *mm, unsigned long address),
+
+	TP_ARGS(mm, address),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+	),
+
+	TP_printk("mm=%lx address=%lx",
+		(unsigned long)__entry->mm, __entry->address)
+	);
+
+TRACE_EVENT(mm_anon_cow,
+
+	TP_PROTO(struct mm_struct *mm,
+			unsigned long address),
+
+	TP_ARGS(mm, address),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+	),
+
+	TP_printk("mm=%lx address=%lx",
+		(unsigned long)__entry->mm, __entry->address)
+	);
+
+TRACE_EVENT(mm_anon_userfree,
+
+	TP_PROTO(struct mm_struct *mm,
+			unsigned long address),
+
+	TP_ARGS(mm, address),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+	),
+
+	TP_printk("mm=%lx address=%lx",
+		(unsigned long)__entry->mm, __entry->address)
+	);
+
+TRACE_EVENT(mm_anon_unmap,
+
+	TP_PROTO(struct mm_struct *mm, unsigned long address),
+
+	TP_ARGS(mm, address),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+	),
+
+	TP_printk("mm=%lx address=%lx",
+		(unsigned long)__entry->mm, __entry->address)
+	);
+
+TRACE_EVENT(mm_filemap_fault,
+
+	TP_PROTO(struct mm_struct *mm, unsigned long address, int flag),
+	TP_ARGS(mm, address, flag),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+		__field(int, flag)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+		__entry->flag = flag;
+	),
+
+	TP_printk("%s: mm=%lx address=%lx",
+		__entry->flag ? "pagein" : "primary fault",
+		(unsigned long)__entry->mm, __entry->address)
+	);
+
+TRACE_EVENT(mm_filemap_cow,
+
+	TP_PROTO(struct mm_struct *mm, unsigned long address),
+
+	TP_ARGS(mm, address),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+	),
+
+	TP_printk("mm=%lx address=%lx",
+		(unsigned long)__entry->mm, __entry->address)
+	);
+
+TRACE_EVENT(mm_filemap_unmap,
+
+	TP_PROTO(struct mm_struct *mm, unsigned long address),
+
+	TP_ARGS(mm, address),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+	),
+
+	TP_printk("mm=%lx address=%lx",
+		(unsigned long)__entry->mm, __entry->address)
+	);
+
+TRACE_EVENT(mm_filemap_userunmap,
+
+	TP_PROTO(struct mm_struct *mm, unsigned long address),
+
+	TP_ARGS(mm, address),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+	),
+
+	TP_printk("mm=%lx address=%lx",
+		(unsigned long)__entry->mm, __entry->address)
+	);
+
+TRACE_EVENT(mm_pagereclaim_pgout,
+
+	TP_PROTO(struct address_space *mapping, unsigned long offset, int anon),
+
+	TP_ARGS(mapping, offset, anon),
+
+	TP_STRUCT__entry(
+		__field(struct address_space *, mapping)
+		__field(unsigned long, offset)
+		__field(int, anon)
+	),
+
+	TP_fast_assign(
+		__entry->mapping = mapping;
+		__entry->offset = offset;
+		__entry->anon = anon;
+	),
+
+	TP_printk("mapping=%lx, offset=%lx %s",
+		(unsigned long)__entry->mapping, __entry->offset, 
+			__entry->anon ? "anonymous" : "pagecache")
+	);
+
+TRACE_EVENT(mm_pagereclaim_free,
+
+	TP_PROTO(unsigned long nr_reclaimed),
+
+	TP_ARGS(nr_reclaimed),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, nr_reclaimed)
+	),
+
+	TP_fast_assign(
+		__entry->nr_reclaimed = nr_reclaimed;
+	),
+
+	TP_printk("freed=%ld", __entry->nr_reclaimed)
+	);
+
+TRACE_EVENT(mm_pdflush_bgwriteout,
+
+	TP_PROTO(unsigned long written),
+
+	TP_ARGS(written),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, written)
+	),
+
+	TP_fast_assign(
+		__entry->written = written;
+	),
+
+	TP_printk("written=%ld", __entry->written)
+	);
+
+TRACE_EVENT(mm_pdflush_kupdate,
+
+	TP_PROTO(unsigned long writes),
+
+	TP_ARGS(writes),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, writes)
+	),
+
+	TP_fast_assign(
+		__entry->writes = writes;
+	),
+
+	TP_printk("writes=%ld", __entry->writes)
+	);
+
+TRACE_EVENT(mm_balance_dirty,
+
+	TP_PROTO(unsigned long written),
+
+	TP_ARGS(written),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, written)
+	),
+
+	TP_fast_assign(
+		__entry->written = written;
+	),
+
+	TP_printk("written=%ld", __entry->written)
+	);
+
+TRACE_EVENT(mm_page_allocation,
+
+	TP_PROTO(unsigned long free),
+
+	TP_ARGS(free),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, free)
+	),
+
+	TP_fast_assign(
+		__entry->free = free;
+	),
+
+	TP_printk("zone_free=%ld", __entry->free)
+	);
+
+TRACE_EVENT(mm_kswapd_ran,
+
+	TP_PROTO(struct pglist_data *pgdat, unsigned long reclaimed),
+
+	TP_ARGS(pgdat, reclaimed),
+
+	TP_STRUCT__entry(
+		__field(struct pglist_data *, pgdat)
+		__field(int, node_id)
+		__field(unsigned long, reclaimed)
+	),
+
+	TP_fast_assign(
+		__entry->pgdat = pgdat;
+		__entry->node_id = pgdat->node_id;
+		__entry->reclaimed = reclaimed;
+	),
+
+	TP_printk("node=%d reclaimed=%ld", __entry->node_id, __entry->reclaimed)
+	);
+
+TRACE_EVENT(mm_directreclaim_reclaimall,
+
+	TP_PROTO(int node, unsigned long reclaimed, unsigned long priority),
+
+	TP_ARGS(node, reclaimed, priority),
+
+	TP_STRUCT__entry(
+		__field(int, node)
+		__field(unsigned long, reclaimed)
+		__field(unsigned long, priority)
+	),
+
+	TP_fast_assign(
+		__entry->node = node;
+		__entry->reclaimed = reclaimed;
+		__entry->priority = priority;
+	),
+
+	TP_printk("node=%d reclaimed=%ld priority=%ld", __entry->node, __entry->reclaimed, 
+					__entry->priority)
+	);
+
+TRACE_EVENT(mm_directreclaim_reclaimzone,
+
+	TP_PROTO(int node, unsigned long reclaimed, unsigned long priority),
+
+	TP_ARGS(node, reclaimed, priority),
+
+	TP_STRUCT__entry(
+		__field(int, node)
+		__field(unsigned long, reclaimed)
+		__field(unsigned long, priority)
+	),
+
+	TP_fast_assign(
+		__entry->node = node;
+		__entry->reclaimed = reclaimed;
+		__entry->priority = priority;
+	),
+
+	TP_printk("node = %d reclaimed=%ld, priority=%ld",
+			__entry->node, __entry->reclaimed, __entry->priority)
+	);
+TRACE_EVENT(mm_pagereclaim_shrinkzone,
+
+	TP_PROTO(unsigned long reclaimed, unsigned long priority),
+
+	TP_ARGS(reclaimed, priority),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, reclaimed)
+		__field(unsigned long, priority)
+	),
+
+	TP_fast_assign(
+		__entry->reclaimed = reclaimed;
+		__entry->priority = priority;
+	),
+
+	TP_printk("reclaimed=%ld priority=%ld",
+			__entry->reclaimed, __entry->priority)
+	);
+
+TRACE_EVENT(mm_pagereclaim_shrinkactive,
+
+	TP_PROTO(unsigned long scanned, int file, int priority),
+
+	TP_ARGS(scanned, file, priority),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, scanned)
+		__field(int, file)
+		__field(int, priority)
+	),
+
+	TP_fast_assign(
+		__entry->scanned = scanned;
+		__entry->file = file;
+		__entry->priority = priority;
+	),
+
+	TP_printk("scanned=%ld, %s, priority=%d",
+		__entry->scanned, __entry->file ? "pagecache" : "anonymous",
+		__entry->priority)
+	);
+
+TRACE_EVENT(mm_pagereclaim_shrinkinactive,
+
+	TP_PROTO(unsigned long scanned, unsigned long reclaimed,
+			int priority),
+
+	TP_ARGS(scanned, reclaimed, priority),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, scanned)
+		__field(unsigned long, reclaimed)
+		__field(int, priority)
+	),
+
+	TP_fast_assign(
+		__entry->scanned = scanned;
+		__entry->reclaimed = reclaimed;
+		__entry->priority = priority;
+	),
+
+	TP_printk("scanned=%ld, reclaimed=%ld, priority=%d",
+		__entry->scanned, __entry->reclaimed, 
+		__entry->priority)
+	);
+
+#endif /* _TRACE_MM_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/filemap.c b/mm/filemap.c
index 1b60f30..af4a964 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -34,6 +34,7 @@
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
+#include <trace/events/mm.h>
 #include "internal.h"
 
 /*
@@ -1568,6 +1569,8 @@ retry_find:
 	 */
 	ra->prev_pos = (loff_t)page->index << PAGE_CACHE_SHIFT;
 	vmf->page = page;
+	trace_mm_filemap_fault(vma->vm_mm, (unsigned long)vmf->virtual_address,
+			vmf->flags&FAULT_FLAG_NONLINEAR);
 	return ret | VM_FAULT_LOCKED;
 
 no_cached_page:
diff --git a/mm/memory.c b/mm/memory.c
index 4126dd1..a4a580c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -61,6 +61,7 @@
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 #include <asm/pgtable.h>
+#include <trace/events/mm.h>
 
 #include "internal.h"
 
@@ -812,15 +813,17 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 						addr) != page->index)
 				set_pte_at(mm, addr, pte,
 					   pgoff_to_pte(page->index));
-			if (PageAnon(page))
+			if (PageAnon(page)) {
 				anon_rss--;
-			else {
+				trace_mm_anon_userfree(mm, addr);
+			} else {
 				if (pte_dirty(ptent))
 					set_page_dirty(page);
 				if (pte_young(ptent) &&
 				    likely(!VM_SequentialReadHint(vma)))
 					mark_page_accessed(page);
 				file_rss--;
+				trace_mm_filemap_userunmap(mm, addr);
 			}
 			page_remove_rmap(page);
 			if (unlikely(page_mapcount(page) < 0))
@@ -1896,7 +1899,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pte_t *page_table, pmd_t *pmd,
 		spinlock_t *ptl, pte_t orig_pte)
 {
-	struct page *old_page, *new_page;
+	struct page *old_page, *new_page = NULL;
 	pte_t entry;
 	int reuse = 0, ret = 0;
 	int page_mkwrite = 0;
@@ -2050,9 +2053,12 @@ gotten:
 			if (!PageAnon(old_page)) {
 				dec_mm_counter(mm, file_rss);
 				inc_mm_counter(mm, anon_rss);
+				trace_mm_filemap_cow(mm, address);
 			}
-		} else
+		} else {
 			inc_mm_counter(mm, anon_rss);
+			trace_mm_anon_cow(mm, address);
+		}
 		flush_cache_page(vma, address, pte_pfn(orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
@@ -2449,7 +2455,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		int write_access, pte_t orig_pte)
 {
 	spinlock_t *ptl;
-	struct page *page;
+	struct page *page = NULL;
 	swp_entry_t entry;
 	pte_t pte;
 	struct mem_cgroup *ptr = NULL;
@@ -2549,6 +2555,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 out:
+	trace_mm_anon_pgin(mm, address);
 	return ret;
 out_nomap:
 	mem_cgroup_cancel_charge_swapin(ptr);
@@ -2582,6 +2589,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto oom;
 	__SetPageUptodate(page);
 
+	trace_mm_anon_fault(mm, address);
 	if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
 		goto oom_free_page;
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index bb553c3..ef92a97 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -34,6 +34,7 @@
 #include <linux/syscalls.h>
 #include <linux/buffer_head.h>
 #include <linux/pagevec.h>
+#include <trace/events/mm.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -574,6 +575,7 @@ static void balance_dirty_pages(struct address_space *mapping)
 		congestion_wait(WRITE, HZ/10);
 	}
 
+	trace_mm_balance_dirty(pages_written);
 	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
 			bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
@@ -716,6 +718,7 @@ static void background_writeout(unsigned long _min_pages)
 				break;
 		}
 	}
+	trace_mm_pdflush_bgwriteout(_min_pages);
 }
 
 /*
@@ -776,6 +779,7 @@ static void wb_kupdate(unsigned long arg)
 	nr_to_write = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
 			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+	trace_mm_pdflush_kupdate(nr_to_write);
 	while (nr_to_write > 0) {
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0727896..ca9355e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -48,6 +48,7 @@
 #include <linux/page_cgroup.h>
 #include <linux/debugobjects.h>
 #include <linux/kmemleak.h>
+#include <trace/events/mm.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -1440,6 +1441,7 @@ zonelist_scan:
 				mark = zone->pages_high;
 			if (!zone_watermark_ok(zone, order, mark,
 				    classzone_idx, alloc_flags)) {
+				trace_mm_page_allocation(zone_page_state(zone, NR_FREE_PAGES));
 				if (!zone_reclaim_mode ||
 				    !zone_reclaim(zone, gfp_mask, order))
 					goto this_zone_full;
diff --git a/mm/rmap.c b/mm/rmap.c
index 23122af..f2156ca 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -50,6 +50,7 @@
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
+#include <trace/events/mm.h>
 
 #include <asm/tlbflush.h>
 
@@ -1025,6 +1026,7 @@ static int try_to_unmap_anon(struct page *page, int unlock, int migration)
 			if (mlocked)
 				break;	/* stop if actually mlocked page */
 		}
+		trace_mm_anon_unmap(vma->vm_mm, vma->vm_start+page->index);
 	}
 
 	page_unlock_anon_vma(anon_vma);
@@ -1152,6 +1154,7 @@ static int try_to_unmap_file(struct page *page, int unlock, int migration)
 					goto out;
 			}
 			vma->vm_private_data = (void *) max_nl_cursor;
+			trace_mm_filemap_unmap(vma->vm_mm, vma->vm_start+page->index);
 		}
 		cond_resched_lock(&mapping->i_mmap_lock);
 		max_nl_cursor += CLUSTER_SIZE;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 95c08a8..bed7125 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -40,6 +40,8 @@
 #include <linux/memcontrol.h>
 #include <linux/delayacct.h>
 #include <linux/sysctl.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/mm.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -417,6 +419,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			ClearPageReclaim(page);
 		}
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
+		trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT,
+						PageAnon(page));
 		return PAGE_SUCCESS;
 	}
 
@@ -796,6 +800,7 @@ keep:
 	if (pagevec_count(&freed_pvec))
 		__pagevec_free(&freed_pvec);
 	count_vm_events(PGACTIVATE, pgactivate);
+	trace_mm_pagereclaim_free(nr_reclaimed);
 	return nr_reclaimed;
 }
 
@@ -1182,6 +1187,8 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 done:
 	local_irq_enable();
 	pagevec_release(&pvec);
+	trace_mm_pagereclaim_shrinkinactive(nr_scanned, nr_reclaimed,
+				priority);
 	return nr_reclaimed;
 }
 
@@ -1316,6 +1323,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 	if (buffer_heads_over_limit)
 		pagevec_strip(&pvec);
 	pagevec_release(&pvec);
+	trace_mm_pagereclaim_shrinkactive(pgscanned, file, priority);
 }
 
 static int inactive_anon_is_low_global(struct zone *zone)
@@ -1516,6 +1524,7 @@ static void shrink_zone(int priority, struct zone *zone,
 	}
 
 	sc->nr_reclaimed = nr_reclaimed;
+	trace_mm_pagereclaim_shrinkzone(nr_reclaimed, priority);
 
 	/*
 	 * Even if we did not try to evict anon pages at all, we want to
@@ -1678,6 +1687,8 @@ out:
 	if (priority < 0)
 		priority = 0;
 
+	trace_mm_directreclaim_reclaimall(zonelist[0]._zonerefs->zone->node,
+						sc->nr_reclaimed, priority);
 	if (scanning_global_lru(sc)) {
 		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 
@@ -1947,6 +1958,7 @@ out:
 		goto loop_again;
 	}
 
+	trace_mm_kswapd_ran(pgdat, sc.nr_reclaimed);
 	return sc.nr_reclaimed;
 }
 
@@ -2299,7 +2311,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	const unsigned long nr_pages = 1 << order;
 	struct task_struct *p = current;
 	struct reclaim_state reclaim_state;
-	int priority;
+	int priority = ZONE_RECLAIM_PRIORITY;
 	struct scan_control sc = {
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
@@ -2366,6 +2378,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 
 	p->reclaim_state = NULL;
 	current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
+	trace_mm_directreclaim_reclaimzone(zone->node,
+				sc.nr_reclaimed, priority);
 	return sc.nr_reclaimed >= nr_pages;
 }
 

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
  2009-06-15 18:26             ` Rik van Riel
@ 2009-06-18  7:57               ` KOSAKI Motohiro
  -1 siblings, 0 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2009-06-18  7:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, Larry Woodman, Ingo Molnar,
	Fr馘駻ic Weisbecker, Li Zefan, Pekka Enberg,
	eduard.munteanu, linux-kernel, linux-mm, rostedt

Hi

sorry for the delaying in replay.
your question is always difficult...


> KOSAKI Motohiro wrote:
> >> On Wed, 2009-04-22 at 08:07 -0400, Larry Woodman wrote:
> 
> >> Attached is an example of what the mm tracepoints can be used for:
> > 
> > I have some comment.
> > 
> > 1. Yes, current zone_reclaim have strange behavior. I plan to fix
> >    some bug-like bahavior.
> > 2. your scenario only use the information of "zone_reclaim called".
> >    function tracer already provide it.
> > 3. but yes, you are going to proper direction. we definitely need
> >    some fine grained tracepoint in this area. we are welcome to you.
> >    but in my personal feeling, your tracepoint have worthless argument
> >    a lot. we need more good information.
> >    I think I can help you in this area. I hope to work together.
> 
> Sorry I am replying to a really old email, but exactly
> what information do you believe would be more useful to
> extract from vmscan.c with tracepoints?
> 
> What are the kinds of problems that customer systems
> (which cannot be rebooted into experimental kernels)
> run into, that can be tracked down with tracepoints?
> 
> I can think of a few:
> - excessive CPU use in page reclaim code
> - excessive reclaim latency in page reclaim code
> - unbalanced memory allocation between zones/nodes
> - strange balance problems between reclaiming of page
>    cache and swapping out process pages
> 
> I suspect we would need fairly fine grained tracepoints
> to track down these kinds of problems, with filtering
> and/or interpretation in userspace, but I am always
> interested in easier ways of tracking down these kinds
> of problems :)
> 
> What kinds of tracepoints do you believe we would need?
> 
> Or, using Larry's patch as a starting point, what do you
> believe should be changed?

OK, I recognize we need use-case discussion more.
following scenario are my freqently received issue list.
(perhaps, there are unwritten issue, but I don't recall it now)

Scenario 1. OOM killer happend. why? and who bring it?
Scenario 2. page allocation failure by memory fragmentation
Scenario 3. try_to_free_pages() makes very long latency. why?
Scenario 4. sar output that free memory dramatically reduced at 10 minute ago, and
            it already recover now. What's happen?

  - suspects
    - kernel memory leak
    - userland memory leak
    - stupid driver use too much memory
    - userland application suddenly start to use much memory

  - what information are valuable?
    - slab usage information (kmemtrace already does)
    - page allocator usage information
    - rss of all processes at oom happend
    - why recent try_to_free_pages() can't reclaim any page?
    - recent sycall history
    - buddy fragmentation info


Plus, another requirement here
1. trace page refault distance (likes past Rik's /proc/refault patch)

2. file cache visualizer - Which file use many page-cache?
   - afaik, Wu Fengguang is working on this issue.


--------------------------------------------
And, here is my reviewing comment to his patch.
btw, I haven't full review it yet. perhaps I might be overlooking something.


First, this is general review comment.

- Please don't display mm and/or another kernel raw pointer.
  if we assume non stop system, we can't use kernel-dump. Thus kernel pointer
  logging is not so useful.
  Any userland tools can't parse it. (/proc/kcore don't help this situation,
  the pointer might be freed before parsing)
- Please makes patch series. one big patch is harder review.
- Please write patch description and use-case.
- Please consider how do this feature works on mem-cgroup.
  (IOW, please don't ignore many "if (scanning_global_lru())")
- tracepoint caller shouldn't have any assumption of displaying representation.
  e.g.
    wrong)  trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT, PageAnon(page));
    good)   trace_mm_pagereclaim_pgout(mapping, page)
  that's general and good callback and/or hook manner.




> diff --git a/include/trace/events/mm.h b/include/trace/events/mm.h
> new file mode 100644
> index 0000000..1d888a4
> --- /dev/null
> +++ b/include/trace/events/mm.h
> @@ -0,0 +1,436 @@
> +#if !defined(_TRACE_MM_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_MM_H
> +
> +#include <linux/mm.h>
> +#include <linux/tracepoint.h>
> +
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM mm
> +
> +TRACE_EVENT(mm_anon_fault,
> +
> +	TP_PROTO(struct mm_struct *mm, unsigned long address),
> +
> +	TP_ARGS(mm, address),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(unsigned long, address)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->address = address;
> +	),
> +
> +	TP_printk("mm=%lx address=%lx",
> +		(unsigned long)__entry->mm, __entry->address)
> +);
> +
> +TRACE_EVENT(mm_anon_pgin,
> +
> +	TP_PROTO(struct mm_struct *mm, unsigned long address),
> +
> +	TP_ARGS(mm, address),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(unsigned long, address)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->address = address;
> +	),
> +
> +	TP_printk("mm=%lx address=%lx",
> +		(unsigned long)__entry->mm, __entry->address)
> +	);
> +
> +TRACE_EVENT(mm_anon_cow,
> +
> +	TP_PROTO(struct mm_struct *mm,
> +			unsigned long address),
> +
> +	TP_ARGS(mm, address),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(unsigned long, address)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->address = address;
> +	),
> +
> +	TP_printk("mm=%lx address=%lx",
> +		(unsigned long)__entry->mm, __entry->address)
> +	);
> +
> +TRACE_EVENT(mm_anon_userfree,
> +
> +	TP_PROTO(struct mm_struct *mm,
> +			unsigned long address),
> +
> +	TP_ARGS(mm, address),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(unsigned long, address)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->address = address;
> +	),
> +
> +	TP_printk("mm=%lx address=%lx",
> +		(unsigned long)__entry->mm, __entry->address)
> +	);
> +
> +TRACE_EVENT(mm_anon_unmap,
> +
> +	TP_PROTO(struct mm_struct *mm, unsigned long address),
> +
> +	TP_ARGS(mm, address),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(unsigned long, address)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->address = address;
> +	),
> +
> +	TP_printk("mm=%lx address=%lx",
> +		(unsigned long)__entry->mm, __entry->address)
> +	);
> +
> +TRACE_EVENT(mm_filemap_fault,
> +
> +	TP_PROTO(struct mm_struct *mm, unsigned long address, int flag),
> +	TP_ARGS(mm, address, flag),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(unsigned long, address)
> +		__field(int, flag)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->address = address;
> +		__entry->flag = flag;
> +	),
> +
> +	TP_printk("%s: mm=%lx address=%lx",
> +		__entry->flag ? "pagein" : "primary fault",
> +		(unsigned long)__entry->mm, __entry->address)
> +	);
> +
> +TRACE_EVENT(mm_filemap_cow,
> +
> +	TP_PROTO(struct mm_struct *mm, unsigned long address),
> +
> +	TP_ARGS(mm, address),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(unsigned long, address)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->address = address;
> +	),
> +
> +	TP_printk("mm=%lx address=%lx",
> +		(unsigned long)__entry->mm, __entry->address)
> +	);
> +
> +TRACE_EVENT(mm_filemap_unmap,
> +
> +	TP_PROTO(struct mm_struct *mm, unsigned long address),
> +
> +	TP_ARGS(mm, address),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(unsigned long, address)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->address = address;
> +	),
> +
> +	TP_printk("mm=%lx address=%lx",
> +		(unsigned long)__entry->mm, __entry->address)
> +	);
> +
> +TRACE_EVENT(mm_filemap_userunmap,
> +
> +	TP_PROTO(struct mm_struct *mm, unsigned long address),
> +
> +	TP_ARGS(mm, address),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(unsigned long, address)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->address = address;
> +	),
> +
> +	TP_printk("mm=%lx address=%lx",
> +		(unsigned long)__entry->mm, __entry->address)
> +	);
> +
> +TRACE_EVENT(mm_pagereclaim_pgout,
> +
> +	TP_PROTO(struct address_space *mapping, unsigned long offset, int anon),
> +
> +	TP_ARGS(mapping, offset, anon),
> +
> +	TP_STRUCT__entry(
> +		__field(struct address_space *, mapping)
> +		__field(unsigned long, offset)
> +		__field(int, anon)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mapping = mapping;
> +		__entry->offset = offset;
> +		__entry->anon = anon;
> +	),
> +
> +	TP_printk("mapping=%lx, offset=%lx %s",
> +		(unsigned long)__entry->mapping, __entry->offset, 
> +			__entry->anon ? "anonymous" : "pagecache")
> +	);
> +
> +TRACE_EVENT(mm_pagereclaim_free,
> +
> +	TP_PROTO(unsigned long nr_reclaimed),
> +
> +	TP_ARGS(nr_reclaimed),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, nr_reclaimed)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->nr_reclaimed = nr_reclaimed;
> +	),
> +
> +	TP_printk("freed=%ld", __entry->nr_reclaimed)
> +	);
> +
> +TRACE_EVENT(mm_pdflush_bgwriteout,
> +
> +	TP_PROTO(unsigned long written),
> +
> +	TP_ARGS(written),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, written)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->written = written;
> +	),
> +
> +	TP_printk("written=%ld", __entry->written)
> +	);
> +
> +TRACE_EVENT(mm_pdflush_kupdate,
> +
> +	TP_PROTO(unsigned long writes),
> +
> +	TP_ARGS(writes),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, writes)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->writes = writes;
> +	),
> +
> +	TP_printk("writes=%ld", __entry->writes)
> +	);
> +
> +TRACE_EVENT(mm_balance_dirty,
> +
> +	TP_PROTO(unsigned long written),
> +
> +	TP_ARGS(written),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, written)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->written = written;
> +	),
> +
> +	TP_printk("written=%ld", __entry->written)
> +	);
> +
> +TRACE_EVENT(mm_page_allocation,
> +
> +	TP_PROTO(unsigned long free),
> +
> +	TP_ARGS(free),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, free)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->free = free;
> +	),
> +
> +	TP_printk("zone_free=%ld", __entry->free)
> +	);
> +
> +TRACE_EVENT(mm_kswapd_ran,
> +
> +	TP_PROTO(struct pglist_data *pgdat, unsigned long reclaimed),
> +
> +	TP_ARGS(pgdat, reclaimed),
> +
> +	TP_STRUCT__entry(
> +		__field(struct pglist_data *, pgdat)
> +		__field(int, node_id)
> +		__field(unsigned long, reclaimed)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->pgdat = pgdat;
> +		__entry->node_id = pgdat->node_id;
> +		__entry->reclaimed = reclaimed;
> +	),
> +
> +	TP_printk("node=%d reclaimed=%ld", __entry->node_id, __entry->reclaimed)
> +	);
> +
> +TRACE_EVENT(mm_directreclaim_reclaimall,
> +
> +	TP_PROTO(int node, unsigned long reclaimed, unsigned long priority),
> +
> +	TP_ARGS(node, reclaimed, priority),
> +
> +	TP_STRUCT__entry(
> +		__field(int, node)
> +		__field(unsigned long, reclaimed)
> +		__field(unsigned long, priority)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->node = node;
> +		__entry->reclaimed = reclaimed;
> +		__entry->priority = priority;
> +	),
> +
> +	TP_printk("node=%d reclaimed=%ld priority=%ld", __entry->node, __entry->reclaimed, 
> +					__entry->priority)
> +	);
> +
> +TRACE_EVENT(mm_directreclaim_reclaimzone,
> +
> +	TP_PROTO(int node, unsigned long reclaimed, unsigned long priority),
> +
> +	TP_ARGS(node, reclaimed, priority),
> +
> +	TP_STRUCT__entry(
> +		__field(int, node)
> +		__field(unsigned long, reclaimed)
> +		__field(unsigned long, priority)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->node = node;
> +		__entry->reclaimed = reclaimed;
> +		__entry->priority = priority;
> +	),
> +
> +	TP_printk("node = %d reclaimed=%ld, priority=%ld",
> +			__entry->node, __entry->reclaimed, __entry->priority)
> +	);
> +TRACE_EVENT(mm_pagereclaim_shrinkzone,
> +
> +	TP_PROTO(unsigned long reclaimed, unsigned long priority),
> +
> +	TP_ARGS(reclaimed, priority),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, reclaimed)
> +		__field(unsigned long, priority)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->reclaimed = reclaimed;
> +		__entry->priority = priority;
> +	),
> +
> +	TP_printk("reclaimed=%ld priority=%ld",
> +			__entry->reclaimed, __entry->priority)
> +	);
> +
> +TRACE_EVENT(mm_pagereclaim_shrinkactive,
> +
> +	TP_PROTO(unsigned long scanned, int file, int priority),
> +
> +	TP_ARGS(scanned, file, priority),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, scanned)
> +		__field(int, file)
> +		__field(int, priority)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->scanned = scanned;
> +		__entry->file = file;
> +		__entry->priority = priority;
> +	),
> +
> +	TP_printk("scanned=%ld, %s, priority=%d",
> +		__entry->scanned, __entry->file ? "pagecache" : "anonymous",
> +		__entry->priority)
> +	);
> +
> +TRACE_EVENT(mm_pagereclaim_shrinkinactive,
> +
> +	TP_PROTO(unsigned long scanned, unsigned long reclaimed,
> +			int priority),
> +
> +	TP_ARGS(scanned, reclaimed, priority),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, scanned)
> +		__field(unsigned long, reclaimed)
> +		__field(int, priority)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->scanned = scanned;
> +		__entry->reclaimed = reclaimed;
> +		__entry->priority = priority;
> +	),
> +
> +	TP_printk("scanned=%ld, reclaimed=%ld, priority=%d",
> +		__entry->scanned, __entry->reclaimed, 
> +		__entry->priority)
> +	);
> +
> +#endif /* _TRACE_MM_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1b60f30..af4a964 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -34,6 +34,7 @@
>  #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
>  #include <linux/memcontrol.h>
>  #include <linux/mm_inline.h> /* for page_is_file_cache() */
> +#include <trace/events/mm.h>
>  #include "internal.h"
>  
>  /*
> @@ -1568,6 +1569,8 @@ retry_find:
>  	 */
>  	ra->prev_pos = (loff_t)page->index << PAGE_CACHE_SHIFT;
>  	vmf->page = page;
> +	trace_mm_filemap_fault(vma->vm_mm, (unsigned long)vmf->virtual_address,
> +			vmf->flags&FAULT_FLAG_NONLINEAR);
>  	return ret | VM_FAULT_LOCKED;
>
>  no_cached_page:
> diff --git a/mm/memory.c b/mm/memory.c
> index 4126dd1..a4a580c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -61,6 +61,7 @@
>  #include <asm/tlb.h>
>  #include <asm/tlbflush.h>
>  #include <asm/pgtable.h>
> +#include <trace/events/mm.h>
>  
>  #include "internal.h"
>  
> @@ -812,15 +813,17 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  						addr) != page->index)
>  				set_pte_at(mm, addr, pte,
>  					   pgoff_to_pte(page->index));
> -			if (PageAnon(page))
> +			if (PageAnon(page)) {
>  				anon_rss--;
> -			else {
> +				trace_mm_anon_userfree(mm, addr);
> +			} else {
>  				if (pte_dirty(ptent))
>  					set_page_dirty(page);
>  				if (pte_young(ptent) &&
>  				    likely(!VM_SequentialReadHint(vma)))
>  					mark_page_accessed(page);
>  				file_rss--;
> +				trace_mm_filemap_userunmap(mm, addr);
>  			}
>  			page_remove_rmap(page);
>  			if (unlikely(page_mapcount(page) < 0))
> @@ -1896,7 +1899,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		unsigned long address, pte_t *page_table, pmd_t *pmd,
>  		spinlock_t *ptl, pte_t orig_pte)
>  {
> -	struct page *old_page, *new_page;
> +	struct page *old_page, *new_page = NULL;
>  	pte_t entry;
>  	int reuse = 0, ret = 0;
>  	int page_mkwrite = 0;
> @@ -2050,9 +2053,12 @@ gotten:
>  			if (!PageAnon(old_page)) {
>  				dec_mm_counter(mm, file_rss);
>  				inc_mm_counter(mm, anon_rss);
> +				trace_mm_filemap_cow(mm, address);
>  			}
> -		} else
> +		} else {
>  			inc_mm_counter(mm, anon_rss);
> +			trace_mm_anon_cow(mm, address);
> +		}
>  		flush_cache_page(vma, address, pte_pfn(orig_pte));
>  		entry = mk_pte(new_page, vma->vm_page_prot);
>  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> @@ -2449,7 +2455,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		int write_access, pte_t orig_pte)
>  {
>  	spinlock_t *ptl;
> -	struct page *page;
> +	struct page *page = NULL;
>  	swp_entry_t entry;
>  	pte_t pte;
>  	struct mem_cgroup *ptr = NULL;
> @@ -2549,6 +2555,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  unlock:
>  	pte_unmap_unlock(page_table, ptl);
>  out:
> +	trace_mm_anon_pgin(mm, address);
>  	return ret;
>  out_nomap:
>  	mem_cgroup_cancel_charge_swapin(ptr);

In swapin, you trace "mm" and "virtual address". but in swap-out, you trace "mapping" and
"virtual address".

Oh well, we can't compare swap-in and swap-out log. Please consider to make input and output synmetric.


> @@ -2582,6 +2589,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		goto oom;
>  	__SetPageUptodate(page);
>  
> +	trace_mm_anon_fault(mm, address);
>  	if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
>  		goto oom_free_page;
>  
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index bb553c3..ef92a97 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -34,6 +34,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/buffer_head.h>
>  #include <linux/pagevec.h>
> +#include <trace/events/mm.h>
>  
>  /*
>   * The maximum number of pages to writeout in a single bdflush/kupdate
> @@ -574,6 +575,7 @@ static void balance_dirty_pages(struct address_space *mapping)
>  		congestion_wait(WRITE, HZ/10);
>  	}
>  
> +	trace_mm_balance_dirty(pages_written);

perhaps, you need to explain why this tracepoint is useful.
I haven't use this log on my past debugging.

perhaps, if you only need number of written pages, new vmstat field is
more useful?


>  	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
>  			bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
> @@ -716,6 +718,7 @@ static void background_writeout(unsigned long _min_pages)
>  				break;
>  		}
>  	}
> +	trace_mm_pdflush_bgwriteout(_min_pages);
>  }

ditto.


>  
>  /*
> @@ -776,6 +779,7 @@ static void wb_kupdate(unsigned long arg)
>  	nr_to_write = global_page_state(NR_FILE_DIRTY) +
>  			global_page_state(NR_UNSTABLE_NFS) +
>  			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
> +	trace_mm_pdflush_kupdate(nr_to_write);
>  	while (nr_to_write > 0) {
>  		wbc.more_io = 0;
>  		wbc.encountered_congestion = 0;

ditto.


> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0727896..ca9355e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -48,6 +48,7 @@
>  #include <linux/page_cgroup.h>
>  #include <linux/debugobjects.h>
>  #include <linux/kmemleak.h>
> +#include <trace/events/mm.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -1440,6 +1441,7 @@ zonelist_scan:
>  				mark = zone->pages_high;
>  			if (!zone_watermark_ok(zone, order, mark,
>  				    classzone_idx, alloc_flags)) {
> +				trace_mm_page_allocation(zone_page_state(zone, NR_FREE_PAGES));
>  				if (!zone_reclaim_mode ||
>  				    !zone_reclaim(zone, gfp_mask, order))
>  					goto this_zone_full;

bad name.
it is not the notification of allocation. 

Plus, this is wrong place too. it doesn't mean allocation failure.

it only mean a zone is not sufficient memory.
However this tracepoint don't have zone argument. then it is totally unuseful.

Plus, NR_FREE_PAGES is not sufficient informantion. the most common reason
of allocation failure is not low NR_FREE_PAGES. it is buddy fragmentation.




> diff --git a/mm/rmap.c b/mm/rmap.c
> index 23122af..f2156ca 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -50,6 +50,7 @@
>  #include <linux/memcontrol.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/migrate.h>
> +#include <trace/events/mm.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -1025,6 +1026,7 @@ static int try_to_unmap_anon(struct page *page, int unlock, int migration)
>  			if (mlocked)
>  				break;	/* stop if actually mlocked page */
>  		}
> +		trace_mm_anon_unmap(vma->vm_mm, vma->vm_start+page->index);
>  	}
>  
>  	page_unlock_anon_vma(anon_vma);
> @@ -1152,6 +1154,7 @@ static int try_to_unmap_file(struct page *page, int unlock, int migration)
>  					goto out;
>  			}
>  			vma->vm_private_data = (void *) max_nl_cursor;
> +			trace_mm_filemap_unmap(vma->vm_mm, vma->vm_start+page->index);
>  		}
>  		cond_resched_lock(&mapping->i_mmap_lock);
>  		max_nl_cursor += CLUSTER_SIZE;

try_to_unmap() and try_to_unlock() are pretty difference.
maybe, we only need try_to_unmap() case?




> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 95c08a8..bed7125 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -40,6 +40,8 @@
>  #include <linux/memcontrol.h>
>  #include <linux/delayacct.h>
>  #include <linux/sysctl.h>
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/mm.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -417,6 +419,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
>  			ClearPageReclaim(page);
>  		}
>  		inc_zone_page_state(page, NR_VMSCAN_WRITE);
> +		trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT,
> +						PageAnon(page));

I don't think it this useful information.

for file-mapped)
  [mapping, offset] pair represent which portion is pointed this cache page.
for swa-backed)
  [process, virtual_address] ..


Plus, I have one question. How do we combine this information and blktrace?
if we can't see I/O activity relationship, this is really unuseful.


>  		return PAGE_SUCCESS;
>  	}
>  
> @@ -796,6 +800,7 @@ keep:
>  	if (pagevec_count(&freed_pvec))
>  		__pagevec_free(&freed_pvec);
>  	count_vm_events(PGACTIVATE, pgactivate);
> +	trace_mm_pagereclaim_free(nr_reclaimed);
>  	return nr_reclaimed;
>  }

No.
if administrator only need number of free pages.
/proc/meminfo and /proc/vmstat already provide it.

but I don't think it is sufficient information.
May I ask when do you use this tracepoint? and why?



>  
> @@ -1182,6 +1187,8 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
>  done:
>  	local_irq_enable();
>  	pagevec_release(&pvec);
> +	trace_mm_pagereclaim_shrinkinactive(nr_scanned, nr_reclaimed,
> +				priority);
>  	return nr_reclaimed;
>  }
>  
> @@ -1316,6 +1323,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>  	if (buffer_heads_over_limit)
>  		pagevec_strip(&pvec);
>  	pagevec_release(&pvec);
> +	trace_mm_pagereclaim_shrinkactive(pgscanned, file, priority);
>  }
>  
>  static int inactive_anon_is_low_global(struct zone *zone)
> @@ -1516,6 +1524,7 @@ static void shrink_zone(int priority, struct zone *zone,
>  	}
>  
>  	sc->nr_reclaimed = nr_reclaimed;
> +	trace_mm_pagereclaim_shrinkzone(nr_reclaimed, priority);
>  
>  	/*
>  	 * Even if we did not try to evict anon pages at all, we want to
> @@ -1678,6 +1687,8 @@ out:
>  	if (priority < 0)
>  		priority = 0;
>  
> +	trace_mm_directreclaim_reclaimall(zonelist[0]._zonerefs->zone->node,
> +						sc->nr_reclaimed, priority);
>  	if (scanning_global_lru(sc)) {
>  		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>  

Why do you want to log to node? Why not zone itself?

Plus, Why you ignore try_to_free_pages() latency?



> @@ -1947,6 +1958,7 @@ out:
>  		goto loop_again;
>  	}
>  
> +	trace_mm_kswapd_ran(pgdat, sc.nr_reclaimed);
>  	return sc.nr_reclaimed;
>  }
>  

equal to kswapd_steal field in /proc/vmstat?


> @@ -2299,7 +2311,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	const unsigned long nr_pages = 1 << order;
>  	struct task_struct *p = current;
>  	struct reclaim_state reclaim_state;
> -	int priority;
> +	int priority = ZONE_RECLAIM_PRIORITY;
>  	struct scan_control sc = {
>  		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
>  		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
> @@ -2366,6 +2378,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  
>  	p->reclaim_state = NULL;
>  	current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
> +	trace_mm_directreclaim_reclaimzone(zone->node,
> +				sc.nr_reclaimed, priority);
>  	return sc.nr_reclaimed >= nr_pages;
>  }

this is _zone_ reclaim. but the code pass node.
Plus, if we consider to log page allocation and reclaim, we shouldn't ignore
gfp_mask.

it cause to change many allocation/reclaim behavior.


----
My current conclusion is, nobody use this patch on his own system.
the patch have many unclear useful tracepoint.

At least, patch splitting is needed for productive discussion.
  e.g.
   - reclaim IO activity tracing
   - memory fragmentation visualizer
   - per i-node page cache visualizer (likes Wu's filecache patch)
   - reclaim failure reason tracing and aggregation ftrace plugin
   - reclaim latency tracing


I'm glad if larry resubmit this effort.




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
@ 2009-06-18  7:57               ` KOSAKI Motohiro
  0 siblings, 0 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2009-06-18  7:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, Larry Woodman, Ingo Molnar,
	Fr馘駻ic Weisbecker, Li Zefan, Pekka Enberg,
	eduard.munteanu, linux-kernel, linux-mm, rostedt

Hi

sorry for the delaying in replay.
your question is always difficult...


> KOSAKI Motohiro wrote:
> >> On Wed, 2009-04-22 at 08:07 -0400, Larry Woodman wrote:
> 
> >> Attached is an example of what the mm tracepoints can be used for:
> > 
> > I have some comment.
> > 
> > 1. Yes, current zone_reclaim have strange behavior. I plan to fix
> >    some bug-like bahavior.
> > 2. your scenario only use the information of "zone_reclaim called".
> >    function tracer already provide it.
> > 3. but yes, you are going to proper direction. we definitely need
> >    some fine grained tracepoint in this area. we are welcome to you.
> >    but in my personal feeling, your tracepoint have worthless argument
> >    a lot. we need more good information.
> >    I think I can help you in this area. I hope to work together.
> 
> Sorry I am replying to a really old email, but exactly
> what information do you believe would be more useful to
> extract from vmscan.c with tracepoints?
> 
> What are the kinds of problems that customer systems
> (which cannot be rebooted into experimental kernels)
> run into, that can be tracked down with tracepoints?
> 
> I can think of a few:
> - excessive CPU use in page reclaim code
> - excessive reclaim latency in page reclaim code
> - unbalanced memory allocation between zones/nodes
> - strange balance problems between reclaiming of page
>    cache and swapping out process pages
> 
> I suspect we would need fairly fine grained tracepoints
> to track down these kinds of problems, with filtering
> and/or interpretation in userspace, but I am always
> interested in easier ways of tracking down these kinds
> of problems :)
> 
> What kinds of tracepoints do you believe we would need?
> 
> Or, using Larry's patch as a starting point, what do you
> believe should be changed?

OK, I recognize we need use-case discussion more.
following scenario are my freqently received issue list.
(perhaps, there are unwritten issue, but I don't recall it now)

Scenario 1. OOM killer happend. why? and who bring it?
Scenario 2. page allocation failure by memory fragmentation
Scenario 3. try_to_free_pages() makes very long latency. why?
Scenario 4. sar output that free memory dramatically reduced at 10 minute ago, and
            it already recover now. What's happen?

  - suspects
    - kernel memory leak
    - userland memory leak
    - stupid driver use too much memory
    - userland application suddenly start to use much memory

  - what information are valuable?
    - slab usage information (kmemtrace already does)
    - page allocator usage information
    - rss of all processes at oom happend
    - why recent try_to_free_pages() can't reclaim any page?
    - recent sycall history
    - buddy fragmentation info


Plus, another requirement here
1. trace page refault distance (likes past Rik's /proc/refault patch)

2. file cache visualizer - Which file use many page-cache?
   - afaik, Wu Fengguang is working on this issue.


--------------------------------------------
And, here is my reviewing comment to his patch.
btw, I haven't full review it yet. perhaps I might be overlooking something.


First, this is general review comment.

- Please don't display mm and/or another kernel raw pointer.
  if we assume non stop system, we can't use kernel-dump. Thus kernel pointer
  logging is not so useful.
  Any userland tools can't parse it. (/proc/kcore don't help this situation,
  the pointer might be freed before parsing)
- Please makes patch series. one big patch is harder review.
- Please write patch description and use-case.
- Please consider how do this feature works on mem-cgroup.
  (IOW, please don't ignore many "if (scanning_global_lru())")
- tracepoint caller shouldn't have any assumption of displaying representation.
  e.g.
    wrong)  trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT, PageAnon(page));
    good)   trace_mm_pagereclaim_pgout(mapping, page)
  that's general and good callback and/or hook manner.




> diff --git a/include/trace/events/mm.h b/include/trace/events/mm.h
> new file mode 100644
> index 0000000..1d888a4
> --- /dev/null
> +++ b/include/trace/events/mm.h
> @@ -0,0 +1,436 @@
> +#if !defined(_TRACE_MM_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_MM_H
> +
> +#include <linux/mm.h>
> +#include <linux/tracepoint.h>
> +
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM mm
> +
> +TRACE_EVENT(mm_anon_fault,
> +
> +	TP_PROTO(struct mm_struct *mm, unsigned long address),
> +
> +	TP_ARGS(mm, address),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(unsigned long, address)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->address = address;
> +	),
> +
> +	TP_printk("mm=%lx address=%lx",
> +		(unsigned long)__entry->mm, __entry->address)
> +);
> +
> +TRACE_EVENT(mm_anon_pgin,
> +
> +	TP_PROTO(struct mm_struct *mm, unsigned long address),
> +
> +	TP_ARGS(mm, address),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(unsigned long, address)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->address = address;
> +	),
> +
> +	TP_printk("mm=%lx address=%lx",
> +		(unsigned long)__entry->mm, __entry->address)
> +	);
> +
> +TRACE_EVENT(mm_anon_cow,
> +
> +	TP_PROTO(struct mm_struct *mm,
> +			unsigned long address),
> +
> +	TP_ARGS(mm, address),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(unsigned long, address)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->address = address;
> +	),
> +
> +	TP_printk("mm=%lx address=%lx",
> +		(unsigned long)__entry->mm, __entry->address)
> +	);
> +
> +TRACE_EVENT(mm_anon_userfree,
> +
> +	TP_PROTO(struct mm_struct *mm,
> +			unsigned long address),
> +
> +	TP_ARGS(mm, address),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(unsigned long, address)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->address = address;
> +	),
> +
> +	TP_printk("mm=%lx address=%lx",
> +		(unsigned long)__entry->mm, __entry->address)
> +	);
> +
> +TRACE_EVENT(mm_anon_unmap,
> +
> +	TP_PROTO(struct mm_struct *mm, unsigned long address),
> +
> +	TP_ARGS(mm, address),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(unsigned long, address)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->address = address;
> +	),
> +
> +	TP_printk("mm=%lx address=%lx",
> +		(unsigned long)__entry->mm, __entry->address)
> +	);
> +
> +TRACE_EVENT(mm_filemap_fault,
> +
> +	TP_PROTO(struct mm_struct *mm, unsigned long address, int flag),
> +	TP_ARGS(mm, address, flag),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(unsigned long, address)
> +		__field(int, flag)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->address = address;
> +		__entry->flag = flag;
> +	),
> +
> +	TP_printk("%s: mm=%lx address=%lx",
> +		__entry->flag ? "pagein" : "primary fault",
> +		(unsigned long)__entry->mm, __entry->address)
> +	);
> +
> +TRACE_EVENT(mm_filemap_cow,
> +
> +	TP_PROTO(struct mm_struct *mm, unsigned long address),
> +
> +	TP_ARGS(mm, address),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(unsigned long, address)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->address = address;
> +	),
> +
> +	TP_printk("mm=%lx address=%lx",
> +		(unsigned long)__entry->mm, __entry->address)
> +	);
> +
> +TRACE_EVENT(mm_filemap_unmap,
> +
> +	TP_PROTO(struct mm_struct *mm, unsigned long address),
> +
> +	TP_ARGS(mm, address),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(unsigned long, address)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->address = address;
> +	),
> +
> +	TP_printk("mm=%lx address=%lx",
> +		(unsigned long)__entry->mm, __entry->address)
> +	);
> +
> +TRACE_EVENT(mm_filemap_userunmap,
> +
> +	TP_PROTO(struct mm_struct *mm, unsigned long address),
> +
> +	TP_ARGS(mm, address),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(unsigned long, address)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->address = address;
> +	),
> +
> +	TP_printk("mm=%lx address=%lx",
> +		(unsigned long)__entry->mm, __entry->address)
> +	);
> +
> +TRACE_EVENT(mm_pagereclaim_pgout,
> +
> +	TP_PROTO(struct address_space *mapping, unsigned long offset, int anon),
> +
> +	TP_ARGS(mapping, offset, anon),
> +
> +	TP_STRUCT__entry(
> +		__field(struct address_space *, mapping)
> +		__field(unsigned long, offset)
> +		__field(int, anon)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mapping = mapping;
> +		__entry->offset = offset;
> +		__entry->anon = anon;
> +	),
> +
> +	TP_printk("mapping=%lx, offset=%lx %s",
> +		(unsigned long)__entry->mapping, __entry->offset, 
> +			__entry->anon ? "anonymous" : "pagecache")
> +	);
> +
> +TRACE_EVENT(mm_pagereclaim_free,
> +
> +	TP_PROTO(unsigned long nr_reclaimed),
> +
> +	TP_ARGS(nr_reclaimed),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, nr_reclaimed)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->nr_reclaimed = nr_reclaimed;
> +	),
> +
> +	TP_printk("freed=%ld", __entry->nr_reclaimed)
> +	);
> +
> +TRACE_EVENT(mm_pdflush_bgwriteout,
> +
> +	TP_PROTO(unsigned long written),
> +
> +	TP_ARGS(written),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, written)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->written = written;
> +	),
> +
> +	TP_printk("written=%ld", __entry->written)
> +	);
> +
> +TRACE_EVENT(mm_pdflush_kupdate,
> +
> +	TP_PROTO(unsigned long writes),
> +
> +	TP_ARGS(writes),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, writes)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->writes = writes;
> +	),
> +
> +	TP_printk("writes=%ld", __entry->writes)
> +	);
> +
> +TRACE_EVENT(mm_balance_dirty,
> +
> +	TP_PROTO(unsigned long written),
> +
> +	TP_ARGS(written),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, written)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->written = written;
> +	),
> +
> +	TP_printk("written=%ld", __entry->written)
> +	);
> +
> +TRACE_EVENT(mm_page_allocation,
> +
> +	TP_PROTO(unsigned long free),
> +
> +	TP_ARGS(free),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, free)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->free = free;
> +	),
> +
> +	TP_printk("zone_free=%ld", __entry->free)
> +	);
> +
> +TRACE_EVENT(mm_kswapd_ran,
> +
> +	TP_PROTO(struct pglist_data *pgdat, unsigned long reclaimed),
> +
> +	TP_ARGS(pgdat, reclaimed),
> +
> +	TP_STRUCT__entry(
> +		__field(struct pglist_data *, pgdat)
> +		__field(int, node_id)
> +		__field(unsigned long, reclaimed)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->pgdat = pgdat;
> +		__entry->node_id = pgdat->node_id;
> +		__entry->reclaimed = reclaimed;
> +	),
> +
> +	TP_printk("node=%d reclaimed=%ld", __entry->node_id, __entry->reclaimed)
> +	);
> +
> +TRACE_EVENT(mm_directreclaim_reclaimall,
> +
> +	TP_PROTO(int node, unsigned long reclaimed, unsigned long priority),
> +
> +	TP_ARGS(node, reclaimed, priority),
> +
> +	TP_STRUCT__entry(
> +		__field(int, node)
> +		__field(unsigned long, reclaimed)
> +		__field(unsigned long, priority)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->node = node;
> +		__entry->reclaimed = reclaimed;
> +		__entry->priority = priority;
> +	),
> +
> +	TP_printk("node=%d reclaimed=%ld priority=%ld", __entry->node, __entry->reclaimed, 
> +					__entry->priority)
> +	);
> +
> +TRACE_EVENT(mm_directreclaim_reclaimzone,
> +
> +	TP_PROTO(int node, unsigned long reclaimed, unsigned long priority),
> +
> +	TP_ARGS(node, reclaimed, priority),
> +
> +	TP_STRUCT__entry(
> +		__field(int, node)
> +		__field(unsigned long, reclaimed)
> +		__field(unsigned long, priority)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->node = node;
> +		__entry->reclaimed = reclaimed;
> +		__entry->priority = priority;
> +	),
> +
> +	TP_printk("node = %d reclaimed=%ld, priority=%ld",
> +			__entry->node, __entry->reclaimed, __entry->priority)
> +	);
> +TRACE_EVENT(mm_pagereclaim_shrinkzone,
> +
> +	TP_PROTO(unsigned long reclaimed, unsigned long priority),
> +
> +	TP_ARGS(reclaimed, priority),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, reclaimed)
> +		__field(unsigned long, priority)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->reclaimed = reclaimed;
> +		__entry->priority = priority;
> +	),
> +
> +	TP_printk("reclaimed=%ld priority=%ld",
> +			__entry->reclaimed, __entry->priority)
> +	);
> +
> +TRACE_EVENT(mm_pagereclaim_shrinkactive,
> +
> +	TP_PROTO(unsigned long scanned, int file, int priority),
> +
> +	TP_ARGS(scanned, file, priority),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, scanned)
> +		__field(int, file)
> +		__field(int, priority)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->scanned = scanned;
> +		__entry->file = file;
> +		__entry->priority = priority;
> +	),
> +
> +	TP_printk("scanned=%ld, %s, priority=%d",
> +		__entry->scanned, __entry->file ? "pagecache" : "anonymous",
> +		__entry->priority)
> +	);
> +
> +TRACE_EVENT(mm_pagereclaim_shrinkinactive,
> +
> +	TP_PROTO(unsigned long scanned, unsigned long reclaimed,
> +			int priority),
> +
> +	TP_ARGS(scanned, reclaimed, priority),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, scanned)
> +		__field(unsigned long, reclaimed)
> +		__field(int, priority)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->scanned = scanned;
> +		__entry->reclaimed = reclaimed;
> +		__entry->priority = priority;
> +	),
> +
> +	TP_printk("scanned=%ld, reclaimed=%ld, priority=%d",
> +		__entry->scanned, __entry->reclaimed, 
> +		__entry->priority)
> +	);
> +
> +#endif /* _TRACE_MM_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1b60f30..af4a964 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -34,6 +34,7 @@
>  #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
>  #include <linux/memcontrol.h>
>  #include <linux/mm_inline.h> /* for page_is_file_cache() */
> +#include <trace/events/mm.h>
>  #include "internal.h"
>  
>  /*
> @@ -1568,6 +1569,8 @@ retry_find:
>  	 */
>  	ra->prev_pos = (loff_t)page->index << PAGE_CACHE_SHIFT;
>  	vmf->page = page;
> +	trace_mm_filemap_fault(vma->vm_mm, (unsigned long)vmf->virtual_address,
> +			vmf->flags&FAULT_FLAG_NONLINEAR);
>  	return ret | VM_FAULT_LOCKED;
>
>  no_cached_page:
> diff --git a/mm/memory.c b/mm/memory.c
> index 4126dd1..a4a580c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -61,6 +61,7 @@
>  #include <asm/tlb.h>
>  #include <asm/tlbflush.h>
>  #include <asm/pgtable.h>
> +#include <trace/events/mm.h>
>  
>  #include "internal.h"
>  
> @@ -812,15 +813,17 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  						addr) != page->index)
>  				set_pte_at(mm, addr, pte,
>  					   pgoff_to_pte(page->index));
> -			if (PageAnon(page))
> +			if (PageAnon(page)) {
>  				anon_rss--;
> -			else {
> +				trace_mm_anon_userfree(mm, addr);
> +			} else {
>  				if (pte_dirty(ptent))
>  					set_page_dirty(page);
>  				if (pte_young(ptent) &&
>  				    likely(!VM_SequentialReadHint(vma)))
>  					mark_page_accessed(page);
>  				file_rss--;
> +				trace_mm_filemap_userunmap(mm, addr);
>  			}
>  			page_remove_rmap(page);
>  			if (unlikely(page_mapcount(page) < 0))
> @@ -1896,7 +1899,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		unsigned long address, pte_t *page_table, pmd_t *pmd,
>  		spinlock_t *ptl, pte_t orig_pte)
>  {
> -	struct page *old_page, *new_page;
> +	struct page *old_page, *new_page = NULL;
>  	pte_t entry;
>  	int reuse = 0, ret = 0;
>  	int page_mkwrite = 0;
> @@ -2050,9 +2053,12 @@ gotten:
>  			if (!PageAnon(old_page)) {
>  				dec_mm_counter(mm, file_rss);
>  				inc_mm_counter(mm, anon_rss);
> +				trace_mm_filemap_cow(mm, address);
>  			}
> -		} else
> +		} else {
>  			inc_mm_counter(mm, anon_rss);
> +			trace_mm_anon_cow(mm, address);
> +		}
>  		flush_cache_page(vma, address, pte_pfn(orig_pte));
>  		entry = mk_pte(new_page, vma->vm_page_prot);
>  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> @@ -2449,7 +2455,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		int write_access, pte_t orig_pte)
>  {
>  	spinlock_t *ptl;
> -	struct page *page;
> +	struct page *page = NULL;
>  	swp_entry_t entry;
>  	pte_t pte;
>  	struct mem_cgroup *ptr = NULL;
> @@ -2549,6 +2555,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  unlock:
>  	pte_unmap_unlock(page_table, ptl);
>  out:
> +	trace_mm_anon_pgin(mm, address);
>  	return ret;
>  out_nomap:
>  	mem_cgroup_cancel_charge_swapin(ptr);

In swapin, you trace "mm" and "virtual address". but in swap-out, you trace "mapping" and
"virtual address".

Oh well, we can't compare swap-in and swap-out log. Please consider to make input and output synmetric.


> @@ -2582,6 +2589,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		goto oom;
>  	__SetPageUptodate(page);
>  
> +	trace_mm_anon_fault(mm, address);
>  	if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
>  		goto oom_free_page;
>  
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index bb553c3..ef92a97 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -34,6 +34,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/buffer_head.h>
>  #include <linux/pagevec.h>
> +#include <trace/events/mm.h>
>  
>  /*
>   * The maximum number of pages to writeout in a single bdflush/kupdate
> @@ -574,6 +575,7 @@ static void balance_dirty_pages(struct address_space *mapping)
>  		congestion_wait(WRITE, HZ/10);
>  	}
>  
> +	trace_mm_balance_dirty(pages_written);

perhaps, you need to explain why this tracepoint is useful.
I haven't use this log on my past debugging.

perhaps, if you only need number of written pages, new vmstat field is
more useful?


>  	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
>  			bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
> @@ -716,6 +718,7 @@ static void background_writeout(unsigned long _min_pages)
>  				break;
>  		}
>  	}
> +	trace_mm_pdflush_bgwriteout(_min_pages);
>  }

ditto.


>  
>  /*
> @@ -776,6 +779,7 @@ static void wb_kupdate(unsigned long arg)
>  	nr_to_write = global_page_state(NR_FILE_DIRTY) +
>  			global_page_state(NR_UNSTABLE_NFS) +
>  			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
> +	trace_mm_pdflush_kupdate(nr_to_write);
>  	while (nr_to_write > 0) {
>  		wbc.more_io = 0;
>  		wbc.encountered_congestion = 0;

ditto.


> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0727896..ca9355e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -48,6 +48,7 @@
>  #include <linux/page_cgroup.h>
>  #include <linux/debugobjects.h>
>  #include <linux/kmemleak.h>
> +#include <trace/events/mm.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -1440,6 +1441,7 @@ zonelist_scan:
>  				mark = zone->pages_high;
>  			if (!zone_watermark_ok(zone, order, mark,
>  				    classzone_idx, alloc_flags)) {
> +				trace_mm_page_allocation(zone_page_state(zone, NR_FREE_PAGES));
>  				if (!zone_reclaim_mode ||
>  				    !zone_reclaim(zone, gfp_mask, order))
>  					goto this_zone_full;

bad name.
it is not the notification of allocation. 

Plus, this is wrong place too. it doesn't mean allocation failure.

it only mean a zone is not sufficient memory.
However this tracepoint don't have zone argument. then it is totally unuseful.

Plus, NR_FREE_PAGES is not sufficient informantion. the most common reason
of allocation failure is not low NR_FREE_PAGES. it is buddy fragmentation.




> diff --git a/mm/rmap.c b/mm/rmap.c
> index 23122af..f2156ca 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -50,6 +50,7 @@
>  #include <linux/memcontrol.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/migrate.h>
> +#include <trace/events/mm.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -1025,6 +1026,7 @@ static int try_to_unmap_anon(struct page *page, int unlock, int migration)
>  			if (mlocked)
>  				break;	/* stop if actually mlocked page */
>  		}
> +		trace_mm_anon_unmap(vma->vm_mm, vma->vm_start+page->index);
>  	}
>  
>  	page_unlock_anon_vma(anon_vma);
> @@ -1152,6 +1154,7 @@ static int try_to_unmap_file(struct page *page, int unlock, int migration)
>  					goto out;
>  			}
>  			vma->vm_private_data = (void *) max_nl_cursor;
> +			trace_mm_filemap_unmap(vma->vm_mm, vma->vm_start+page->index);
>  		}
>  		cond_resched_lock(&mapping->i_mmap_lock);
>  		max_nl_cursor += CLUSTER_SIZE;

try_to_unmap() and try_to_unlock() are pretty difference.
maybe, we only need try_to_unmap() case?




> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 95c08a8..bed7125 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -40,6 +40,8 @@
>  #include <linux/memcontrol.h>
>  #include <linux/delayacct.h>
>  #include <linux/sysctl.h>
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/mm.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -417,6 +419,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
>  			ClearPageReclaim(page);
>  		}
>  		inc_zone_page_state(page, NR_VMSCAN_WRITE);
> +		trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT,
> +						PageAnon(page));

I don't think it this useful information.

for file-mapped)
  [mapping, offset] pair represent which portion is pointed this cache page.
for swa-backed)
  [process, virtual_address] ..


Plus, I have one question. How do we combine this information and blktrace?
if we can't see I/O activity relationship, this is really unuseful.


>  		return PAGE_SUCCESS;
>  	}
>  
> @@ -796,6 +800,7 @@ keep:
>  	if (pagevec_count(&freed_pvec))
>  		__pagevec_free(&freed_pvec);
>  	count_vm_events(PGACTIVATE, pgactivate);
> +	trace_mm_pagereclaim_free(nr_reclaimed);
>  	return nr_reclaimed;
>  }

No.
if administrator only need number of free pages.
/proc/meminfo and /proc/vmstat already provide it.

but I don't think it is sufficient information.
May I ask when do you use this tracepoint? and why?



>  
> @@ -1182,6 +1187,8 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
>  done:
>  	local_irq_enable();
>  	pagevec_release(&pvec);
> +	trace_mm_pagereclaim_shrinkinactive(nr_scanned, nr_reclaimed,
> +				priority);
>  	return nr_reclaimed;
>  }
>  
> @@ -1316,6 +1323,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>  	if (buffer_heads_over_limit)
>  		pagevec_strip(&pvec);
>  	pagevec_release(&pvec);
> +	trace_mm_pagereclaim_shrinkactive(pgscanned, file, priority);
>  }
>  
>  static int inactive_anon_is_low_global(struct zone *zone)
> @@ -1516,6 +1524,7 @@ static void shrink_zone(int priority, struct zone *zone,
>  	}
>  
>  	sc->nr_reclaimed = nr_reclaimed;
> +	trace_mm_pagereclaim_shrinkzone(nr_reclaimed, priority);
>  
>  	/*
>  	 * Even if we did not try to evict anon pages at all, we want to
> @@ -1678,6 +1687,8 @@ out:
>  	if (priority < 0)
>  		priority = 0;
>  
> +	trace_mm_directreclaim_reclaimall(zonelist[0]._zonerefs->zone->node,
> +						sc->nr_reclaimed, priority);
>  	if (scanning_global_lru(sc)) {
>  		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>  

Why do you want to log to node? Why not zone itself?

Plus, Why you ignore try_to_free_pages() latency?



> @@ -1947,6 +1958,7 @@ out:
>  		goto loop_again;
>  	}
>  
> +	trace_mm_kswapd_ran(pgdat, sc.nr_reclaimed);
>  	return sc.nr_reclaimed;
>  }
>  

equal to kswapd_steal field in /proc/vmstat?


> @@ -2299,7 +2311,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	const unsigned long nr_pages = 1 << order;
>  	struct task_struct *p = current;
>  	struct reclaim_state reclaim_state;
> -	int priority;
> +	int priority = ZONE_RECLAIM_PRIORITY;
>  	struct scan_control sc = {
>  		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
>  		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
> @@ -2366,6 +2378,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  
>  	p->reclaim_state = NULL;
>  	current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
> +	trace_mm_directreclaim_reclaimzone(zone->node,
> +				sc.nr_reclaimed, priority);
>  	return sc.nr_reclaimed >= nr_pages;
>  }

this is _zone_ reclaim. but the code pass node.
Plus, if we consider to log page allocation and reclaim, we shouldn't ignore
gfp_mask.

it cause to change many allocation/reclaim behavior.


----
My current conclusion is, nobody use this patch on his own system.
the patch have many unclear useful tracepoint.

At least, patch splitting is needed for productive discussion.
  e.g.
   - reclaim IO activity tracing
   - memory fragmentation visualizer
   - per i-node page cache visualizer (likes Wu's filecache patch)
   - reclaim failure reason tracing and aggregation ftrace plugin
   - reclaim latency tracing


I'm glad if larry resubmit this effort.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
  2009-06-18  7:57               ` KOSAKI Motohiro
@ 2009-06-18 19:22                 ` Larry Woodman
  -1 siblings, 0 replies; 37+ messages in thread
From: Larry Woodman @ 2009-06-18 19:22 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Ingo Molnar, Fr馘駻ic Weisbecker,
	Li Zefan, Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm,
	rostedt

On Thu, 2009-06-18 at 16:57 +0900, KOSAKI Motohiro wrote:

Thanks for the feedback Kosaki!


> Scenario 1. OOM killer happend. why? and who bring it?

Doesnt the showmem() and stack trace to the console when the OOM kill
occurred show enough in the majority of cases?  I realize that direct
alloc_pages() calls are not accounted for here but that can be really
invasive.

> Scenario 2. page allocation failure by memory fragmentation

Are you talking about order>0 allocation failures here?  Most of the
slabs are single page allocations now.

> Scenario 3. try_to_free_pages() makes very long latency. why?

This is available in the mm tracepoints, they all include timestamps.

> Scenario 4. sar output that free memory dramatically reduced at 10 minute ago, and
>             it already recover now. What's happen?

Is this really important?  It would take buffering lots of data to
figure out what happened in the past.

> 
>   - suspects
>     - kernel memory leak

Other than direct callers to the page allocator isnt that covered with
the kmemtrace stuff?

>     - userland memory leak

The mm tracepoints track all user space allocations and frees(perhaps
too many?).

>     - stupid driver use too much memory

hopefully kmemtrace will catch this?

>     - userland application suddenly start to use much memory

The mm tracepoints track all user space allocations and frees.

> 
>   - what information are valuable?
>     - slab usage information (kmemtrace already does)
>     - page allocator usage information
>     - rss of all processes at oom happend
>     - why recent try_to_free_pages() can't reclaim any page?

The counters in the mm tracepoints do give counts but not the reasons
that the pagereclaim code fails.

>     - recent sycall history
>     - buddy fragmentation info
> 
> 
> Plus, another requirement here
> 1. trace page refault distance (likes past Rik's /proc/refault patch)
> 
> 2. file cache visualizer - Which file use many page-cache?
>    - afaik, Wu Fengguang is working on this issue.
> 
> 
> --------------------------------------------
> And, here is my reviewing comment to his patch.
> btw, I haven't full review it yet. perhaps I might be overlooking something.
> 
> 
> First, this is general review comment.
> 
> - Please don't display mm and/or another kernel raw pointer.
>   if we assume non stop system, we can't use kernel-dump. Thus kernel pointer
>   logging is not so useful.

OK, I just dont know how valuable the trace output is with out some raw
data like the mm_struct.

>   Any userland tools can't parse it. (/proc/kcore don't help this situation,
>   the pointer might be freed before parsing)
> - Please makes patch series. one big patch is harder review.

OK.

> - Please write patch description and use-case.

OK.

> - Please consider how do this feature works on mem-cgroup.
>   (IOW, please don't ignore many "if (scanning_global_lru())")
> - tracepoint caller shouldn't have any assumption of displaying representation.
>   e.g.
>     wrong)  trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT, PageAnon(page));
>     good)   trace_mm_pagereclaim_pgout(mapping, page)

OK.

>   that's general and good callback and/or hook manner.
> 
> 
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
@ 2009-06-18 19:22                 ` Larry Woodman
  0 siblings, 0 replies; 37+ messages in thread
From: Larry Woodman @ 2009-06-18 19:22 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Ingo Molnar, Fr馘駻ic Weisbecker,
	Li Zefan, Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm,
	rostedt

On Thu, 2009-06-18 at 16:57 +0900, KOSAKI Motohiro wrote:

Thanks for the feedback Kosaki!


> Scenario 1. OOM killer happend. why? and who bring it?

Doesnt the showmem() and stack trace to the console when the OOM kill
occurred show enough in the majority of cases?  I realize that direct
alloc_pages() calls are not accounted for here but that can be really
invasive.

> Scenario 2. page allocation failure by memory fragmentation

Are you talking about order>0 allocation failures here?  Most of the
slabs are single page allocations now.

> Scenario 3. try_to_free_pages() makes very long latency. why?

This is available in the mm tracepoints, they all include timestamps.

> Scenario 4. sar output that free memory dramatically reduced at 10 minute ago, and
>             it already recover now. What's happen?

Is this really important?  It would take buffering lots of data to
figure out what happened in the past.

> 
>   - suspects
>     - kernel memory leak

Other than direct callers to the page allocator isnt that covered with
the kmemtrace stuff?

>     - userland memory leak

The mm tracepoints track all user space allocations and frees(perhaps
too many?).

>     - stupid driver use too much memory

hopefully kmemtrace will catch this?

>     - userland application suddenly start to use much memory

The mm tracepoints track all user space allocations and frees.

> 
>   - what information are valuable?
>     - slab usage information (kmemtrace already does)
>     - page allocator usage information
>     - rss of all processes at oom happend
>     - why recent try_to_free_pages() can't reclaim any page?

The counters in the mm tracepoints do give counts but not the reasons
that the pagereclaim code fails.

>     - recent sycall history
>     - buddy fragmentation info
> 
> 
> Plus, another requirement here
> 1. trace page refault distance (likes past Rik's /proc/refault patch)
> 
> 2. file cache visualizer - Which file use many page-cache?
>    - afaik, Wu Fengguang is working on this issue.
> 
> 
> --------------------------------------------
> And, here is my reviewing comment to his patch.
> btw, I haven't full review it yet. perhaps I might be overlooking something.
> 
> 
> First, this is general review comment.
> 
> - Please don't display mm and/or another kernel raw pointer.
>   if we assume non stop system, we can't use kernel-dump. Thus kernel pointer
>   logging is not so useful.

OK, I just dont know how valuable the trace output is with out some raw
data like the mm_struct.

>   Any userland tools can't parse it. (/proc/kcore don't help this situation,
>   the pointer might be freed before parsing)
> - Please makes patch series. one big patch is harder review.

OK.

> - Please write patch description and use-case.

OK.

> - Please consider how do this feature works on mem-cgroup.
>   (IOW, please don't ignore many "if (scanning_global_lru())")
> - tracepoint caller shouldn't have any assumption of displaying representation.
>   e.g.
>     wrong)  trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT, PageAnon(page));
>     good)   trace_mm_pagereclaim_pgout(mapping, page)

OK.

>   that's general and good callback and/or hook manner.
> 
> 
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
  2009-06-18 19:22                 ` Larry Woodman
@ 2009-06-18 19:40                   ` Rik van Riel
  -1 siblings, 0 replies; 37+ messages in thread
From: Rik van Riel @ 2009-06-18 19:40 UTC (permalink / raw)
  To: Larry Woodman
  Cc: KOSAKI Motohiro, Ingo Molnar, Fr馘駻ic Weisbecker,
	Li Zefan, Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm,
	rostedt

Larry Woodman wrote:

>> - Please don't display mm and/or another kernel raw pointer.
>>   if we assume non stop system, we can't use kernel-dump. Thus kernel pointer
>>   logging is not so useful.
> 
> OK, I just dont know how valuable the trace output is with out some raw
> data like the mm_struct.

I believe that we do want something like the mm_struct in
the trace info, so we can figure out which process was
allocating pages, etc...

>> - Please consider how do this feature works on mem-cgroup.
>>   (IOW, please don't ignore many "if (scanning_global_lru())")

Good point, we want to trace cgroup vs non-cgroup reclaims,
too.

>> - tracepoint caller shouldn't have any assumption of displaying representation.
>>   e.g.
>>     wrong)  trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT, PageAnon(page));
>>     good)   trace_mm_pagereclaim_pgout(mapping, page)
> 
> OK.
> 
>>   that's general and good callback and/or hook manner.

How do we figure those out from the page pointer at the time
the tracepoint triggers?

I believe that it would be useful to export that info in the
trace point, since we cannot expect the userspace trace tool
to figure out these things from the struct page address.

Or did I overlook something here?

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
@ 2009-06-18 19:40                   ` Rik van Riel
  0 siblings, 0 replies; 37+ messages in thread
From: Rik van Riel @ 2009-06-18 19:40 UTC (permalink / raw)
  To: Larry Woodman
  Cc: KOSAKI Motohiro, Ingo Molnar, Fr馘駻ic Weisbecker,
	Li Zefan, Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm,
	rostedt

Larry Woodman wrote:

>> - Please don't display mm and/or another kernel raw pointer.
>>   if we assume non stop system, we can't use kernel-dump. Thus kernel pointer
>>   logging is not so useful.
> 
> OK, I just dont know how valuable the trace output is with out some raw
> data like the mm_struct.

I believe that we do want something like the mm_struct in
the trace info, so we can figure out which process was
allocating pages, etc...

>> - Please consider how do this feature works on mem-cgroup.
>>   (IOW, please don't ignore many "if (scanning_global_lru())")

Good point, we want to trace cgroup vs non-cgroup reclaims,
too.

>> - tracepoint caller shouldn't have any assumption of displaying representation.
>>   e.g.
>>     wrong)  trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT, PageAnon(page));
>>     good)   trace_mm_pagereclaim_pgout(mapping, page)
> 
> OK.
> 
>>   that's general and good callback and/or hook manner.

How do we figure those out from the page pointer at the time
the tracepoint triggers?

I believe that it would be useful to export that info in the
trace point, since we cannot expect the userspace trace tool
to figure out these things from the struct page address.

Or did I overlook something here?

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
  2009-06-18 19:22                 ` Larry Woodman
@ 2009-06-22  3:37                   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2009-06-22  3:37 UTC (permalink / raw)
  To: Larry Woodman
  Cc: kosaki.motohiro, Rik van Riel, Ingo Molnar,
	Fr馘駻ic Weisbecker, Li Zefan, Pekka Enberg,
	eduard.munteanu, linux-kernel, linux-mm, rostedt

Hi

> Thanks for the feedback Kosaki!
> 
> 
> > Scenario 1. OOM killer happend. why? and who bring it?
> 
> Doesnt the showmem() and stack trace to the console when the OOM kill
> occurred show enough in the majority of cases?  I realize that direct
> alloc_pages() calls are not accounted for here but that can be really
> invasive.

showmem() display _result_ of memory usage and fragmentation.
but Administrator often need to know the _reason_.

Plus, kmemtrace already trace slab allocate/free activity.
You mean you think this is really invasive?


> > Scenario 2. page allocation failure by memory fragmentation
> 
> Are you talking about order>0 allocation failures here?  Most of the
> slabs are single page allocations now.

Yes, order>0.
but I confused. Why do you talk about slab, not page alloc?

Note, non-x86 architecture freqently use order-1 allocation for
making stack.



> > Scenario 3. try_to_free_pages() makes very long latency. why?
> 
> This is available in the mm tracepoints, they all include timestamps.

perhaps, no.
Administrator need to know the reason. not accumulated time. it's the result.

We can guess some reason
  - IO congestion
  - memory eating speed is fast than reclaim speed
  - memory fragmentation

but it's only guess. we often need to get data.


> > Scenario 4. sar output that free memory dramatically reduced at 10 minute ago, and
> >             it already recover now. What's happen?
> 
> Is this really important?  It would take buffering lots of data to
> figure out what happened in the past.

ok, my scenario description is a bit wrong.

if userland process explicitly  consume memory or explicitely write
many data, it is true.

Is this more appropriate?

"userland process take the same action periodically, but only 10 minute ago
free memory reduced, why?"



> >   - suspects
> >     - kernel memory leak
> 
> Other than direct callers to the page allocator isnt that covered with
> the kmemtrace stuff?

Yeah.
perhaps, kmemtrace enhance to cover page allocator is good approach.


> >     - userland memory leak
> 
> The mm tracepoints track all user space allocations and frees(perhaps
> too many?).

hmhm.


> 
> >     - stupid driver use too much memory
> 
> hopefully kmemtrace will catch this?

ditto.
I agree with kmemtrace enhancement is good idea.

> 
> >     - userland application suddenly start to use much memory
> 
> The mm tracepoints track all user space allocations and frees.

ok.


> >   - what information are valuable?
> >     - slab usage information (kmemtrace already does)
> >     - page allocator usage information
> >     - rss of all processes at oom happend
> >     - why recent try_to_free_pages() can't reclaim any page?
> 
> The counters in the mm tracepoints do give counts but not the reasons
> that the pagereclaim code fails.

That's very important key point. please don't ignore.




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
@ 2009-06-22  3:37                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2009-06-22  3:37 UTC (permalink / raw)
  To: Larry Woodman
  Cc: kosaki.motohiro, Rik van Riel, Ingo Molnar,
	Fr馘駻ic Weisbecker, Li Zefan, Pekka Enberg,
	eduard.munteanu, linux-kernel, linux-mm, rostedt

Hi

> Thanks for the feedback Kosaki!
> 
> 
> > Scenario 1. OOM killer happend. why? and who bring it?
> 
> Doesnt the showmem() and stack trace to the console when the OOM kill
> occurred show enough in the majority of cases?  I realize that direct
> alloc_pages() calls are not accounted for here but that can be really
> invasive.

showmem() display _result_ of memory usage and fragmentation.
but Administrator often need to know the _reason_.

Plus, kmemtrace already trace slab allocate/free activity.
You mean you think this is really invasive?


> > Scenario 2. page allocation failure by memory fragmentation
> 
> Are you talking about order>0 allocation failures here?  Most of the
> slabs are single page allocations now.

Yes, order>0.
but I confused. Why do you talk about slab, not page alloc?

Note, non-x86 architecture freqently use order-1 allocation for
making stack.



> > Scenario 3. try_to_free_pages() makes very long latency. why?
> 
> This is available in the mm tracepoints, they all include timestamps.

perhaps, no.
Administrator need to know the reason. not accumulated time. it's the result.

We can guess some reason
  - IO congestion
  - memory eating speed is fast than reclaim speed
  - memory fragmentation

but it's only guess. we often need to get data.


> > Scenario 4. sar output that free memory dramatically reduced at 10 minute ago, and
> >             it already recover now. What's happen?
> 
> Is this really important?  It would take buffering lots of data to
> figure out what happened in the past.

ok, my scenario description is a bit wrong.

if userland process explicitly  consume memory or explicitely write
many data, it is true.

Is this more appropriate?

"userland process take the same action periodically, but only 10 minute ago
free memory reduced, why?"



> >   - suspects
> >     - kernel memory leak
> 
> Other than direct callers to the page allocator isnt that covered with
> the kmemtrace stuff?

Yeah.
perhaps, kmemtrace enhance to cover page allocator is good approach.


> >     - userland memory leak
> 
> The mm tracepoints track all user space allocations and frees(perhaps
> too many?).

hmhm.


> 
> >     - stupid driver use too much memory
> 
> hopefully kmemtrace will catch this?

ditto.
I agree with kmemtrace enhancement is good idea.

> 
> >     - userland application suddenly start to use much memory
> 
> The mm tracepoints track all user space allocations and frees.

ok.


> >   - what information are valuable?
> >     - slab usage information (kmemtrace already does)
> >     - page allocator usage information
> >     - rss of all processes at oom happend
> >     - why recent try_to_free_pages() can't reclaim any page?
> 
> The counters in the mm tracepoints do give counts but not the reasons
> that the pagereclaim code fails.

That's very important key point. please don't ignore.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
  2009-06-18 19:40                   ` Rik van Riel
@ 2009-06-22  3:37                     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2009-06-22  3:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, Larry Woodman, Ingo Molnar,
	Fr馘駻ic Weisbecker, Li Zefan, Pekka Enberg,
	eduard.munteanu, linux-kernel, linux-mm, rostedt

> Larry Woodman wrote:
> 
> >> - Please don't display mm and/or another kernel raw pointer.
> >>   if we assume non stop system, we can't use kernel-dump. Thus kernel pointer
> >>   logging is not so useful.
> > 
> > OK, I just dont know how valuable the trace output is with out some raw
> > data like the mm_struct.
> 
> I believe that we do want something like the mm_struct in
> the trace info, so we can figure out which process was
> allocating pages, etc...

Yes.
I think we need to print tgid, it is needed to imporove CONFIG_MM_OWNER.
current CONFIG_MM_OWNER back-pointer point to semi-random task_struct.


> >> - Please consider how do this feature works on mem-cgroup.
> >>   (IOW, please don't ignore many "if (scanning_global_lru())")
> 
> Good point, we want to trace cgroup vs non-cgroup reclaims,
> too.

thank you.

> 
> >> - tracepoint caller shouldn't have any assumption of displaying representation.
> >>   e.g.
> >>     wrong)  trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT, PageAnon(page));
> >>     good)   trace_mm_pagereclaim_pgout(mapping, page)
> > 
> > OK.
> > 
> >>   that's general and good callback and/or hook manner.
> 
> How do we figure those out from the page pointer at the time
> the tracepoint triggers?
> 
> I believe that it would be useful to export that info in the
> trace point, since we cannot expect the userspace trace tool
> to figure out these things from the struct page address.
> 
> Or did I overlook something here?

current, TRACE_EVENT have two step information trasformation.

 - step1 - TP_fast_assign()
   it is called from tracepoint directly. it makes ring-buffer representaion.
 - step2 - TP_printk
   it is called when reading debug/tracing/trace file. it makes printable
   representation from ring-buffer data.

example:

trace_sched_switch() has three argument, rq, prev, next.

--------------------------------------------------
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
               struct task_struct *next)
{
(snip)
        trace_sched_switch(rq, prev, next);

-------------------------------------------------

TP_fast_assing extract data from argument pointer.
-----------------------------------------------------
        TP_fast_assign(
                memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
                __entry->prev_pid       = prev->pid;
                __entry->prev_prio      = prev->prio;
                __entry->prev_state     = prev->state;
                memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
                __entry->next_pid       = next->pid;
                __entry->next_prio      = next->prio;
        ),
-----------------------------------------------------


I think mm tracepoint can do the same way.





^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
@ 2009-06-22  3:37                     ` KOSAKI Motohiro
  0 siblings, 0 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2009-06-22  3:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, Larry Woodman, Ingo Molnar,
	Fr馘駻ic Weisbecker, Li Zefan, Pekka Enberg,
	eduard.munteanu, linux-kernel, linux-mm, rostedt

> Larry Woodman wrote:
> 
> >> - Please don't display mm and/or another kernel raw pointer.
> >>   if we assume non stop system, we can't use kernel-dump. Thus kernel pointer
> >>   logging is not so useful.
> > 
> > OK, I just dont know how valuable the trace output is with out some raw
> > data like the mm_struct.
> 
> I believe that we do want something like the mm_struct in
> the trace info, so we can figure out which process was
> allocating pages, etc...

Yes.
I think we need to print tgid, it is needed to imporove CONFIG_MM_OWNER.
current CONFIG_MM_OWNER back-pointer point to semi-random task_struct.


> >> - Please consider how do this feature works on mem-cgroup.
> >>   (IOW, please don't ignore many "if (scanning_global_lru())")
> 
> Good point, we want to trace cgroup vs non-cgroup reclaims,
> too.

thank you.

> 
> >> - tracepoint caller shouldn't have any assumption of displaying representation.
> >>   e.g.
> >>     wrong)  trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT, PageAnon(page));
> >>     good)   trace_mm_pagereclaim_pgout(mapping, page)
> > 
> > OK.
> > 
> >>   that's general and good callback and/or hook manner.
> 
> How do we figure those out from the page pointer at the time
> the tracepoint triggers?
> 
> I believe that it would be useful to export that info in the
> trace point, since we cannot expect the userspace trace tool
> to figure out these things from the struct page address.
> 
> Or did I overlook something here?

current, TRACE_EVENT have two step information trasformation.

 - step1 - TP_fast_assign()
   it is called from tracepoint directly. it makes ring-buffer representaion.
 - step2 - TP_printk
   it is called when reading debug/tracing/trace file. it makes printable
   representation from ring-buffer data.

example:

trace_sched_switch() has three argument, rq, prev, next.

--------------------------------------------------
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
               struct task_struct *next)
{
(snip)
        trace_sched_switch(rq, prev, next);

-------------------------------------------------

TP_fast_assing extract data from argument pointer.
-----------------------------------------------------
        TP_fast_assign(
                memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
                __entry->prev_pid       = prev->pid;
                __entry->prev_prio      = prev->prio;
                __entry->prev_state     = prev->state;
                memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
                __entry->next_pid       = next->pid;
                __entry->next_prio      = next->prio;
        ),
-----------------------------------------------------


I think mm tracepoint can do the same way.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
  2009-06-22  3:37                     ` KOSAKI Motohiro
@ 2009-06-22 15:04                       ` Larry Woodman
  -1 siblings, 0 replies; 37+ messages in thread
From: Larry Woodman @ 2009-06-22 15:04 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Ingo Molnar, Fr馘駻ic Weisbecker,
	Li Zefan, Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm,
	rostedt

On Mon, 2009-06-22 at 12:37 +0900, KOSAKI Motohiro wrote:

Thanks for the feedback KOSAKI!


> > Larry Woodman wrote:
> > 
> > >> - Please don't display mm and/or another kernel raw pointer.
> > >>   if we assume non stop system, we can't use kernel-dump. Thus kernel pointer
> > >>   logging is not so useful.
> > > 
> > > OK, I just dont know how valuable the trace output is with out some raw
> > > data like the mm_struct.
> > 
> > I believe that we do want something like the mm_struct in
> > the trace info, so we can figure out which process was
> > allocating pages, etc...
> 
> Yes.
> I think we need to print tgid, it is needed to imporove CONFIG_MM_OWNER.
> current CONFIG_MM_OWNER back-pointer point to semi-random task_struct.

All of the tracepoints contain command, pid, CPU and timestamp and
tracepoint name information.  Are you saying I should capture more
information in specific mm tracepoints like the tgid and if the answer
is yes, what would we need this for?


cat-10962 [005]  1877.984589: mm_anon_fault:
cat-10962 [005]  1877.984638: mm_anon_fault:
cat-10962 [005]  1877.984658: sched_switch:
cat-10962 [005]  1877.988359: sched_switch:

> 
> 
> > >> - Please consider how do this feature works on mem-cgroup.
> > >>   (IOW, please don't ignore many "if (scanning_global_lru())")
> > 
> > Good point, we want to trace cgroup vs non-cgroup reclaims,
> > too.
> 
> thank you.

All of the mm tracepoints are located above the cgroup specific calls.
This means that they will capture the same exact data reguardless of
whether cgroups are used or not.  Are you saying I should capture
whether the data was specific to a cgroup or it was from the global
LRUs?

  
> 
> > 
> > >> - tracepoint caller shouldn't have any assumption of displaying representation.
> > >>   e.g.
> > >>     wrong)  trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT, PageAnon(page));
> > >>     good)   trace_mm_pagereclaim_pgout(mapping, page)
> > > 
> > > OK.
> > > 
> > >>   that's general and good callback and/or hook manner.
> > 
> > How do we figure those out from the page pointer at the time
> > the tracepoint triggers?
> > 
> > I believe that it would be useful to export that info in the
> > trace point, since we cannot expect the userspace trace tool
> > to figure out these things from the struct page address.
> > 
> > Or did I overlook something here?
> 
> current, TRACE_EVENT have two step information trasformation.
> 
>  - step1 - TP_fast_assign()
>    it is called from tracepoint directly. it makes ring-buffer representaion.
>  - step2 - TP_printk
>    it is called when reading debug/tracing/trace file. it makes printable
>    representation from ring-buffer data.
> 
> example:
> 
> trace_sched_switch() has three argument, rq, prev, next.
> 
> --------------------------------------------------
> static inline void
> context_switch(struct rq *rq, struct task_struct *prev,
>                struct task_struct *next)
> {
> (snip)
>         trace_sched_switch(rq, prev, next);
> 
> -------------------------------------------------
> 
> TP_fast_assing extract data from argument pointer.
> -----------------------------------------------------
>         TP_fast_assign(
>                 memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
>                 __entry->prev_pid       = prev->pid;
>                 __entry->prev_prio      = prev->prio;
>                 __entry->prev_state     = prev->state;
>                 memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
>                 __entry->next_pid       = next->pid;
>                 __entry->next_prio      = next->prio;
>         ),
> -----------------------------------------------------
> 
> 
> I think mm tracepoint can do the same way.

The sched_switch tracepoint tells us the name of the outgoing and
incomming process during a context switch so this information is very
significant to that tracepoint.  What mm tracepoint would I need to add
such information without it being redundant?

Thanks, Larry Woodman

> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
@ 2009-06-22 15:04                       ` Larry Woodman
  0 siblings, 0 replies; 37+ messages in thread
From: Larry Woodman @ 2009-06-22 15:04 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Ingo Molnar, Fr馘駻ic Weisbecker,
	Li Zefan, Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm,
	rostedt

On Mon, 2009-06-22 at 12:37 +0900, KOSAKI Motohiro wrote:

Thanks for the feedback KOSAKI!


> > Larry Woodman wrote:
> > 
> > >> - Please don't display mm and/or another kernel raw pointer.
> > >>   if we assume non stop system, we can't use kernel-dump. Thus kernel pointer
> > >>   logging is not so useful.
> > > 
> > > OK, I just dont know how valuable the trace output is with out some raw
> > > data like the mm_struct.
> > 
> > I believe that we do want something like the mm_struct in
> > the trace info, so we can figure out which process was
> > allocating pages, etc...
> 
> Yes.
> I think we need to print tgid, it is needed to imporove CONFIG_MM_OWNER.
> current CONFIG_MM_OWNER back-pointer point to semi-random task_struct.

All of the tracepoints contain command, pid, CPU and timestamp and
tracepoint name information.  Are you saying I should capture more
information in specific mm tracepoints like the tgid and if the answer
is yes, what would we need this for?


cat-10962 [005]  1877.984589: mm_anon_fault:
cat-10962 [005]  1877.984638: mm_anon_fault:
cat-10962 [005]  1877.984658: sched_switch:
cat-10962 [005]  1877.988359: sched_switch:

> 
> 
> > >> - Please consider how do this feature works on mem-cgroup.
> > >>   (IOW, please don't ignore many "if (scanning_global_lru())")
> > 
> > Good point, we want to trace cgroup vs non-cgroup reclaims,
> > too.
> 
> thank you.

All of the mm tracepoints are located above the cgroup specific calls.
This means that they will capture the same exact data reguardless of
whether cgroups are used or not.  Are you saying I should capture
whether the data was specific to a cgroup or it was from the global
LRUs?

  
> 
> > 
> > >> - tracepoint caller shouldn't have any assumption of displaying representation.
> > >>   e.g.
> > >>     wrong)  trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT, PageAnon(page));
> > >>     good)   trace_mm_pagereclaim_pgout(mapping, page)
> > > 
> > > OK.
> > > 
> > >>   that's general and good callback and/or hook manner.
> > 
> > How do we figure those out from the page pointer at the time
> > the tracepoint triggers?
> > 
> > I believe that it would be useful to export that info in the
> > trace point, since we cannot expect the userspace trace tool
> > to figure out these things from the struct page address.
> > 
> > Or did I overlook something here?
> 
> current, TRACE_EVENT have two step information trasformation.
> 
>  - step1 - TP_fast_assign()
>    it is called from tracepoint directly. it makes ring-buffer representaion.
>  - step2 - TP_printk
>    it is called when reading debug/tracing/trace file. it makes printable
>    representation from ring-buffer data.
> 
> example:
> 
> trace_sched_switch() has three argument, rq, prev, next.
> 
> --------------------------------------------------
> static inline void
> context_switch(struct rq *rq, struct task_struct *prev,
>                struct task_struct *next)
> {
> (snip)
>         trace_sched_switch(rq, prev, next);
> 
> -------------------------------------------------
> 
> TP_fast_assing extract data from argument pointer.
> -----------------------------------------------------
>         TP_fast_assign(
>                 memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
>                 __entry->prev_pid       = prev->pid;
>                 __entry->prev_prio      = prev->prio;
>                 __entry->prev_state     = prev->state;
>                 memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
>                 __entry->next_pid       = next->pid;
>                 __entry->next_prio      = next->prio;
>         ),
> -----------------------------------------------------
> 
> 
> I think mm tracepoint can do the same way.

The sched_switch tracepoint tells us the name of the outgoing and
incomming process during a context switch so this information is very
significant to that tracepoint.  What mm tracepoint would I need to add
such information without it being redundant?

Thanks, Larry Woodman

> 
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
  2009-06-22  3:37                   ` KOSAKI Motohiro
@ 2009-06-22 15:28                     ` Larry Woodman
  -1 siblings, 0 replies; 37+ messages in thread
From: Larry Woodman @ 2009-06-22 15:28 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Ingo Molnar, Fr馘駻ic Weisbecker,
	Li Zefan, Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm,
	rostedt

On Mon, 2009-06-22 at 12:37 +0900, KOSAKI Motohiro wrote:

Thanks for the feedback Kosaki! 

> Hi
> 
> > Thanks for the feedback Kosaki!
> > 
> > 
> > > Scenario 1. OOM killer happend. why? and who bring it?
> > 
> > Doesnt the showmem() and stack trace to the console when the OOM kill
> > occurred show enough in the majority of cases?  I realize that direct
> > alloc_pages() calls are not accounted for here but that can be really
> > invasive.
> 
> showmem() display _result_ of memory usage and fragmentation.
> but Administrator often need to know the _reason_.

Right, thats why I have mm tracepoints in locations like shrink_zone,
shrink_active and shrink_inactive so we can drill down into exactly what
happened when either kswapd ran or a direct reclaim occured out of the
page allocator.  Since we will know the timestamps and the number of
pages scanned and reclaimed we can tell the reason the page reclamation
did not supply enough pages and therefore the OOM occurred.

Do you think this is enough information or do you thine we need more?

> 
> Plus, kmemtrace already trace slab allocate/free activity.
> You mean you think this is really invasive?
> 
> 
> > > Scenario 2. page allocation failure by memory fragmentation
> > 
> > Are you talking about order>0 allocation failures here?  Most of the
> > slabs are single page allocations now.
> 
> Yes, order>0.
> but I confused. Why do you talk about slab, not page alloc?
> 
> Note, non-x86 architecture freqently use order-1 allocation for
> making stack.

OK, I can add a tracepoint in the lumpy reclaim logic when it fails to
get enough contiguous memory to satisfy a high order allocation.

> 
> 
> 
> > > Scenario 3. try_to_free_pages() makes very long latency. why?
> > 
> > This is available in the mm tracepoints, they all include timestamps.
> 
> perhaps, no.
> Administrator need to know the reason. not accumulated time. it's the result.
> 
> We can guess some reason
>   - IO congestion

This can be seen when the number of page scans is significantly greater
than the number pf page frees and pagouts.  Do you thing we need to
combine these tracepoints or add one to throttle_vm_writeout() when it
needs to stall?
 
>   - memory eating speed is fast than reclaim speed

The anonymous and filemapped tracepoints combined with the reclaim
tracepoints will tell us this, do you thing we need more tracepoints to
pinpoint when allocations outpace reclamations?

>   - memory fragmentation

Would adding the order to the page_allocation tracepoint satisfy this?
Currently this tracepoint only triggers when the allocation fails and we
need to reclaim memory.  Another option would be to include the order
information to the direct reclaim tracepoint so we can tell if it was
triggered due to memory fragmentation.  Sorry but I navent seen many
cases in which fragmented memory caused failures.

> 
> but it's only guess. we often need to get data.
> 
> 
> > > Scenario 4. sar output that free memory dramatically reduced at 10 minute ago, and
> > >             it already recover now. What's happen?
> > 
> > Is this really important?  It would take buffering lots of data to
> > figure out what happened in the past.
> 
> ok, my scenario description is a bit wrong.
> 
> if userland process explicitly  consume memory or explicitely write
> many data, it is true.
> 
> Is this more appropriate?
> 
> "userland process take the same action periodically, but only 10 minute ago
> free memory reduced, why?"
> 
We could have a user space script that enabled specific tracepoints only
when it noticed something like the free pages fell below some threshold
and disabled it when free pages climbed back up above some other
threshold.  Would this help?

> 
> 
> > >   - suspects
> > >     - kernel memory leak
> > 
> > Other than direct callers to the page allocator isnt that covered with
> > the kmemtrace stuff?
> 
> Yeah.
> perhaps, kmemtrace enhance to cover page allocator is good approach.
> 
> 
> > >     - userland memory leak
> > 
> > The mm tracepoints track all user space allocations and frees(perhaps
> > too many?).
> 
> hmhm.

Is this a yes?  Would the user space script described above help?

> 
> 
> > 
> > >     - stupid driver use too much memory
> > 
> > hopefully kmemtrace will catch this?
> 
> ditto.
> I agree with kmemtrace enhancement is good idea.
> 
> > 
> > >     - userland application suddenly start to use much memory
> > 
> > The mm tracepoints track all user space allocations and frees.
> 
> ok.
> 
> 
> > >   - what information are valuable?
> > >     - slab usage information (kmemtrace already does)
> > >     - page allocator usage information
> > >     - rss of all processes at oom happend
> > >     - why recent try_to_free_pages() can't reclaim any page?
> > 
> > The counters in the mm tracepoints do give counts but not the reasons
> > that the pagereclaim code fails.
> 
> That's very important key point. please don't ignore.

OK, would you suggest changing the code to count failures or simply
adding a tracepoint to the failure path which would potentially capture
lots more data?

> 
> 
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
@ 2009-06-22 15:28                     ` Larry Woodman
  0 siblings, 0 replies; 37+ messages in thread
From: Larry Woodman @ 2009-06-22 15:28 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Ingo Molnar, Fr馘駻ic Weisbecker,
	Li Zefan, Pekka Enberg, eduard.munteanu, linux-kernel, linux-mm,
	rostedt

On Mon, 2009-06-22 at 12:37 +0900, KOSAKI Motohiro wrote:

Thanks for the feedback Kosaki! 

> Hi
> 
> > Thanks for the feedback Kosaki!
> > 
> > 
> > > Scenario 1. OOM killer happend. why? and who bring it?
> > 
> > Doesnt the showmem() and stack trace to the console when the OOM kill
> > occurred show enough in the majority of cases?  I realize that direct
> > alloc_pages() calls are not accounted for here but that can be really
> > invasive.
> 
> showmem() display _result_ of memory usage and fragmentation.
> but Administrator often need to know the _reason_.

Right, thats why I have mm tracepoints in locations like shrink_zone,
shrink_active and shrink_inactive so we can drill down into exactly what
happened when either kswapd ran or a direct reclaim occured out of the
page allocator.  Since we will know the timestamps and the number of
pages scanned and reclaimed we can tell the reason the page reclamation
did not supply enough pages and therefore the OOM occurred.

Do you think this is enough information or do you thine we need more?

> 
> Plus, kmemtrace already trace slab allocate/free activity.
> You mean you think this is really invasive?
> 
> 
> > > Scenario 2. page allocation failure by memory fragmentation
> > 
> > Are you talking about order>0 allocation failures here?  Most of the
> > slabs are single page allocations now.
> 
> Yes, order>0.
> but I confused. Why do you talk about slab, not page alloc?
> 
> Note, non-x86 architecture freqently use order-1 allocation for
> making stack.

OK, I can add a tracepoint in the lumpy reclaim logic when it fails to
get enough contiguous memory to satisfy a high order allocation.

> 
> 
> 
> > > Scenario 3. try_to_free_pages() makes very long latency. why?
> > 
> > This is available in the mm tracepoints, they all include timestamps.
> 
> perhaps, no.
> Administrator need to know the reason. not accumulated time. it's the result.
> 
> We can guess some reason
>   - IO congestion

This can be seen when the number of page scans is significantly greater
than the number pf page frees and pagouts.  Do you thing we need to
combine these tracepoints or add one to throttle_vm_writeout() when it
needs to stall?
 
>   - memory eating speed is fast than reclaim speed

The anonymous and filemapped tracepoints combined with the reclaim
tracepoints will tell us this, do you thing we need more tracepoints to
pinpoint when allocations outpace reclamations?

>   - memory fragmentation

Would adding the order to the page_allocation tracepoint satisfy this?
Currently this tracepoint only triggers when the allocation fails and we
need to reclaim memory.  Another option would be to include the order
information to the direct reclaim tracepoint so we can tell if it was
triggered due to memory fragmentation.  Sorry but I navent seen many
cases in which fragmented memory caused failures.

> 
> but it's only guess. we often need to get data.
> 
> 
> > > Scenario 4. sar output that free memory dramatically reduced at 10 minute ago, and
> > >             it already recover now. What's happen?
> > 
> > Is this really important?  It would take buffering lots of data to
> > figure out what happened in the past.
> 
> ok, my scenario description is a bit wrong.
> 
> if userland process explicitly  consume memory or explicitely write
> many data, it is true.
> 
> Is this more appropriate?
> 
> "userland process take the same action periodically, but only 10 minute ago
> free memory reduced, why?"
> 
We could have a user space script that enabled specific tracepoints only
when it noticed something like the free pages fell below some threshold
and disabled it when free pages climbed back up above some other
threshold.  Would this help?

> 
> 
> > >   - suspects
> > >     - kernel memory leak
> > 
> > Other than direct callers to the page allocator isnt that covered with
> > the kmemtrace stuff?
> 
> Yeah.
> perhaps, kmemtrace enhance to cover page allocator is good approach.
> 
> 
> > >     - userland memory leak
> > 
> > The mm tracepoints track all user space allocations and frees(perhaps
> > too many?).
> 
> hmhm.

Is this a yes?  Would the user space script described above help?

> 
> 
> > 
> > >     - stupid driver use too much memory
> > 
> > hopefully kmemtrace will catch this?
> 
> ditto.
> I agree with kmemtrace enhancement is good idea.
> 
> > 
> > >     - userland application suddenly start to use much memory
> > 
> > The mm tracepoints track all user space allocations and frees.
> 
> ok.
> 
> 
> > >   - what information are valuable?
> > >     - slab usage information (kmemtrace already does)
> > >     - page allocator usage information
> > >     - rss of all processes at oom happend
> > >     - why recent try_to_free_pages() can't reclaim any page?
> > 
> > The counters in the mm tracepoints do give counts but not the reasons
> > that the pagereclaim code fails.
> 
> That's very important key point. please don't ignore.

OK, would you suggest changing the code to count failures or simply
adding a tracepoint to the failure path which would potentially capture
lots more data?

> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
  2009-06-22 15:04                       ` Larry Woodman
@ 2009-06-23  5:52                         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2009-06-23  5:52 UTC (permalink / raw)
  To: Larry Woodman
  Cc: kosaki.motohiro, Rik van Riel, Ingo Molnar,
	Fr馘駻ic Weisbecker, Li Zefan, Pekka Enberg,
	eduard.munteanu, linux-kernel, linux-mm, rostedt

> On Mon, 2009-06-22 at 12:37 +0900, KOSAKI Motohiro wrote:
> 
> Thanks for the feedback KOSAKI!
> 
> 
> > > Larry Woodman wrote:
> > > 
> > > >> - Please don't display mm and/or another kernel raw pointer.
> > > >>   if we assume non stop system, we can't use kernel-dump. Thus kernel pointer
> > > >>   logging is not so useful.
> > > > 
> > > > OK, I just dont know how valuable the trace output is with out some raw
> > > > data like the mm_struct.
> > > 
> > > I believe that we do want something like the mm_struct in
> > > the trace info, so we can figure out which process was
> > > allocating pages, etc...
> > 
> > Yes.
> > I think we need to print tgid, it is needed to imporove CONFIG_MM_OWNER.
> > current CONFIG_MM_OWNER back-pointer point to semi-random task_struct.
> 
> All of the tracepoints contain command, pid, CPU and timestamp and
> tracepoint name information.  Are you saying I should capture more
> information in specific mm tracepoints like the tgid and if the answer
> is yes, what would we need this for?
> 
> 
> cat-10962 [005]  1877.984589: mm_anon_fault:
> cat-10962 [005]  1877.984638: mm_anon_fault:
> cat-10962 [005]  1877.984658: sched_switch:
> cat-10962 [005]  1877.988359: sched_switch:

this is sufficient in almost cause. but there are few exception.

ftrace common header logged current->pid, but kswapd steal the page
from another process. we interest victim process, not kswapd pid.
(e.g. Please see your trace_mm_anon_unmap())


> > > >> - Please consider how do this feature works on mem-cgroup.
> > > >>   (IOW, please don't ignore many "if (scanning_global_lru())")
> > > 
> > > Good point, we want to trace cgroup vs non-cgroup reclaims,
> > > too.
> > 
> > thank you.
> 
> All of the mm tracepoints are located above the cgroup specific calls.
> This means that they will capture the same exact data reguardless of
> whether cgroups are used or not.  Are you saying I should capture
> whether the data was specific to a cgroup or it was from the global
> LRUs?

Yes and No.

example, if frequently cgroup reclaim occur, it mean administrator
miss to set memory limit.
but if frequently global reclaim occur, it mean we need to add physical
memory.

I mean, cgroup or not is one of major information for making analysis.
and perhaps cgroup path name is also useful.



> > > >> - tracepoint caller shouldn't have any assumption of displaying representation.
> > > >>   e.g.
> > > >>     wrong)  trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT, PageAnon(page));
> > > >>     good)   trace_mm_pagereclaim_pgout(mapping, page)
> > > > 
> > > > OK.
> > > > 
> > > >>   that's general and good callback and/or hook manner.
> > > 
> > > How do we figure those out from the page pointer at the time
> > > the tracepoint triggers?
> > > 
> > > I believe that it would be useful to export that info in the
> > > trace point, since we cannot expect the userspace trace tool
> > > to figure out these things from the struct page address.
> > > 
> > > Or did I overlook something here?
> > 
> > current, TRACE_EVENT have two step information trasformation.
> > 
> >  - step1 - TP_fast_assign()
> >    it is called from tracepoint directly. it makes ring-buffer representaion.
> >  - step2 - TP_printk
> >    it is called when reading debug/tracing/trace file. it makes printable
> >    representation from ring-buffer data.
> > 
> > example:
> > 
> > trace_sched_switch() has three argument, rq, prev, next.
> > 
> > --------------------------------------------------
> > static inline void
> > context_switch(struct rq *rq, struct task_struct *prev,
> >                struct task_struct *next)
> > {
> > (snip)
> >         trace_sched_switch(rq, prev, next);
> > 
> > -------------------------------------------------
> > 
> > TP_fast_assing extract data from argument pointer.
> > -----------------------------------------------------
> >         TP_fast_assign(
> >                 memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
> >                 __entry->prev_pid       = prev->pid;
> >                 __entry->prev_prio      = prev->prio;
> >                 __entry->prev_state     = prev->state;
> >                 memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
> >                 __entry->next_pid       = next->pid;
> >                 __entry->next_prio      = next->prio;
> >         ),
> > -----------------------------------------------------
> > 
> > 
> > I think mm tracepoint can do the same way.
> 
> The sched_switch tracepoint tells us the name of the outgoing and
> incomming process during a context switch so this information is very
> significant to that tracepoint.  What mm tracepoint would I need to add
> such information without it being redundant?

perhaps I missed you mean.
I only pointed out that mm tracepoint can reduce number of argument.

I don't says increase/decrease display information.


maybe my explanation was wrong. my english is very poor. sorry.





^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Patch] mm tracepoints update - use case.
@ 2009-06-23  5:52                         ` KOSAKI Motohiro
  0 siblings, 0 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2009-06-23  5:52 UTC (permalink / raw)
  To: Larry Woodman
  Cc: kosaki.motohiro, Rik van Riel, Ingo Molnar,
	Fr馘駻ic Weisbecker, Li Zefan, Pekka Enberg,
	eduard.munteanu, linux-kernel, linux-mm, rostedt

> On Mon, 2009-06-22 at 12:37 +0900, KOSAKI Motohiro wrote:
> 
> Thanks for the feedback KOSAKI!
> 
> 
> > > Larry Woodman wrote:
> > > 
> > > >> - Please don't display mm and/or another kernel raw pointer.
> > > >>   if we assume non stop system, we can't use kernel-dump. Thus kernel pointer
> > > >>   logging is not so useful.
> > > > 
> > > > OK, I just dont know how valuable the trace output is with out some raw
> > > > data like the mm_struct.
> > > 
> > > I believe that we do want something like the mm_struct in
> > > the trace info, so we can figure out which process was
> > > allocating pages, etc...
> > 
> > Yes.
> > I think we need to print tgid, it is needed to imporove CONFIG_MM_OWNER.
> > current CONFIG_MM_OWNER back-pointer point to semi-random task_struct.
> 
> All of the tracepoints contain command, pid, CPU and timestamp and
> tracepoint name information.  Are you saying I should capture more
> information in specific mm tracepoints like the tgid and if the answer
> is yes, what would we need this for?
> 
> 
> cat-10962 [005]  1877.984589: mm_anon_fault:
> cat-10962 [005]  1877.984638: mm_anon_fault:
> cat-10962 [005]  1877.984658: sched_switch:
> cat-10962 [005]  1877.988359: sched_switch:

this is sufficient in almost cause. but there are few exception.

ftrace common header logged current->pid, but kswapd steal the page
from another process. we interest victim process, not kswapd pid.
(e.g. Please see your trace_mm_anon_unmap())


> > > >> - Please consider how do this feature works on mem-cgroup.
> > > >>   (IOW, please don't ignore many "if (scanning_global_lru())")
> > > 
> > > Good point, we want to trace cgroup vs non-cgroup reclaims,
> > > too.
> > 
> > thank you.
> 
> All of the mm tracepoints are located above the cgroup specific calls.
> This means that they will capture the same exact data reguardless of
> whether cgroups are used or not.  Are you saying I should capture
> whether the data was specific to a cgroup or it was from the global
> LRUs?

Yes and No.

example, if frequently cgroup reclaim occur, it mean administrator
miss to set memory limit.
but if frequently global reclaim occur, it mean we need to add physical
memory.

I mean, cgroup or not is one of major information for making analysis.
and perhaps cgroup path name is also useful.



> > > >> - tracepoint caller shouldn't have any assumption of displaying representation.
> > > >>   e.g.
> > > >>     wrong)  trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT, PageAnon(page));
> > > >>     good)   trace_mm_pagereclaim_pgout(mapping, page)
> > > > 
> > > > OK.
> > > > 
> > > >>   that's general and good callback and/or hook manner.
> > > 
> > > How do we figure those out from the page pointer at the time
> > > the tracepoint triggers?
> > > 
> > > I believe that it would be useful to export that info in the
> > > trace point, since we cannot expect the userspace trace tool
> > > to figure out these things from the struct page address.
> > > 
> > > Or did I overlook something here?
> > 
> > current, TRACE_EVENT have two step information trasformation.
> > 
> >  - step1 - TP_fast_assign()
> >    it is called from tracepoint directly. it makes ring-buffer representaion.
> >  - step2 - TP_printk
> >    it is called when reading debug/tracing/trace file. it makes printable
> >    representation from ring-buffer data.
> > 
> > example:
> > 
> > trace_sched_switch() has three argument, rq, prev, next.
> > 
> > --------------------------------------------------
> > static inline void
> > context_switch(struct rq *rq, struct task_struct *prev,
> >                struct task_struct *next)
> > {
> > (snip)
> >         trace_sched_switch(rq, prev, next);
> > 
> > -------------------------------------------------
> > 
> > TP_fast_assing extract data from argument pointer.
> > -----------------------------------------------------
> >         TP_fast_assign(
> >                 memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
> >                 __entry->prev_pid       = prev->pid;
> >                 __entry->prev_prio      = prev->prio;
> >                 __entry->prev_state     = prev->state;
> >                 memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
> >                 __entry->next_pid       = next->pid;
> >                 __entry->next_prio      = next->prio;
> >         ),
> > -----------------------------------------------------
> > 
> > 
> > I think mm tracepoint can do the same way.
> 
> The sched_switch tracepoint tells us the name of the outgoing and
> incomming process during a context switch so this information is very
> significant to that tracepoint.  What mm tracepoint would I need to add
> such information without it being redundant?

perhaps I missed you mean.
I only pointed out that mm tracepoint can reduce number of argument.

I don't says increase/decrease display information.


maybe my explanation was wrong. my english is very poor. sorry.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2009-06-23  5:52 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-21 22:45 [Patch] mm tracepoints update Larry Woodman
2009-04-21 22:45 ` Larry Woodman
2009-04-22  1:00 ` KOSAKI Motohiro
2009-04-22  1:00   ` KOSAKI Motohiro
2009-04-22  9:57   ` Ingo Molnar
2009-04-22  9:57     ` Ingo Molnar
2009-04-22 12:07     ` Larry Woodman
2009-04-22 12:07       ` Larry Woodman
2009-04-22 19:22       ` [Patch] mm tracepoints update - use case Larry Woodman
2009-04-23  0:48         ` KOSAKI Motohiro
2009-04-23  0:48           ` KOSAKI Motohiro
2009-04-23  4:50           ` Andrew Morton
2009-04-23  4:50             ` Andrew Morton
2009-04-23  8:42             ` Ingo Molnar
2009-04-23  8:42               ` Ingo Molnar
2009-04-23 11:47               ` Larry Woodman
2009-04-23 11:47                 ` Larry Woodman
2009-04-24 20:48                 ` Larry Woodman
2009-06-15 18:26           ` Rik van Riel
2009-06-15 18:26             ` Rik van Riel
2009-06-17 14:07             ` Larry Woodman
2009-06-18  7:57             ` KOSAKI Motohiro
2009-06-18  7:57               ` KOSAKI Motohiro
2009-06-18 19:22               ` Larry Woodman
2009-06-18 19:22                 ` Larry Woodman
2009-06-18 19:40                 ` Rik van Riel
2009-06-18 19:40                   ` Rik van Riel
2009-06-22  3:37                   ` KOSAKI Motohiro
2009-06-22  3:37                     ` KOSAKI Motohiro
2009-06-22 15:04                     ` Larry Woodman
2009-06-22 15:04                       ` Larry Woodman
2009-06-23  5:52                       ` KOSAKI Motohiro
2009-06-23  5:52                         ` KOSAKI Motohiro
2009-06-22  3:37                 ` KOSAKI Motohiro
2009-06-22  3:37                   ` KOSAKI Motohiro
2009-06-22 15:28                   ` Larry Woodman
2009-06-22 15:28                     ` Larry Woodman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.