All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] export more page flags in /proc/kpageflags (take 6)
@ 2009-05-08 10:53 ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Matt Mackall, KOSAKI Motohiro, Wu, Fengguang, Andi Kleen, linux-mm

Andrew,

Can you merge this patchset? There should be no more concerns :-)
The last patch may be delayed to the hwpoison merge time.

take 6:

- show more help text in page-types
- separate out PG_hwpoison
- comment on KPF_MMAP

take 5:

- add page-types tool for querying the exported page flags.
- export all page flags unconditionally and faithfully, and offload
  complicated filtering works to the user space tool.

This patchset:

Export 10 more flags to end users (and more for kernel developers):

        11. KPF_MMAP            (pseudo flag) memory mapped page
        12. KPF_ANON            (pseudo flag) memory mapped page (anonymous)
        13. KPF_SWAPCACHE       page is in swap cache
        14. KPF_SWAPBACKED      page is swap/RAM backed
        15. KPF_COMPOUND_HEAD   (*)
        16. KPF_COMPOUND_TAIL   (*)
        17. KPF_HUGE		hugeTLB pages
        18. KPF_UNEVICTABLE     page is in the unevictable LRU list
        19. KPF_HWPOISON        hardware detected corruption
        20. KPF_NOPAGE          (pseudo flag) no page frame at the address

        (*) For compound pages, exporting _both_ head/tail info enables
            users to tell where a compound page starts/ends, and its order.

Patches:

[PATCH 1/8] mm: introduce PageHuge() for testing huge/gigantic pages
[PATCH 2/8] slob: use PG_slab for identifying SLOB pages
[PATCH 3/8] proc: kpagecount/kpageflags code cleanup
[PATCH 4/8] proc: export more page flags in /proc/kpageflags
[PATCH 5/8] pagemap: document clarifications
[PATCH 6/8] pagemap: document 9 more exported page flags
[PATCH 7/8] pagemap: add page-types tool
[PATCH 7/8] pagemap: export PG_hwpoison

 Documentation/vm/Makefile     |    2 
 Documentation/vm/page-types.c |  700 ++++++++++++++++++++++++++++++++
 Documentation/vm/pagemap.txt  |   72 +++
 fs/proc/page.c                |  166 ++++++-
 include/linux/mm.h            |   24 +
 include/linux/page-flags.h    |    2 
 mm/hugetlb.c                  |    2 
 mm/page_alloc.c               |   11 
 mm/slob.c                     |    6 
 9 files changed, 940 insertions(+), 45 deletions(-)

Thanks,
Fengguang
--

a simple demo of the page-types tool

# ./page-types -h
page-types [options]
            -r|--raw                  Raw mode, for kernel developers
            -a|--addr    addr-spec    Walk a range of pages
            -b|--bits    bits-spec    Walk pages with specified bits
            -l|--list                 Show page details in ranges
            -L|--list-each            Show page details one by one
            -N|--no-summary           Don't show summay info
            -h|--help                 Show this usage message
addr-spec:
            N                         one page at offset N (unit: pages)
            N+M                       pages range from N to N+M-1
            N,M                       pages range from N to M-1
            N,                        pages range from N to end
            ,M                        pages range from 0 to M
bits-spec:
            bit1,bit2                 (flags & (bit1|bit2)) != 0
            bit1,bit2=bit1            (flags & (bit1|bit2)) == bit1
            bit1,~bit2                (flags & (bit1|bit2)) == bit1
            =bit1,bit2                flags == (bit1|bit2)
bit-names:
          locked              error         referenced           uptodate   
           dirty                lru             active               slab   
       writeback            reclaim              buddy               mmap   
       anonymous          swapcache         swapbacked      compound_head   
   compound_tail               huge        unevictable           hwpoison   
          nopage           reserved(r)         mlocked(r)    mappedtodisk(r)
         private(r)       private_2(r)   owner_private(r)            arch(r)
        uncached(r)       readahead(o)       slob_free(o)     slub_frozen(o)
      slub_debug(o)
                                   (r) raw mode bits  (o) overloaded bits


# ./page-types
             flags      page-count       MB  symbolic-flags                     long-symbolic-flags
0x0000000000000000          487369     1903  _________________________________
0x0000000000000014               5        0  __R_D____________________________  referenced,dirty
0x0000000000000020               1        0  _____l___________________________  lru
0x0000000000000024              34        0  __R__l___________________________  referenced,lru
0x0000000000000028            3838       14  ___U_l___________________________  uptodate,lru
0x0001000000000028              48        0  ___U_l_______________________I___  uptodate,lru,readahead
0x000000000000002c            6478       25  __RU_l___________________________  referenced,uptodate,lru
0x000100000000002c              47        0  __RU_l_______________________I___  referenced,uptodate,lru,readahead
0x0000000000000040            8344       32  ______A__________________________  active
0x0000000000000060               1        0  _____lA__________________________  lru,active
0x0000000000000068             348        1  ___U_lA__________________________  uptodate,lru,active
0x0001000000000068              12        0  ___U_lA______________________I___  uptodate,lru,active,readahead
0x000000000000006c             988        3  __RU_lA__________________________  referenced,uptodate,lru,active
0x000100000000006c              48        0  __RU_lA______________________I___  referenced,uptodate,lru,active,readahead
0x0000000000004078               1        0  ___UDlA_______b__________________  uptodate,dirty,lru,active,swapbacked
0x000000000000407c              34        0  __RUDlA_______b__________________  referenced,uptodate,dirty,lru,active,swapbacked
0x0000000000000400             503        1  __________B______________________  buddy
0x0000000000000804               1        0  __R________M_____________________  referenced,mmap
0x0000000000000828            1029        4  ___U_l_____M_____________________  uptodate,lru,mmap
0x0001000000000828              43        0  ___U_l_____M_________________I___  uptodate,lru,mmap,readahead
0x000000000000082c             382        1  __RU_l_____M_____________________  referenced,uptodate,lru,mmap
0x000100000000082c              12        0  __RU_l_____M_________________I___  referenced,uptodate,lru,mmap,readahead
0x0000000000000868             192        0  ___U_lA____M_____________________  uptodate,lru,active,mmap
0x0001000000000868              12        0  ___U_lA____M_________________I___  uptodate,lru,active,mmap,readahead
0x000000000000086c             800        3  __RU_lA____M_____________________  referenced,uptodate,lru,active,mmap
0x000100000000086c              31        0  __RU_lA____M_________________I___  referenced,uptodate,lru,active,mmap,readahead
0x0000000000004878               2        0  ___UDlA____M__b__________________  uptodate,dirty,lru,active,mmap,swapbacked
0x0000000000001000             492        1  ____________a____________________  anonymous
0x0000000000005808               4        0  ___U_______Ma_b__________________  uptodate,mmap,anonymous,swapbacked
0x0000000000005868            2839       11  ___U_lA____Ma_b__________________  uptodate,lru,active,mmap,anonymous,swapbacked
0x000000000000586c              30        0  __RU_lA____Ma_b__________________  referenced,uptodate,lru,active,mmap,anonymous,swapbacked
             total          513968     2007


# ./page-types -r
             flags      page-count       MB  symbolic-flags                     long-symbolic-flags
0x0000000000000000          468002     1828  _________________________________
0x0000000100000000           19102       74  _____________________r___________  reserved
0x0000000000008000              41        0  _______________H_________________  compound_head
0x0000000000010000             188        0  ________________T________________  compound_tail
0x0000000000008014               1        0  __R_D__________H_________________  referenced,dirty,compound_head
0x0000000000010014               4        0  __R_D___________T________________  referenced,dirty,compound_tail
0x0000000000000020               1        0  _____l___________________________  lru
0x0000000800000024              34        0  __R__l__________________P________  referenced,lru,private
0x0000000000000028            3794       14  ___U_l___________________________  uptodate,lru
0x0001000000000028              46        0  ___U_l_______________________I___  uptodate,lru,readahead
0x0000000400000028              44        0  ___U_l_________________d_________  uptodate,lru,mappedtodisk
0x0001000400000028               2        0  ___U_l_________________d_____I___  uptodate,lru,mappedtodisk,readahead
0x000000000000002c            6434       25  __RU_l___________________________  referenced,uptodate,lru
0x000100000000002c              47        0  __RU_l_______________________I___  referenced,uptodate,lru,readahead
0x000000040000002c              14        0  __RU_l_________________d_________  referenced,uptodate,lru,mappedtodisk
0x000000080000002c              30        0  __RU_l__________________P________  referenced,uptodate,lru,private
0x0000000800000040            8124       31  ______A_________________P________  active,private
0x0000000000000040             219        0  ______A__________________________  active
0x0000000800000060               1        0  _____lA_________________P________  lru,active,private
0x0000000000000068             322        1  ___U_lA__________________________  uptodate,lru,active
0x0001000000000068              12        0  ___U_lA______________________I___  uptodate,lru,active,readahead
0x0000000400000068              13        0  ___U_lA________________d_________  uptodate,lru,active,mappedtodisk
0x0000000800000068              12        0  ___U_lA_________________P________  uptodate,lru,active,private
0x000000000000006c             977        3  __RU_lA__________________________  referenced,uptodate,lru,active
0x000100000000006c              48        0  __RU_lA______________________I___  referenced,uptodate,lru,active,readahead
0x000000040000006c               5        0  __RU_lA________________d_________  referenced,uptodate,lru,active,mappedtodisk
0x000000080000006c               3        0  __RU_lA_________________P________  referenced,uptodate,lru,active,private
0x0000000c0000006c               3        0  __RU_lA________________dP________  referenced,uptodate,lru,active,mappedtodisk,private
0x0000000c00000068               1        0  ___U_lA________________dP________  uptodate,lru,active,mappedtodisk,private
0x0000000000004078               1        0  ___UDlA_______b__________________  uptodate,dirty,lru,active,swapbacked
0x000000000000407c              34        0  __RUDlA_______b__________________  referenced,uptodate,dirty,lru,active,swapbacked
0x0000000000000400             538        2  __________B______________________  buddy
0x0000000000000804               1        0  __R________M_____________________  referenced,mmap
0x0000000000000828            1029        4  ___U_l_____M_____________________  uptodate,lru,mmap
0x0001000000000828              43        0  ___U_l_____M_________________I___  uptodate,lru,mmap,readahead
0x000000000000082c             382        1  __RU_l_____M_____________________  referenced,uptodate,lru,mmap
0x000100000000082c              12        0  __RU_l_____M_________________I___  referenced,uptodate,lru,mmap,readahead
0x0000000000000868             192        0  ___U_lA____M_____________________  uptodate,lru,active,mmap
0x0001000000000868              12        0  ___U_lA____M_________________I___  uptodate,lru,active,mmap,readahead
0x000000000000086c             800        3  __RU_lA____M_____________________  referenced,uptodate,lru,active,mmap
0x000100000000086c              31        0  __RU_lA____M_________________I___  referenced,uptodate,lru,active,mmap,readahead
0x0000000000004878               2        0  ___UDlA____M__b__________________  uptodate,dirty,lru,active,mmap,swapbacked
0x0000000000001000             492        1  ____________a____________________  anonymous
0x0000000000005008               2        0  ___U________a_b__________________  uptodate,anonymous,swapbacked
0x0000000000005808               4        0  ___U_______Ma_b__________________  uptodate,mmap,anonymous,swapbacked
0x000000000000580c               1        0  __RU_______Ma_b__________________  referenced,uptodate,mmap,anonymous,swapbacked
0x0000000000005868            2839       11  ___U_lA____Ma_b__________________  uptodate,lru,active,mmap,anonymous,swapbacked
0x000000000000586c              29        0  __RU_lA____Ma_b__________________  referenced,uptodate,lru,active,mmap,anonymous,swapbacked
             total          513968     2007


# ./page-types --raw --list --no-summary --bits reserved
offset  count   flags
0       15      _____________________r___________
31      4       _____________________r___________
159     97      _____________________r___________
4096    2067    _____________________r___________
6752    2390    _____________________r___________
9355    3       _____________________r___________
9728    14526   _____________________r___________



^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 0/8] export more page flags in /proc/kpageflags (take 6)
@ 2009-05-08 10:53 ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Matt Mackall, KOSAKI Motohiro, Wu, Fengguang, Andi Kleen, linux-mm

Andrew,

Can you merge this patchset? There should be no more concerns :-)
The last patch may be delayed to the hwpoison merge time.

take 6:

- show more help text in page-types
- separate out PG_hwpoison
- comment on KPF_MMAP

take 5:

- add page-types tool for querying the exported page flags.
- export all page flags unconditionally and faithfully, and offload
  complicated filtering works to the user space tool.

This patchset:

Export 10 more flags to end users (and more for kernel developers):

        11. KPF_MMAP            (pseudo flag) memory mapped page
        12. KPF_ANON            (pseudo flag) memory mapped page (anonymous)
        13. KPF_SWAPCACHE       page is in swap cache
        14. KPF_SWAPBACKED      page is swap/RAM backed
        15. KPF_COMPOUND_HEAD   (*)
        16. KPF_COMPOUND_TAIL   (*)
        17. KPF_HUGE		hugeTLB pages
        18. KPF_UNEVICTABLE     page is in the unevictable LRU list
        19. KPF_HWPOISON        hardware detected corruption
        20. KPF_NOPAGE          (pseudo flag) no page frame at the address

        (*) For compound pages, exporting _both_ head/tail info enables
            users to tell where a compound page starts/ends, and its order.

Patches:

[PATCH 1/8] mm: introduce PageHuge() for testing huge/gigantic pages
[PATCH 2/8] slob: use PG_slab for identifying SLOB pages
[PATCH 3/8] proc: kpagecount/kpageflags code cleanup
[PATCH 4/8] proc: export more page flags in /proc/kpageflags
[PATCH 5/8] pagemap: document clarifications
[PATCH 6/8] pagemap: document 9 more exported page flags
[PATCH 7/8] pagemap: add page-types tool
[PATCH 7/8] pagemap: export PG_hwpoison

 Documentation/vm/Makefile     |    2 
 Documentation/vm/page-types.c |  700 ++++++++++++++++++++++++++++++++
 Documentation/vm/pagemap.txt  |   72 +++
 fs/proc/page.c                |  166 ++++++-
 include/linux/mm.h            |   24 +
 include/linux/page-flags.h    |    2 
 mm/hugetlb.c                  |    2 
 mm/page_alloc.c               |   11 
 mm/slob.c                     |    6 
 9 files changed, 940 insertions(+), 45 deletions(-)

Thanks,
Fengguang
--

a simple demo of the page-types tool

# ./page-types -h
page-types [options]
            -r|--raw                  Raw mode, for kernel developers
            -a|--addr    addr-spec    Walk a range of pages
            -b|--bits    bits-spec    Walk pages with specified bits
            -l|--list                 Show page details in ranges
            -L|--list-each            Show page details one by one
            -N|--no-summary           Don't show summay info
            -h|--help                 Show this usage message
addr-spec:
            N                         one page at offset N (unit: pages)
            N+M                       pages range from N to N+M-1
            N,M                       pages range from N to M-1
            N,                        pages range from N to end
            ,M                        pages range from 0 to M
bits-spec:
            bit1,bit2                 (flags & (bit1|bit2)) != 0
            bit1,bit2=bit1            (flags & (bit1|bit2)) == bit1
            bit1,~bit2                (flags & (bit1|bit2)) == bit1
            =bit1,bit2                flags == (bit1|bit2)
bit-names:
          locked              error         referenced           uptodate   
           dirty                lru             active               slab   
       writeback            reclaim              buddy               mmap   
       anonymous          swapcache         swapbacked      compound_head   
   compound_tail               huge        unevictable           hwpoison   
          nopage           reserved(r)         mlocked(r)    mappedtodisk(r)
         private(r)       private_2(r)   owner_private(r)            arch(r)
        uncached(r)       readahead(o)       slob_free(o)     slub_frozen(o)
      slub_debug(o)
                                   (r) raw mode bits  (o) overloaded bits


# ./page-types
             flags      page-count       MB  symbolic-flags                     long-symbolic-flags
0x0000000000000000          487369     1903  _________________________________
0x0000000000000014               5        0  __R_D____________________________  referenced,dirty
0x0000000000000020               1        0  _____l___________________________  lru
0x0000000000000024              34        0  __R__l___________________________  referenced,lru
0x0000000000000028            3838       14  ___U_l___________________________  uptodate,lru
0x0001000000000028              48        0  ___U_l_______________________I___  uptodate,lru,readahead
0x000000000000002c            6478       25  __RU_l___________________________  referenced,uptodate,lru
0x000100000000002c              47        0  __RU_l_______________________I___  referenced,uptodate,lru,readahead
0x0000000000000040            8344       32  ______A__________________________  active
0x0000000000000060               1        0  _____lA__________________________  lru,active
0x0000000000000068             348        1  ___U_lA__________________________  uptodate,lru,active
0x0001000000000068              12        0  ___U_lA______________________I___  uptodate,lru,active,readahead
0x000000000000006c             988        3  __RU_lA__________________________  referenced,uptodate,lru,active
0x000100000000006c              48        0  __RU_lA______________________I___  referenced,uptodate,lru,active,readahead
0x0000000000004078               1        0  ___UDlA_______b__________________  uptodate,dirty,lru,active,swapbacked
0x000000000000407c              34        0  __RUDlA_______b__________________  referenced,uptodate,dirty,lru,active,swapbacked
0x0000000000000400             503        1  __________B______________________  buddy
0x0000000000000804               1        0  __R________M_____________________  referenced,mmap
0x0000000000000828            1029        4  ___U_l_____M_____________________  uptodate,lru,mmap
0x0001000000000828              43        0  ___U_l_____M_________________I___  uptodate,lru,mmap,readahead
0x000000000000082c             382        1  __RU_l_____M_____________________  referenced,uptodate,lru,mmap
0x000100000000082c              12        0  __RU_l_____M_________________I___  referenced,uptodate,lru,mmap,readahead
0x0000000000000868             192        0  ___U_lA____M_____________________  uptodate,lru,active,mmap
0x0001000000000868              12        0  ___U_lA____M_________________I___  uptodate,lru,active,mmap,readahead
0x000000000000086c             800        3  __RU_lA____M_____________________  referenced,uptodate,lru,active,mmap
0x000100000000086c              31        0  __RU_lA____M_________________I___  referenced,uptodate,lru,active,mmap,readahead
0x0000000000004878               2        0  ___UDlA____M__b__________________  uptodate,dirty,lru,active,mmap,swapbacked
0x0000000000001000             492        1  ____________a____________________  anonymous
0x0000000000005808               4        0  ___U_______Ma_b__________________  uptodate,mmap,anonymous,swapbacked
0x0000000000005868            2839       11  ___U_lA____Ma_b__________________  uptodate,lru,active,mmap,anonymous,swapbacked
0x000000000000586c              30        0  __RU_lA____Ma_b__________________  referenced,uptodate,lru,active,mmap,anonymous,swapbacked
             total          513968     2007


# ./page-types -r
             flags      page-count       MB  symbolic-flags                     long-symbolic-flags
0x0000000000000000          468002     1828  _________________________________
0x0000000100000000           19102       74  _____________________r___________  reserved
0x0000000000008000              41        0  _______________H_________________  compound_head
0x0000000000010000             188        0  ________________T________________  compound_tail
0x0000000000008014               1        0  __R_D__________H_________________  referenced,dirty,compound_head
0x0000000000010014               4        0  __R_D___________T________________  referenced,dirty,compound_tail
0x0000000000000020               1        0  _____l___________________________  lru
0x0000000800000024              34        0  __R__l__________________P________  referenced,lru,private
0x0000000000000028            3794       14  ___U_l___________________________  uptodate,lru
0x0001000000000028              46        0  ___U_l_______________________I___  uptodate,lru,readahead
0x0000000400000028              44        0  ___U_l_________________d_________  uptodate,lru,mappedtodisk
0x0001000400000028               2        0  ___U_l_________________d_____I___  uptodate,lru,mappedtodisk,readahead
0x000000000000002c            6434       25  __RU_l___________________________  referenced,uptodate,lru
0x000100000000002c              47        0  __RU_l_______________________I___  referenced,uptodate,lru,readahead
0x000000040000002c              14        0  __RU_l_________________d_________  referenced,uptodate,lru,mappedtodisk
0x000000080000002c              30        0  __RU_l__________________P________  referenced,uptodate,lru,private
0x0000000800000040            8124       31  ______A_________________P________  active,private
0x0000000000000040             219        0  ______A__________________________  active
0x0000000800000060               1        0  _____lA_________________P________  lru,active,private
0x0000000000000068             322        1  ___U_lA__________________________  uptodate,lru,active
0x0001000000000068              12        0  ___U_lA______________________I___  uptodate,lru,active,readahead
0x0000000400000068              13        0  ___U_lA________________d_________  uptodate,lru,active,mappedtodisk
0x0000000800000068              12        0  ___U_lA_________________P________  uptodate,lru,active,private
0x000000000000006c             977        3  __RU_lA__________________________  referenced,uptodate,lru,active
0x000100000000006c              48        0  __RU_lA______________________I___  referenced,uptodate,lru,active,readahead
0x000000040000006c               5        0  __RU_lA________________d_________  referenced,uptodate,lru,active,mappedtodisk
0x000000080000006c               3        0  __RU_lA_________________P________  referenced,uptodate,lru,active,private
0x0000000c0000006c               3        0  __RU_lA________________dP________  referenced,uptodate,lru,active,mappedtodisk,private
0x0000000c00000068               1        0  ___U_lA________________dP________  uptodate,lru,active,mappedtodisk,private
0x0000000000004078               1        0  ___UDlA_______b__________________  uptodate,dirty,lru,active,swapbacked
0x000000000000407c              34        0  __RUDlA_______b__________________  referenced,uptodate,dirty,lru,active,swapbacked
0x0000000000000400             538        2  __________B______________________  buddy
0x0000000000000804               1        0  __R________M_____________________  referenced,mmap
0x0000000000000828            1029        4  ___U_l_____M_____________________  uptodate,lru,mmap
0x0001000000000828              43        0  ___U_l_____M_________________I___  uptodate,lru,mmap,readahead
0x000000000000082c             382        1  __RU_l_____M_____________________  referenced,uptodate,lru,mmap
0x000100000000082c              12        0  __RU_l_____M_________________I___  referenced,uptodate,lru,mmap,readahead
0x0000000000000868             192        0  ___U_lA____M_____________________  uptodate,lru,active,mmap
0x0001000000000868              12        0  ___U_lA____M_________________I___  uptodate,lru,active,mmap,readahead
0x000000000000086c             800        3  __RU_lA____M_____________________  referenced,uptodate,lru,active,mmap
0x000100000000086c              31        0  __RU_lA____M_________________I___  referenced,uptodate,lru,active,mmap,readahead
0x0000000000004878               2        0  ___UDlA____M__b__________________  uptodate,dirty,lru,active,mmap,swapbacked
0x0000000000001000             492        1  ____________a____________________  anonymous
0x0000000000005008               2        0  ___U________a_b__________________  uptodate,anonymous,swapbacked
0x0000000000005808               4        0  ___U_______Ma_b__________________  uptodate,mmap,anonymous,swapbacked
0x000000000000580c               1        0  __RU_______Ma_b__________________  referenced,uptodate,mmap,anonymous,swapbacked
0x0000000000005868            2839       11  ___U_lA____Ma_b__________________  uptodate,lru,active,mmap,anonymous,swapbacked
0x000000000000586c              29        0  __RU_lA____Ma_b__________________  referenced,uptodate,lru,active,mmap,anonymous,swapbacked
             total          513968     2007


# ./page-types --raw --list --no-summary --bits reserved
offset  count   flags
0       15      _____________________r___________
31      4       _____________________r___________
159     97      _____________________r___________
4096    2067    _____________________r___________
6752    2390    _____________________r___________
9355    3       _____________________r___________
9728    14526   _____________________r___________


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 1/8] mm: introduce PageHuge() for testing huge/gigantic pages
  2009-05-08 10:53 ` Wu Fengguang
@ 2009-05-08 10:53   ` Wu Fengguang
  -1 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Wu Fengguang, Matt Mackall, KOSAKI Motohiro, Andi Kleen, linux-mm

[-- Attachment #1: giga-page.patch --]
[-- Type: text/plain, Size: 2130 bytes --]

Introduce PageHuge(), which identifies huge/gigantic pages
by their dedicated compound destructor functions.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/mm.h |   24 ++++++++++++++++++++++++
 mm/hugetlb.c       |    2 +-
 mm/page_alloc.c    |   11 ++++++++++-
 3 files changed, 35 insertions(+), 2 deletions(-)

--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -299,13 +299,22 @@ void prep_compound_page(struct page *pag
 }
 
 #ifdef CONFIG_HUGETLBFS
+/*
+ * This (duplicated) destructor function distinguishes gigantic pages from
+ * normal compound pages.
+ */
+void free_gigantic_page(struct page *page)
+{
+	__free_pages_ok(page, compound_order(page));
+}
+
 void prep_compound_gigantic_page(struct page *page, unsigned long order)
 {
 	int i;
 	int nr_pages = 1 << order;
 	struct page *p = page + 1;
 
-	set_compound_page_dtor(page, free_compound_page);
+	set_compound_page_dtor(page, free_gigantic_page);
 	set_compound_order(page, order);
 	__SetPageHead(page);
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -550,7 +550,7 @@ struct hstate *size_to_hstate(unsigned l
 	return NULL;
 }
 
-static void free_huge_page(struct page *page)
+void free_huge_page(struct page *page)
 {
 	/*
 	 * Can't pass hstate in here because it is called from the
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -355,6 +355,30 @@ static inline void set_compound_order(st
 	page[1].lru.prev = (void *)order;
 }
 
+#ifdef CONFIG_HUGETLBFS
+void free_huge_page(struct page *page);
+void free_gigantic_page(struct page *page);
+
+static inline int PageHuge(struct page *page)
+{
+	compound_page_dtor *dtor;
+
+	if (!PageCompound(page))
+		return 0;
+
+	page = compound_head(page);
+	dtor = get_compound_page_dtor(page);
+
+	return  dtor == free_huge_page ||
+		dtor == free_gigantic_page;
+}
+#else
+static inline int PageHuge(struct page *page)
+{
+	return 0;
+}
+#endif
+
 /*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of

-- 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 1/8] mm: introduce PageHuge() for testing huge/gigantic pages
@ 2009-05-08 10:53   ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Wu Fengguang, Matt Mackall, KOSAKI Motohiro, Andi Kleen, linux-mm

[-- Attachment #1: giga-page.patch --]
[-- Type: text/plain, Size: 2355 bytes --]

Introduce PageHuge(), which identifies huge/gigantic pages
by their dedicated compound destructor functions.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/mm.h |   24 ++++++++++++++++++++++++
 mm/hugetlb.c       |    2 +-
 mm/page_alloc.c    |   11 ++++++++++-
 3 files changed, 35 insertions(+), 2 deletions(-)

--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -299,13 +299,22 @@ void prep_compound_page(struct page *pag
 }
 
 #ifdef CONFIG_HUGETLBFS
+/*
+ * This (duplicated) destructor function distinguishes gigantic pages from
+ * normal compound pages.
+ */
+void free_gigantic_page(struct page *page)
+{
+	__free_pages_ok(page, compound_order(page));
+}
+
 void prep_compound_gigantic_page(struct page *page, unsigned long order)
 {
 	int i;
 	int nr_pages = 1 << order;
 	struct page *p = page + 1;
 
-	set_compound_page_dtor(page, free_compound_page);
+	set_compound_page_dtor(page, free_gigantic_page);
 	set_compound_order(page, order);
 	__SetPageHead(page);
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -550,7 +550,7 @@ struct hstate *size_to_hstate(unsigned l
 	return NULL;
 }
 
-static void free_huge_page(struct page *page)
+void free_huge_page(struct page *page)
 {
 	/*
 	 * Can't pass hstate in here because it is called from the
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -355,6 +355,30 @@ static inline void set_compound_order(st
 	page[1].lru.prev = (void *)order;
 }
 
+#ifdef CONFIG_HUGETLBFS
+void free_huge_page(struct page *page);
+void free_gigantic_page(struct page *page);
+
+static inline int PageHuge(struct page *page)
+{
+	compound_page_dtor *dtor;
+
+	if (!PageCompound(page))
+		return 0;
+
+	page = compound_head(page);
+	dtor = get_compound_page_dtor(page);
+
+	return  dtor == free_huge_page ||
+		dtor == free_gigantic_page;
+}
+#else
+static inline int PageHuge(struct page *page)
+{
+	return 0;
+}
+#endif
+
 /*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 2/8] slob: use PG_slab for identifying SLOB pages
  2009-05-08 10:53 ` Wu Fengguang
@ 2009-05-08 10:53   ` Wu Fengguang
  -1 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Matt Mackall, Wu Fengguang, KOSAKI Motohiro, Andi Kleen, linux-mm

[-- Attachment #1: mm-slob-page-flag.patch --]
[-- Type: text/plain, Size: 1394 bytes --]

For the sake of consistency.

Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/page-flags.h |    2 --
 mm/slob.c                  |    6 +++---
 2 files changed, 3 insertions(+), 5 deletions(-)

--- linux.orig/include/linux/page-flags.h
+++ linux/include/linux/page-flags.h
@@ -120,7 +120,6 @@ enum pageflags {
 	PG_savepinned = PG_dirty,
 
 	/* SLOB */
-	PG_slob_page = PG_active,
 	PG_slob_free = PG_private,
 
 	/* SLUB */
@@ -203,7 +202,6 @@ PAGEFLAG(SavePinned, savepinned);			/* X
 PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved)
 PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
 
-__PAGEFLAG(SlobPage, slob_page)
 __PAGEFLAG(SlobFree, slob_free)
 
 __PAGEFLAG(SlubFrozen, slub_frozen)
--- linux.orig/mm/slob.c
+++ linux/mm/slob.c
@@ -132,17 +132,17 @@ static LIST_HEAD(free_slob_large);
  */
 static inline int is_slob_page(struct slob_page *sp)
 {
-	return PageSlobPage((struct page *)sp);
+	return PageSlab((struct page *)sp);
 }
 
 static inline void set_slob_page(struct slob_page *sp)
 {
-	__SetPageSlobPage((struct page *)sp);
+	__SetPageSlab((struct page *)sp);
 }
 
 static inline void clear_slob_page(struct slob_page *sp)
 {
-	__ClearPageSlobPage((struct page *)sp);
+	__ClearPageSlab((struct page *)sp);
 }
 
 static inline struct slob_page *slob_page(const void *addr)

-- 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 2/8] slob: use PG_slab for identifying SLOB pages
@ 2009-05-08 10:53   ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Matt Mackall, Wu Fengguang, KOSAKI Motohiro, Andi Kleen, linux-mm

[-- Attachment #1: mm-slob-page-flag.patch --]
[-- Type: text/plain, Size: 1619 bytes --]

For the sake of consistency.

Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/page-flags.h |    2 --
 mm/slob.c                  |    6 +++---
 2 files changed, 3 insertions(+), 5 deletions(-)

--- linux.orig/include/linux/page-flags.h
+++ linux/include/linux/page-flags.h
@@ -120,7 +120,6 @@ enum pageflags {
 	PG_savepinned = PG_dirty,
 
 	/* SLOB */
-	PG_slob_page = PG_active,
 	PG_slob_free = PG_private,
 
 	/* SLUB */
@@ -203,7 +202,6 @@ PAGEFLAG(SavePinned, savepinned);			/* X
 PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved)
 PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
 
-__PAGEFLAG(SlobPage, slob_page)
 __PAGEFLAG(SlobFree, slob_free)
 
 __PAGEFLAG(SlubFrozen, slub_frozen)
--- linux.orig/mm/slob.c
+++ linux/mm/slob.c
@@ -132,17 +132,17 @@ static LIST_HEAD(free_slob_large);
  */
 static inline int is_slob_page(struct slob_page *sp)
 {
-	return PageSlobPage((struct page *)sp);
+	return PageSlab((struct page *)sp);
 }
 
 static inline void set_slob_page(struct slob_page *sp)
 {
-	__SetPageSlobPage((struct page *)sp);
+	__SetPageSlab((struct page *)sp);
 }
 
 static inline void clear_slob_page(struct slob_page *sp)
 {
-	__ClearPageSlobPage((struct page *)sp);
+	__ClearPageSlab((struct page *)sp);
 }
 
 static inline struct slob_page *slob_page(const void *addr)

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 3/8] proc: kpagecount/kpageflags code cleanup
  2009-05-08 10:53 ` Wu Fengguang
@ 2009-05-08 10:53   ` Wu Fengguang
  -1 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Wu Fengguang, Matt Mackall, KOSAKI Motohiro, Andi Kleen, linux-mm

[-- Attachment #1: kpageflags-fix-out.patch --]
[-- Type: text/plain, Size: 1455 bytes --]

Move increments of pfn/out to bottom of the loop.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/proc/page.c |   17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

--- linux.orig/fs/proc/page.c
+++ linux/fs/proc/page.c
@@ -11,6 +11,7 @@
 
 #define KPMSIZE sizeof(u64)
 #define KPMMASK (KPMSIZE - 1)
+
 /* /proc/kpagecount - an array exposing page counts
  *
  * Each entry is a u64 representing the corresponding
@@ -32,20 +33,22 @@ static ssize_t kpagecount_read(struct fi
 		return -EINVAL;
 
 	while (count > 0) {
-		ppage = NULL;
 		if (pfn_valid(pfn))
 			ppage = pfn_to_page(pfn);
-		pfn++;
+		else
+			ppage = NULL;
 		if (!ppage)
 			pcount = 0;
 		else
 			pcount = page_mapcount(ppage);
 
-		if (put_user(pcount, out++)) {
+		if (put_user(pcount, out)) {
 			ret = -EFAULT;
 			break;
 		}
 
+		pfn++;
+		out++;
 		count -= KPMSIZE;
 	}
 
@@ -98,10 +101,10 @@ static ssize_t kpageflags_read(struct fi
 		return -EINVAL;
 
 	while (count > 0) {
-		ppage = NULL;
 		if (pfn_valid(pfn))
 			ppage = pfn_to_page(pfn);
-		pfn++;
+		else
+			ppage = NULL;
 		if (!ppage)
 			kflags = 0;
 		else
@@ -119,11 +122,13 @@ static ssize_t kpageflags_read(struct fi
 			kpf_copy_bit(kflags, KPF_RECLAIM, PG_reclaim) |
 			kpf_copy_bit(kflags, KPF_BUDDY, PG_buddy);
 
-		if (put_user(uflags, out++)) {
+		if (put_user(uflags, out)) {
 			ret = -EFAULT;
 			break;
 		}
 
+		pfn++;
+		out++;
 		count -= KPMSIZE;
 	}
 

-- 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 3/8] proc: kpagecount/kpageflags code cleanup
@ 2009-05-08 10:53   ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Wu Fengguang, Matt Mackall, KOSAKI Motohiro, Andi Kleen, linux-mm

[-- Attachment #1: kpageflags-fix-out.patch --]
[-- Type: text/plain, Size: 1680 bytes --]

Move increments of pfn/out to bottom of the loop.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/proc/page.c |   17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

--- linux.orig/fs/proc/page.c
+++ linux/fs/proc/page.c
@@ -11,6 +11,7 @@
 
 #define KPMSIZE sizeof(u64)
 #define KPMMASK (KPMSIZE - 1)
+
 /* /proc/kpagecount - an array exposing page counts
  *
  * Each entry is a u64 representing the corresponding
@@ -32,20 +33,22 @@ static ssize_t kpagecount_read(struct fi
 		return -EINVAL;
 
 	while (count > 0) {
-		ppage = NULL;
 		if (pfn_valid(pfn))
 			ppage = pfn_to_page(pfn);
-		pfn++;
+		else
+			ppage = NULL;
 		if (!ppage)
 			pcount = 0;
 		else
 			pcount = page_mapcount(ppage);
 
-		if (put_user(pcount, out++)) {
+		if (put_user(pcount, out)) {
 			ret = -EFAULT;
 			break;
 		}
 
+		pfn++;
+		out++;
 		count -= KPMSIZE;
 	}
 
@@ -98,10 +101,10 @@ static ssize_t kpageflags_read(struct fi
 		return -EINVAL;
 
 	while (count > 0) {
-		ppage = NULL;
 		if (pfn_valid(pfn))
 			ppage = pfn_to_page(pfn);
-		pfn++;
+		else
+			ppage = NULL;
 		if (!ppage)
 			kflags = 0;
 		else
@@ -119,11 +122,13 @@ static ssize_t kpageflags_read(struct fi
 			kpf_copy_bit(kflags, KPF_RECLAIM, PG_reclaim) |
 			kpf_copy_bit(kflags, KPF_BUDDY, PG_buddy);
 
-		if (put_user(uflags, out++)) {
+		if (put_user(uflags, out)) {
 			ret = -EFAULT;
 			break;
 		}
 
+		pfn++;
+		out++;
 		count -= KPMSIZE;
 	}
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 4/8] proc: export more page flags in /proc/kpageflags
  2009-05-08 10:53 ` Wu Fengguang
@ 2009-05-08 10:53   ` Wu Fengguang
  -1 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, KOSAKI Motohiro, Andi Kleen, Matt Mackall, Alexey Dobriyan,
	Wu Fengguang, linux-mm

[-- Attachment #1: kpageflags-extending.patch --]
[-- Type: text/plain, Size: 6204 bytes --]

Export all page flags faithfully in /proc/kpageflags.

	11. KPF_MMAP		(pseudo flag) memory mapped page
	12. KPF_ANON		(pseudo flag) memory mapped page (anonymous)
	13. KPF_SWAPCACHE	page is in swap cache
	14. KPF_SWAPBACKED	page is swap/RAM backed
	15. KPF_COMPOUND_HEAD	(*)
	16. KPF_COMPOUND_TAIL	(*)
	17. KPF_HUGE		hugeTLB pages
	18. KPF_UNEVICTABLE	page is in the unevictable LRU list
	19. KPF_HWPOISON(TBD)	hardware detected corruption
	20. KPF_NOPAGE		(pseudo flag) no page frame at the address
	32-39.			more obscure flags for kernel developers

	(*) For compound pages, exporting _both_ head/tail info enables
	    users to tell where a compound page starts/ends, and its order.

The acompanied page-types tool will handle the details like decoupling
overloaded flags and hiding obscure flags to normal users.

Thanks to KOSAKI and Andi for their valuable recommendations!

Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/proc/page.c |  148 +++++++++++++++++++++++++++++++++++++----------
 1 file changed, 118 insertions(+), 30 deletions(-)

--- linux.orig/fs/proc/page.c
+++ linux/fs/proc/page.c
@@ -71,19 +71,124 @@ static const struct file_operations proc
 
 /* These macros are used to decouple internal flags from exported ones */
 
-#define KPF_LOCKED     0
-#define KPF_ERROR      1
-#define KPF_REFERENCED 2
-#define KPF_UPTODATE   3
-#define KPF_DIRTY      4
-#define KPF_LRU        5
-#define KPF_ACTIVE     6
-#define KPF_SLAB       7
-#define KPF_WRITEBACK  8
-#define KPF_RECLAIM    9
-#define KPF_BUDDY     10
+#define KPF_LOCKED		0
+#define KPF_ERROR		1
+#define KPF_REFERENCED		2
+#define KPF_UPTODATE		3
+#define KPF_DIRTY		4
+#define KPF_LRU			5
+#define KPF_ACTIVE		6
+#define KPF_SLAB		7
+#define KPF_WRITEBACK		8
+#define KPF_RECLAIM		9
+#define KPF_BUDDY		10
+
+/* 11-20: new additions in 2.6.31 */
+#define KPF_MMAP		11
+#define KPF_ANON		12
+#define KPF_SWAPCACHE		13
+#define KPF_SWAPBACKED		14
+#define KPF_COMPOUND_HEAD	15
+#define KPF_COMPOUND_TAIL	16
+#define KPF_HUGE		17
+#define KPF_UNEVICTABLE		18
+#define KPF_NOPAGE		20
+
+/* kernel hacking assistances
+ * WARNING: subject to change, never rely on them!
+ */
+#define KPF_RESERVED		32
+#define KPF_MLOCKED		33
+#define KPF_MAPPEDTODISK	34
+#define KPF_PRIVATE		35
+#define KPF_PRIVATE_2		36
+#define KPF_OWNER_PRIVATE	37
+#define KPF_ARCH		38
+#define KPF_UNCACHED		39
 
-#define kpf_copy_bit(flags, dstpos, srcpos) (((flags >> srcpos) & 1) << dstpos)
+static inline u64 kpf_copy_bit(u64 kflags, int ubit, int kbit)
+{
+	return ((kflags >> kbit) & 1) << ubit;
+}
+
+static u64 get_uflags(struct page *page)
+{
+	u64 k;
+	u64 u;
+
+	/*
+	 * pseudo flag: KPF_NOPAGE
+	 * it differentiates a memory hole from a page with no flags
+	 */
+	if (!page)
+		return 1 << KPF_NOPAGE;
+
+	k = page->flags;
+	u = 0;
+
+	/*
+	 * pseudo flags for the well known (anonymous) memory mapped pages
+	 *
+	 * Note that page->_mapcount is overloaded in SLOB/SLUB/SLQB, so the
+	 * simple test in page_mapped() is not enough.
+	 */
+	if (!PageSlab(page) && page_mapped(page))
+		u |= 1 << KPF_MMAP;
+	if (PageAnon(page))
+		u |= 1 << KPF_ANON;
+
+	/*
+	 * compound pages: export both head/tail info
+	 * they together define a compound page's start/end pos and order
+	 */
+	if (PageHead(page))
+		u |= 1 << KPF_COMPOUND_HEAD;
+	if (PageTail(page))
+		u |= 1 << KPF_COMPOUND_TAIL;
+	if (PageHuge(page))
+		u |= 1 << KPF_HUGE;
+
+	u |= kpf_copy_bit(k, KPF_LOCKED,	PG_locked);
+
+	/*
+	 * Caveats on high order pages:
+	 * PG_buddy will only be set on the head page; SLUB/SLQB do the same
+	 * for PG_slab; SLOB won't set PG_slab at all on compound pages.
+	 */
+	u |= kpf_copy_bit(k, KPF_SLAB,		PG_slab);
+	u |= kpf_copy_bit(k, KPF_BUDDY,		PG_buddy);
+
+	u |= kpf_copy_bit(k, KPF_ERROR,		PG_error);
+	u |= kpf_copy_bit(k, KPF_DIRTY,		PG_dirty);
+	u |= kpf_copy_bit(k, KPF_UPTODATE,	PG_uptodate);
+	u |= kpf_copy_bit(k, KPF_WRITEBACK,	PG_writeback);
+
+	u |= kpf_copy_bit(k, KPF_LRU,		PG_lru);
+	u |= kpf_copy_bit(k, KPF_REFERENCED,	PG_referenced);
+	u |= kpf_copy_bit(k, KPF_ACTIVE,	PG_active);
+	u |= kpf_copy_bit(k, KPF_RECLAIM,	PG_reclaim);
+
+	u |= kpf_copy_bit(k, KPF_SWAPCACHE,	PG_swapcache);
+	u |= kpf_copy_bit(k, KPF_SWAPBACKED,	PG_swapbacked);
+
+#ifdef CONFIG_UNEVICTABLE_LRU
+	u |= kpf_copy_bit(k, KPF_UNEVICTABLE,	PG_unevictable);
+	u |= kpf_copy_bit(k, KPF_MLOCKED,	PG_mlocked);
+#endif
+
+#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
+	u |= kpf_copy_bit(k, KPF_UNCACHED,	PG_uncached);
+#endif
+
+	u |= kpf_copy_bit(k, KPF_RESERVED,	PG_reserved);
+	u |= kpf_copy_bit(k, KPF_MAPPEDTODISK,	PG_mappedtodisk);
+	u |= kpf_copy_bit(k, KPF_PRIVATE,	PG_private);
+	u |= kpf_copy_bit(k, KPF_PRIVATE_2,	PG_private_2);
+	u |= kpf_copy_bit(k, KPF_OWNER_PRIVATE,	PG_owner_priv_1);
+	u |= kpf_copy_bit(k, KPF_ARCH,		PG_arch_1);
+
+	return u;
+};
 
 static ssize_t kpageflags_read(struct file *file, char __user *buf,
 			     size_t count, loff_t *ppos)
@@ -93,7 +198,6 @@ static ssize_t kpageflags_read(struct fi
 	unsigned long src = *ppos;
 	unsigned long pfn;
 	ssize_t ret = 0;
-	u64 kflags, uflags;
 
 	pfn = src / KPMSIZE;
 	count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
@@ -105,24 +209,8 @@ static ssize_t kpageflags_read(struct fi
 			ppage = pfn_to_page(pfn);
 		else
 			ppage = NULL;
-		if (!ppage)
-			kflags = 0;
-		else
-			kflags = ppage->flags;
-
-		uflags = kpf_copy_bit(kflags, KPF_LOCKED, PG_locked) |
-			kpf_copy_bit(kflags, KPF_ERROR, PG_error) |
-			kpf_copy_bit(kflags, KPF_REFERENCED, PG_referenced) |
-			kpf_copy_bit(kflags, KPF_UPTODATE, PG_uptodate) |
-			kpf_copy_bit(kflags, KPF_DIRTY, PG_dirty) |
-			kpf_copy_bit(kflags, KPF_LRU, PG_lru) |
-			kpf_copy_bit(kflags, KPF_ACTIVE, PG_active) |
-			kpf_copy_bit(kflags, KPF_SLAB, PG_slab) |
-			kpf_copy_bit(kflags, KPF_WRITEBACK, PG_writeback) |
-			kpf_copy_bit(kflags, KPF_RECLAIM, PG_reclaim) |
-			kpf_copy_bit(kflags, KPF_BUDDY, PG_buddy);
 
-		if (put_user(uflags, out)) {
+		if (put_user(get_uflags(ppage), out)) {
 			ret = -EFAULT;
 			break;
 		}

-- 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 4/8] proc: export more page flags in /proc/kpageflags
@ 2009-05-08 10:53   ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, KOSAKI Motohiro, Andi Kleen, Matt Mackall, Alexey Dobriyan,
	Wu Fengguang, linux-mm

[-- Attachment #1: kpageflags-extending.patch --]
[-- Type: text/plain, Size: 6429 bytes --]

Export all page flags faithfully in /proc/kpageflags.

	11. KPF_MMAP		(pseudo flag) memory mapped page
	12. KPF_ANON		(pseudo flag) memory mapped page (anonymous)
	13. KPF_SWAPCACHE	page is in swap cache
	14. KPF_SWAPBACKED	page is swap/RAM backed
	15. KPF_COMPOUND_HEAD	(*)
	16. KPF_COMPOUND_TAIL	(*)
	17. KPF_HUGE		hugeTLB pages
	18. KPF_UNEVICTABLE	page is in the unevictable LRU list
	19. KPF_HWPOISON(TBD)	hardware detected corruption
	20. KPF_NOPAGE		(pseudo flag) no page frame at the address
	32-39.			more obscure flags for kernel developers

	(*) For compound pages, exporting _both_ head/tail info enables
	    users to tell where a compound page starts/ends, and its order.

The acompanied page-types tool will handle the details like decoupling
overloaded flags and hiding obscure flags to normal users.

Thanks to KOSAKI and Andi for their valuable recommendations!

Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/proc/page.c |  148 +++++++++++++++++++++++++++++++++++++----------
 1 file changed, 118 insertions(+), 30 deletions(-)

--- linux.orig/fs/proc/page.c
+++ linux/fs/proc/page.c
@@ -71,19 +71,124 @@ static const struct file_operations proc
 
 /* These macros are used to decouple internal flags from exported ones */
 
-#define KPF_LOCKED     0
-#define KPF_ERROR      1
-#define KPF_REFERENCED 2
-#define KPF_UPTODATE   3
-#define KPF_DIRTY      4
-#define KPF_LRU        5
-#define KPF_ACTIVE     6
-#define KPF_SLAB       7
-#define KPF_WRITEBACK  8
-#define KPF_RECLAIM    9
-#define KPF_BUDDY     10
+#define KPF_LOCKED		0
+#define KPF_ERROR		1
+#define KPF_REFERENCED		2
+#define KPF_UPTODATE		3
+#define KPF_DIRTY		4
+#define KPF_LRU			5
+#define KPF_ACTIVE		6
+#define KPF_SLAB		7
+#define KPF_WRITEBACK		8
+#define KPF_RECLAIM		9
+#define KPF_BUDDY		10
+
+/* 11-20: new additions in 2.6.31 */
+#define KPF_MMAP		11
+#define KPF_ANON		12
+#define KPF_SWAPCACHE		13
+#define KPF_SWAPBACKED		14
+#define KPF_COMPOUND_HEAD	15
+#define KPF_COMPOUND_TAIL	16
+#define KPF_HUGE		17
+#define KPF_UNEVICTABLE		18
+#define KPF_NOPAGE		20
+
+/* kernel hacking assistances
+ * WARNING: subject to change, never rely on them!
+ */
+#define KPF_RESERVED		32
+#define KPF_MLOCKED		33
+#define KPF_MAPPEDTODISK	34
+#define KPF_PRIVATE		35
+#define KPF_PRIVATE_2		36
+#define KPF_OWNER_PRIVATE	37
+#define KPF_ARCH		38
+#define KPF_UNCACHED		39
 
-#define kpf_copy_bit(flags, dstpos, srcpos) (((flags >> srcpos) & 1) << dstpos)
+static inline u64 kpf_copy_bit(u64 kflags, int ubit, int kbit)
+{
+	return ((kflags >> kbit) & 1) << ubit;
+}
+
+static u64 get_uflags(struct page *page)
+{
+	u64 k;
+	u64 u;
+
+	/*
+	 * pseudo flag: KPF_NOPAGE
+	 * it differentiates a memory hole from a page with no flags
+	 */
+	if (!page)
+		return 1 << KPF_NOPAGE;
+
+	k = page->flags;
+	u = 0;
+
+	/*
+	 * pseudo flags for the well known (anonymous) memory mapped pages
+	 *
+	 * Note that page->_mapcount is overloaded in SLOB/SLUB/SLQB, so the
+	 * simple test in page_mapped() is not enough.
+	 */
+	if (!PageSlab(page) && page_mapped(page))
+		u |= 1 << KPF_MMAP;
+	if (PageAnon(page))
+		u |= 1 << KPF_ANON;
+
+	/*
+	 * compound pages: export both head/tail info
+	 * they together define a compound page's start/end pos and order
+	 */
+	if (PageHead(page))
+		u |= 1 << KPF_COMPOUND_HEAD;
+	if (PageTail(page))
+		u |= 1 << KPF_COMPOUND_TAIL;
+	if (PageHuge(page))
+		u |= 1 << KPF_HUGE;
+
+	u |= kpf_copy_bit(k, KPF_LOCKED,	PG_locked);
+
+	/*
+	 * Caveats on high order pages:
+	 * PG_buddy will only be set on the head page; SLUB/SLQB do the same
+	 * for PG_slab; SLOB won't set PG_slab at all on compound pages.
+	 */
+	u |= kpf_copy_bit(k, KPF_SLAB,		PG_slab);
+	u |= kpf_copy_bit(k, KPF_BUDDY,		PG_buddy);
+
+	u |= kpf_copy_bit(k, KPF_ERROR,		PG_error);
+	u |= kpf_copy_bit(k, KPF_DIRTY,		PG_dirty);
+	u |= kpf_copy_bit(k, KPF_UPTODATE,	PG_uptodate);
+	u |= kpf_copy_bit(k, KPF_WRITEBACK,	PG_writeback);
+
+	u |= kpf_copy_bit(k, KPF_LRU,		PG_lru);
+	u |= kpf_copy_bit(k, KPF_REFERENCED,	PG_referenced);
+	u |= kpf_copy_bit(k, KPF_ACTIVE,	PG_active);
+	u |= kpf_copy_bit(k, KPF_RECLAIM,	PG_reclaim);
+
+	u |= kpf_copy_bit(k, KPF_SWAPCACHE,	PG_swapcache);
+	u |= kpf_copy_bit(k, KPF_SWAPBACKED,	PG_swapbacked);
+
+#ifdef CONFIG_UNEVICTABLE_LRU
+	u |= kpf_copy_bit(k, KPF_UNEVICTABLE,	PG_unevictable);
+	u |= kpf_copy_bit(k, KPF_MLOCKED,	PG_mlocked);
+#endif
+
+#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
+	u |= kpf_copy_bit(k, KPF_UNCACHED,	PG_uncached);
+#endif
+
+	u |= kpf_copy_bit(k, KPF_RESERVED,	PG_reserved);
+	u |= kpf_copy_bit(k, KPF_MAPPEDTODISK,	PG_mappedtodisk);
+	u |= kpf_copy_bit(k, KPF_PRIVATE,	PG_private);
+	u |= kpf_copy_bit(k, KPF_PRIVATE_2,	PG_private_2);
+	u |= kpf_copy_bit(k, KPF_OWNER_PRIVATE,	PG_owner_priv_1);
+	u |= kpf_copy_bit(k, KPF_ARCH,		PG_arch_1);
+
+	return u;
+};
 
 static ssize_t kpageflags_read(struct file *file, char __user *buf,
 			     size_t count, loff_t *ppos)
@@ -93,7 +198,6 @@ static ssize_t kpageflags_read(struct fi
 	unsigned long src = *ppos;
 	unsigned long pfn;
 	ssize_t ret = 0;
-	u64 kflags, uflags;
 
 	pfn = src / KPMSIZE;
 	count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
@@ -105,24 +209,8 @@ static ssize_t kpageflags_read(struct fi
 			ppage = pfn_to_page(pfn);
 		else
 			ppage = NULL;
-		if (!ppage)
-			kflags = 0;
-		else
-			kflags = ppage->flags;
-
-		uflags = kpf_copy_bit(kflags, KPF_LOCKED, PG_locked) |
-			kpf_copy_bit(kflags, KPF_ERROR, PG_error) |
-			kpf_copy_bit(kflags, KPF_REFERENCED, PG_referenced) |
-			kpf_copy_bit(kflags, KPF_UPTODATE, PG_uptodate) |
-			kpf_copy_bit(kflags, KPF_DIRTY, PG_dirty) |
-			kpf_copy_bit(kflags, KPF_LRU, PG_lru) |
-			kpf_copy_bit(kflags, KPF_ACTIVE, PG_active) |
-			kpf_copy_bit(kflags, KPF_SLAB, PG_slab) |
-			kpf_copy_bit(kflags, KPF_WRITEBACK, PG_writeback) |
-			kpf_copy_bit(kflags, KPF_RECLAIM, PG_reclaim) |
-			kpf_copy_bit(kflags, KPF_BUDDY, PG_buddy);
 
-		if (put_user(uflags, out)) {
+		if (put_user(get_uflags(ppage), out)) {
 			ret = -EFAULT;
 			break;
 		}

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 5/8] pagemap: document clarifications
  2009-05-08 10:53 ` Wu Fengguang
@ 2009-05-08 10:53   ` Wu Fengguang
  -1 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Wu Fengguang, Matt Mackall, KOSAKI Motohiro, Andi Kleen, linux-mm

[-- Attachment #1: kpageflags-doc-fix.patch --]
[-- Type: text/plain, Size: 1177 bytes --]

Some bit ranges were inclusive and some not.
Fix them to be consistently inclusive.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/vm/pagemap.txt |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- linux.orig/Documentation/vm/pagemap.txt
+++ linux/Documentation/vm/pagemap.txt
@@ -12,9 +12,9 @@ There are three components to pagemap:
    value for each virtual page, containing the following data (from
    fs/proc/task_mmu.c, above pagemap_read):
 
-    * Bits 0-55  page frame number (PFN) if present
+    * Bits 0-54  page frame number (PFN) if present
     * Bits 0-4   swap type if swapped
-    * Bits 5-55  swap offset if swapped
+    * Bits 5-54  swap offset if swapped
     * Bits 55-60 page shift (page size = 1<<page shift)
     * Bit  61    reserved for future use
     * Bit  62    page swapped
@@ -36,7 +36,7 @@ There are three components to pagemap:
  * /proc/kpageflags.  This file contains a 64-bit set of flags for each
    page, indexed by PFN.
 
-   The flags are (from fs/proc/proc_misc, above kpageflags_read):
+   The flags are (from fs/proc/page.c, above kpageflags_read):
 
      0. LOCKED
      1. ERROR

-- 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 5/8] pagemap: document clarifications
@ 2009-05-08 10:53   ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Wu Fengguang, Matt Mackall, KOSAKI Motohiro, Andi Kleen, linux-mm

[-- Attachment #1: kpageflags-doc-fix.patch --]
[-- Type: text/plain, Size: 1402 bytes --]

Some bit ranges were inclusive and some not.
Fix them to be consistently inclusive.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/vm/pagemap.txt |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- linux.orig/Documentation/vm/pagemap.txt
+++ linux/Documentation/vm/pagemap.txt
@@ -12,9 +12,9 @@ There are three components to pagemap:
    value for each virtual page, containing the following data (from
    fs/proc/task_mmu.c, above pagemap_read):
 
-    * Bits 0-55  page frame number (PFN) if present
+    * Bits 0-54  page frame number (PFN) if present
     * Bits 0-4   swap type if swapped
-    * Bits 5-55  swap offset if swapped
+    * Bits 5-54  swap offset if swapped
     * Bits 55-60 page shift (page size = 1<<page shift)
     * Bit  61    reserved for future use
     * Bit  62    page swapped
@@ -36,7 +36,7 @@ There are three components to pagemap:
  * /proc/kpageflags.  This file contains a 64-bit set of flags for each
    page, indexed by PFN.
 
-   The flags are (from fs/proc/proc_misc, above kpageflags_read):
+   The flags are (from fs/proc/page.c, above kpageflags_read):
 
      0. LOCKED
      1. ERROR

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 6/8] pagemap: document 9 more exported page flags
  2009-05-08 10:53 ` Wu Fengguang
@ 2009-05-08 10:53   ` Wu Fengguang
  -1 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Wu Fengguang, Matt Mackall, KOSAKI Motohiro, Andi Kleen, linux-mm

[-- Attachment #1: kpageflags-doc.patch --]
[-- Type: text/plain, Size: 3034 bytes --]

Also add short descriptions for all of the 20 exported page flags.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/vm/pagemap.txt |   62 +++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

--- linux.orig/Documentation/vm/pagemap.txt
+++ linux/Documentation/vm/pagemap.txt
@@ -49,6 +49,68 @@ There are three components to pagemap:
      8. WRITEBACK
      9. RECLAIM
     10. BUDDY
+    11. MMAP
+    12. ANON
+    13. SWAPCACHE
+    14. SWAPBACKED
+    15. COMPOUND_HEAD
+    16. COMPOUND_TAIL
+    16. HUGE
+    18. UNEVICTABLE
+    20. NOPAGE
+
+Short descriptions to the page flags:
+
+ 0. LOCKED
+    page is being locked for exclusive access, eg. by undergoing read/write IO
+
+ 7. SLAB
+    page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
+    When compound page is used, SLUB/SLQB will only set this flag on the head
+    page; SLOB will not flag it at all.
+
+10. BUDDY
+    a free memory block managed by the buddy system allocator
+    The buddy system organizes free memory in blocks of various orders.
+    An order N block has 2^N physically contiguous pages, with the BUDDY flag
+    set for and _only_ for the first page.
+
+15. COMPOUND_HEAD
+16. COMPOUND_TAIL
+    A compound page with order N consists of 2^N physically contiguous pages.
+    A compound page with order 2 takes the form of "HTTT", where H donates its
+    head page and T donates its tail page(s).  The major consumers of compound
+    pages are hugeTLB pages (Documentation/vm/hugetlbpage.txt), the SLUB etc.
+    memory allocators and various device drivers. However in this interface,
+    only huge/giga pages are made visible to end users.
+17. HUGE
+    this is an integral part of a HugeTLB page
+
+20. NOPAGE
+    no page frame exists at the requested address
+
+    [IO related page flags]
+ 1. ERROR     IO error occurred
+ 3. UPTODATE  page has up-to-date data
+              ie. for file backed page: (in-memory data revision >= on-disk one)
+ 4. DIRTY     page has been written to, hence contains new data
+              ie. for file backed page: (in-memory data revision >  on-disk one)
+ 8. WRITEBACK page is being synced to disk
+
+    [LRU related page flags]
+ 5. LRU         page is in one of the LRU lists
+ 6. ACTIVE      page is in the active LRU list
+18. UNEVICTABLE page is in the unevictable (non-)LRU list
+                It is somehow pinned and not a candidate for LRU page reclaims,
+		eg. ramfs pages, shmctl(SHM_LOCK) and mlock() memory segments
+ 2. REFERENCED  page has been referenced since last LRU list enqueue/requeue
+ 9. RECLAIM     page will be reclaimed soon after its pageout IO completed
+11. MMAP        a memory mapped page
+12. ANON        a memory mapped page that is not part of a file
+13. SWAPCACHE   page is mapped to swap space, ie. has an associated swap entry
+14. SWAPBACKED  page is backed by swap/RAM
+
+The page-types tool in this directory can be used to query the above flags.
 
 Using pagemap to do something useful:
 

-- 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 6/8] pagemap: document 9 more exported page flags
@ 2009-05-08 10:53   ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Wu Fengguang, Matt Mackall, KOSAKI Motohiro, Andi Kleen, linux-mm

[-- Attachment #1: kpageflags-doc.patch --]
[-- Type: text/plain, Size: 3259 bytes --]

Also add short descriptions for all of the 20 exported page flags.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/vm/pagemap.txt |   62 +++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

--- linux.orig/Documentation/vm/pagemap.txt
+++ linux/Documentation/vm/pagemap.txt
@@ -49,6 +49,68 @@ There are three components to pagemap:
      8. WRITEBACK
      9. RECLAIM
     10. BUDDY
+    11. MMAP
+    12. ANON
+    13. SWAPCACHE
+    14. SWAPBACKED
+    15. COMPOUND_HEAD
+    16. COMPOUND_TAIL
+    16. HUGE
+    18. UNEVICTABLE
+    20. NOPAGE
+
+Short descriptions to the page flags:
+
+ 0. LOCKED
+    page is being locked for exclusive access, eg. by undergoing read/write IO
+
+ 7. SLAB
+    page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
+    When compound page is used, SLUB/SLQB will only set this flag on the head
+    page; SLOB will not flag it at all.
+
+10. BUDDY
+    a free memory block managed by the buddy system allocator
+    The buddy system organizes free memory in blocks of various orders.
+    An order N block has 2^N physically contiguous pages, with the BUDDY flag
+    set for and _only_ for the first page.
+
+15. COMPOUND_HEAD
+16. COMPOUND_TAIL
+    A compound page with order N consists of 2^N physically contiguous pages.
+    A compound page with order 2 takes the form of "HTTT", where H donates its
+    head page and T donates its tail page(s).  The major consumers of compound
+    pages are hugeTLB pages (Documentation/vm/hugetlbpage.txt), the SLUB etc.
+    memory allocators and various device drivers. However in this interface,
+    only huge/giga pages are made visible to end users.
+17. HUGE
+    this is an integral part of a HugeTLB page
+
+20. NOPAGE
+    no page frame exists at the requested address
+
+    [IO related page flags]
+ 1. ERROR     IO error occurred
+ 3. UPTODATE  page has up-to-date data
+              ie. for file backed page: (in-memory data revision >= on-disk one)
+ 4. DIRTY     page has been written to, hence contains new data
+              ie. for file backed page: (in-memory data revision >  on-disk one)
+ 8. WRITEBACK page is being synced to disk
+
+    [LRU related page flags]
+ 5. LRU         page is in one of the LRU lists
+ 6. ACTIVE      page is in the active LRU list
+18. UNEVICTABLE page is in the unevictable (non-)LRU list
+                It is somehow pinned and not a candidate for LRU page reclaims,
+		eg. ramfs pages, shmctl(SHM_LOCK) and mlock() memory segments
+ 2. REFERENCED  page has been referenced since last LRU list enqueue/requeue
+ 9. RECLAIM     page will be reclaimed soon after its pageout IO completed
+11. MMAP        a memory mapped page
+12. ANON        a memory mapped page that is not part of a file
+13. SWAPCACHE   page is mapped to swap space, ie. has an associated swap entry
+14. SWAPBACKED  page is backed by swap/RAM
+
+The page-types tool in this directory can be used to query the above flags.
 
 Using pagemap to do something useful:
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 7/8] pagemap: add page-types tool
  2009-05-08 10:53 ` Wu Fengguang
@ 2009-05-08 10:53   ` Wu Fengguang
  -1 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Wu Fengguang, Matt Mackall, KOSAKI Motohiro, linux-mm

[-- Attachment #1: page-types.patch --]
[-- Type: text/plain, Size: 16319 bytes --]

Add page-types, a handy tool for querying page flags.

It will expand some of the overloaded flags:
	PG_slob_free   = PG_private
	PG_slub_frozen = PG_active
	PG_slub_debug  = PG_error
	PG_readahead   = PG_reclaim

and mask out obscure flags except in -raw mode:
	PG_reserved
	PG_mlocked
	PG_mappedtodisk
	PG_private
	PG_private_2
	PG_owner_priv_1
	PG_arch_1
	PG_uncached
	PG_compound* for non hugeTLB pages

CC: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/vm/Makefile     |    2 
 Documentation/vm/page-types.c |  698 ++++++++++++++++++++++++++++++++
 2 files changed, 699 insertions(+), 1 deletion(-)

--- /dev/null
+++ linux/Documentation/vm/page-types.c
@@ -0,0 +1,698 @@
+/*
+ * page-types: Tool for querying page flags
+ *
+ * Copyright (C) 2009 Intel corporation
+ * Copyright (C) 2009 Wu Fengguang <fengguang.wu@intel.com>
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <stdint.h>
+#include <stdarg.h>
+#include <string.h>
+#include <getopt.h>
+#include <limits.h>
+#include <sys/types.h>
+#include <sys/errno.h>
+#include <sys/fcntl.h>
+
+
+/*
+ * kernel page flags
+ */
+
+#define KPF_BYTES		8
+#define PROC_KPAGEFLAGS		"/proc/kpageflags"
+
+/* copied from kpageflags_read() */
+#define KPF_LOCKED		0
+#define KPF_ERROR		1
+#define KPF_REFERENCED		2
+#define KPF_UPTODATE		3
+#define KPF_DIRTY		4
+#define KPF_LRU			5
+#define KPF_ACTIVE		6
+#define KPF_SLAB		7
+#define KPF_WRITEBACK		8
+#define KPF_RECLAIM		9
+#define KPF_BUDDY		10
+
+/* [11-20] new additions in 2.6.31 */
+#define KPF_MMAP		11
+#define KPF_ANON		12
+#define KPF_SWAPCACHE		13
+#define KPF_SWAPBACKED		14
+#define KPF_COMPOUND_HEAD	15
+#define KPF_COMPOUND_TAIL	16
+#define KPF_HUGE		17
+#define KPF_UNEVICTABLE		18
+#define KPF_NOPAGE		20
+
+/* [32-] kernel hacking assistances */
+#define KPF_RESERVED		32
+#define KPF_MLOCKED		33
+#define KPF_MAPPEDTODISK	34
+#define KPF_PRIVATE		35
+#define KPF_PRIVATE_2		36
+#define KPF_OWNER_PRIVATE	37
+#define KPF_ARCH		38
+#define KPF_UNCACHED		39
+
+/* [48-] take some arbitrary free slots for expanding overloaded flags
+ * not part of kernel API
+ */
+#define KPF_READAHEAD		48
+#define KPF_SLOB_FREE		49
+#define KPF_SLUB_FROZEN		50
+#define KPF_SLUB_DEBUG		51
+
+#define KPF_ALL_BITS		((uint64_t)~0ULL)
+#define KPF_HACKERS_BITS	(0xffffULL << 32)
+#define KPF_OVERLOADED_BITS	(0xffffULL << 48)
+#define BIT(name)		(1ULL << KPF_##name)
+#define BITS_COMPOUND		(BIT(COMPOUND_HEAD) | BIT(COMPOUND_TAIL))
+
+static char *page_flag_names[] = {
+	[KPF_LOCKED]		= "L:locked",
+	[KPF_ERROR]		= "E:error",
+	[KPF_REFERENCED]	= "R:referenced",
+	[KPF_UPTODATE]		= "U:uptodate",
+	[KPF_DIRTY]		= "D:dirty",
+	[KPF_LRU]		= "l:lru",
+	[KPF_ACTIVE]		= "A:active",
+	[KPF_SLAB]		= "S:slab",
+	[KPF_WRITEBACK]		= "W:writeback",
+	[KPF_RECLAIM]		= "I:reclaim",
+	[KPF_BUDDY]		= "B:buddy",
+
+	[KPF_MMAP]		= "M:mmap",
+	[KPF_ANON]		= "a:anonymous",
+	[KPF_SWAPCACHE]		= "s:swapcache",
+	[KPF_SWAPBACKED]	= "b:swapbacked",
+	[KPF_COMPOUND_HEAD]	= "H:compound_head",
+	[KPF_COMPOUND_TAIL]	= "T:compound_tail",
+	[KPF_HUGE]		= "G:huge",
+	[KPF_UNEVICTABLE]	= "u:unevictable",
+	[KPF_NOPAGE]		= "n:nopage",
+
+	[KPF_RESERVED]		= "r:reserved",
+	[KPF_MLOCKED]		= "m:mlocked",
+	[KPF_MAPPEDTODISK]	= "d:mappedtodisk",
+	[KPF_PRIVATE]		= "P:private",
+	[KPF_PRIVATE_2]		= "p:private_2",
+	[KPF_OWNER_PRIVATE]	= "O:owner_private",
+	[KPF_ARCH]		= "h:arch",
+	[KPF_UNCACHED]		= "c:uncached",
+
+	[KPF_READAHEAD]		= "I:readahead",
+	[KPF_SLOB_FREE]		= "P:slob_free",
+	[KPF_SLUB_FROZEN]	= "A:slub_frozen",
+	[KPF_SLUB_DEBUG]	= "E:slub_debug",
+};
+
+
+/*
+ * data structures
+ */
+
+static int		opt_raw;	/* for kernel developers */
+static int		opt_list;	/* list pages (in ranges) */
+static int		opt_no_summary;	/* don't show summary */
+static pid_t		opt_pid;	/* process to walk */
+
+#define MAX_ADDR_RANGES	1024
+static int		nr_addr_ranges;
+static unsigned long	opt_offset[MAX_ADDR_RANGES];
+static unsigned long	opt_size[MAX_ADDR_RANGES];
+
+#define MAX_BIT_FILTERS	64
+static int		nr_bit_filters;
+static uint64_t		opt_mask[MAX_BIT_FILTERS];
+static uint64_t		opt_bits[MAX_BIT_FILTERS];
+
+static int		page_size;
+
+#define PAGES_BATCH	(64 << 10)	/* 64k pages */
+static int		kpageflags_fd;
+static uint64_t		kpageflags_buf[KPF_BYTES * PAGES_BATCH];
+
+#define HASH_SHIFT	13
+#define HASH_SIZE	(1 << HASH_SHIFT)
+#define HASH_MASK	(HASH_SIZE - 1)
+#define HASH_KEY(flags)	(flags & HASH_MASK)
+
+static unsigned long	total_pages;
+static unsigned long	nr_pages[HASH_SIZE];
+static uint64_t 	page_flags[HASH_SIZE];
+
+
+/*
+ * helper functions
+ */
+
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+
+#define min_t(type, x, y) ({			\
+	type __min1 = (x);			\
+	type __min2 = (y);			\
+	__min1 < __min2 ? __min1 : __min2; })
+
+unsigned long pages2mb(unsigned long pages)
+{
+	return (pages * page_size) >> 20;
+}
+
+void fatal(const char *x, ...)
+{
+	va_list ap;
+
+	va_start(ap, x);
+	vfprintf(stderr, x, ap);
+	va_end(ap);
+	exit(EXIT_FAILURE);
+}
+
+
+/*
+ * page flag names
+ */
+
+char *page_flag_name(uint64_t flags)
+{
+	static char buf[65];
+	int present;
+	int i, j;
+
+	for (i = 0, j = 0; i < ARRAY_SIZE(page_flag_names); i++) {
+		present = (flags >> i) & 1;
+		if (!page_flag_names[i]) {
+			if (present)
+				fatal("unkown flag bit %d\n", i);
+			continue;
+		}
+		buf[j++] = present ? page_flag_names[i][0] : '_';
+	}
+
+	return buf;
+}
+
+char *page_flag_longname(uint64_t flags)
+{
+	static char buf[1024];
+	int i, n;
+
+	for (i = 0, n = 0; i < ARRAY_SIZE(page_flag_names); i++) {
+		if (!page_flag_names[i])
+			continue;
+		if ((flags >> i) & 1)
+			n += snprintf(buf + n, sizeof(buf) - n, "%s,",
+					page_flag_names[i] + 2);
+	}
+	if (n)
+		n--;
+	buf[n] = '\0';
+
+	return buf;
+}
+
+
+/*
+ * page list and summary
+ */
+
+void show_page_range(unsigned long offset, uint64_t flags)
+{
+	static uint64_t      flags0;
+	static unsigned long index;
+	static unsigned long count;
+
+	if (flags == flags0 && offset == index + count) {
+		count++;
+		return;
+	}
+
+	if (count)
+		printf("%lu\t%lu\t%s\n",
+				index, count, page_flag_name(flags0));
+
+	flags0 = flags;
+	index  = offset;
+	count  = 1;
+}
+
+void show_page(unsigned long offset, uint64_t flags)
+{
+	printf("%lu\t%s\n", offset, page_flag_name(flags));
+}
+
+void show_summary()
+{
+	int i;
+
+	printf("             flags\tpage-count       MB"
+		"  symbolic-flags\t\t\tlong-symbolic-flags\n");
+
+	for (i = 0; i < ARRAY_SIZE(nr_pages); i++) {
+		if (nr_pages[i])
+			printf("0x%016llx\t%10lu %8lu  %s\t%s\n",
+				(unsigned long long)page_flags[i],
+				nr_pages[i],
+				pages2mb(nr_pages[i]),
+				page_flag_name(page_flags[i]),
+				page_flag_longname(page_flags[i]));
+	}
+
+	printf("             total\t%10lu %8lu\n",
+			total_pages, pages2mb(total_pages));
+}
+
+
+/*
+ * page flag filters
+ */
+
+int bit_mask_ok(uint64_t flags)
+{
+	int i;
+
+	for (i = 0; i < nr_bit_filters; i++) {
+		if (opt_bits[i] == KPF_ALL_BITS) {
+			if ((flags & opt_mask[i]) == 0)
+				return 0;
+		} else {
+			if ((flags & opt_mask[i]) != opt_bits[i])
+				return 0;
+		}
+	}
+
+	return 1;
+}
+
+uint64_t expand_overloaded_flags(uint64_t flags)
+{
+	/* SLOB/SLUB overload several page flags */
+	if (flags & BIT(SLAB)) {
+		if (flags & BIT(PRIVATE))
+			flags ^= BIT(PRIVATE) | BIT(SLOB_FREE);
+		if (flags & BIT(ACTIVE))
+			flags ^= BIT(ACTIVE) | BIT(SLUB_FROZEN);
+		if (flags & BIT(ERROR))
+			flags ^= BIT(ERROR) | BIT(SLUB_DEBUG);
+	}
+
+	/* PG_reclaim is overloaded as PG_readahead in the read path */
+	if ((flags & (BIT(RECLAIM) | BIT(WRITEBACK))) == BIT(RECLAIM))
+		flags ^= BIT(RECLAIM) | BIT(READAHEAD);
+
+	return flags;
+}
+
+uint64_t well_known_flags(uint64_t flags)
+{
+	/* hide flags intended only for kernel hacker */
+	flags &= ~KPF_HACKERS_BITS;
+
+	/* hide non-hugeTLB compound pages */
+	if ((flags & BITS_COMPOUND) && !(flags & BIT(HUGE)))
+		flags &= ~BITS_COMPOUND;
+
+	return flags;
+}
+
+
+/*
+ * page frame walker
+ */
+
+int hash_slot(uint64_t flags)
+{
+	int k = HASH_KEY(flags);
+	int i;
+
+	/* Explicitly reserve slot 0 for flags 0: the following logic
+	 * cannot distinguish an unoccupied slot from slot (flags==0).
+	 */
+	if (flags == 0)
+		return 0;
+
+	/* search through the remaining (HASH_SIZE-1) slots */
+	for (i = 1; i < ARRAY_SIZE(page_flags); i++, k++) {
+		if (!k || k >= ARRAY_SIZE(page_flags))
+			k = 1;
+		if (page_flags[k] == 0) {
+			page_flags[k] = flags;
+			return k;
+		}
+		if (page_flags[k] == flags)
+			return k;
+	}
+
+	fatal("hash table full: bump up HASH_SHIFT?\n");
+	exit(EXIT_FAILURE);
+}
+
+void add_page(unsigned long offset, uint64_t flags)
+{
+	flags = expand_overloaded_flags(flags);
+
+	if (!opt_raw)
+		flags = well_known_flags(flags);
+
+	if (!bit_mask_ok(flags))
+		return;
+
+	if (opt_list == 1)
+		show_page_range(offset, flags);
+	else if (opt_list == 2)
+		show_page(offset, flags);
+
+	nr_pages[hash_slot(flags)]++;
+	total_pages++;
+}
+
+void walk_pfn(unsigned long index, unsigned long count)
+{
+	unsigned long batch;
+	unsigned long n;
+	unsigned long i;
+
+	if (index > ULONG_MAX / KPF_BYTES)
+		fatal("index overflow: %lu\n", index);
+
+	lseek(kpageflags_fd, index * KPF_BYTES, SEEK_SET);
+
+	while (count) {
+		batch = min_t(unsigned long, count, PAGES_BATCH);
+		n = read(kpageflags_fd, kpageflags_buf, batch * KPF_BYTES);
+		if (n == 0)
+			break;
+		if (n < 0) {
+			perror(PROC_KPAGEFLAGS);
+			exit(EXIT_FAILURE);
+		}
+
+		if (n % KPF_BYTES != 0)
+			fatal("partial read: %lu bytes\n", n);
+		n = n / KPF_BYTES;
+
+		for (i = 0; i < n; i++)
+			add_page(index + i, kpageflags_buf[i]);
+
+		index += batch;
+		count -= batch;
+	}
+}
+
+void walk_addr_ranges(void)
+{
+	int i;
+
+	kpageflags_fd = open(PROC_KPAGEFLAGS, O_RDONLY);
+	if (kpageflags_fd < 0) {
+		perror(PROC_KPAGEFLAGS);
+		exit(EXIT_FAILURE);
+	}
+
+	if (!nr_addr_ranges)
+		walk_pfn(0, ULONG_MAX);
+
+	for (i = 0; i < nr_addr_ranges; i++)
+		walk_pfn(opt_offset[i], opt_size[i]);
+
+	close(kpageflags_fd);
+}
+
+
+/*
+ * user interface
+ */
+
+const char *page_flag_type(uint64_t flag)
+{
+	if (flag & KPF_HACKERS_BITS)
+		return "(r)";
+	if (flag & KPF_OVERLOADED_BITS)
+		return "(o)";
+	return "   ";
+}
+
+void usage(void)
+{
+	int i, j;
+
+	printf(
+"page-types [options]\n"
+"            -r|--raw                  Raw mode, for kernel developers\n"
+"            -a|--addr    addr-spec    Walk a range of pages\n"
+"            -b|--bits    bits-spec    Walk pages with specified bits\n"
+#if 0 /* planned features */
+"            -p|--pid     pid          Walk process address space\n"
+"            -f|--file    filename     Walk file address space\n"
+#endif
+"            -l|--list                 Show page details in ranges\n"
+"            -L|--list-each            Show page details one by one\n"
+"            -N|--no-summary           Don't show summay info\n"
+"            -h|--help                 Show this usage message\n"
+"addr-spec:\n"
+"            N                         one page at offset N (unit: pages)\n"
+"            N+M                       pages range from N to N+M-1\n"
+"            N,M                       pages range from N to M-1\n"
+"            N,                        pages range from N to end\n"
+"            ,M                        pages range from 0 to M\n"
+"bits-spec:\n"
+"            bit1,bit2                 (flags & (bit1|bit2)) != 0\n"
+"            bit1,bit2=bit1            (flags & (bit1|bit2)) == bit1\n"
+"            bit1,~bit2                (flags & (bit1|bit2)) == bit1\n"
+"            =bit1,bit2                flags == (bit1|bit2)\n"
+"bit-names:\n"
+	);
+
+	for (i = 0, j = 0; i < ARRAY_SIZE(page_flag_names); i++) {
+		if (!page_flag_names[i])
+			continue;
+		printf("%16s%s", page_flag_names[i] + 2,
+				 page_flag_type(1ULL << i));
+		if (++j > 3) {
+			j = 0;
+			putchar('\n');
+		}
+	}
+	printf("\n                                   "
+		"(r) raw mode bits  (o) overloaded bits\n");
+}
+
+unsigned long long parse_number(const char *str)
+{
+	unsigned long long n;
+
+	n = strtoll(str, NULL, 0);
+
+	if (n == 0 && str[0] != '0')
+		fatal("invalid name or number: %s\n", str);
+
+	return n;
+}
+
+void parse_pid(const char *str)
+{
+	opt_pid = parse_number(str);
+}
+
+void parse_file(const char *name)
+{
+}
+
+void add_addr_range(unsigned long offset, unsigned long size)
+{
+	if (nr_addr_ranges >= MAX_ADDR_RANGES)
+		fatal("too much addr ranges\n");
+
+	opt_offset[nr_addr_ranges] = offset;
+	opt_size[nr_addr_ranges] = size;
+	nr_addr_ranges++;
+}
+
+void parse_addr_range(const char *optarg)
+{
+	unsigned long offset;
+	unsigned long size;
+	char *p;
+
+	p = strchr(optarg, ',');
+	if (!p)
+		p = strchr(optarg, '+');
+
+	if (p == optarg) {
+		offset = 0;
+		size   = parse_number(p + 1);
+	} else if (p) {
+		offset = parse_number(optarg);
+		if (p[1] == '\0')
+			size = ULONG_MAX;
+		else {
+			size = parse_number(p + 1);
+			if (*p == ',') {
+				if (size < offset)
+					fatal("invalid range: %lu,%lu\n",
+							offset, size);
+				size -= offset;
+			}
+		}
+	} else {
+		offset = parse_number(optarg);
+		size   = 1;
+	}
+
+	add_addr_range(offset, size);
+}
+
+void add_bits_filter(uint64_t mask, uint64_t bits)
+{
+	if (nr_bit_filters >= MAX_BIT_FILTERS)
+		fatal("too much bit filters\n");
+
+	opt_mask[nr_bit_filters] = mask;
+	opt_bits[nr_bit_filters] = bits;
+	nr_bit_filters++;
+}
+
+uint64_t parse_flag_name(const char *str, int len)
+{
+	int i;
+
+	if (!*str || !len)
+		return 0;
+
+	if (len <= 8 && !strncmp(str, "compound", len))
+		return BITS_COMPOUND;
+
+	for (i = 0; i < ARRAY_SIZE(page_flag_names); i++) {
+		if (!page_flag_names[i])
+			continue;
+		if (!strncmp(str, page_flag_names[i] + 2, len))
+			return 1ULL << i;
+	}
+
+	return parse_number(str);
+}
+
+uint64_t parse_flag_names(const char *str, int all)
+{
+	const char *p    = str;
+	uint64_t   flags = 0;
+
+	while (1) {
+		if (*p == ',' || *p == '=' || *p == '\0') {
+			if ((*str != '~') || (*str == '~' && all && *++str))
+				flags |= parse_flag_name(str, p - str);
+			if (*p != ',')
+				break;
+			str = p + 1;
+		}
+		p++;
+	}
+
+	return flags;
+}
+
+void parse_bits_mask(const char *optarg)
+{
+	uint64_t mask;
+	uint64_t bits;
+	const char *p;
+
+	p = strchr(optarg, '=');
+	if (p == optarg) {
+		mask = KPF_ALL_BITS;
+		bits = parse_flag_names(p + 1, 0);
+	} else if (p) {
+		mask = parse_flag_names(optarg, 0);
+		bits = parse_flag_names(p + 1, 0);
+	} else if (strchr(optarg, '~')) {
+		mask = parse_flag_names(optarg, 1);
+		bits = parse_flag_names(optarg, 0);
+	} else {
+		mask = parse_flag_names(optarg, 0);
+		bits = KPF_ALL_BITS;
+	}
+
+	add_bits_filter(mask, bits);
+}
+
+
+struct option opts[] = {
+	{ "raw"       , 0, NULL, 'r' },
+	{ "pid"       , 1, NULL, 'p' },
+	{ "file"      , 1, NULL, 'f' },
+	{ "addr"      , 1, NULL, 'a' },
+	{ "bits"      , 1, NULL, 'b' },
+	{ "list"      , 0, NULL, 'l' },
+	{ "list-each" , 0, NULL, 'L' },
+	{ "no-summary", 0, NULL, 'N' },
+	{ "help"      , 0, NULL, 'h' },
+	{ NULL        , 0, NULL, 0 }
+};
+
+int main(int argc, char *argv[])
+{
+	int c;
+
+	page_size = getpagesize();
+
+	while ((c = getopt_long(argc, argv,
+				"rp:f:a:b:lLNh", opts, NULL)) != -1) {
+		switch (c) {
+		case 'r':
+			opt_raw = 1;
+			break;
+		case 'p':
+			parse_pid(optarg);
+			break;
+		case 'f':
+			parse_file(optarg);
+			break;
+		case 'a':
+			parse_addr_range(optarg);
+			break;
+		case 'b':
+			parse_bits_mask(optarg);
+			break;
+		case 'l':
+			opt_list = 1;
+			break;
+		case 'L':
+			opt_list = 2;
+			break;
+		case 'N':
+			opt_no_summary = 1;
+			break;
+		case 'h':
+			usage();
+			exit(0);
+		default:
+			usage();
+			exit(1);
+		}
+	}
+
+	if (opt_list == 1)
+		printf("offset\tcount\tflags\n");
+	if (opt_list == 2)
+		printf("offset\tflags\n");
+
+	walk_addr_ranges();
+
+	if (opt_list == 1)
+		show_page_range(0, 0);  /* drain the buffer */
+
+	if (opt_no_summary)
+		return 0;
+
+	if (opt_list)
+		printf("\n\n");
+
+	show_summary();
+
+	return 0;
+}
--- linux.orig/Documentation/vm/Makefile
+++ linux/Documentation/vm/Makefile
@@ -2,7 +2,7 @@
 obj- := dummy.o
 
 # List of programs to build
-hostprogs-y := slabinfo
+hostprogs-y := slabinfo slqbinfo page-types
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)

-- 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 7/8] pagemap: add page-types tool
@ 2009-05-08 10:53   ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Wu Fengguang, Matt Mackall, KOSAKI Motohiro, linux-mm

[-- Attachment #1: page-types.patch --]
[-- Type: text/plain, Size: 16544 bytes --]

Add page-types, a handy tool for querying page flags.

It will expand some of the overloaded flags:
	PG_slob_free   = PG_private
	PG_slub_frozen = PG_active
	PG_slub_debug  = PG_error
	PG_readahead   = PG_reclaim

and mask out obscure flags except in -raw mode:
	PG_reserved
	PG_mlocked
	PG_mappedtodisk
	PG_private
	PG_private_2
	PG_owner_priv_1
	PG_arch_1
	PG_uncached
	PG_compound* for non hugeTLB pages

CC: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/vm/Makefile     |    2 
 Documentation/vm/page-types.c |  698 ++++++++++++++++++++++++++++++++
 2 files changed, 699 insertions(+), 1 deletion(-)

--- /dev/null
+++ linux/Documentation/vm/page-types.c
@@ -0,0 +1,698 @@
+/*
+ * page-types: Tool for querying page flags
+ *
+ * Copyright (C) 2009 Intel corporation
+ * Copyright (C) 2009 Wu Fengguang <fengguang.wu@intel.com>
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <stdint.h>
+#include <stdarg.h>
+#include <string.h>
+#include <getopt.h>
+#include <limits.h>
+#include <sys/types.h>
+#include <sys/errno.h>
+#include <sys/fcntl.h>
+
+
+/*
+ * kernel page flags
+ */
+
+#define KPF_BYTES		8
+#define PROC_KPAGEFLAGS		"/proc/kpageflags"
+
+/* copied from kpageflags_read() */
+#define KPF_LOCKED		0
+#define KPF_ERROR		1
+#define KPF_REFERENCED		2
+#define KPF_UPTODATE		3
+#define KPF_DIRTY		4
+#define KPF_LRU			5
+#define KPF_ACTIVE		6
+#define KPF_SLAB		7
+#define KPF_WRITEBACK		8
+#define KPF_RECLAIM		9
+#define KPF_BUDDY		10
+
+/* [11-20] new additions in 2.6.31 */
+#define KPF_MMAP		11
+#define KPF_ANON		12
+#define KPF_SWAPCACHE		13
+#define KPF_SWAPBACKED		14
+#define KPF_COMPOUND_HEAD	15
+#define KPF_COMPOUND_TAIL	16
+#define KPF_HUGE		17
+#define KPF_UNEVICTABLE		18
+#define KPF_NOPAGE		20
+
+/* [32-] kernel hacking assistances */
+#define KPF_RESERVED		32
+#define KPF_MLOCKED		33
+#define KPF_MAPPEDTODISK	34
+#define KPF_PRIVATE		35
+#define KPF_PRIVATE_2		36
+#define KPF_OWNER_PRIVATE	37
+#define KPF_ARCH		38
+#define KPF_UNCACHED		39
+
+/* [48-] take some arbitrary free slots for expanding overloaded flags
+ * not part of kernel API
+ */
+#define KPF_READAHEAD		48
+#define KPF_SLOB_FREE		49
+#define KPF_SLUB_FROZEN		50
+#define KPF_SLUB_DEBUG		51
+
+#define KPF_ALL_BITS		((uint64_t)~0ULL)
+#define KPF_HACKERS_BITS	(0xffffULL << 32)
+#define KPF_OVERLOADED_BITS	(0xffffULL << 48)
+#define BIT(name)		(1ULL << KPF_##name)
+#define BITS_COMPOUND		(BIT(COMPOUND_HEAD) | BIT(COMPOUND_TAIL))
+
+static char *page_flag_names[] = {
+	[KPF_LOCKED]		= "L:locked",
+	[KPF_ERROR]		= "E:error",
+	[KPF_REFERENCED]	= "R:referenced",
+	[KPF_UPTODATE]		= "U:uptodate",
+	[KPF_DIRTY]		= "D:dirty",
+	[KPF_LRU]		= "l:lru",
+	[KPF_ACTIVE]		= "A:active",
+	[KPF_SLAB]		= "S:slab",
+	[KPF_WRITEBACK]		= "W:writeback",
+	[KPF_RECLAIM]		= "I:reclaim",
+	[KPF_BUDDY]		= "B:buddy",
+
+	[KPF_MMAP]		= "M:mmap",
+	[KPF_ANON]		= "a:anonymous",
+	[KPF_SWAPCACHE]		= "s:swapcache",
+	[KPF_SWAPBACKED]	= "b:swapbacked",
+	[KPF_COMPOUND_HEAD]	= "H:compound_head",
+	[KPF_COMPOUND_TAIL]	= "T:compound_tail",
+	[KPF_HUGE]		= "G:huge",
+	[KPF_UNEVICTABLE]	= "u:unevictable",
+	[KPF_NOPAGE]		= "n:nopage",
+
+	[KPF_RESERVED]		= "r:reserved",
+	[KPF_MLOCKED]		= "m:mlocked",
+	[KPF_MAPPEDTODISK]	= "d:mappedtodisk",
+	[KPF_PRIVATE]		= "P:private",
+	[KPF_PRIVATE_2]		= "p:private_2",
+	[KPF_OWNER_PRIVATE]	= "O:owner_private",
+	[KPF_ARCH]		= "h:arch",
+	[KPF_UNCACHED]		= "c:uncached",
+
+	[KPF_READAHEAD]		= "I:readahead",
+	[KPF_SLOB_FREE]		= "P:slob_free",
+	[KPF_SLUB_FROZEN]	= "A:slub_frozen",
+	[KPF_SLUB_DEBUG]	= "E:slub_debug",
+};
+
+
+/*
+ * data structures
+ */
+
+static int		opt_raw;	/* for kernel developers */
+static int		opt_list;	/* list pages (in ranges) */
+static int		opt_no_summary;	/* don't show summary */
+static pid_t		opt_pid;	/* process to walk */
+
+#define MAX_ADDR_RANGES	1024
+static int		nr_addr_ranges;
+static unsigned long	opt_offset[MAX_ADDR_RANGES];
+static unsigned long	opt_size[MAX_ADDR_RANGES];
+
+#define MAX_BIT_FILTERS	64
+static int		nr_bit_filters;
+static uint64_t		opt_mask[MAX_BIT_FILTERS];
+static uint64_t		opt_bits[MAX_BIT_FILTERS];
+
+static int		page_size;
+
+#define PAGES_BATCH	(64 << 10)	/* 64k pages */
+static int		kpageflags_fd;
+static uint64_t		kpageflags_buf[KPF_BYTES * PAGES_BATCH];
+
+#define HASH_SHIFT	13
+#define HASH_SIZE	(1 << HASH_SHIFT)
+#define HASH_MASK	(HASH_SIZE - 1)
+#define HASH_KEY(flags)	(flags & HASH_MASK)
+
+static unsigned long	total_pages;
+static unsigned long	nr_pages[HASH_SIZE];
+static uint64_t 	page_flags[HASH_SIZE];
+
+
+/*
+ * helper functions
+ */
+
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+
+#define min_t(type, x, y) ({			\
+	type __min1 = (x);			\
+	type __min2 = (y);			\
+	__min1 < __min2 ? __min1 : __min2; })
+
+unsigned long pages2mb(unsigned long pages)
+{
+	return (pages * page_size) >> 20;
+}
+
+void fatal(const char *x, ...)
+{
+	va_list ap;
+
+	va_start(ap, x);
+	vfprintf(stderr, x, ap);
+	va_end(ap);
+	exit(EXIT_FAILURE);
+}
+
+
+/*
+ * page flag names
+ */
+
+char *page_flag_name(uint64_t flags)
+{
+	static char buf[65];
+	int present;
+	int i, j;
+
+	for (i = 0, j = 0; i < ARRAY_SIZE(page_flag_names); i++) {
+		present = (flags >> i) & 1;
+		if (!page_flag_names[i]) {
+			if (present)
+				fatal("unkown flag bit %d\n", i);
+			continue;
+		}
+		buf[j++] = present ? page_flag_names[i][0] : '_';
+	}
+
+	return buf;
+}
+
+char *page_flag_longname(uint64_t flags)
+{
+	static char buf[1024];
+	int i, n;
+
+	for (i = 0, n = 0; i < ARRAY_SIZE(page_flag_names); i++) {
+		if (!page_flag_names[i])
+			continue;
+		if ((flags >> i) & 1)
+			n += snprintf(buf + n, sizeof(buf) - n, "%s,",
+					page_flag_names[i] + 2);
+	}
+	if (n)
+		n--;
+	buf[n] = '\0';
+
+	return buf;
+}
+
+
+/*
+ * page list and summary
+ */
+
+void show_page_range(unsigned long offset, uint64_t flags)
+{
+	static uint64_t      flags0;
+	static unsigned long index;
+	static unsigned long count;
+
+	if (flags == flags0 && offset == index + count) {
+		count++;
+		return;
+	}
+
+	if (count)
+		printf("%lu\t%lu\t%s\n",
+				index, count, page_flag_name(flags0));
+
+	flags0 = flags;
+	index  = offset;
+	count  = 1;
+}
+
+void show_page(unsigned long offset, uint64_t flags)
+{
+	printf("%lu\t%s\n", offset, page_flag_name(flags));
+}
+
+void show_summary()
+{
+	int i;
+
+	printf("             flags\tpage-count       MB"
+		"  symbolic-flags\t\t\tlong-symbolic-flags\n");
+
+	for (i = 0; i < ARRAY_SIZE(nr_pages); i++) {
+		if (nr_pages[i])
+			printf("0x%016llx\t%10lu %8lu  %s\t%s\n",
+				(unsigned long long)page_flags[i],
+				nr_pages[i],
+				pages2mb(nr_pages[i]),
+				page_flag_name(page_flags[i]),
+				page_flag_longname(page_flags[i]));
+	}
+
+	printf("             total\t%10lu %8lu\n",
+			total_pages, pages2mb(total_pages));
+}
+
+
+/*
+ * page flag filters
+ */
+
+int bit_mask_ok(uint64_t flags)
+{
+	int i;
+
+	for (i = 0; i < nr_bit_filters; i++) {
+		if (opt_bits[i] == KPF_ALL_BITS) {
+			if ((flags & opt_mask[i]) == 0)
+				return 0;
+		} else {
+			if ((flags & opt_mask[i]) != opt_bits[i])
+				return 0;
+		}
+	}
+
+	return 1;
+}
+
+uint64_t expand_overloaded_flags(uint64_t flags)
+{
+	/* SLOB/SLUB overload several page flags */
+	if (flags & BIT(SLAB)) {
+		if (flags & BIT(PRIVATE))
+			flags ^= BIT(PRIVATE) | BIT(SLOB_FREE);
+		if (flags & BIT(ACTIVE))
+			flags ^= BIT(ACTIVE) | BIT(SLUB_FROZEN);
+		if (flags & BIT(ERROR))
+			flags ^= BIT(ERROR) | BIT(SLUB_DEBUG);
+	}
+
+	/* PG_reclaim is overloaded as PG_readahead in the read path */
+	if ((flags & (BIT(RECLAIM) | BIT(WRITEBACK))) == BIT(RECLAIM))
+		flags ^= BIT(RECLAIM) | BIT(READAHEAD);
+
+	return flags;
+}
+
+uint64_t well_known_flags(uint64_t flags)
+{
+	/* hide flags intended only for kernel hacker */
+	flags &= ~KPF_HACKERS_BITS;
+
+	/* hide non-hugeTLB compound pages */
+	if ((flags & BITS_COMPOUND) && !(flags & BIT(HUGE)))
+		flags &= ~BITS_COMPOUND;
+
+	return flags;
+}
+
+
+/*
+ * page frame walker
+ */
+
+int hash_slot(uint64_t flags)
+{
+	int k = HASH_KEY(flags);
+	int i;
+
+	/* Explicitly reserve slot 0 for flags 0: the following logic
+	 * cannot distinguish an unoccupied slot from slot (flags==0).
+	 */
+	if (flags == 0)
+		return 0;
+
+	/* search through the remaining (HASH_SIZE-1) slots */
+	for (i = 1; i < ARRAY_SIZE(page_flags); i++, k++) {
+		if (!k || k >= ARRAY_SIZE(page_flags))
+			k = 1;
+		if (page_flags[k] == 0) {
+			page_flags[k] = flags;
+			return k;
+		}
+		if (page_flags[k] == flags)
+			return k;
+	}
+
+	fatal("hash table full: bump up HASH_SHIFT?\n");
+	exit(EXIT_FAILURE);
+}
+
+void add_page(unsigned long offset, uint64_t flags)
+{
+	flags = expand_overloaded_flags(flags);
+
+	if (!opt_raw)
+		flags = well_known_flags(flags);
+
+	if (!bit_mask_ok(flags))
+		return;
+
+	if (opt_list == 1)
+		show_page_range(offset, flags);
+	else if (opt_list == 2)
+		show_page(offset, flags);
+
+	nr_pages[hash_slot(flags)]++;
+	total_pages++;
+}
+
+void walk_pfn(unsigned long index, unsigned long count)
+{
+	unsigned long batch;
+	unsigned long n;
+	unsigned long i;
+
+	if (index > ULONG_MAX / KPF_BYTES)
+		fatal("index overflow: %lu\n", index);
+
+	lseek(kpageflags_fd, index * KPF_BYTES, SEEK_SET);
+
+	while (count) {
+		batch = min_t(unsigned long, count, PAGES_BATCH);
+		n = read(kpageflags_fd, kpageflags_buf, batch * KPF_BYTES);
+		if (n == 0)
+			break;
+		if (n < 0) {
+			perror(PROC_KPAGEFLAGS);
+			exit(EXIT_FAILURE);
+		}
+
+		if (n % KPF_BYTES != 0)
+			fatal("partial read: %lu bytes\n", n);
+		n = n / KPF_BYTES;
+
+		for (i = 0; i < n; i++)
+			add_page(index + i, kpageflags_buf[i]);
+
+		index += batch;
+		count -= batch;
+	}
+}
+
+void walk_addr_ranges(void)
+{
+	int i;
+
+	kpageflags_fd = open(PROC_KPAGEFLAGS, O_RDONLY);
+	if (kpageflags_fd < 0) {
+		perror(PROC_KPAGEFLAGS);
+		exit(EXIT_FAILURE);
+	}
+
+	if (!nr_addr_ranges)
+		walk_pfn(0, ULONG_MAX);
+
+	for (i = 0; i < nr_addr_ranges; i++)
+		walk_pfn(opt_offset[i], opt_size[i]);
+
+	close(kpageflags_fd);
+}
+
+
+/*
+ * user interface
+ */
+
+const char *page_flag_type(uint64_t flag)
+{
+	if (flag & KPF_HACKERS_BITS)
+		return "(r)";
+	if (flag & KPF_OVERLOADED_BITS)
+		return "(o)";
+	return "   ";
+}
+
+void usage(void)
+{
+	int i, j;
+
+	printf(
+"page-types [options]\n"
+"            -r|--raw                  Raw mode, for kernel developers\n"
+"            -a|--addr    addr-spec    Walk a range of pages\n"
+"            -b|--bits    bits-spec    Walk pages with specified bits\n"
+#if 0 /* planned features */
+"            -p|--pid     pid          Walk process address space\n"
+"            -f|--file    filename     Walk file address space\n"
+#endif
+"            -l|--list                 Show page details in ranges\n"
+"            -L|--list-each            Show page details one by one\n"
+"            -N|--no-summary           Don't show summay info\n"
+"            -h|--help                 Show this usage message\n"
+"addr-spec:\n"
+"            N                         one page at offset N (unit: pages)\n"
+"            N+M                       pages range from N to N+M-1\n"
+"            N,M                       pages range from N to M-1\n"
+"            N,                        pages range from N to end\n"
+"            ,M                        pages range from 0 to M\n"
+"bits-spec:\n"
+"            bit1,bit2                 (flags & (bit1|bit2)) != 0\n"
+"            bit1,bit2=bit1            (flags & (bit1|bit2)) == bit1\n"
+"            bit1,~bit2                (flags & (bit1|bit2)) == bit1\n"
+"            =bit1,bit2                flags == (bit1|bit2)\n"
+"bit-names:\n"
+	);
+
+	for (i = 0, j = 0; i < ARRAY_SIZE(page_flag_names); i++) {
+		if (!page_flag_names[i])
+			continue;
+		printf("%16s%s", page_flag_names[i] + 2,
+				 page_flag_type(1ULL << i));
+		if (++j > 3) {
+			j = 0;
+			putchar('\n');
+		}
+	}
+	printf("\n                                   "
+		"(r) raw mode bits  (o) overloaded bits\n");
+}
+
+unsigned long long parse_number(const char *str)
+{
+	unsigned long long n;
+
+	n = strtoll(str, NULL, 0);
+
+	if (n == 0 && str[0] != '0')
+		fatal("invalid name or number: %s\n", str);
+
+	return n;
+}
+
+void parse_pid(const char *str)
+{
+	opt_pid = parse_number(str);
+}
+
+void parse_file(const char *name)
+{
+}
+
+void add_addr_range(unsigned long offset, unsigned long size)
+{
+	if (nr_addr_ranges >= MAX_ADDR_RANGES)
+		fatal("too much addr ranges\n");
+
+	opt_offset[nr_addr_ranges] = offset;
+	opt_size[nr_addr_ranges] = size;
+	nr_addr_ranges++;
+}
+
+void parse_addr_range(const char *optarg)
+{
+	unsigned long offset;
+	unsigned long size;
+	char *p;
+
+	p = strchr(optarg, ',');
+	if (!p)
+		p = strchr(optarg, '+');
+
+	if (p == optarg) {
+		offset = 0;
+		size   = parse_number(p + 1);
+	} else if (p) {
+		offset = parse_number(optarg);
+		if (p[1] == '\0')
+			size = ULONG_MAX;
+		else {
+			size = parse_number(p + 1);
+			if (*p == ',') {
+				if (size < offset)
+					fatal("invalid range: %lu,%lu\n",
+							offset, size);
+				size -= offset;
+			}
+		}
+	} else {
+		offset = parse_number(optarg);
+		size   = 1;
+	}
+
+	add_addr_range(offset, size);
+}
+
+void add_bits_filter(uint64_t mask, uint64_t bits)
+{
+	if (nr_bit_filters >= MAX_BIT_FILTERS)
+		fatal("too much bit filters\n");
+
+	opt_mask[nr_bit_filters] = mask;
+	opt_bits[nr_bit_filters] = bits;
+	nr_bit_filters++;
+}
+
+uint64_t parse_flag_name(const char *str, int len)
+{
+	int i;
+
+	if (!*str || !len)
+		return 0;
+
+	if (len <= 8 && !strncmp(str, "compound", len))
+		return BITS_COMPOUND;
+
+	for (i = 0; i < ARRAY_SIZE(page_flag_names); i++) {
+		if (!page_flag_names[i])
+			continue;
+		if (!strncmp(str, page_flag_names[i] + 2, len))
+			return 1ULL << i;
+	}
+
+	return parse_number(str);
+}
+
+uint64_t parse_flag_names(const char *str, int all)
+{
+	const char *p    = str;
+	uint64_t   flags = 0;
+
+	while (1) {
+		if (*p == ',' || *p == '=' || *p == '\0') {
+			if ((*str != '~') || (*str == '~' && all && *++str))
+				flags |= parse_flag_name(str, p - str);
+			if (*p != ',')
+				break;
+			str = p + 1;
+		}
+		p++;
+	}
+
+	return flags;
+}
+
+void parse_bits_mask(const char *optarg)
+{
+	uint64_t mask;
+	uint64_t bits;
+	const char *p;
+
+	p = strchr(optarg, '=');
+	if (p == optarg) {
+		mask = KPF_ALL_BITS;
+		bits = parse_flag_names(p + 1, 0);
+	} else if (p) {
+		mask = parse_flag_names(optarg, 0);
+		bits = parse_flag_names(p + 1, 0);
+	} else if (strchr(optarg, '~')) {
+		mask = parse_flag_names(optarg, 1);
+		bits = parse_flag_names(optarg, 0);
+	} else {
+		mask = parse_flag_names(optarg, 0);
+		bits = KPF_ALL_BITS;
+	}
+
+	add_bits_filter(mask, bits);
+}
+
+
+struct option opts[] = {
+	{ "raw"       , 0, NULL, 'r' },
+	{ "pid"       , 1, NULL, 'p' },
+	{ "file"      , 1, NULL, 'f' },
+	{ "addr"      , 1, NULL, 'a' },
+	{ "bits"      , 1, NULL, 'b' },
+	{ "list"      , 0, NULL, 'l' },
+	{ "list-each" , 0, NULL, 'L' },
+	{ "no-summary", 0, NULL, 'N' },
+	{ "help"      , 0, NULL, 'h' },
+	{ NULL        , 0, NULL, 0 }
+};
+
+int main(int argc, char *argv[])
+{
+	int c;
+
+	page_size = getpagesize();
+
+	while ((c = getopt_long(argc, argv,
+				"rp:f:a:b:lLNh", opts, NULL)) != -1) {
+		switch (c) {
+		case 'r':
+			opt_raw = 1;
+			break;
+		case 'p':
+			parse_pid(optarg);
+			break;
+		case 'f':
+			parse_file(optarg);
+			break;
+		case 'a':
+			parse_addr_range(optarg);
+			break;
+		case 'b':
+			parse_bits_mask(optarg);
+			break;
+		case 'l':
+			opt_list = 1;
+			break;
+		case 'L':
+			opt_list = 2;
+			break;
+		case 'N':
+			opt_no_summary = 1;
+			break;
+		case 'h':
+			usage();
+			exit(0);
+		default:
+			usage();
+			exit(1);
+		}
+	}
+
+	if (opt_list == 1)
+		printf("offset\tcount\tflags\n");
+	if (opt_list == 2)
+		printf("offset\tflags\n");
+
+	walk_addr_ranges();
+
+	if (opt_list == 1)
+		show_page_range(0, 0);  /* drain the buffer */
+
+	if (opt_no_summary)
+		return 0;
+
+	if (opt_list)
+		printf("\n\n");
+
+	show_summary();
+
+	return 0;
+}
--- linux.orig/Documentation/vm/Makefile
+++ linux/Documentation/vm/Makefile
@@ -2,7 +2,7 @@
 obj- := dummy.o
 
 # List of programs to build
-hostprogs-y := slabinfo
+hostprogs-y := slabinfo slqbinfo page-types
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 8/8] pagemap: export PG_hwpoison
  2009-05-08 10:53 ` Wu Fengguang
@ 2009-05-08 10:53   ` Wu Fengguang
  -1 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Wu Fengguang, Matt Mackall, KOSAKI Motohiro, linux-mm

[-- Attachment #1: kpageflags-hwpoison.patch --]
[-- Type: text/plain, Size: 2161 bytes --]

This flag indicates a hardware detected memory corruption on the page.
Any future access of the page data may bring down the machine.

CC: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/vm/page-types.c |    2 ++
 Documentation/vm/pagemap.txt  |    4 ++++
 fs/proc/page.c                |    5 +++++
 3 files changed, 11 insertions(+)

--- linux.orig/fs/proc/page.c
+++ linux/fs/proc/page.c
@@ -92,6 +92,7 @@ static const struct file_operations proc
 #define KPF_COMPOUND_TAIL	16
 #define KPF_HUGE		17
 #define KPF_UNEVICTABLE		18
+#define KPF_HWPOISON		19
 #define KPF_NOPAGE		20
 
 /* kernel hacking assistances
@@ -171,6 +172,10 @@ static u64 get_uflags(struct page *page)
 	u |= kpf_copy_bit(k, KPF_SWAPCACHE,	PG_swapcache);
 	u |= kpf_copy_bit(k, KPF_SWAPBACKED,	PG_swapbacked);
 
+#ifdef CONFIG_MEMORY_FAILURE
+	u |= kpf_copy_bit(k, KPF_HWPOISON,	PG_hwpoison);
+#endif
+
 #ifdef CONFIG_UNEVICTABLE_LRU
 	u |= kpf_copy_bit(k, KPF_UNEVICTABLE,	PG_unevictable);
 	u |= kpf_copy_bit(k, KPF_MLOCKED,	PG_mlocked);
--- linux.orig/Documentation/vm/page-types.c
+++ linux/Documentation/vm/page-types.c
@@ -47,6 +47,7 @@
 #define KPF_COMPOUND_TAIL	16
 #define KPF_HUGE		17
 #define KPF_UNEVICTABLE		18
+#define KPF_HWPOISON		19
 #define KPF_NOPAGE		20
 
 /* [32-] kernel hacking assistances */
@@ -94,6 +95,7 @@ static char *page_flag_names[] = {
 	[KPF_COMPOUND_TAIL]	= "T:compound_tail",
 	[KPF_HUGE]		= "G:huge",
 	[KPF_UNEVICTABLE]	= "u:unevictable",
+	[KPF_HWPOISON]		= "X:hwpoison",
 	[KPF_NOPAGE]		= "n:nopage",
 
 	[KPF_RESERVED]		= "r:reserved",
--- linux.orig/Documentation/vm/pagemap.txt
+++ linux/Documentation/vm/pagemap.txt
@@ -57,6 +57,7 @@ There are three components to pagemap:
     16. COMPOUND_TAIL
     16. HUGE
     18. UNEVICTABLE
+    19. HWPOISON
     20. NOPAGE
 
 Short descriptions to the page flags:
@@ -86,6 +87,9 @@ Short descriptions to the page flags:
 17. HUGE
     this is an integral part of a HugeTLB page
 
+19. HWPOISON
+    hardware detected memory corruption on this page: don't touch the data!
+
 20. NOPAGE
     no page frame exists at the requested address
 

-- 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 8/8] pagemap: export PG_hwpoison
@ 2009-05-08 10:53   ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Wu Fengguang, Matt Mackall, KOSAKI Motohiro, linux-mm

[-- Attachment #1: kpageflags-hwpoison.patch --]
[-- Type: text/plain, Size: 2386 bytes --]

This flag indicates a hardware detected memory corruption on the page.
Any future access of the page data may bring down the machine.

CC: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/vm/page-types.c |    2 ++
 Documentation/vm/pagemap.txt  |    4 ++++
 fs/proc/page.c                |    5 +++++
 3 files changed, 11 insertions(+)

--- linux.orig/fs/proc/page.c
+++ linux/fs/proc/page.c
@@ -92,6 +92,7 @@ static const struct file_operations proc
 #define KPF_COMPOUND_TAIL	16
 #define KPF_HUGE		17
 #define KPF_UNEVICTABLE		18
+#define KPF_HWPOISON		19
 #define KPF_NOPAGE		20
 
 /* kernel hacking assistances
@@ -171,6 +172,10 @@ static u64 get_uflags(struct page *page)
 	u |= kpf_copy_bit(k, KPF_SWAPCACHE,	PG_swapcache);
 	u |= kpf_copy_bit(k, KPF_SWAPBACKED,	PG_swapbacked);
 
+#ifdef CONFIG_MEMORY_FAILURE
+	u |= kpf_copy_bit(k, KPF_HWPOISON,	PG_hwpoison);
+#endif
+
 #ifdef CONFIG_UNEVICTABLE_LRU
 	u |= kpf_copy_bit(k, KPF_UNEVICTABLE,	PG_unevictable);
 	u |= kpf_copy_bit(k, KPF_MLOCKED,	PG_mlocked);
--- linux.orig/Documentation/vm/page-types.c
+++ linux/Documentation/vm/page-types.c
@@ -47,6 +47,7 @@
 #define KPF_COMPOUND_TAIL	16
 #define KPF_HUGE		17
 #define KPF_UNEVICTABLE		18
+#define KPF_HWPOISON		19
 #define KPF_NOPAGE		20
 
 /* [32-] kernel hacking assistances */
@@ -94,6 +95,7 @@ static char *page_flag_names[] = {
 	[KPF_COMPOUND_TAIL]	= "T:compound_tail",
 	[KPF_HUGE]		= "G:huge",
 	[KPF_UNEVICTABLE]	= "u:unevictable",
+	[KPF_HWPOISON]		= "X:hwpoison",
 	[KPF_NOPAGE]		= "n:nopage",
 
 	[KPF_RESERVED]		= "r:reserved",
--- linux.orig/Documentation/vm/pagemap.txt
+++ linux/Documentation/vm/pagemap.txt
@@ -57,6 +57,7 @@ There are three components to pagemap:
     16. COMPOUND_TAIL
     16. HUGE
     18. UNEVICTABLE
+    19. HWPOISON
     20. NOPAGE
 
 Short descriptions to the page flags:
@@ -86,6 +87,9 @@ Short descriptions to the page flags:
 17. HUGE
     this is an integral part of a HugeTLB page
 
+19. HWPOISON
+    hardware detected memory corruption on this page: don't touch the data!
+
 20. NOPAGE
     no page frame exists at the requested address
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/8] mm: introduce PageHuge() for testing huge/gigantic pages
  2009-05-08 10:53   ` Wu Fengguang
@ 2009-05-08 11:40     ` Ingo Molnar
  -1 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-08 11:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Matt Mackall, KOSAKI Motohiro, Andi Kleen, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> Introduce PageHuge(), which identifies huge/gigantic pages
> by their dedicated compound destructor functions.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/mm.h |   24 ++++++++++++++++++++++++
>  mm/hugetlb.c       |    2 +-
>  mm/page_alloc.c    |   11 ++++++++++-
>  3 files changed, 35 insertions(+), 2 deletions(-)
> 
> --- linux.orig/mm/page_alloc.c
> +++ linux/mm/page_alloc.c
> @@ -299,13 +299,22 @@ void prep_compound_page(struct page *pag
>  }
>  
>  #ifdef CONFIG_HUGETLBFS
> +/*
> + * This (duplicated) destructor function distinguishes gigantic pages from
> + * normal compound pages.
> + */
> +void free_gigantic_page(struct page *page)
> +{
> +	__free_pages_ok(page, compound_order(page));
> +}
> +
>  void prep_compound_gigantic_page(struct page *page, unsigned long order)
>  {
>  	int i;
>  	int nr_pages = 1 << order;
>  	struct page *p = page + 1;
>  
> -	set_compound_page_dtor(page, free_compound_page);
> +	set_compound_page_dtor(page, free_gigantic_page);
>  	set_compound_order(page, order);
>  	__SetPageHead(page);
>  	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
> --- linux.orig/mm/hugetlb.c
> +++ linux/mm/hugetlb.c
> @@ -550,7 +550,7 @@ struct hstate *size_to_hstate(unsigned l
>  	return NULL;
>  }
>  
> -static void free_huge_page(struct page *page)
> +void free_huge_page(struct page *page)
>  {
>  	/*
>  	 * Can't pass hstate in here because it is called from the
> --- linux.orig/include/linux/mm.h
> +++ linux/include/linux/mm.h
> @@ -355,6 +355,30 @@ static inline void set_compound_order(st
>  	page[1].lru.prev = (void *)order;
>  }
>  
> +#ifdef CONFIG_HUGETLBFS
> +void free_huge_page(struct page *page);
> +void free_gigantic_page(struct page *page);
> +
> +static inline int PageHuge(struct page *page)
> +{
> +	compound_page_dtor *dtor;
> +
> +	if (!PageCompound(page))
> +		return 0;
> +
> +	page = compound_head(page);
> +	dtor = get_compound_page_dtor(page);
> +
> +	return  dtor == free_huge_page ||
> +		dtor == free_gigantic_page;
> +}

Hm, this function is _way_ too large to be inlined.

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/8] mm: introduce PageHuge() for testing huge/gigantic pages
@ 2009-05-08 11:40     ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-08 11:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Matt Mackall, KOSAKI Motohiro, Andi Kleen, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> Introduce PageHuge(), which identifies huge/gigantic pages
> by their dedicated compound destructor functions.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/mm.h |   24 ++++++++++++++++++++++++
>  mm/hugetlb.c       |    2 +-
>  mm/page_alloc.c    |   11 ++++++++++-
>  3 files changed, 35 insertions(+), 2 deletions(-)
> 
> --- linux.orig/mm/page_alloc.c
> +++ linux/mm/page_alloc.c
> @@ -299,13 +299,22 @@ void prep_compound_page(struct page *pag
>  }
>  
>  #ifdef CONFIG_HUGETLBFS
> +/*
> + * This (duplicated) destructor function distinguishes gigantic pages from
> + * normal compound pages.
> + */
> +void free_gigantic_page(struct page *page)
> +{
> +	__free_pages_ok(page, compound_order(page));
> +}
> +
>  void prep_compound_gigantic_page(struct page *page, unsigned long order)
>  {
>  	int i;
>  	int nr_pages = 1 << order;
>  	struct page *p = page + 1;
>  
> -	set_compound_page_dtor(page, free_compound_page);
> +	set_compound_page_dtor(page, free_gigantic_page);
>  	set_compound_order(page, order);
>  	__SetPageHead(page);
>  	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
> --- linux.orig/mm/hugetlb.c
> +++ linux/mm/hugetlb.c
> @@ -550,7 +550,7 @@ struct hstate *size_to_hstate(unsigned l
>  	return NULL;
>  }
>  
> -static void free_huge_page(struct page *page)
> +void free_huge_page(struct page *page)
>  {
>  	/*
>  	 * Can't pass hstate in here because it is called from the
> --- linux.orig/include/linux/mm.h
> +++ linux/include/linux/mm.h
> @@ -355,6 +355,30 @@ static inline void set_compound_order(st
>  	page[1].lru.prev = (void *)order;
>  }
>  
> +#ifdef CONFIG_HUGETLBFS
> +void free_huge_page(struct page *page);
> +void free_gigantic_page(struct page *page);
> +
> +static inline int PageHuge(struct page *page)
> +{
> +	compound_page_dtor *dtor;
> +
> +	if (!PageCompound(page))
> +		return 0;
> +
> +	page = compound_head(page);
> +	dtor = get_compound_page_dtor(page);
> +
> +	return  dtor == free_huge_page ||
> +		dtor == free_gigantic_page;
> +}

Hm, this function is _way_ too large to be inlined.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
  2009-05-08 10:53   ` Wu Fengguang
@ 2009-05-08 11:47     ` Ingo Molnar
  -1 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-08 11:47 UTC (permalink / raw)
  To: Wu Fengguang, Frédéric Weisbecker, Steven Rostedt,
	Peter Zijlstra, Li Zefan
  Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen, Matt Mackall,
	Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> Export all page flags faithfully in /proc/kpageflags.

Ongoing objection and NAK against extended haphazard exporting of 
kernel internals via an ad-hoc ABI via ad-hoc, privatized 
instrumentation that only helps the MM code and nothing else. It was 
a mistake to introduce the /proc/kpageflags hack a year ago, and it 
even more wrong today to expand on it.

/proc/kpageflags should be done via the proper methods outlined in 
the previous mails i wrote on this topic: for example by using the 
'object collections' abstraction i suggested. Clean enumeration of 
all pages (files, tasks, etc.) and the definition of histograms over 
it via free-form filter expressions is the right way to do this. It 
would not only help other subsystems, it would also be far more 
capable.

So this should be done in cooperation with instrumentation folks, 
while improving _all_ of Linux instrumentation in general. Or, if 
you dont have the time/interest to work with us on that, it should 
not be done at all. Not having the resources/interest to do 
something properly is not a license to introduce further 
instrumentation crap into Linux.

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
@ 2009-05-08 11:47     ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-08 11:47 UTC (permalink / raw)
  To: Wu Fengguang, Frédéric Weisbecker, Steven Rostedt,
	Peter Zijlstra, Li Zefan
  Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen, Matt Mackall,
	Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> Export all page flags faithfully in /proc/kpageflags.

Ongoing objection and NAK against extended haphazard exporting of 
kernel internals via an ad-hoc ABI via ad-hoc, privatized 
instrumentation that only helps the MM code and nothing else. It was 
a mistake to introduce the /proc/kpageflags hack a year ago, and it 
even more wrong today to expand on it.

/proc/kpageflags should be done via the proper methods outlined in 
the previous mails i wrote on this topic: for example by using the 
'object collections' abstraction i suggested. Clean enumeration of 
all pages (files, tasks, etc.) and the definition of histograms over 
it via free-form filter expressions is the right way to do this. It 
would not only help other subsystems, it would also be far more 
capable.

So this should be done in cooperation with instrumentation folks, 
while improving _all_ of Linux instrumentation in general. Or, if 
you dont have the time/interest to work with us on that, it should 
not be done at all. Not having the resources/interest to do 
something properly is not a license to introduce further 
instrumentation crap into Linux.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 8/8] pagemap: export PG_hwpoison
  2009-05-08 10:53   ` Wu Fengguang
@ 2009-05-08 11:49     ` Ingo Molnar
  -1 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-08 11:49 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Andi Kleen, Matt Mackall, KOSAKI Motohiro, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> This flag indicates a hardware detected memory corruption on the 
> page. Any future access of the page data may bring down the 
> machine.

NAK on this whole idea, it's utterly harmful. At _minimum_ 
/proc/kpageflags should be moved to /debug/vm/ to not have
any ABI bindings.

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 8/8] pagemap: export PG_hwpoison
@ 2009-05-08 11:49     ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-08 11:49 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Andi Kleen, Matt Mackall, KOSAKI Motohiro, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> This flag indicates a hardware detected memory corruption on the 
> page. Any future access of the page data may bring down the 
> machine.

NAK on this whole idea, it's utterly harmful. At _minimum_ 
/proc/kpageflags should be moved to /debug/vm/ to not have
any ABI bindings.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/8] mm: introduce PageHuge() for testing huge/gigantic pages
  2009-05-08 11:40     ` Ingo Molnar
@ 2009-05-08 12:21       ` Wu Fengguang
  -1 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 12:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, LKML, Matt Mackall, KOSAKI Motohiro, Andi Kleen, linux-mm

On Fri, May 08, 2009 at 07:40:18PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Introduce PageHuge(), which identifies huge/gigantic pages
> > by their dedicated compound destructor functions.
[snip]
> > +#ifdef CONFIG_HUGETLBFS
> > +void free_huge_page(struct page *page);
> > +void free_gigantic_page(struct page *page);
> > +
> > +static inline int PageHuge(struct page *page)
> > +{
> > +	compound_page_dtor *dtor;
> > +
> > +	if (!PageCompound(page))
> > +		return 0;
> > +
> > +	page = compound_head(page);
> > +	dtor = get_compound_page_dtor(page);
> > +
> > +	return  dtor == free_huge_page ||
> > +		dtor == free_gigantic_page;
> > +}
> 
> Hm, this function is _way_ too large to be inlined.

Thanks, updated patch as follows.

---
Subject: mm: introduce PageHuge() for testing huge/gigantic pages

Introduce PageHuge(), which identifies huge/gigantic pages
by their dedicated compound destructor functions.

Also move prep_compound_gigantic_page() to hugetlb.c and
make __free_pages_ok() non-static.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/mm.h |    9 +++
 mm/hugetlb.c       |   98 ++++++++++++++++++++++++++++---------------
 mm/internal.h      |    6 +-
 mm/page_alloc.c    |   21 ---------
 4 files changed, 79 insertions(+), 55 deletions(-)

--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -77,8 +77,6 @@ int percpu_pagelist_fraction;
 int pageblock_order __read_mostly;
 #endif
 
-static void __free_pages_ok(struct page *page, unsigned int order);
-
 /*
  * results with 256, 32 in the lowmem_reserve sysctl:
  *	1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
@@ -298,23 +296,6 @@ void prep_compound_page(struct page *pag
 	}
 }
 
-#ifdef CONFIG_HUGETLBFS
-void prep_compound_gigantic_page(struct page *page, unsigned long order)
-{
-	int i;
-	int nr_pages = 1 << order;
-	struct page *p = page + 1;
-
-	set_compound_page_dtor(page, free_compound_page);
-	set_compound_order(page, order);
-	__SetPageHead(page);
-	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
-		__SetPageTail(p);
-		p->first_page = page;
-	}
-}
-#endif
-
 static int destroy_compound_page(struct page *page, unsigned long order)
 {
 	int i;
@@ -544,7 +525,7 @@ static void free_one_page(struct zone *z
 	spin_unlock(&zone->lock);
 }
 
-static void __free_pages_ok(struct page *page, unsigned int order)
+void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
 	int i;
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -578,39 +578,9 @@ static void free_huge_page(struct page *
 		hugetlb_put_quota(mapping, 1);
 }
 
-/*
- * Increment or decrement surplus_huge_pages.  Keep node-specific counters
- * balanced by operating on them in a round-robin fashion.
- * Returns 1 if an adjustment was made.
- */
-static int adjust_pool_surplus(struct hstate *h, int delta)
+static void free_gigantic_page(struct page *page)
 {
-	static int prev_nid;
-	int nid = prev_nid;
-	int ret = 0;
-
-	VM_BUG_ON(delta != -1 && delta != 1);
-	do {
-		nid = next_node(nid, node_online_map);
-		if (nid == MAX_NUMNODES)
-			nid = first_node(node_online_map);
-
-		/* To shrink on this node, there must be a surplus page */
-		if (delta < 0 && !h->surplus_huge_pages_node[nid])
-			continue;
-		/* Surplus cannot exceed the total number of pages */
-		if (delta > 0 && h->surplus_huge_pages_node[nid] >=
-						h->nr_huge_pages_node[nid])
-			continue;
-
-		h->surplus_huge_pages += delta;
-		h->surplus_huge_pages_node[nid] += delta;
-		ret = 1;
-		break;
-	} while (nid != prev_nid);
-
-	prev_nid = nid;
-	return ret;
+	__free_pages_ok(page, compound_order(page));
 }
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
@@ -623,6 +593,35 @@ static void prep_new_huge_page(struct hs
 	put_page(page); /* free it into the hugepage allocator */
 }
 
+static void prep_compound_gigantic_page(struct page *page, unsigned long order)
+{
+	int i;
+	int nr_pages = 1 << order;
+	struct page *p = page + 1;
+
+	set_compound_page_dtor(page, free_gigantic_page);
+	set_compound_order(page, order);
+	__SetPageHead(page);
+	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
+		__SetPageTail(p);
+		p->first_page = page;
+	}
+}
+
+int PageHuge(struct page *page)
+{
+	compound_page_dtor *dtor;
+
+	if (!PageCompound(page))
+		return 0;
+
+	page = compound_head(page);
+	dtor = get_compound_page_dtor(page);
+
+	return  dtor == free_huge_page ||
+		dtor == free_gigantic_page;
+}
+
 static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
@@ -1140,6 +1139,41 @@ static inline void try_to_free_low(struc
 }
 #endif
 
+/*
+ * Increment or decrement surplus_huge_pages.  Keep node-specific counters
+ * balanced by operating on them in a round-robin fashion.
+ * Returns 1 if an adjustment was made.
+ */
+static int adjust_pool_surplus(struct hstate *h, int delta)
+{
+	static int prev_nid;
+	int nid = prev_nid;
+	int ret = 0;
+
+	VM_BUG_ON(delta != -1 && delta != 1);
+	do {
+		nid = next_node(nid, node_online_map);
+		if (nid == MAX_NUMNODES)
+			nid = first_node(node_online_map);
+
+		/* To shrink on this node, there must be a surplus page */
+		if (delta < 0 && !h->surplus_huge_pages_node[nid])
+			continue;
+		/* Surplus cannot exceed the total number of pages */
+		if (delta > 0 && h->surplus_huge_pages_node[nid] >=
+						h->nr_huge_pages_node[nid])
+			continue;
+
+		h->surplus_huge_pages += delta;
+		h->surplus_huge_pages_node[nid] += delta;
+		ret = 1;
+		break;
+	} while (nid != prev_nid);
+
+	prev_nid = nid;
+	return ret;
+}
+
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
 static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
 {
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -355,6 +355,15 @@ static inline void set_compound_order(st
 	page[1].lru.prev = (void *)order;
 }
 
+#ifdef CONFIG_HUGETLBFS
+int PageHuge(struct page *page);
+#else
+static inline int PageHuge(struct page *page)
+{
+	return 0;
+}
+#endif
+
 /*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of
--- linux.orig/mm/internal.h
+++ linux/mm/internal.h
@@ -16,9 +16,6 @@
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
-extern void prep_compound_page(struct page *page, unsigned long order);
-extern void prep_compound_gigantic_page(struct page *page, unsigned long order);
-
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
@@ -51,6 +48,9 @@ extern void putback_lru_page(struct page
  */
 extern unsigned long highest_memmap_pfn;
 extern void __free_pages_bootmem(struct page *page, unsigned int order);
+extern void __free_pages_ok(struct page *page, unsigned int order);
+extern void prep_compound_page(struct page *page, unsigned long order);
+
 
 /*
  * function for dealing with page's order in buddy system.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/8] mm: introduce PageHuge() for testing huge/gigantic pages
@ 2009-05-08 12:21       ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 12:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, LKML, Matt Mackall, KOSAKI Motohiro, Andi Kleen, linux-mm

On Fri, May 08, 2009 at 07:40:18PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Introduce PageHuge(), which identifies huge/gigantic pages
> > by their dedicated compound destructor functions.
[snip]
> > +#ifdef CONFIG_HUGETLBFS
> > +void free_huge_page(struct page *page);
> > +void free_gigantic_page(struct page *page);
> > +
> > +static inline int PageHuge(struct page *page)
> > +{
> > +	compound_page_dtor *dtor;
> > +
> > +	if (!PageCompound(page))
> > +		return 0;
> > +
> > +	page = compound_head(page);
> > +	dtor = get_compound_page_dtor(page);
> > +
> > +	return  dtor == free_huge_page ||
> > +		dtor == free_gigantic_page;
> > +}
> 
> Hm, this function is _way_ too large to be inlined.

Thanks, updated patch as follows.

---
Subject: mm: introduce PageHuge() for testing huge/gigantic pages

Introduce PageHuge(), which identifies huge/gigantic pages
by their dedicated compound destructor functions.

Also move prep_compound_gigantic_page() to hugetlb.c and
make __free_pages_ok() non-static.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/mm.h |    9 +++
 mm/hugetlb.c       |   98 ++++++++++++++++++++++++++++---------------
 mm/internal.h      |    6 +-
 mm/page_alloc.c    |   21 ---------
 4 files changed, 79 insertions(+), 55 deletions(-)

--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -77,8 +77,6 @@ int percpu_pagelist_fraction;
 int pageblock_order __read_mostly;
 #endif
 
-static void __free_pages_ok(struct page *page, unsigned int order);
-
 /*
  * results with 256, 32 in the lowmem_reserve sysctl:
  *	1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
@@ -298,23 +296,6 @@ void prep_compound_page(struct page *pag
 	}
 }
 
-#ifdef CONFIG_HUGETLBFS
-void prep_compound_gigantic_page(struct page *page, unsigned long order)
-{
-	int i;
-	int nr_pages = 1 << order;
-	struct page *p = page + 1;
-
-	set_compound_page_dtor(page, free_compound_page);
-	set_compound_order(page, order);
-	__SetPageHead(page);
-	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
-		__SetPageTail(p);
-		p->first_page = page;
-	}
-}
-#endif
-
 static int destroy_compound_page(struct page *page, unsigned long order)
 {
 	int i;
@@ -544,7 +525,7 @@ static void free_one_page(struct zone *z
 	spin_unlock(&zone->lock);
 }
 
-static void __free_pages_ok(struct page *page, unsigned int order)
+void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
 	int i;
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -578,39 +578,9 @@ static void free_huge_page(struct page *
 		hugetlb_put_quota(mapping, 1);
 }
 
-/*
- * Increment or decrement surplus_huge_pages.  Keep node-specific counters
- * balanced by operating on them in a round-robin fashion.
- * Returns 1 if an adjustment was made.
- */
-static int adjust_pool_surplus(struct hstate *h, int delta)
+static void free_gigantic_page(struct page *page)
 {
-	static int prev_nid;
-	int nid = prev_nid;
-	int ret = 0;
-
-	VM_BUG_ON(delta != -1 && delta != 1);
-	do {
-		nid = next_node(nid, node_online_map);
-		if (nid == MAX_NUMNODES)
-			nid = first_node(node_online_map);
-
-		/* To shrink on this node, there must be a surplus page */
-		if (delta < 0 && !h->surplus_huge_pages_node[nid])
-			continue;
-		/* Surplus cannot exceed the total number of pages */
-		if (delta > 0 && h->surplus_huge_pages_node[nid] >=
-						h->nr_huge_pages_node[nid])
-			continue;
-
-		h->surplus_huge_pages += delta;
-		h->surplus_huge_pages_node[nid] += delta;
-		ret = 1;
-		break;
-	} while (nid != prev_nid);
-
-	prev_nid = nid;
-	return ret;
+	__free_pages_ok(page, compound_order(page));
 }
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
@@ -623,6 +593,35 @@ static void prep_new_huge_page(struct hs
 	put_page(page); /* free it into the hugepage allocator */
 }
 
+static void prep_compound_gigantic_page(struct page *page, unsigned long order)
+{
+	int i;
+	int nr_pages = 1 << order;
+	struct page *p = page + 1;
+
+	set_compound_page_dtor(page, free_gigantic_page);
+	set_compound_order(page, order);
+	__SetPageHead(page);
+	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
+		__SetPageTail(p);
+		p->first_page = page;
+	}
+}
+
+int PageHuge(struct page *page)
+{
+	compound_page_dtor *dtor;
+
+	if (!PageCompound(page))
+		return 0;
+
+	page = compound_head(page);
+	dtor = get_compound_page_dtor(page);
+
+	return  dtor == free_huge_page ||
+		dtor == free_gigantic_page;
+}
+
 static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
@@ -1140,6 +1139,41 @@ static inline void try_to_free_low(struc
 }
 #endif
 
+/*
+ * Increment or decrement surplus_huge_pages.  Keep node-specific counters
+ * balanced by operating on them in a round-robin fashion.
+ * Returns 1 if an adjustment was made.
+ */
+static int adjust_pool_surplus(struct hstate *h, int delta)
+{
+	static int prev_nid;
+	int nid = prev_nid;
+	int ret = 0;
+
+	VM_BUG_ON(delta != -1 && delta != 1);
+	do {
+		nid = next_node(nid, node_online_map);
+		if (nid == MAX_NUMNODES)
+			nid = first_node(node_online_map);
+
+		/* To shrink on this node, there must be a surplus page */
+		if (delta < 0 && !h->surplus_huge_pages_node[nid])
+			continue;
+		/* Surplus cannot exceed the total number of pages */
+		if (delta > 0 && h->surplus_huge_pages_node[nid] >=
+						h->nr_huge_pages_node[nid])
+			continue;
+
+		h->surplus_huge_pages += delta;
+		h->surplus_huge_pages_node[nid] += delta;
+		ret = 1;
+		break;
+	} while (nid != prev_nid);
+
+	prev_nid = nid;
+	return ret;
+}
+
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
 static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
 {
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -355,6 +355,15 @@ static inline void set_compound_order(st
 	page[1].lru.prev = (void *)order;
 }
 
+#ifdef CONFIG_HUGETLBFS
+int PageHuge(struct page *page);
+#else
+static inline int PageHuge(struct page *page)
+{
+	return 0;
+}
+#endif
+
 /*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of
--- linux.orig/mm/internal.h
+++ linux/mm/internal.h
@@ -16,9 +16,6 @@
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
-extern void prep_compound_page(struct page *page, unsigned long order);
-extern void prep_compound_gigantic_page(struct page *page, unsigned long order);
-
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
@@ -51,6 +48,9 @@ extern void putback_lru_page(struct page
  */
 extern unsigned long highest_memmap_pfn;
 extern void __free_pages_bootmem(struct page *page, unsigned int order);
+extern void __free_pages_ok(struct page *page, unsigned int order);
+extern void prep_compound_page(struct page *page, unsigned long order);
+
 
 /*
  * function for dealing with page's order in buddy system.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
  2009-05-08 11:47     ` Ingo Molnar
@ 2009-05-08 12:44       ` Wu Fengguang
  -1 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 12:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm

Hi Ingo,

On Fri, May 08, 2009 at 07:47:42PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Export all page flags faithfully in /proc/kpageflags.
> 
> Ongoing objection and NAK against extended haphazard exporting of 
> kernel internals via an ad-hoc ABI via ad-hoc, privatized 
> instrumentation that only helps the MM code and nothing else. It was 
> a mistake to introduce the /proc/kpageflags hack a year ago, and it 
> even more wrong today to expand on it.

If cannot abandon it, embrace it. That's my attitude.

> /proc/kpageflags should be done via the proper methods outlined in 
> the previous mails i wrote on this topic: for example by using the 
> 'object collections' abstraction i suggested. Clean enumeration of 
> all pages (files, tasks, etc.) and the definition of histograms over 
> it via free-form filter expressions is the right way to do this. It 
> would not only help other subsystems, it would also be far more 
> capable.

For the new interfaces(files etc.) I'd very like to use the ftrace
interface. For the existing pagemap interfaces, if they can fulfill
their targeted tasks, why bother making the shift?

When the pagemap interfaces cannot satisfy some new applications,
and ftrace can provide a superset of the pagemap interfaces and shows
clear advantages while meeting the new demands, then we can schedule
tearing down of the old interface?

> So this should be done in cooperation with instrumentation folks, 
> while improving _all_ of Linux instrumentation in general. Or, if 
> you dont have the time/interest to work with us on that, it should 
> not be done at all. Not having the resources/interest to do 
> something properly is not a license to introduce further 
> instrumentation crap into Linux.

I'd be glad to work with you on the 'object collections' ftrace
interfaces.  Maybe next month. For now my time have been allocated
for the hwpoison work, sorry!

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
@ 2009-05-08 12:44       ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 12:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm

Hi Ingo,

On Fri, May 08, 2009 at 07:47:42PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Export all page flags faithfully in /proc/kpageflags.
> 
> Ongoing objection and NAK against extended haphazard exporting of 
> kernel internals via an ad-hoc ABI via ad-hoc, privatized 
> instrumentation that only helps the MM code and nothing else. It was 
> a mistake to introduce the /proc/kpageflags hack a year ago, and it 
> even more wrong today to expand on it.

If cannot abandon it, embrace it. That's my attitude.

> /proc/kpageflags should be done via the proper methods outlined in 
> the previous mails i wrote on this topic: for example by using the 
> 'object collections' abstraction i suggested. Clean enumeration of 
> all pages (files, tasks, etc.) and the definition of histograms over 
> it via free-form filter expressions is the right way to do this. It 
> would not only help other subsystems, it would also be far more 
> capable.

For the new interfaces(files etc.) I'd very like to use the ftrace
interface. For the existing pagemap interfaces, if they can fulfill
their targeted tasks, why bother making the shift?

When the pagemap interfaces cannot satisfy some new applications,
and ftrace can provide a superset of the pagemap interfaces and shows
clear advantages while meeting the new demands, then we can schedule
tearing down of the old interface?

> So this should be done in cooperation with instrumentation folks, 
> while improving _all_ of Linux instrumentation in general. Or, if 
> you dont have the time/interest to work with us on that, it should 
> not be done at all. Not having the resources/interest to do 
> something properly is not a license to introduce further 
> instrumentation crap into Linux.

I'd be glad to work with you on the 'object collections' ftrace
interfaces.  Maybe next month. For now my time have been allocated
for the hwpoison work, sorry!

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* ftrace: concurrent accesses possible?
  2009-05-08 11:47     ` Ingo Molnar
@ 2009-05-08 12:58       ` Wu Fengguang
  -1 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 12:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm

Hello,

On Fri, May 08, 2009 at 07:47:42PM +0800, Ingo Molnar wrote:
> 
> So this should be done in cooperation with instrumentation folks, 
> while improving _all_ of Linux instrumentation in general. Or, if 
> you dont have the time/interest to work with us on that, it should 
> not be done at all. Not having the resources/interest to do 
> something properly is not a license to introduce further 
> instrumentation crap into Linux.

I have a dummy question on /debug/trace: is it possible to
- use 2+ tracers concurrently?
- run a system script that makes use of a tracer,
  without disturbing the sysadmin's tracer activities?
- access 1 tracer concurrently from many threads,
  with different filter etc. options?

If not currently, will private mounts be a viable solution?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 92+ messages in thread

* ftrace: concurrent accesses possible?
@ 2009-05-08 12:58       ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 12:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm

Hello,

On Fri, May 08, 2009 at 07:47:42PM +0800, Ingo Molnar wrote:
> 
> So this should be done in cooperation with instrumentation folks, 
> while improving _all_ of Linux instrumentation in general. Or, if 
> you dont have the time/interest to work with us on that, it should 
> not be done at all. Not having the resources/interest to do 
> something properly is not a license to introduce further 
> instrumentation crap into Linux.

I have a dummy question on /debug/trace: is it possible to
- use 2+ tracers concurrently?
- run a system script that makes use of a tracer,
  without disturbing the sysadmin's tracer activities?
- access 1 tracer concurrently from many threads,
  with different filter etc. options?

If not currently, will private mounts be a viable solution?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: ftrace: concurrent accesses possible?
  2009-05-08 12:58       ` Wu Fengguang
@ 2009-05-08 13:17         ` Steven Rostedt
  -1 siblings, 0 replies; 92+ messages in thread
From: Steven Rostedt @ 2009-05-08 13:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Ingo Molnar, Frédéric Weisbecker, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


On Fri, 8 May 2009, Wu Fengguang wrote:

> Hello,
> 
> On Fri, May 08, 2009 at 07:47:42PM +0800, Ingo Molnar wrote:
> > 
> > So this should be done in cooperation with instrumentation folks, 
> > while improving _all_ of Linux instrumentation in general. Or, if 
> > you dont have the time/interest to work with us on that, it should 
> > not be done at all. Not having the resources/interest to do 
> > something properly is not a license to introduce further 
> > instrumentation crap into Linux.
> 
> I have a dummy question on /debug/trace: is it possible to
> - use 2+ tracers concurrently?

Two plugins? no.

Two types of tracing? yes.

The "current_tracer" is for specific tracing purposes that like, latency 
tracing, function tracing and graph tracing. There are others, but they 
are more "themes" than tracers. The latency tracing only shows a "max 
latency" and does not show current traces unless they hit the max 
threshold. The function graph tracer has a different output format that 
has indentation based on the depth of the traced functions.

But with tracing events, we can pick and choose any event and trace them 
all together. You can filter them as well. For new events in the kernel, 
we only add them via trace events. These events show up in the plugin 
tracers too.

> - run a system script that makes use of a tracer,

Sure

>   without disturbing the sysadmin's tracer activities?

Hmm, you mean have individual tracers tracing different things. We sorta 
do that now, but they are more custom. That is, you can have the stack 
tracer running (recording max stack of the kernel) and run other tracers 
as well, without noticing.  But those that write to the ring buffer, only 
write to a single ring buffer. If another trace facility created their own 
ring buffer, then you could have more than one ring buffer being used. But 
ftrace currently uses only one (This is net exactly true, because the 
latency tracers have a separate ring buffer to store the max).

> - access 1 tracer concurrently from many threads,

More than one reader can happen, but inside the kernel, they are 
serialized. When reading from the trace_pipe (consumer mode), every read 
will produce a different output, because the previous read was "consumed". 
If two threads try to read this way at the same time, they will each get a 
different result.

>   with different filter etc. options?

Not sure what you mean here. If you two threads filtering differently, 
this should be done in userspace.

-- Steve

> 
> If not currently, will private mounts be a viable solution?


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: ftrace: concurrent accesses possible?
@ 2009-05-08 13:17         ` Steven Rostedt
  0 siblings, 0 replies; 92+ messages in thread
From: Steven Rostedt @ 2009-05-08 13:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Ingo Molnar, Frédéric Weisbecker, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


On Fri, 8 May 2009, Wu Fengguang wrote:

> Hello,
> 
> On Fri, May 08, 2009 at 07:47:42PM +0800, Ingo Molnar wrote:
> > 
> > So this should be done in cooperation with instrumentation folks, 
> > while improving _all_ of Linux instrumentation in general. Or, if 
> > you dont have the time/interest to work with us on that, it should 
> > not be done at all. Not having the resources/interest to do 
> > something properly is not a license to introduce further 
> > instrumentation crap into Linux.
> 
> I have a dummy question on /debug/trace: is it possible to
> - use 2+ tracers concurrently?

Two plugins? no.

Two types of tracing? yes.

The "current_tracer" is for specific tracing purposes that like, latency 
tracing, function tracing and graph tracing. There are others, but they 
are more "themes" than tracers. The latency tracing only shows a "max 
latency" and does not show current traces unless they hit the max 
threshold. The function graph tracer has a different output format that 
has indentation based on the depth of the traced functions.

But with tracing events, we can pick and choose any event and trace them 
all together. You can filter them as well. For new events in the kernel, 
we only add them via trace events. These events show up in the plugin 
tracers too.

> - run a system script that makes use of a tracer,

Sure

>   without disturbing the sysadmin's tracer activities?

Hmm, you mean have individual tracers tracing different things. We sorta 
do that now, but they are more custom. That is, you can have the stack 
tracer running (recording max stack of the kernel) and run other tracers 
as well, without noticing.  But those that write to the ring buffer, only 
write to a single ring buffer. If another trace facility created their own 
ring buffer, then you could have more than one ring buffer being used. But 
ftrace currently uses only one (This is net exactly true, because the 
latency tracers have a separate ring buffer to store the max).

> - access 1 tracer concurrently from many threads,

More than one reader can happen, but inside the kernel, they are 
serialized. When reading from the trace_pipe (consumer mode), every read 
will produce a different output, because the previous read was "consumed". 
If two threads try to read this way at the same time, they will each get a 
different result.

>   with different filter etc. options?

Not sure what you mean here. If you two threads filtering differently, 
this should be done in userspace.

-- Steve

> 
> If not currently, will private mounts be a viable solution?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: ftrace: concurrent accesses possible?
  2009-05-08 13:17         ` Steven Rostedt
@ 2009-05-08 13:43           ` Wu Fengguang
  -1 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 13:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Frédéric Weisbecker, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Fri, May 08, 2009 at 09:17:04PM +0800, Steven Rostedt wrote:
> 
> On Fri, 8 May 2009, Wu Fengguang wrote:
> 
> > Hello,
> > 
> > On Fri, May 08, 2009 at 07:47:42PM +0800, Ingo Molnar wrote:
> > > 
> > > So this should be done in cooperation with instrumentation folks, 
> > > while improving _all_ of Linux instrumentation in general. Or, if 
> > > you dont have the time/interest to work with us on that, it should 
> > > not be done at all. Not having the resources/interest to do 
> > > something properly is not a license to introduce further 
> > > instrumentation crap into Linux.
> > 
> > I have a dummy question on /debug/trace: is it possible to
> > - use 2+ tracers concurrently?
> 
> Two plugins? no.
> 
> Two types of tracing? yes.
> 
> The "current_tracer" is for specific tracing purposes that like, latency 
> tracing, function tracing and graph tracing. There are others, but they 
> are more "themes" than tracers. The latency tracing only shows a "max 
> latency" and does not show current traces unless they hit the max 
> threshold. The function graph tracer has a different output format that 
> has indentation based on the depth of the traced functions.
> 
> But with tracing events, we can pick and choose any event and trace them 
> all together. You can filter them as well. For new events in the kernel, 
> we only add them via trace events. These events show up in the plugin 
> tracers too.

OK. Thanks for explaining!

> > - run a system script that makes use of a tracer,
> 
> Sure
> 
> >   without disturbing the sysadmin's tracer activities?
> 
> Hmm, you mean have individual tracers tracing different things. We sorta 

Right. Plus two 'instances' of the same tracer run with different options.

> do that now, but they are more custom. That is, you can have the stack 
> tracer running (recording max stack of the kernel) and run other tracers 
> as well, without noticing.  But those that write to the ring buffer, only 
> write to a single ring buffer. If another trace facility created their own 
> ring buffer, then you could have more than one ring buffer being used. But 
> ftrace currently uses only one (This is net exactly true, because the 
> latency tracers have a separate ring buffer to store the max).

That's OK.

> > - access 1 tracer concurrently from many threads,
> 
> More than one reader can happen, but inside the kernel, they are 
> serialized. When reading from the trace_pipe (consumer mode), every read 
> will produce a different output, because the previous read was "consumed". 
> If two threads try to read this way at the same time, they will each get a 
> different result.
> 
> >   with different filter etc. options?
> 
> Not sure what you mean here. If you two threads filtering differently, 
> this should be done in userspace.

It's about efficiency.  Here is a use case: one have N CPUs and want
to create N threads to query N different segments of the total memory
via kpageflags. This ability is important for a large memory system.

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: ftrace: concurrent accesses possible?
@ 2009-05-08 13:43           ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-08 13:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Frédéric Weisbecker, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Fri, May 08, 2009 at 09:17:04PM +0800, Steven Rostedt wrote:
> 
> On Fri, 8 May 2009, Wu Fengguang wrote:
> 
> > Hello,
> > 
> > On Fri, May 08, 2009 at 07:47:42PM +0800, Ingo Molnar wrote:
> > > 
> > > So this should be done in cooperation with instrumentation folks, 
> > > while improving _all_ of Linux instrumentation in general. Or, if 
> > > you dont have the time/interest to work with us on that, it should 
> > > not be done at all. Not having the resources/interest to do 
> > > something properly is not a license to introduce further 
> > > instrumentation crap into Linux.
> > 
> > I have a dummy question on /debug/trace: is it possible to
> > - use 2+ tracers concurrently?
> 
> Two plugins? no.
> 
> Two types of tracing? yes.
> 
> The "current_tracer" is for specific tracing purposes that like, latency 
> tracing, function tracing and graph tracing. There are others, but they 
> are more "themes" than tracers. The latency tracing only shows a "max 
> latency" and does not show current traces unless they hit the max 
> threshold. The function graph tracer has a different output format that 
> has indentation based on the depth of the traced functions.
> 
> But with tracing events, we can pick and choose any event and trace them 
> all together. You can filter them as well. For new events in the kernel, 
> we only add them via trace events. These events show up in the plugin 
> tracers too.

OK. Thanks for explaining!

> > - run a system script that makes use of a tracer,
> 
> Sure
> 
> >   without disturbing the sysadmin's tracer activities?
> 
> Hmm, you mean have individual tracers tracing different things. We sorta 

Right. Plus two 'instances' of the same tracer run with different options.

> do that now, but they are more custom. That is, you can have the stack 
> tracer running (recording max stack of the kernel) and run other tracers 
> as well, without noticing.  But those that write to the ring buffer, only 
> write to a single ring buffer. If another trace facility created their own 
> ring buffer, then you could have more than one ring buffer being used. But 
> ftrace currently uses only one (This is net exactly true, because the 
> latency tracers have a separate ring buffer to store the max).

That's OK.

> > - access 1 tracer concurrently from many threads,
> 
> More than one reader can happen, but inside the kernel, they are 
> serialized. When reading from the trace_pipe (consumer mode), every read 
> will produce a different output, because the previous read was "consumed". 
> If two threads try to read this way at the same time, they will each get a 
> different result.
> 
> >   with different filter etc. options?
> 
> Not sure what you mean here. If you two threads filtering differently, 
> this should be done in userspace.

It's about efficiency.  Here is a use case: one have N CPUs and want
to create N threads to query N different segments of the total memory
via kpageflags. This ability is important for a large memory system.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
  2009-05-08 11:47     ` Ingo Molnar
@ 2009-05-08 20:24       ` Andrew Morton
  -1 siblings, 0 replies; 92+ messages in thread
From: Andrew Morton @ 2009-05-08 20:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: fengguang.wu, fweisbec, rostedt, a.p.zijlstra, lizf,
	linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm

On Fri, 8 May 2009 13:47:42 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Export all page flags faithfully in /proc/kpageflags.
> 
> Ongoing objection and NAK against extended haphazard exporting of 
> kernel internals via an ad-hoc ABI via ad-hoc, privatized 
> instrumentation that only helps the MM code and nothing else.

You're a year too late.  The pagemap interface is useful.

> /proc/kpageflags should be done via the proper methods outlined in 
> the previous mails i wrote on this topic: for example by using the 
> 'object collections' abstraction i suggested.

What's that?

> So this should be done in cooperation with instrumentation folks, 

Feel free to start cooperating.

> while improving _all_ of Linux instrumentation in general. Or, if 
> you dont have the time/interest to work with us on that, it should 
> not be done at all. Not having the resources/interest to do 
> something properly is not a license to introduce further 
> instrumentation crap into Linux.

If and when whatever-this-stuff-is is available and if it turns out to be
usable then someone can take on the task of migrating the existing
apgemap implementation over to use the new machinery while preserving
existing userspace interfaces.

But we shouldn't block improvements to an existing feature because
someone might change the way that feature is implemented some time in
the future.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
@ 2009-05-08 20:24       ` Andrew Morton
  0 siblings, 0 replies; 92+ messages in thread
From: Andrew Morton @ 2009-05-08 20:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: fengguang.wu, fweisbec, rostedt, a.p.zijlstra, lizf,
	linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm

On Fri, 8 May 2009 13:47:42 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Export all page flags faithfully in /proc/kpageflags.
> 
> Ongoing objection and NAK against extended haphazard exporting of 
> kernel internals via an ad-hoc ABI via ad-hoc, privatized 
> instrumentation that only helps the MM code and nothing else.

You're a year too late.  The pagemap interface is useful.

> /proc/kpageflags should be done via the proper methods outlined in 
> the previous mails i wrote on this topic: for example by using the 
> 'object collections' abstraction i suggested.

What's that?

> So this should be done in cooperation with instrumentation folks, 

Feel free to start cooperating.

> while improving _all_ of Linux instrumentation in general. Or, if 
> you dont have the time/interest to work with us on that, it should 
> not be done at all. Not having the resources/interest to do 
> something properly is not a license to introduce further 
> instrumentation crap into Linux.

If and when whatever-this-stuff-is is available and if it turns out to be
usable then someone can take on the task of migrating the existing
apgemap implementation over to use the new machinery while preserving
existing userspace interfaces.

But we shouldn't block improvements to an existing feature because
someone might change the way that feature is implemented some time in
the future.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
  2009-05-08 12:44       ` Wu Fengguang
@ 2009-05-09  5:59         ` Ingo Molnar
  -1 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-09  5:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> > /proc/kpageflags should be done via the proper methods outlined 
> > in the previous mails i wrote on this topic: for example by 
> > using the 'object collections' abstraction i suggested. Clean 
> > enumeration of all pages (files, tasks, etc.) and the definition 
> > of histograms over it via free-form filter expressions is the 
> > right way to do this. It would not only help other subsystems, 
> > it would also be far more capable.
> 
> For the new interfaces(files etc.) I'd very like to use the ftrace 
> interface. For the existing pagemap interfaces, if they can 
> fulfill their targeted tasks, why bother making the shift?

Because they were a mistake to be merged? Because having them 
fragments and thus weakens Linux instrumentation in general? 
Because, somewhat hipocritically, other MM instrumentation patches 
are being rejected under the pretense that they "do not matter" - 
while instrumentation that provably _does_ matter (yours) is added 
outside the existing instrumentation frameworks?

> When the pagemap interfaces cannot satisfy some new applications, 
> and ftrace can provide a superset of the pagemap interfaces and 
> shows clear advantages while meeting the new demands, then we can 
> schedule tearing down of the old interface?

Yes. But meanwhile dont extend it ... otherwise this bad cycle will 
never end. "Oh, we just added this to /proc/kpageflags too, why 
should we go through the trouble of use the generic framework?"

Do you see my position?

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
@ 2009-05-09  5:59         ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-09  5:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> > /proc/kpageflags should be done via the proper methods outlined 
> > in the previous mails i wrote on this topic: for example by 
> > using the 'object collections' abstraction i suggested. Clean 
> > enumeration of all pages (files, tasks, etc.) and the definition 
> > of histograms over it via free-form filter expressions is the 
> > right way to do this. It would not only help other subsystems, 
> > it would also be far more capable.
> 
> For the new interfaces(files etc.) I'd very like to use the ftrace 
> interface. For the existing pagemap interfaces, if they can 
> fulfill their targeted tasks, why bother making the shift?

Because they were a mistake to be merged? Because having them 
fragments and thus weakens Linux instrumentation in general? 
Because, somewhat hipocritically, other MM instrumentation patches 
are being rejected under the pretense that they "do not matter" - 
while instrumentation that provably _does_ matter (yours) is added 
outside the existing instrumentation frameworks?

> When the pagemap interfaces cannot satisfy some new applications, 
> and ftrace can provide a superset of the pagemap interfaces and 
> shows clear advantages while meeting the new demands, then we can 
> schedule tearing down of the old interface?

Yes. But meanwhile dont extend it ... otherwise this bad cycle will 
never end. "Oh, we just added this to /proc/kpageflags too, why 
should we go through the trouble of use the generic framework?"

Do you see my position?

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [patch] tracing/mm: add page frame snapshot trace
  2009-05-08 12:44       ` Wu Fengguang
@ 2009-05-09  6:27         ` Ingo Molnar
  -1 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-09  6:27 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> > So this should be done in cooperation with instrumentation 
> > folks, while improving _all_ of Linux instrumentation in 
> > general. Or, if you dont have the time/interest to work with us 
> > on that, it should not be done at all. Not having the 
> > resources/interest to do something properly is not a license to 
> > introduce further instrumentation crap into Linux.
> 
> I'd be glad to work with you on the 'object collections' ftrace 
> interfaces.  Maybe next month. For now my time have been allocated 
> for the hwpoison work, sorry!

No problem - our offer still stands: we are glad to help out with 
the instrumentation side bits. We'll even write all the patches for 
you, just please help us out with making it maximally useful to 
_you_ :-)

Find below a first prototype patch written by Steve yesterday and 
tidied up a bit by me today. It can also be tried on latest -tip:

  http://people.redhat.com/mingo/tip.git/README

This patch adds the first version of the 'object collections' 
instrumentation facility under /debug/tracing/objects/mm/. It has a 
single control so far, a 'number of pages to dump' trigger file:

To dump 1000 pages to the trace buffers, do:

  echo 1000 > /debug/tracing/objects/mm/pages/trigger

To dump all pages to the trace buffers, do:

  echo -1 > /debug/tracing/objects/mm/pages/trigger

Preliminary timings on an older, 1GB RAM 2 GHz Athlon64 box show 
that it's plenty fast:

 # time echo -1 > /debug/tracing/objects/mm/pages/trigger

  real	0m0.127s
  user	0m0.000s
  sys	0m0.126s

 # time cat /debug/tracing/per_cpu/*/trace_pipe_raw > /tmp/page-trace.bin

  real	0m0.065s
  user	0m0.001s
  sys	0m0.064s

  # ls -l /tmp/1
  -rw-r--r-- 1 root root 13774848 2009-05-09 11:46 /tmp/page-dump.bin

127 millisecs to collect, 65 milliseconds to dump. (And that's not 
using splice() to dump the trace data.)

The current (very preliminary) record format is:

  # cat /debug/tracing/events/mm/dump_pages/format 
  name: dump_pages
  ID: 40
  format:
	field:unsigned short common_type;	offset:0;	size:2;
	field:unsigned char common_flags;	offset:2;	size:1;
	field:unsigned char common_preempt_count;	offset:3;	size:1;
	field:int common_pid;	offset:4;	size:4;
	field:int common_tgid;	offset:8;	size:4;

	field:unsigned long pfn;	offset:16;	size:8;
	field:unsigned long flags;	offset:24;	size:8;
	field:unsigned long index;	offset:32;	size:8;
	field:unsigned int count;	offset:40;	size:4;
	field:unsigned int mapcount;	offset:44;	size:4;

  print fmt: "pfn=%lu flags=%lx count=%u mapcount=%u index=%lu", 
  REC->pfn, REC->flags, REC->count, REC->mapcount, REC->index

Note: the page->flags value should probably be converted into more 
independent values i suspect, like get_uflags() is - the raw 
page->flags is too compressed and inter-dependent on other 
properties of struct page to be directly usable.

Also, buffer size has to be large enough to hold the dump. To hold 
one million entries (4GB of RAM), this should be enough:

  echo 60000 > /debug/tracing/buffer_size_kb

Once we add synchronization between producer and consumer, pretty 
much any buffer size will suffice.

The trace records are unique so user-space can filter out the dump 
and only the dump - even if there are other trace events in the 
buffer.

TODO:

 - add smarter flags output - a'la your get_uflags().

 - add synchronization between trace producer and trace consumer

 - port user-space bits to this facility: Documentation/vm/page-types.c

What do you think about this patch? We could also further reduce the 
patch/plugin size by factoring out some of this code into generic 
tracing code. This will be best done when we add the 'tasks' object 
collection to dump a tasks snapshot to the trace buffer.

	Ingo

---------------------------->
>From dcac8cdac1d41af0336d8ed17c2cb898ba8a791f Mon Sep 17 00:00:00 2001
From: Steven Rostedt <srostedt@redhat.com>
Date: Fri, 8 May 2009 16:44:15 -0400
Subject: [PATCH] tracing/mm: add page frame snapshot trace

This is a prototype to dump out a snapshot of the page tables to the
tracing buffer. Currently it is very primitive, and just writes out
the events. There is no synchronization to not loose the events,
so /debug/tracing/buffer_size_kb has to be large enough for all
events to fit.

We will do something about synchronization later. That is, have a way
to read the buffer through the tracing/object/mm/page/X file and have
the two in sync.

But this is just a prototype to get the ball rolling.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/trace/events/mm.h |   48 +++++++++++++
 kernel/trace/Makefile     |    1 +
 kernel/trace/trace_mm.c   |  172 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 221 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/mm.h b/include/trace/events/mm.h
new file mode 100644
index 0000000..f5a1668
--- /dev/null
+++ b/include/trace/events/mm.h
@@ -0,0 +1,48 @@
+#if !defined(_TRACE_MM_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MM_H
+
+#include <linux/tracepoint.h>
+#include <linux/mm.h>
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mm
+
+/**
+ * dump_pages - called by the trace page dump trigger
+ * @pfn: page frame number
+ * @page: pointer to the page frame
+ *
+ * This is a helper trace point into the dumping of the page frames.
+ * It will record various infromation about a page frame.
+ */
+TRACE_EVENT(dump_pages,
+
+	TP_PROTO(unsigned long pfn, struct page *page),
+
+	TP_ARGS(pfn, page),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	pfn		)
+		__field(	unsigned long,	flags		)
+		__field(	unsigned long,	index		)
+		__field(	unsigned int,	count		)
+		__field(	unsigned int,	mapcount	)
+	),
+
+	TP_fast_assign(
+		__entry->pfn		= pfn;
+		__entry->flags		= page->flags;
+		__entry->count		= atomic_read(&page->_count);
+		__entry->mapcount	= atomic_read(&page->_mapcount);
+		__entry->index		= page->index;
+	),
+
+	TP_printk("pfn=%lu flags=%lx count=%u mapcount=%u index=%lu",
+		  __entry->pfn, __entry->flags, __entry->count,
+		  __entry->mapcount, __entry->index)
+);
+
+#endif /*  _TRACE_MM_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 06b8585..848e5ce 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -51,5 +51,6 @@ obj-$(CONFIG_EVENT_TRACING) += trace_export.o
 obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
 obj-$(CONFIG_EVENT_PROFILE) += trace_event_profile.o
 obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o
+obj-$(CONFIG_EVENT_TRACING) += trace_mm.o
 
 libftrace-y := ftrace.o
diff --git a/kernel/trace/trace_mm.c b/kernel/trace/trace_mm.c
new file mode 100644
index 0000000..87123ed
--- /dev/null
+++ b/kernel/trace/trace_mm.c
@@ -0,0 +1,172 @@
+/*
+ * Trace mm pages
+ *
+ * Copyright (C) 2009 Red Hat Inc, Steven Rostedt <srostedt@redhat.com>
+ *
+ * Code based on Matt Mackall's /proc/[kpagecount|kpageflags] code.
+ */
+#include <linux/module.h>
+#include <linux/bootmem.h>
+#include <linux/debugfs.h>
+#include <linux/uaccess.h>
+
+#include "trace_output.h"
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/mm.h>
+
+void trace_read_page_frames(unsigned long start, unsigned long end,
+			    void (*trace)(unsigned long pfn, struct page *page))
+{
+	unsigned long pfn = start;
+	struct page *page;
+
+	if (start > max_pfn - 1)
+		return;
+
+	if (end > max_pfn - 1)
+		end = max_pfn - 1;
+
+	while (pfn < end) {
+		page = NULL;
+		if (pfn_valid(pfn))
+			page = pfn_to_page(pfn);
+		pfn++;
+		if (page)
+			trace(pfn, page);
+	}
+}
+
+static void trace_do_dump_pages(unsigned long pfn, struct page *page)
+{
+	trace_dump_pages(pfn, page);
+}
+
+static ssize_t
+trace_mm_trigger_read(struct file *filp, char __user *ubuf, size_t cnt,
+		 loff_t *ppos)
+{
+	return simple_read_from_buffer(ubuf, cnt, ppos, "0\n", 2);
+}
+
+
+static ssize_t
+trace_mm_trigger_write(struct file *filp, const char __user *ubuf, size_t cnt,
+		       loff_t *ppos)
+{
+	unsigned long val, start, end;
+	char buf[64];
+	int ret;
+
+	if (cnt >= sizeof(buf))
+		return -EINVAL;
+
+	if (copy_from_user(&buf, ubuf, cnt))
+		return -EFAULT;
+
+	if (tracing_update_buffers() < 0)
+		return -ENOMEM;
+
+	if (trace_set_clr_event("mm", "dump_pages", 1))
+		return -EINVAL;
+
+	buf[cnt] = 0;
+
+	ret = strict_strtol(buf, 10, &val);
+	if (ret < 0)
+		return ret;
+
+	start = *ppos;
+	if (val < 0)
+		end = max_pfn - 1;
+	else
+		end = start + val;
+
+	trace_read_page_frames(start, end, trace_do_dump_pages);
+
+	*ppos += cnt;
+
+	return cnt;
+}
+
+static const struct file_operations trace_mm_fops = {
+	.open		= tracing_open_generic,
+	.read		= trace_mm_trigger_read,
+	.write		= trace_mm_trigger_write,
+};
+
+/* move this into trace_objects.c when that file is created */
+static struct dentry *trace_objects_dir(void)
+{
+	static struct dentry *d_objects;
+	struct dentry *d_tracer;
+
+	if (d_objects)
+		return d_objects;
+
+	d_tracer = tracing_init_dentry();
+	if (!d_tracer)
+		return NULL;
+
+	d_objects = debugfs_create_dir("objects", d_tracer);
+	if (!d_objects)
+		pr_warning("Could not create debugfs "
+			   "'objects' directory\n");
+
+	return d_objects;
+}
+
+
+static struct dentry *trace_objects_mm_dir(void)
+{
+	static struct dentry *d_mm;
+	struct dentry *d_objects;
+
+	if (d_mm)
+		return d_mm;
+
+	d_objects = trace_objects_dir();
+	if (!d_objects)
+		return NULL;
+
+	d_mm = debugfs_create_dir("mm", d_objects);
+	if (!d_mm)
+		pr_warning("Could not create 'objects/mm' directory\n");
+
+	return d_mm;
+}
+
+static struct dentry *trace_objects_mm_pages_dir(void)
+{
+	static struct dentry *d_pages;
+	struct dentry *d_mm;
+
+	if (d_pages)
+		return d_pages;
+
+	d_mm = trace_objects_mm_dir();
+	if (!d_mm)
+		return NULL;
+
+	d_pages = debugfs_create_dir("pages", d_mm);
+	if (!d_pages)
+		pr_warning("Could not create debugfs "
+			   "'objects/mm/pages' directory\n");
+
+	return d_pages;
+}
+
+static __init int trace_objects_mm_init(void)
+{
+	struct dentry *d_pages;
+
+	d_pages = trace_objects_mm_pages_dir();
+	if (!d_pages)
+		return 0;
+
+	trace_create_file("trigger", 0600, d_pages, NULL,
+			  &trace_mm_fops);
+
+	return 0;
+}
+fs_initcall(trace_objects_mm_init);

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [patch] tracing/mm: add page frame snapshot trace
@ 2009-05-09  6:27         ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-09  6:27 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> > So this should be done in cooperation with instrumentation 
> > folks, while improving _all_ of Linux instrumentation in 
> > general. Or, if you dont have the time/interest to work with us 
> > on that, it should not be done at all. Not having the 
> > resources/interest to do something properly is not a license to 
> > introduce further instrumentation crap into Linux.
> 
> I'd be glad to work with you on the 'object collections' ftrace 
> interfaces.  Maybe next month. For now my time have been allocated 
> for the hwpoison work, sorry!

No problem - our offer still stands: we are glad to help out with 
the instrumentation side bits. We'll even write all the patches for 
you, just please help us out with making it maximally useful to 
_you_ :-)

Find below a first prototype patch written by Steve yesterday and 
tidied up a bit by me today. It can also be tried on latest -tip:

  http://people.redhat.com/mingo/tip.git/README

This patch adds the first version of the 'object collections' 
instrumentation facility under /debug/tracing/objects/mm/. It has a 
single control so far, a 'number of pages to dump' trigger file:

To dump 1000 pages to the trace buffers, do:

  echo 1000 > /debug/tracing/objects/mm/pages/trigger

To dump all pages to the trace buffers, do:

  echo -1 > /debug/tracing/objects/mm/pages/trigger

Preliminary timings on an older, 1GB RAM 2 GHz Athlon64 box show 
that it's plenty fast:

 # time echo -1 > /debug/tracing/objects/mm/pages/trigger

  real	0m0.127s
  user	0m0.000s
  sys	0m0.126s

 # time cat /debug/tracing/per_cpu/*/trace_pipe_raw > /tmp/page-trace.bin

  real	0m0.065s
  user	0m0.001s
  sys	0m0.064s

  # ls -l /tmp/1
  -rw-r--r-- 1 root root 13774848 2009-05-09 11:46 /tmp/page-dump.bin

127 millisecs to collect, 65 milliseconds to dump. (And that's not 
using splice() to dump the trace data.)

The current (very preliminary) record format is:

  # cat /debug/tracing/events/mm/dump_pages/format 
  name: dump_pages
  ID: 40
  format:
	field:unsigned short common_type;	offset:0;	size:2;
	field:unsigned char common_flags;	offset:2;	size:1;
	field:unsigned char common_preempt_count;	offset:3;	size:1;
	field:int common_pid;	offset:4;	size:4;
	field:int common_tgid;	offset:8;	size:4;

	field:unsigned long pfn;	offset:16;	size:8;
	field:unsigned long flags;	offset:24;	size:8;
	field:unsigned long index;	offset:32;	size:8;
	field:unsigned int count;	offset:40;	size:4;
	field:unsigned int mapcount;	offset:44;	size:4;

  print fmt: "pfn=%lu flags=%lx count=%u mapcount=%u index=%lu", 
  REC->pfn, REC->flags, REC->count, REC->mapcount, REC->index

Note: the page->flags value should probably be converted into more 
independent values i suspect, like get_uflags() is - the raw 
page->flags is too compressed and inter-dependent on other 
properties of struct page to be directly usable.

Also, buffer size has to be large enough to hold the dump. To hold 
one million entries (4GB of RAM), this should be enough:

  echo 60000 > /debug/tracing/buffer_size_kb

Once we add synchronization between producer and consumer, pretty 
much any buffer size will suffice.

The trace records are unique so user-space can filter out the dump 
and only the dump - even if there are other trace events in the 
buffer.

TODO:

 - add smarter flags output - a'la your get_uflags().

 - add synchronization between trace producer and trace consumer

 - port user-space bits to this facility: Documentation/vm/page-types.c

What do you think about this patch? We could also further reduce the 
patch/plugin size by factoring out some of this code into generic 
tracing code. This will be best done when we add the 'tasks' object 
collection to dump a tasks snapshot to the trace buffer.

	Ingo

---------------------------->

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
  2009-05-09  5:59         ` Ingo Molnar
@ 2009-05-09  7:56           ` Wu Fengguang
  -1 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-09  7:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Sat, May 09, 2009 at 01:59:14PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > /proc/kpageflags should be done via the proper methods outlined 
> > > in the previous mails i wrote on this topic: for example by 
> > > using the 'object collections' abstraction i suggested. Clean 
> > > enumeration of all pages (files, tasks, etc.) and the definition 
> > > of histograms over it via free-form filter expressions is the 
> > > right way to do this. It would not only help other subsystems, 
> > > it would also be far more capable.
> > 
> > For the new interfaces(files etc.) I'd very like to use the ftrace 
> > interface. For the existing pagemap interfaces, if they can 
> > fulfill their targeted tasks, why bother making the shift?
> 
> Because they were a mistake to be merged? Because having them 
> fragments and thus weakens Linux instrumentation in general? 
> Because, somewhat hipocritically, other MM instrumentation patches 
> are being rejected under the pretense that they "do not matter" - 
> while instrumentation that provably _does_ matter (yours) is added 
> outside the existing instrumentation frameworks?
> 
> > When the pagemap interfaces cannot satisfy some new applications, 
> > and ftrace can provide a superset of the pagemap interfaces and 
> > shows clear advantages while meeting the new demands, then we can 
> > schedule tearing down of the old interface?
> 
> Yes. But meanwhile dont extend it ... otherwise this bad cycle will 
> never end. "Oh, we just added this to /proc/kpageflags too, why 
> should we go through the trouble of use the generic framework?"
> 
> Do you see my position?

Yes I can understand the merits of conforming to a generic framework.
But that alone is not enough. If you at the same time demonstrate some
clear technical advantages(flexibility, speed, simplicity etc.), then
it would be great.  (Let me work out some expectations for ftrace..)

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
@ 2009-05-09  7:56           ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-09  7:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Sat, May 09, 2009 at 01:59:14PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > /proc/kpageflags should be done via the proper methods outlined 
> > > in the previous mails i wrote on this topic: for example by 
> > > using the 'object collections' abstraction i suggested. Clean 
> > > enumeration of all pages (files, tasks, etc.) and the definition 
> > > of histograms over it via free-form filter expressions is the 
> > > right way to do this. It would not only help other subsystems, 
> > > it would also be far more capable.
> > 
> > For the new interfaces(files etc.) I'd very like to use the ftrace 
> > interface. For the existing pagemap interfaces, if they can 
> > fulfill their targeted tasks, why bother making the shift?
> 
> Because they were a mistake to be merged? Because having them 
> fragments and thus weakens Linux instrumentation in general? 
> Because, somewhat hipocritically, other MM instrumentation patches 
> are being rejected under the pretense that they "do not matter" - 
> while instrumentation that provably _does_ matter (yours) is added 
> outside the existing instrumentation frameworks?
> 
> > When the pagemap interfaces cannot satisfy some new applications, 
> > and ftrace can provide a superset of the pagemap interfaces and 
> > shows clear advantages while meeting the new demands, then we can 
> > schedule tearing down of the old interface?
> 
> Yes. But meanwhile dont extend it ... otherwise this bad cycle will 
> never end. "Oh, we just added this to /proc/kpageflags too, why 
> should we go through the trouble of use the generic framework?"
> 
> Do you see my position?

Yes I can understand the merits of conforming to a generic framework.
But that alone is not enough. If you at the same time demonstrate some
clear technical advantages(flexibility, speed, simplicity etc.), then
it would be great.  (Let me work out some expectations for ftrace..)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] pagemap: document 9 more exported page flags
  2009-05-08 10:53   ` Wu Fengguang
@ 2009-05-09  8:13     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 92+ messages in thread
From: KOSAKI Motohiro @ 2009-05-09  8:13 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Andrew Morton, LKML, Matt Mackall, Andi Kleen, linux-mm

> Also add short descriptions for all of the 20 exported page flags.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  Documentation/vm/pagemap.txt |   62 +++++++++++++++++++++++++++++++++
>  1 file changed, 62 insertions(+)
> 
> --- linux.orig/Documentation/vm/pagemap.txt
> +++ linux/Documentation/vm/pagemap.txt
> @@ -49,6 +49,68 @@ There are three components to pagemap:
>       8. WRITEBACK
>       9. RECLAIM
>      10. BUDDY
> +    11. MMAP
> +    12. ANON
> +    13. SWAPCACHE
> +    14. SWAPBACKED
> +    15. COMPOUND_HEAD
> +    16. COMPOUND_TAIL
> +    16. HUGE

nit. 16 appear twice.



> +    18. UNEVICTABLE
> +    20. NOPAGE



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] pagemap: document 9 more exported page flags
@ 2009-05-09  8:13     ` KOSAKI Motohiro
  0 siblings, 0 replies; 92+ messages in thread
From: KOSAKI Motohiro @ 2009-05-09  8:13 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Andrew Morton, LKML, Matt Mackall, Andi Kleen, linux-mm

> Also add short descriptions for all of the 20 exported page flags.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  Documentation/vm/pagemap.txt |   62 +++++++++++++++++++++++++++++++++
>  1 file changed, 62 insertions(+)
> 
> --- linux.orig/Documentation/vm/pagemap.txt
> +++ linux/Documentation/vm/pagemap.txt
> @@ -49,6 +49,68 @@ There are three components to pagemap:
>       8. WRITEBACK
>       9. RECLAIM
>      10. BUDDY
> +    11. MMAP
> +    12. ANON
> +    13. SWAPCACHE
> +    14. SWAPBACKED
> +    15. COMPOUND_HEAD
> +    16. COMPOUND_TAIL
> +    16. HUGE

nit. 16 appear twice.



> +    18. UNEVICTABLE
> +    20. NOPAGE


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] pagemap: document 9 more exported page flags
  2009-05-09  8:13     ` KOSAKI Motohiro
@ 2009-05-09  8:18       ` Wu Fengguang
  -1 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-09  8:18 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Andrew Morton, LKML, Matt Mackall, Andi Kleen, linux-mm

On Sat, May 09, 2009 at 04:13:40PM +0800, KOSAKI Motohiro wrote:
> > Also add short descriptions for all of the 20 exported page flags.
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  Documentation/vm/pagemap.txt |   62 +++++++++++++++++++++++++++++++++
> >  1 file changed, 62 insertions(+)
> > 
> > --- linux.orig/Documentation/vm/pagemap.txt
> > +++ linux/Documentation/vm/pagemap.txt
> > @@ -49,6 +49,68 @@ There are three components to pagemap:
> >       8. WRITEBACK
> >       9. RECLAIM
> >      10. BUDDY
> > +    11. MMAP
> > +    12. ANON
> > +    13. SWAPCACHE
> > +    14. SWAPBACKED
> > +    15. COMPOUND_HEAD
> > +    16. COMPOUND_TAIL
> > +    16. HUGE
> 
> nit. 16 appear twice.

Good catch!

Andrew, this fix can be folded into the last patch.
---
pagemap: fix HUGE numbering

Thanks to KOSAKI Motohiro for catching this.

cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/vm/pagemap.txt |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux.orig/Documentation/vm/pagemap.txt
+++ linux/Documentation/vm/pagemap.txt
@@ -55,7 +55,7 @@ There are three components to pagemap:
     14. SWAPBACKED
     15. COMPOUND_HEAD
     16. COMPOUND_TAIL
-    16. HUGE
+    17. HUGE
     18. UNEVICTABLE
     19. HWPOISON
     20. NOPAGE

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] pagemap: document 9 more exported page flags
@ 2009-05-09  8:18       ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-09  8:18 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Andrew Morton, LKML, Matt Mackall, Andi Kleen, linux-mm

On Sat, May 09, 2009 at 04:13:40PM +0800, KOSAKI Motohiro wrote:
> > Also add short descriptions for all of the 20 exported page flags.
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  Documentation/vm/pagemap.txt |   62 +++++++++++++++++++++++++++++++++
> >  1 file changed, 62 insertions(+)
> > 
> > --- linux.orig/Documentation/vm/pagemap.txt
> > +++ linux/Documentation/vm/pagemap.txt
> > @@ -49,6 +49,68 @@ There are three components to pagemap:
> >       8. WRITEBACK
> >       9. RECLAIM
> >      10. BUDDY
> > +    11. MMAP
> > +    12. ANON
> > +    13. SWAPCACHE
> > +    14. SWAPBACKED
> > +    15. COMPOUND_HEAD
> > +    16. COMPOUND_TAIL
> > +    16. HUGE
> 
> nit. 16 appear twice.

Good catch!

Andrew, this fix can be folded into the last patch.
---
pagemap: fix HUGE numbering

Thanks to KOSAKI Motohiro for catching this.

cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/vm/pagemap.txt |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux.orig/Documentation/vm/pagemap.txt
+++ linux/Documentation/vm/pagemap.txt
@@ -55,7 +55,7 @@ There are three components to pagemap:
     14. SWAPBACKED
     15. COMPOUND_HEAD
     16. COMPOUND_TAIL
-    16. HUGE
+    17. HUGE
     18. UNEVICTABLE
     19. HWPOISON
     20. NOPAGE

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
  2009-05-09  6:27         ` Ingo Molnar
@ 2009-05-09  9:13           ` Wu Fengguang
  -1 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-09  9:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm

Hi Ingo,

On Sat, May 09, 2009 at 02:27:58PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > So this should be done in cooperation with instrumentation 
> > > folks, while improving _all_ of Linux instrumentation in 
> > > general. Or, if you dont have the time/interest to work with us 
> > > on that, it should not be done at all. Not having the 
> > > resources/interest to do something properly is not a license to 
> > > introduce further instrumentation crap into Linux.
> > 
> > I'd be glad to work with you on the 'object collections' ftrace 
> > interfaces.  Maybe next month. For now my time have been allocated 
> > for the hwpoison work, sorry!
> 
> No problem - our offer still stands: we are glad to help out with 
> the instrumentation side bits. We'll even write all the patches for 
> you, just please help us out with making it maximally useful to 
> _you_ :-)

Thank you very much!

The good fact is, 2/3 of the code and experiences can be reused.

> Find below a first prototype patch written by Steve yesterday and 
> tidied up a bit by me today. It can also be tried on latest -tip:
> 
>   http://people.redhat.com/mingo/tip.git/README
> 
> This patch adds the first version of the 'object collections' 
> instrumentation facility under /debug/tracing/objects/mm/. It has a 
> single control so far, a 'number of pages to dump' trigger file:
> 
> To dump 1000 pages to the trace buffers, do:
> 
>   echo 1000 > /debug/tracing/objects/mm/pages/trigger
> 
> To dump all pages to the trace buffers, do:
> 
>   echo -1 > /debug/tracing/objects/mm/pages/trigger

That is not too intuitive, I'm afraid.

> Preliminary timings on an older, 1GB RAM 2 GHz Athlon64 box show 
> that it's plenty fast:
> 
>  # time echo -1 > /debug/tracing/objects/mm/pages/trigger
> 
>   real	0m0.127s
>   user	0m0.000s
>   sys	0m0.126s
> 
>  # time cat /debug/tracing/per_cpu/*/trace_pipe_raw > /tmp/page-trace.bin
> 
>   real	0m0.065s
>   user	0m0.001s
>   sys	0m0.064s
> 
>   # ls -l /tmp/1
>   -rw-r--r-- 1 root root 13774848 2009-05-09 11:46 /tmp/page-dump.bin
> 
> 127 millisecs to collect, 65 milliseconds to dump. (And that's not 
> using splice() to dump the trace data.)

That's pretty fast and on par with kpageflags!

> The current (very preliminary) record format is:
> 
>   # cat /debug/tracing/events/mm/dump_pages/format 
>   name: dump_pages
>   ID: 40
>   format:
> 	field:unsigned short common_type;	offset:0;	size:2;
> 	field:unsigned char common_flags;	offset:2;	size:1;
> 	field:unsigned char common_preempt_count;	offset:3;	size:1;
> 	field:int common_pid;	offset:4;	size:4;
> 	field:int common_tgid;	offset:8;	size:4;
> 
> 	field:unsigned long pfn;	offset:16;	size:8;
> 	field:unsigned long flags;	offset:24;	size:8;
> 	field:unsigned long index;	offset:32;	size:8;
> 	field:unsigned int count;	offset:40;	size:4;
> 	field:unsigned int mapcount;	offset:44;	size:4;
> 
>   print fmt: "pfn=%lu flags=%lx count=%u mapcount=%u index=%lu", 
>   REC->pfn, REC->flags, REC->count, REC->mapcount, REC->index
> 
> Note: the page->flags value should probably be converted into more 
> independent values i suspect, like get_uflags() is - the raw 
> page->flags is too compressed and inter-dependent on other 
> properties of struct page to be directly usable.

Agreed.

> Also, buffer size has to be large enough to hold the dump. To hold 
> one million entries (4GB of RAM), this should be enough:
> 
>   echo 60000 > /debug/tracing/buffer_size_kb
> 
> Once we add synchronization between producer and consumer, pretty 
> much any buffer size will suffice.

That would be good.

> The trace records are unique so user-space can filter out the dump 
> and only the dump - even if there are other trace events in the 
> buffer.

OK.

> TODO:
> 
>  - add smarter flags output - a'la your get_uflags().

That's 100% code reuse :-)

>  - add synchronization between trace producer and trace consumer
> 
>  - port user-space bits to this facility: Documentation/vm/page-types.c

page-types' kernel ABI code is smallish. So would be trivial to port.

> What do you think about this patch? We could also further reduce the 
> patch/plugin size by factoring out some of this code into generic 
> tracing code. This will be best done when we add the 'tasks' object 
> collection to dump a tasks snapshot to the trace buffer.

To be frank, the code size is a bit larger than kpageflags, and
(as a newbie) the ftrace interface is not as straightforward as the
traditional read().

But that's acceptable, as long as it will allow more powerful object
dumping. I'll attempt to list two fundamental requirements:

1) support multiple object iteration paths
   For example, the pages can be iterated by
   - pfn
   - process virtual address
   - inode address space
   - swap space?

2) support concurrent object iterations
   For example, a huge 1TB memory space can be split up into 10
   segments which can be queried concurrently (with different options).

(1) provides great flexibility and advantage to the existing interface,
(2) provides equal performance to the existing interface.

Are they at least possible?

Thanks,
Fengguang

> 	Ingo
> 
> ---------------------------->
> >From dcac8cdac1d41af0336d8ed17c2cb898ba8a791f Mon Sep 17 00:00:00 2001
> From: Steven Rostedt <srostedt@redhat.com>
> Date: Fri, 8 May 2009 16:44:15 -0400
> Subject: [PATCH] tracing/mm: add page frame snapshot trace
> 
> This is a prototype to dump out a snapshot of the page tables to the
> tracing buffer. Currently it is very primitive, and just writes out
> the events. There is no synchronization to not loose the events,
> so /debug/tracing/buffer_size_kb has to be large enough for all
> events to fit.
> 
> We will do something about synchronization later. That is, have a way
> to read the buffer through the tracing/object/mm/page/X file and have
> the two in sync.
> 
> But this is just a prototype to get the ball rolling.
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  include/trace/events/mm.h |   48 +++++++++++++
>  kernel/trace/Makefile     |    1 +
>  kernel/trace/trace_mm.c   |  172 +++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 221 insertions(+), 0 deletions(-)
> 
> diff --git a/include/trace/events/mm.h b/include/trace/events/mm.h
> new file mode 100644
> index 0000000..f5a1668
> --- /dev/null
> +++ b/include/trace/events/mm.h
> @@ -0,0 +1,48 @@
> +#if !defined(_TRACE_MM_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_MM_H
> +
> +#include <linux/tracepoint.h>
> +#include <linux/mm.h>
> +
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM mm
> +
> +/**
> + * dump_pages - called by the trace page dump trigger
> + * @pfn: page frame number
> + * @page: pointer to the page frame
> + *
> + * This is a helper trace point into the dumping of the page frames.
> + * It will record various infromation about a page frame.
> + */
> +TRACE_EVENT(dump_pages,
> +
> +	TP_PROTO(unsigned long pfn, struct page *page),
> +
> +	TP_ARGS(pfn, page),
> +
> +	TP_STRUCT__entry(
> +		__field(	unsigned long,	pfn		)
> +		__field(	unsigned long,	flags		)
> +		__field(	unsigned long,	index		)
> +		__field(	unsigned int,	count		)
> +		__field(	unsigned int,	mapcount	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->pfn		= pfn;
> +		__entry->flags		= page->flags;
> +		__entry->count		= atomic_read(&page->_count);
> +		__entry->mapcount	= atomic_read(&page->_mapcount);
> +		__entry->index		= page->index;
> +	),
> +
> +	TP_printk("pfn=%lu flags=%lx count=%u mapcount=%u index=%lu",
> +		  __entry->pfn, __entry->flags, __entry->count,
> +		  __entry->mapcount, __entry->index)
> +);
> +
> +#endif /*  _TRACE_MM_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index 06b8585..848e5ce 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -51,5 +51,6 @@ obj-$(CONFIG_EVENT_TRACING) += trace_export.o
>  obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
>  obj-$(CONFIG_EVENT_PROFILE) += trace_event_profile.o
>  obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o
> +obj-$(CONFIG_EVENT_TRACING) += trace_mm.o
>  
>  libftrace-y := ftrace.o
> diff --git a/kernel/trace/trace_mm.c b/kernel/trace/trace_mm.c
> new file mode 100644
> index 0000000..87123ed
> --- /dev/null
> +++ b/kernel/trace/trace_mm.c
> @@ -0,0 +1,172 @@
> +/*
> + * Trace mm pages
> + *
> + * Copyright (C) 2009 Red Hat Inc, Steven Rostedt <srostedt@redhat.com>
> + *
> + * Code based on Matt Mackall's /proc/[kpagecount|kpageflags] code.
> + */
> +#include <linux/module.h>
> +#include <linux/bootmem.h>
> +#include <linux/debugfs.h>
> +#include <linux/uaccess.h>
> +
> +#include "trace_output.h"
> +
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/mm.h>
> +
> +void trace_read_page_frames(unsigned long start, unsigned long end,
> +			    void (*trace)(unsigned long pfn, struct page *page))
> +{
> +	unsigned long pfn = start;
> +	struct page *page;
> +
> +	if (start > max_pfn - 1)
> +		return;
> +
> +	if (end > max_pfn - 1)
> +		end = max_pfn - 1;
> +
> +	while (pfn < end) {
> +		page = NULL;
> +		if (pfn_valid(pfn))
> +			page = pfn_to_page(pfn);
> +		pfn++;
> +		if (page)
> +			trace(pfn, page);
> +	}
> +}
> +
> +static void trace_do_dump_pages(unsigned long pfn, struct page *page)
> +{
> +	trace_dump_pages(pfn, page);
> +}
> +
> +static ssize_t
> +trace_mm_trigger_read(struct file *filp, char __user *ubuf, size_t cnt,
> +		 loff_t *ppos)
> +{
> +	return simple_read_from_buffer(ubuf, cnt, ppos, "0\n", 2);
> +}
> +
> +
> +static ssize_t
> +trace_mm_trigger_write(struct file *filp, const char __user *ubuf, size_t cnt,
> +		       loff_t *ppos)
> +{
> +	unsigned long val, start, end;
> +	char buf[64];
> +	int ret;
> +
> +	if (cnt >= sizeof(buf))
> +		return -EINVAL;
> +
> +	if (copy_from_user(&buf, ubuf, cnt))
> +		return -EFAULT;
> +
> +	if (tracing_update_buffers() < 0)
> +		return -ENOMEM;
> +
> +	if (trace_set_clr_event("mm", "dump_pages", 1))
> +		return -EINVAL;
> +
> +	buf[cnt] = 0;
> +
> +	ret = strict_strtol(buf, 10, &val);
> +	if (ret < 0)
> +		return ret;
> +
> +	start = *ppos;
> +	if (val < 0)
> +		end = max_pfn - 1;
> +	else
> +		end = start + val;
> +
> +	trace_read_page_frames(start, end, trace_do_dump_pages);
> +
> +	*ppos += cnt;
> +
> +	return cnt;
> +}
> +
> +static const struct file_operations trace_mm_fops = {
> +	.open		= tracing_open_generic,
> +	.read		= trace_mm_trigger_read,
> +	.write		= trace_mm_trigger_write,
> +};
> +
> +/* move this into trace_objects.c when that file is created */
> +static struct dentry *trace_objects_dir(void)
> +{
> +	static struct dentry *d_objects;
> +	struct dentry *d_tracer;
> +
> +	if (d_objects)
> +		return d_objects;
> +
> +	d_tracer = tracing_init_dentry();
> +	if (!d_tracer)
> +		return NULL;
> +
> +	d_objects = debugfs_create_dir("objects", d_tracer);
> +	if (!d_objects)
> +		pr_warning("Could not create debugfs "
> +			   "'objects' directory\n");
> +
> +	return d_objects;
> +}
> +
> +
> +static struct dentry *trace_objects_mm_dir(void)
> +{
> +	static struct dentry *d_mm;
> +	struct dentry *d_objects;
> +
> +	if (d_mm)
> +		return d_mm;
> +
> +	d_objects = trace_objects_dir();
> +	if (!d_objects)
> +		return NULL;
> +
> +	d_mm = debugfs_create_dir("mm", d_objects);
> +	if (!d_mm)
> +		pr_warning("Could not create 'objects/mm' directory\n");
> +
> +	return d_mm;
> +}
> +
> +static struct dentry *trace_objects_mm_pages_dir(void)
> +{
> +	static struct dentry *d_pages;
> +	struct dentry *d_mm;
> +
> +	if (d_pages)
> +		return d_pages;
> +
> +	d_mm = trace_objects_mm_dir();
> +	if (!d_mm)
> +		return NULL;
> +
> +	d_pages = debugfs_create_dir("pages", d_mm);
> +	if (!d_pages)
> +		pr_warning("Could not create debugfs "
> +			   "'objects/mm/pages' directory\n");
> +
> +	return d_pages;
> +}
> +
> +static __init int trace_objects_mm_init(void)
> +{
> +	struct dentry *d_pages;
> +
> +	d_pages = trace_objects_mm_pages_dir();
> +	if (!d_pages)
> +		return 0;
> +
> +	trace_create_file("trigger", 0600, d_pages, NULL,
> +			  &trace_mm_fops);
> +
> +	return 0;
> +}
> +fs_initcall(trace_objects_mm_init);

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
@ 2009-05-09  9:13           ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-09  9:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm

Hi Ingo,

On Sat, May 09, 2009 at 02:27:58PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > So this should be done in cooperation with instrumentation 
> > > folks, while improving _all_ of Linux instrumentation in 
> > > general. Or, if you dont have the time/interest to work with us 
> > > on that, it should not be done at all. Not having the 
> > > resources/interest to do something properly is not a license to 
> > > introduce further instrumentation crap into Linux.
> > 
> > I'd be glad to work with you on the 'object collections' ftrace 
> > interfaces.  Maybe next month. For now my time have been allocated 
> > for the hwpoison work, sorry!
> 
> No problem - our offer still stands: we are glad to help out with 
> the instrumentation side bits. We'll even write all the patches for 
> you, just please help us out with making it maximally useful to 
> _you_ :-)

Thank you very much!

The good fact is, 2/3 of the code and experiences can be reused.

> Find below a first prototype patch written by Steve yesterday and 
> tidied up a bit by me today. It can also be tried on latest -tip:
> 
>   http://people.redhat.com/mingo/tip.git/README
> 
> This patch adds the first version of the 'object collections' 
> instrumentation facility under /debug/tracing/objects/mm/. It has a 
> single control so far, a 'number of pages to dump' trigger file:
> 
> To dump 1000 pages to the trace buffers, do:
> 
>   echo 1000 > /debug/tracing/objects/mm/pages/trigger
> 
> To dump all pages to the trace buffers, do:
> 
>   echo -1 > /debug/tracing/objects/mm/pages/trigger

That is not too intuitive, I'm afraid.

> Preliminary timings on an older, 1GB RAM 2 GHz Athlon64 box show 
> that it's plenty fast:
> 
>  # time echo -1 > /debug/tracing/objects/mm/pages/trigger
> 
>   real	0m0.127s
>   user	0m0.000s
>   sys	0m0.126s
> 
>  # time cat /debug/tracing/per_cpu/*/trace_pipe_raw > /tmp/page-trace.bin
> 
>   real	0m0.065s
>   user	0m0.001s
>   sys	0m0.064s
> 
>   # ls -l /tmp/1
>   -rw-r--r-- 1 root root 13774848 2009-05-09 11:46 /tmp/page-dump.bin
> 
> 127 millisecs to collect, 65 milliseconds to dump. (And that's not 
> using splice() to dump the trace data.)

That's pretty fast and on par with kpageflags!

> The current (very preliminary) record format is:
> 
>   # cat /debug/tracing/events/mm/dump_pages/format 
>   name: dump_pages
>   ID: 40
>   format:
> 	field:unsigned short common_type;	offset:0;	size:2;
> 	field:unsigned char common_flags;	offset:2;	size:1;
> 	field:unsigned char common_preempt_count;	offset:3;	size:1;
> 	field:int common_pid;	offset:4;	size:4;
> 	field:int common_tgid;	offset:8;	size:4;
> 
> 	field:unsigned long pfn;	offset:16;	size:8;
> 	field:unsigned long flags;	offset:24;	size:8;
> 	field:unsigned long index;	offset:32;	size:8;
> 	field:unsigned int count;	offset:40;	size:4;
> 	field:unsigned int mapcount;	offset:44;	size:4;
> 
>   print fmt: "pfn=%lu flags=%lx count=%u mapcount=%u index=%lu", 
>   REC->pfn, REC->flags, REC->count, REC->mapcount, REC->index
> 
> Note: the page->flags value should probably be converted into more 
> independent values i suspect, like get_uflags() is - the raw 
> page->flags is too compressed and inter-dependent on other 
> properties of struct page to be directly usable.

Agreed.

> Also, buffer size has to be large enough to hold the dump. To hold 
> one million entries (4GB of RAM), this should be enough:
> 
>   echo 60000 > /debug/tracing/buffer_size_kb
> 
> Once we add synchronization between producer and consumer, pretty 
> much any buffer size will suffice.

That would be good.

> The trace records are unique so user-space can filter out the dump 
> and only the dump - even if there are other trace events in the 
> buffer.

OK.

> TODO:
> 
>  - add smarter flags output - a'la your get_uflags().

That's 100% code reuse :-)

>  - add synchronization between trace producer and trace consumer
> 
>  - port user-space bits to this facility: Documentation/vm/page-types.c

page-types' kernel ABI code is smallish. So would be trivial to port.

> What do you think about this patch? We could also further reduce the 
> patch/plugin size by factoring out some of this code into generic 
> tracing code. This will be best done when we add the 'tasks' object 
> collection to dump a tasks snapshot to the trace buffer.

To be frank, the code size is a bit larger than kpageflags, and
(as a newbie) the ftrace interface is not as straightforward as the
traditional read().

But that's acceptable, as long as it will allow more powerful object
dumping. I'll attempt to list two fundamental requirements:

1) support multiple object iteration paths
   For example, the pages can be iterated by
   - pfn
   - process virtual address
   - inode address space
   - swap space?

2) support concurrent object iterations
   For example, a huge 1TB memory space can be split up into 10
   segments which can be queried concurrently (with different options).

(1) provides great flexibility and advantage to the existing interface,
(2) provides equal performance to the existing interface.

Are they at least possible?

Thanks,
Fengguang

> 	Ingo
> 
> ---------------------------->
> >From dcac8cdac1d41af0336d8ed17c2cb898ba8a791f Mon Sep 17 00:00:00 2001
> From: Steven Rostedt <srostedt@redhat.com>
> Date: Fri, 8 May 2009 16:44:15 -0400
> Subject: [PATCH] tracing/mm: add page frame snapshot trace
> 
> This is a prototype to dump out a snapshot of the page tables to the
> tracing buffer. Currently it is very primitive, and just writes out
> the events. There is no synchronization to not loose the events,
> so /debug/tracing/buffer_size_kb has to be large enough for all
> events to fit.
> 
> We will do something about synchronization later. That is, have a way
> to read the buffer through the tracing/object/mm/page/X file and have
> the two in sync.
> 
> But this is just a prototype to get the ball rolling.
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  include/trace/events/mm.h |   48 +++++++++++++
>  kernel/trace/Makefile     |    1 +
>  kernel/trace/trace_mm.c   |  172 +++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 221 insertions(+), 0 deletions(-)
> 
> diff --git a/include/trace/events/mm.h b/include/trace/events/mm.h
> new file mode 100644
> index 0000000..f5a1668
> --- /dev/null
> +++ b/include/trace/events/mm.h
> @@ -0,0 +1,48 @@
> +#if !defined(_TRACE_MM_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_MM_H
> +
> +#include <linux/tracepoint.h>
> +#include <linux/mm.h>
> +
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM mm
> +
> +/**
> + * dump_pages - called by the trace page dump trigger
> + * @pfn: page frame number
> + * @page: pointer to the page frame
> + *
> + * This is a helper trace point into the dumping of the page frames.
> + * It will record various infromation about a page frame.
> + */
> +TRACE_EVENT(dump_pages,
> +
> +	TP_PROTO(unsigned long pfn, struct page *page),
> +
> +	TP_ARGS(pfn, page),
> +
> +	TP_STRUCT__entry(
> +		__field(	unsigned long,	pfn		)
> +		__field(	unsigned long,	flags		)
> +		__field(	unsigned long,	index		)
> +		__field(	unsigned int,	count		)
> +		__field(	unsigned int,	mapcount	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->pfn		= pfn;
> +		__entry->flags		= page->flags;
> +		__entry->count		= atomic_read(&page->_count);
> +		__entry->mapcount	= atomic_read(&page->_mapcount);
> +		__entry->index		= page->index;
> +	),
> +
> +	TP_printk("pfn=%lu flags=%lx count=%u mapcount=%u index=%lu",
> +		  __entry->pfn, __entry->flags, __entry->count,
> +		  __entry->mapcount, __entry->index)
> +);
> +
> +#endif /*  _TRACE_MM_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index 06b8585..848e5ce 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -51,5 +51,6 @@ obj-$(CONFIG_EVENT_TRACING) += trace_export.o
>  obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
>  obj-$(CONFIG_EVENT_PROFILE) += trace_event_profile.o
>  obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o
> +obj-$(CONFIG_EVENT_TRACING) += trace_mm.o
>  
>  libftrace-y := ftrace.o
> diff --git a/kernel/trace/trace_mm.c b/kernel/trace/trace_mm.c
> new file mode 100644
> index 0000000..87123ed
> --- /dev/null
> +++ b/kernel/trace/trace_mm.c
> @@ -0,0 +1,172 @@
> +/*
> + * Trace mm pages
> + *
> + * Copyright (C) 2009 Red Hat Inc, Steven Rostedt <srostedt@redhat.com>
> + *
> + * Code based on Matt Mackall's /proc/[kpagecount|kpageflags] code.
> + */
> +#include <linux/module.h>
> +#include <linux/bootmem.h>
> +#include <linux/debugfs.h>
> +#include <linux/uaccess.h>
> +
> +#include "trace_output.h"
> +
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/mm.h>
> +
> +void trace_read_page_frames(unsigned long start, unsigned long end,
> +			    void (*trace)(unsigned long pfn, struct page *page))
> +{
> +	unsigned long pfn = start;
> +	struct page *page;
> +
> +	if (start > max_pfn - 1)
> +		return;
> +
> +	if (end > max_pfn - 1)
> +		end = max_pfn - 1;
> +
> +	while (pfn < end) {
> +		page = NULL;
> +		if (pfn_valid(pfn))
> +			page = pfn_to_page(pfn);
> +		pfn++;
> +		if (page)
> +			trace(pfn, page);
> +	}
> +}
> +
> +static void trace_do_dump_pages(unsigned long pfn, struct page *page)
> +{
> +	trace_dump_pages(pfn, page);
> +}
> +
> +static ssize_t
> +trace_mm_trigger_read(struct file *filp, char __user *ubuf, size_t cnt,
> +		 loff_t *ppos)
> +{
> +	return simple_read_from_buffer(ubuf, cnt, ppos, "0\n", 2);
> +}
> +
> +
> +static ssize_t
> +trace_mm_trigger_write(struct file *filp, const char __user *ubuf, size_t cnt,
> +		       loff_t *ppos)
> +{
> +	unsigned long val, start, end;
> +	char buf[64];
> +	int ret;
> +
> +	if (cnt >= sizeof(buf))
> +		return -EINVAL;
> +
> +	if (copy_from_user(&buf, ubuf, cnt))
> +		return -EFAULT;
> +
> +	if (tracing_update_buffers() < 0)
> +		return -ENOMEM;
> +
> +	if (trace_set_clr_event("mm", "dump_pages", 1))
> +		return -EINVAL;
> +
> +	buf[cnt] = 0;
> +
> +	ret = strict_strtol(buf, 10, &val);
> +	if (ret < 0)
> +		return ret;
> +
> +	start = *ppos;
> +	if (val < 0)
> +		end = max_pfn - 1;
> +	else
> +		end = start + val;
> +
> +	trace_read_page_frames(start, end, trace_do_dump_pages);
> +
> +	*ppos += cnt;
> +
> +	return cnt;
> +}
> +
> +static const struct file_operations trace_mm_fops = {
> +	.open		= tracing_open_generic,
> +	.read		= trace_mm_trigger_read,
> +	.write		= trace_mm_trigger_write,
> +};
> +
> +/* move this into trace_objects.c when that file is created */
> +static struct dentry *trace_objects_dir(void)
> +{
> +	static struct dentry *d_objects;
> +	struct dentry *d_tracer;
> +
> +	if (d_objects)
> +		return d_objects;
> +
> +	d_tracer = tracing_init_dentry();
> +	if (!d_tracer)
> +		return NULL;
> +
> +	d_objects = debugfs_create_dir("objects", d_tracer);
> +	if (!d_objects)
> +		pr_warning("Could not create debugfs "
> +			   "'objects' directory\n");
> +
> +	return d_objects;
> +}
> +
> +
> +static struct dentry *trace_objects_mm_dir(void)
> +{
> +	static struct dentry *d_mm;
> +	struct dentry *d_objects;
> +
> +	if (d_mm)
> +		return d_mm;
> +
> +	d_objects = trace_objects_dir();
> +	if (!d_objects)
> +		return NULL;
> +
> +	d_mm = debugfs_create_dir("mm", d_objects);
> +	if (!d_mm)
> +		pr_warning("Could not create 'objects/mm' directory\n");
> +
> +	return d_mm;
> +}
> +
> +static struct dentry *trace_objects_mm_pages_dir(void)
> +{
> +	static struct dentry *d_pages;
> +	struct dentry *d_mm;
> +
> +	if (d_pages)
> +		return d_pages;
> +
> +	d_mm = trace_objects_mm_dir();
> +	if (!d_mm)
> +		return NULL;
> +
> +	d_pages = debugfs_create_dir("pages", d_mm);
> +	if (!d_pages)
> +		pr_warning("Could not create debugfs "
> +			   "'objects/mm/pages' directory\n");
> +
> +	return d_pages;
> +}
> +
> +static __init int trace_objects_mm_init(void)
> +{
> +	struct dentry *d_pages;
> +
> +	d_pages = trace_objects_mm_pages_dir();
> +	if (!d_pages)
> +		return 0;
> +
> +	trace_create_file("trigger", 0600, d_pages, NULL,
> +			  &trace_mm_fops);
> +
> +	return 0;
> +}
> +fs_initcall(trace_objects_mm_init);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
  2009-05-09  9:13           ` Wu Fengguang
@ 2009-05-09  9:24             ` Ingo Molnar
  -1 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-09  9:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> Hi Ingo,
> 
> On Sat, May 09, 2009 at 02:27:58PM +0800, Ingo Molnar wrote:
> > 
> > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > > So this should be done in cooperation with instrumentation 
> > > > folks, while improving _all_ of Linux instrumentation in 
> > > > general. Or, if you dont have the time/interest to work with us 
> > > > on that, it should not be done at all. Not having the 
> > > > resources/interest to do something properly is not a license to 
> > > > introduce further instrumentation crap into Linux.
> > > 
> > > I'd be glad to work with you on the 'object collections' ftrace 
> > > interfaces.  Maybe next month. For now my time have been allocated 
> > > for the hwpoison work, sorry!
> > 
> > No problem - our offer still stands: we are glad to help out with 
> > the instrumentation side bits. We'll even write all the patches for 
> > you, just please help us out with making it maximally useful to 
> > _you_ :-)
> 
> Thank you very much!
> 
> The good fact is, 2/3 of the code and experiences can be reused.
> 
> > Find below a first prototype patch written by Steve yesterday and 
> > tidied up a bit by me today. It can also be tried on latest -tip:
> > 
> >   http://people.redhat.com/mingo/tip.git/README
> > 
> > This patch adds the first version of the 'object collections' 
> > instrumentation facility under /debug/tracing/objects/mm/. It has a 
> > single control so far, a 'number of pages to dump' trigger file:
> > 
> > To dump 1000 pages to the trace buffers, do:
> > 
> >   echo 1000 > /debug/tracing/objects/mm/pages/trigger
> > 
> > To dump all pages to the trace buffers, do:
> > 
> >   echo -1 > /debug/tracing/objects/mm/pages/trigger
> 
> That is not too intuitive, I'm afraid.

This was just a first-level approximation - and it matches the usual 
"0xffffffff means infinite" idiom.

How about changing it from 'trigger' to 'dump_range':

   echo "*" > /debug/tracing/objects/mm/pages/dump_range

being a shortcut for 'dump all'?

And:

   echo "1000 2000" > /debug/tracing/objects/mm/pages/dump_range

?

The '1000' is the offset where the dumping starts, and 2000 is the 
size of the dump.

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
@ 2009-05-09  9:24             ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-09  9:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> Hi Ingo,
> 
> On Sat, May 09, 2009 at 02:27:58PM +0800, Ingo Molnar wrote:
> > 
> > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > > So this should be done in cooperation with instrumentation 
> > > > folks, while improving _all_ of Linux instrumentation in 
> > > > general. Or, if you dont have the time/interest to work with us 
> > > > on that, it should not be done at all. Not having the 
> > > > resources/interest to do something properly is not a license to 
> > > > introduce further instrumentation crap into Linux.
> > > 
> > > I'd be glad to work with you on the 'object collections' ftrace 
> > > interfaces.  Maybe next month. For now my time have been allocated 
> > > for the hwpoison work, sorry!
> > 
> > No problem - our offer still stands: we are glad to help out with 
> > the instrumentation side bits. We'll even write all the patches for 
> > you, just please help us out with making it maximally useful to 
> > _you_ :-)
> 
> Thank you very much!
> 
> The good fact is, 2/3 of the code and experiences can be reused.
> 
> > Find below a first prototype patch written by Steve yesterday and 
> > tidied up a bit by me today. It can also be tried on latest -tip:
> > 
> >   http://people.redhat.com/mingo/tip.git/README
> > 
> > This patch adds the first version of the 'object collections' 
> > instrumentation facility under /debug/tracing/objects/mm/. It has a 
> > single control so far, a 'number of pages to dump' trigger file:
> > 
> > To dump 1000 pages to the trace buffers, do:
> > 
> >   echo 1000 > /debug/tracing/objects/mm/pages/trigger
> > 
> > To dump all pages to the trace buffers, do:
> > 
> >   echo -1 > /debug/tracing/objects/mm/pages/trigger
> 
> That is not too intuitive, I'm afraid.

This was just a first-level approximation - and it matches the usual 
"0xffffffff means infinite" idiom.

How about changing it from 'trigger' to 'dump_range':

   echo "*" > /debug/tracing/objects/mm/pages/dump_range

being a shortcut for 'dump all'?

And:

   echo "1000 2000" > /debug/tracing/objects/mm/pages/dump_range

?

The '1000' is the offset where the dumping starts, and 2000 is the 
size of the dump.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
  2009-05-09  9:24             ` Ingo Molnar
@ 2009-05-09  9:43               ` Wu Fengguang
  -1 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-09  9:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Sat, May 09, 2009 at 05:24:31PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Hi Ingo,
> > 
> > On Sat, May 09, 2009 at 02:27:58PM +0800, Ingo Molnar wrote:
> > > 
> > > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > > So this should be done in cooperation with instrumentation 
> > > > > folks, while improving _all_ of Linux instrumentation in 
> > > > > general. Or, if you dont have the time/interest to work with us 
> > > > > on that, it should not be done at all. Not having the 
> > > > > resources/interest to do something properly is not a license to 
> > > > > introduce further instrumentation crap into Linux.
> > > > 
> > > > I'd be glad to work with you on the 'object collections' ftrace 
> > > > interfaces.  Maybe next month. For now my time have been allocated 
> > > > for the hwpoison work, sorry!
> > > 
> > > No problem - our offer still stands: we are glad to help out with 
> > > the instrumentation side bits. We'll even write all the patches for 
> > > you, just please help us out with making it maximally useful to 
> > > _you_ :-)
> > 
> > Thank you very much!
> > 
> > The good fact is, 2/3 of the code and experiences can be reused.
> > 
> > > Find below a first prototype patch written by Steve yesterday and 
> > > tidied up a bit by me today. It can also be tried on latest -tip:
> > > 
> > >   http://people.redhat.com/mingo/tip.git/README
> > > 
> > > This patch adds the first version of the 'object collections' 
> > > instrumentation facility under /debug/tracing/objects/mm/. It has a 
> > > single control so far, a 'number of pages to dump' trigger file:
> > > 
> > > To dump 1000 pages to the trace buffers, do:
> > > 
> > >   echo 1000 > /debug/tracing/objects/mm/pages/trigger
> > > 
> > > To dump all pages to the trace buffers, do:
> > > 
> > >   echo -1 > /debug/tracing/objects/mm/pages/trigger
> > 
> > That is not too intuitive, I'm afraid.
> 
> This was just a first-level approximation - and it matches the usual 
> "0xffffffff means infinite" idiom.

8^)

> How about changing it from 'trigger' to 'dump_range':

That's a better name!

>    echo "*" > /debug/tracing/objects/mm/pages/dump_range
> 
> being a shortcut for 'dump all'?

No I'm not complaining about -1. That's even better than "*",
because the latter can easily be expanded by shell ;)

> And:
> 
>    echo "1000 2000" > /debug/tracing/objects/mm/pages/dump_range
> 
> ?

Now it's much more intuitive!

> The '1000' is the offset where the dumping starts, and 2000 is the 
> size of the dump.

Ah the second parameter 2000 can easily be taken as "end"..


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
@ 2009-05-09  9:43               ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-09  9:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Sat, May 09, 2009 at 05:24:31PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Hi Ingo,
> > 
> > On Sat, May 09, 2009 at 02:27:58PM +0800, Ingo Molnar wrote:
> > > 
> > > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > > So this should be done in cooperation with instrumentation 
> > > > > folks, while improving _all_ of Linux instrumentation in 
> > > > > general. Or, if you dont have the time/interest to work with us 
> > > > > on that, it should not be done at all. Not having the 
> > > > > resources/interest to do something properly is not a license to 
> > > > > introduce further instrumentation crap into Linux.
> > > > 
> > > > I'd be glad to work with you on the 'object collections' ftrace 
> > > > interfaces.  Maybe next month. For now my time have been allocated 
> > > > for the hwpoison work, sorry!
> > > 
> > > No problem - our offer still stands: we are glad to help out with 
> > > the instrumentation side bits. We'll even write all the patches for 
> > > you, just please help us out with making it maximally useful to 
> > > _you_ :-)
> > 
> > Thank you very much!
> > 
> > The good fact is, 2/3 of the code and experiences can be reused.
> > 
> > > Find below a first prototype patch written by Steve yesterday and 
> > > tidied up a bit by me today. It can also be tried on latest -tip:
> > > 
> > >   http://people.redhat.com/mingo/tip.git/README
> > > 
> > > This patch adds the first version of the 'object collections' 
> > > instrumentation facility under /debug/tracing/objects/mm/. It has a 
> > > single control so far, a 'number of pages to dump' trigger file:
> > > 
> > > To dump 1000 pages to the trace buffers, do:
> > > 
> > >   echo 1000 > /debug/tracing/objects/mm/pages/trigger
> > > 
> > > To dump all pages to the trace buffers, do:
> > > 
> > >   echo -1 > /debug/tracing/objects/mm/pages/trigger
> > 
> > That is not too intuitive, I'm afraid.
> 
> This was just a first-level approximation - and it matches the usual 
> "0xffffffff means infinite" idiom.

8^)

> How about changing it from 'trigger' to 'dump_range':

That's a better name!

>    echo "*" > /debug/tracing/objects/mm/pages/dump_range
> 
> being a shortcut for 'dump all'?

No I'm not complaining about -1. That's even better than "*",
because the latter can easily be expanded by shell ;)

> And:
> 
>    echo "1000 2000" > /debug/tracing/objects/mm/pages/dump_range
> 
> ?

Now it's much more intuitive!

> The '1000' is the offset where the dumping starts, and 2000 is the 
> size of the dump.

Ah the second parameter 2000 can easily be taken as "end"..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
  2009-05-09  9:13           ` Wu Fengguang
@ 2009-05-09 10:01             ` Ingo Molnar
  -1 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-09 10:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> 2) support concurrent object iterations
>    For example, a huge 1TB memory space can be split up into 10
>    segments which can be queried concurrently (with different options).

this should already be possible. If you lseek the trigger file, that 
will be understood as an 'offset' by the patch, and then write a 
(decimal) value into the file, that will be the count.

So it should already be possible to fork off nr_cpus helper threads, 
one bound to each CPU, each triggering trace output of a separate 
segment of the memory map - and each reading that CPU's 
trace_pipe_raw file to recover the data - all in parallel.

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
@ 2009-05-09 10:01             ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-09 10:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> 2) support concurrent object iterations
>    For example, a huge 1TB memory space can be split up into 10
>    segments which can be queried concurrently (with different options).

this should already be possible. If you lseek the trigger file, that 
will be understood as an 'offset' by the patch, and then write a 
(decimal) value into the file, that will be the count.

So it should already be possible to fork off nr_cpus helper threads, 
one bound to each CPU, each triggering trace output of a separate 
segment of the memory map - and each reading that CPU's 
trace_pipe_raw file to recover the data - all in parallel.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
  2009-05-09  9:43               ` Wu Fengguang
@ 2009-05-09 10:22                 ` Ingo Molnar
  -1 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-09 10:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> > How about changing it from 'trigger' to 'dump_range':
> 
> That's a better name!
> 
> >    echo "*" > /debug/tracing/objects/mm/pages/dump_range
> > 
> > being a shortcut for 'dump all'?
> 
> No I'm not complaining about -1. That's even better than "*",
> because the latter can easily be expanded by shell ;)
> 
> > And:
> > 
> >    echo "1000 2000" > /debug/tracing/objects/mm/pages/dump_range
> > 
> > ?
> 
> Now it's much more intuitive!
> 
> > The '1000' is the offset where the dumping starts, and 2000 is the 
> > size of the dump.
> 
> Ah the second parameter 2000 can easily be taken as "end"..

Ok ... i've changed the name to dump_range and added your fix for 
mapcount as well. I pushed it all out to -tip.

Would you be interested in having a look at that and tweaking the 
dump_range API to any variant of your liking, and sending a patch 
for that? Both "<start> <end>" and "<start> <size>" (or any other 
variant) would be fine IMHO.

The lseek hack is nice (and we can keep that) but an explicit range 
API would be nice, we try to keep all of ftrace scriptable.

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
@ 2009-05-09 10:22                 ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-09 10:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> > How about changing it from 'trigger' to 'dump_range':
> 
> That's a better name!
> 
> >    echo "*" > /debug/tracing/objects/mm/pages/dump_range
> > 
> > being a shortcut for 'dump all'?
> 
> No I'm not complaining about -1. That's even better than "*",
> because the latter can easily be expanded by shell ;)
> 
> > And:
> > 
> >    echo "1000 2000" > /debug/tracing/objects/mm/pages/dump_range
> > 
> > ?
> 
> Now it's much more intuitive!
> 
> > The '1000' is the offset where the dumping starts, and 2000 is the 
> > size of the dump.
> 
> Ah the second parameter 2000 can easily be taken as "end"..

Ok ... i've changed the name to dump_range and added your fix for 
mapcount as well. I pushed it all out to -tip.

Would you be interested in having a look at that and tweaking the 
dump_range API to any variant of your liking, and sending a patch 
for that? Both "<start> <end>" and "<start> <size>" (or any other 
variant) would be fine IMHO.

The lseek hack is nice (and we can keep that) but an explicit range 
API would be nice, we try to keep all of ftrace scriptable.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
  2009-05-09 10:01             ` Ingo Molnar
@ 2009-05-09 10:27               ` Ingo Molnar
  -1 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-09 10:27 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Ingo Molnar <mingo@elte.hu> wrote:

> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > 2) support concurrent object iterations
> >    For example, a huge 1TB memory space can be split up into 10
> >    segments which can be queried concurrently (with different options).
> 
> this should already be possible. If you lseek the trigger file, 
> that will be understood as an 'offset' by the patch, and then 
> write a (decimal) value into the file, that will be the count.
> 
> So it should already be possible to fork off nr_cpus helper 
> threads, one bound to each CPU, each triggering trace output of a 
> separate segment of the memory map - and each reading that CPU's 
> trace_pipe_raw file to recover the data - all in parallel.

And note that trace_pipe_raw supports splice(), while 
/proc/{kpageflags|kpagecount} does not, so the output side might 
probably be a bit faster than the /proc method.

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
@ 2009-05-09 10:27               ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-09 10:27 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Ingo Molnar <mingo@elte.hu> wrote:

> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > 2) support concurrent object iterations
> >    For example, a huge 1TB memory space can be split up into 10
> >    segments which can be queried concurrently (with different options).
> 
> this should already be possible. If you lseek the trigger file, 
> that will be understood as an 'offset' by the patch, and then 
> write a (decimal) value into the file, that will be the count.
> 
> So it should already be possible to fork off nr_cpus helper 
> threads, one bound to each CPU, each triggering trace output of a 
> separate segment of the memory map - and each reading that CPU's 
> trace_pipe_raw file to recover the data - all in parallel.

And note that trace_pipe_raw supports splice(), while 
/proc/{kpageflags|kpagecount} does not, so the output side might 
probably be a bit faster than the /proc method.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
  2009-05-09  9:13           ` Wu Fengguang
@ 2009-05-09 10:36             ` Ingo Molnar
  -1 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-09 10:36 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> > Preliminary timings on an older, 1GB RAM 2 GHz Athlon64 box show 
> > that it's plenty fast:
> > 
> >  # time echo -1 > /debug/tracing/objects/mm/pages/trigger
> > 
> >   real	0m0.127s
> >   user	0m0.000s
> >   sys	0m0.126s
> > 
> >  # time cat /debug/tracing/per_cpu/*/trace_pipe_raw > /tmp/page-trace.bin
> > 
> >   real	0m0.065s
> >   user	0m0.001s
> >   sys	0m0.064s
> > 
> >   # ls -l /tmp/1
> >   -rw-r--r-- 1 root root 13774848 2009-05-09 11:46 /tmp/page-dump.bin
> > 
> > 127 millisecs to collect, 65 milliseconds to dump. (And that's not 
> > using splice() to dump the trace data.)
> 
> That's pretty fast and on par with kpageflags!

It's already faster here than kpageflags, on a 32 GB box i just 
tried, and the sum of timings (dumping + reading of 4 million page 
frame records, into/from a sufficiently large trace buffer) is 2.8 
seconds.

current upstream kpageflags is 3.3 seconds:

 phoenix:/home/mingo> time cat /proc/kpageflags  > /tmp/1

 real	0m3.338s
 user	0m0.004s
 sys	0m0.608s

(although it varies around a bit, sometimes back to 3.0 secs, 
sometimes more)

That's about 10% faster. Note that output performance could be 
improved more by using splice().

Also, it's apples to oranges, in an unfavorable-to-ftrace way: the 
pages object collection outputs all of these fields:

	field:unsigned short common_type;	offset:0;	size:2;
	field:unsigned char common_flags;	offset:2;	size:1;
	field:unsigned char common_preempt_count;	offset:3;	size:1;
	field:int common_pid;	offset:4;	size:4;
	field:int common_tgid;	offset:8;	size:4;

	field:unsigned long pfn;	offset:16;	size:8;
	field:unsigned long flags;	offset:24;	size:8;
	field:unsigned long index;	offset:32;	size:8;
	field:unsigned int count;	offset:40;	size:4;
	field:unsigned int mapcount;	offset:44;	size:4;

plus it generates and outputs the timestamp as well - while 
kpageflags is just page flags. (and kpagecount is only page counts)

Spreading the dumping+output out to the 16 CPUs of this box would 
shorten the run time at least 10-fold, to about 0.3-0.5 seconds 
IMHO. (but that has to be tried and measured first)

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
@ 2009-05-09 10:36             ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-09 10:36 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> > Preliminary timings on an older, 1GB RAM 2 GHz Athlon64 box show 
> > that it's plenty fast:
> > 
> >  # time echo -1 > /debug/tracing/objects/mm/pages/trigger
> > 
> >   real	0m0.127s
> >   user	0m0.000s
> >   sys	0m0.126s
> > 
> >  # time cat /debug/tracing/per_cpu/*/trace_pipe_raw > /tmp/page-trace.bin
> > 
> >   real	0m0.065s
> >   user	0m0.001s
> >   sys	0m0.064s
> > 
> >   # ls -l /tmp/1
> >   -rw-r--r-- 1 root root 13774848 2009-05-09 11:46 /tmp/page-dump.bin
> > 
> > 127 millisecs to collect, 65 milliseconds to dump. (And that's not 
> > using splice() to dump the trace data.)
> 
> That's pretty fast and on par with kpageflags!

It's already faster here than kpageflags, on a 32 GB box i just 
tried, and the sum of timings (dumping + reading of 4 million page 
frame records, into/from a sufficiently large trace buffer) is 2.8 
seconds.

current upstream kpageflags is 3.3 seconds:

 phoenix:/home/mingo> time cat /proc/kpageflags  > /tmp/1

 real	0m3.338s
 user	0m0.004s
 sys	0m0.608s

(although it varies around a bit, sometimes back to 3.0 secs, 
sometimes more)

That's about 10% faster. Note that output performance could be 
improved more by using splice().

Also, it's apples to oranges, in an unfavorable-to-ftrace way: the 
pages object collection outputs all of these fields:

	field:unsigned short common_type;	offset:0;	size:2;
	field:unsigned char common_flags;	offset:2;	size:1;
	field:unsigned char common_preempt_count;	offset:3;	size:1;
	field:int common_pid;	offset:4;	size:4;
	field:int common_tgid;	offset:8;	size:4;

	field:unsigned long pfn;	offset:16;	size:8;
	field:unsigned long flags;	offset:24;	size:8;
	field:unsigned long index;	offset:32;	size:8;
	field:unsigned int count;	offset:40;	size:4;
	field:unsigned int mapcount;	offset:44;	size:4;

plus it generates and outputs the timestamp as well - while 
kpageflags is just page flags. (and kpagecount is only page counts)

Spreading the dumping+output out to the 16 CPUs of this box would 
shorten the run time at least 10-fold, to about 0.3-0.5 seconds 
IMHO. (but that has to be tried and measured first)

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
  2009-05-08 20:24       ` Andrew Morton
@ 2009-05-09 10:44         ` Ingo Molnar
  -1 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-09 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: fengguang.wu, fweisbec, rostedt, a.p.zijlstra, lizf,
	linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 8 May 2009 13:47:42 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > 
> > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > Export all page flags faithfully in /proc/kpageflags.
> > 
> > Ongoing objection and NAK against extended haphazard exporting of 
> > kernel internals via an ad-hoc ABI via ad-hoc, privatized 
> > instrumentation that only helps the MM code and nothing else.
> 
> You're a year too late.  The pagemap interface is useful.

My NAK is against the extension of this mistake.

So is your answer to my NAK in essence:

 " We merged crappy MM instrumentation a short year ago, too bad.
   And because it was so crappy to be in /proc we are now also
   treating it as a hard ABI, not as a debugfs interface - for that 
   single app that is using it. Furthermore, we are now going to 
   make the API and ABI even more crappy via patches queued up in 
   -mm, and we are ignoring NAKs. We are also going to make it even 
   harder to have sane, generic instrumentation in the upstream 
   kernel. Deal with it, this is our code and we can mess it up the 
   way we wish to, it's none of your business."

right?

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
@ 2009-05-09 10:44         ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-09 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: fengguang.wu, fweisbec, rostedt, a.p.zijlstra, lizf,
	linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 8 May 2009 13:47:42 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > 
> > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > Export all page flags faithfully in /proc/kpageflags.
> > 
> > Ongoing objection and NAK against extended haphazard exporting of 
> > kernel internals via an ad-hoc ABI via ad-hoc, privatized 
> > instrumentation that only helps the MM code and nothing else.
> 
> You're a year too late.  The pagemap interface is useful.

My NAK is against the extension of this mistake.

So is your answer to my NAK in essence:

 " We merged crappy MM instrumentation a short year ago, too bad.
   And because it was so crappy to be in /proc we are now also
   treating it as a hard ABI, not as a debugfs interface - for that 
   single app that is using it. Furthermore, we are now going to 
   make the API and ABI even more crappy via patches queued up in 
   -mm, and we are ignoring NAKs. We are also going to make it even 
   harder to have sane, generic instrumentation in the upstream 
   kernel. Deal with it, this is our code and we can mess it up the 
   way we wish to, it's none of your business."

right?

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
  2009-05-09 10:22                 ` Ingo Molnar
@ 2009-05-09 10:45                   ` Wu Fengguang
  -1 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-09 10:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Sat, May 09, 2009 at 06:22:54PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > How about changing it from 'trigger' to 'dump_range':
> > 
> > That's a better name!
> > 
> > >    echo "*" > /debug/tracing/objects/mm/pages/dump_range
> > > 
> > > being a shortcut for 'dump all'?
> > 
> > No I'm not complaining about -1. That's even better than "*",
> > because the latter can easily be expanded by shell ;)
> > 
> > > And:
> > > 
> > >    echo "1000 2000" > /debug/tracing/objects/mm/pages/dump_range
> > > 
> > > ?
> > 
> > Now it's much more intuitive!
> > 
> > > The '1000' is the offset where the dumping starts, and 2000 is the 
> > > size of the dump.
> > 
> > Ah the second parameter 2000 can easily be taken as "end"..
> 
> Ok ... i've changed the name to dump_range and added your fix for 
> mapcount as well. I pushed it all out to -tip.

Thanks.

> Would you be interested in having a look at that and tweaking the 
> dump_range API to any variant of your liking, and sending a patch 
> for that? Both "<start> <end>" and "<start> <size>" (or any other 
> variant) would be fine IMHO.

Sure. I can even volunteer the process/file page walk works :)

> The lseek hack is nice (and we can keep that) but an explicit range 
> API would be nice, we try to keep all of ftrace scriptable.

OK.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
@ 2009-05-09 10:45                   ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-09 10:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Sat, May 09, 2009 at 06:22:54PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > How about changing it from 'trigger' to 'dump_range':
> > 
> > That's a better name!
> > 
> > >    echo "*" > /debug/tracing/objects/mm/pages/dump_range
> > > 
> > > being a shortcut for 'dump all'?
> > 
> > No I'm not complaining about -1. That's even better than "*",
> > because the latter can easily be expanded by shell ;)
> > 
> > > And:
> > > 
> > >    echo "1000 2000" > /debug/tracing/objects/mm/pages/dump_range
> > > 
> > > ?
> > 
> > Now it's much more intuitive!
> > 
> > > The '1000' is the offset where the dumping starts, and 2000 is the 
> > > size of the dump.
> > 
> > Ah the second parameter 2000 can easily be taken as "end"..
> 
> Ok ... i've changed the name to dump_range and added your fix for 
> mapcount as well. I pushed it all out to -tip.

Thanks.

> Would you be interested in having a look at that and tweaking the 
> dump_range API to any variant of your liking, and sending a patch 
> for that? Both "<start> <end>" and "<start> <size>" (or any other 
> variant) would be fine IMHO.

Sure. I can even volunteer the process/file page walk works :)

> The lseek hack is nice (and we can keep that) but an explicit range 
> API would be nice, we try to keep all of ftrace scriptable.

OK.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
  2009-05-09 10:01             ` Ingo Molnar
@ 2009-05-09 10:57               ` Wu Fengguang
  -1 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-09 10:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Sat, May 09, 2009 at 06:01:37PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > 2) support concurrent object iterations
> >    For example, a huge 1TB memory space can be split up into 10
> >    segments which can be queried concurrently (with different options).
> 
> this should already be possible. If you lseek the trigger file, that 
> will be understood as an 'offset' by the patch, and then write a 
> (decimal) value into the file, that will be the count.
> 
> So it should already be possible to fork off nr_cpus helper threads, 
> one bound to each CPU, each triggering trace output of a separate 
> segment of the memory map - and each reading that CPU's 
> trace_pipe_raw file to recover the data - all in parallel.

How will this work out in general? More examples, when walking pages
by file/process, is it possible to divide the files/processes into N
sets, and dump their pages concurrently? When walking the (huge) inode
lists of different superblocks, is it possible to fork one thread for
each superblock?

In the above situations, they would demand concurrent instances with
different filename/pid/superblock options.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
@ 2009-05-09 10:57               ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-09 10:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Sat, May 09, 2009 at 06:01:37PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > 2) support concurrent object iterations
> >    For example, a huge 1TB memory space can be split up into 10
> >    segments which can be queried concurrently (with different options).
> 
> this should already be possible. If you lseek the trigger file, that 
> will be understood as an 'offset' by the patch, and then write a 
> (decimal) value into the file, that will be the count.
> 
> So it should already be possible to fork off nr_cpus helper threads, 
> one bound to each CPU, each triggering trace output of a separate 
> segment of the memory map - and each reading that CPU's 
> trace_pipe_raw file to recover the data - all in parallel.

How will this work out in general? More examples, when walking pages
by file/process, is it possible to divide the files/processes into N
sets, and dump their pages concurrently? When walking the (huge) inode
lists of different superblocks, is it possible to fork one thread for
each superblock?

In the above situations, they would demand concurrent instances with
different filename/pid/superblock options.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
  2009-05-09 10:57               ` Wu Fengguang
@ 2009-05-09 11:05                 ` Ingo Molnar
  -1 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-09 11:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Sat, May 09, 2009 at 06:01:37PM +0800, Ingo Molnar wrote:
> > 
> > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > 2) support concurrent object iterations
> > >    For example, a huge 1TB memory space can be split up into 10
> > >    segments which can be queried concurrently (with different options).
> > 
> > this should already be possible. If you lseek the trigger file, that 
> > will be understood as an 'offset' by the patch, and then write a 
> > (decimal) value into the file, that will be the count.
> > 
> > So it should already be possible to fork off nr_cpus helper threads, 
> > one bound to each CPU, each triggering trace output of a separate 
> > segment of the memory map - and each reading that CPU's 
> > trace_pipe_raw file to recover the data - all in parallel.
> 

> How will this work out in general? More examples, when walking 
> pages by file/process, is it possible to divide the 
> files/processes into N sets, and dump their pages concurrently? 
> When walking the (huge) inode lists of different superblocks, is 
> it possible to fork one thread for each superblock?
> 
> In the above situations, they would demand concurrent instances 
> with different filename/pid/superblock options.

the iterators are certainly more complex, and harder to parallelise, 
in those cases, i submit.

But i like the page map example because it is (by far!) the largest 
collection of objects. Four million pages on a test-box i have.

So if the design is right and we do dumping on that extreme-end very 
well, we might not even care that much about parallelising dumping 
in other situations, even if there are thousands of tasks - it will 
just be even faster. And then we can keep the iterators and the APIs 
as simple as simple.

( End even for tasks, which are perhaps the hardest to iterate, we
  can still do the /proc method of iterating up to the offset by 
  counting. It wastes some time for each separate thread as it has 
  to count up to its offset, but it still allows the dumping itself
  to be parallelised. Or we could dump blocks of the PID hash array. 
  That distributes tasks well, and can be iterated very easily with 
  low/zero contention. The result will come out unordered in any 
  case. )

What do you think?

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
@ 2009-05-09 11:05                 ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-09 11:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Sat, May 09, 2009 at 06:01:37PM +0800, Ingo Molnar wrote:
> > 
> > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > 2) support concurrent object iterations
> > >    For example, a huge 1TB memory space can be split up into 10
> > >    segments which can be queried concurrently (with different options).
> > 
> > this should already be possible. If you lseek the trigger file, that 
> > will be understood as an 'offset' by the patch, and then write a 
> > (decimal) value into the file, that will be the count.
> > 
> > So it should already be possible to fork off nr_cpus helper threads, 
> > one bound to each CPU, each triggering trace output of a separate 
> > segment of the memory map - and each reading that CPU's 
> > trace_pipe_raw file to recover the data - all in parallel.
> 

> How will this work out in general? More examples, when walking 
> pages by file/process, is it possible to divide the 
> files/processes into N sets, and dump their pages concurrently? 
> When walking the (huge) inode lists of different superblocks, is 
> it possible to fork one thread for each superblock?
> 
> In the above situations, they would demand concurrent instances 
> with different filename/pid/superblock options.

the iterators are certainly more complex, and harder to parallelise, 
in those cases, i submit.

But i like the page map example because it is (by far!) the largest 
collection of objects. Four million pages on a test-box i have.

So if the design is right and we do dumping on that extreme-end very 
well, we might not even care that much about parallelising dumping 
in other situations, even if there are thousands of tasks - it will 
just be even faster. And then we can keep the iterators and the APIs 
as simple as simple.

( End even for tasks, which are perhaps the hardest to iterate, we
  can still do the /proc method of iterating up to the offset by 
  counting. It wastes some time for each separate thread as it has 
  to count up to its offset, but it still allows the dumping itself
  to be parallelised. Or we could dump blocks of the PID hash array. 
  That distributes tasks well, and can be iterated very easily with 
  low/zero contention. The result will come out unordered in any 
  case. )

What do you think?

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
  2009-05-09 11:05                 ` Ingo Molnar
@ 2009-05-09 12:23                   ` Wu Fengguang
  -1 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-09 12:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Sat, May 09, 2009 at 07:05:13PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Sat, May 09, 2009 at 06:01:37PM +0800, Ingo Molnar wrote:
> > > 
> > > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > 2) support concurrent object iterations
> > > >    For example, a huge 1TB memory space can be split up into 10
> > > >    segments which can be queried concurrently (with different options).
> > > 
> > > this should already be possible. If you lseek the trigger file, that 
> > > will be understood as an 'offset' by the patch, and then write a 
> > > (decimal) value into the file, that will be the count.
> > > 
> > > So it should already be possible to fork off nr_cpus helper threads, 
> > > one bound to each CPU, each triggering trace output of a separate 
> > > segment of the memory map - and each reading that CPU's 
> > > trace_pipe_raw file to recover the data - all in parallel.
> > 
> 
> > How will this work out in general? More examples, when walking 
> > pages by file/process, is it possible to divide the 
> > files/processes into N sets, and dump their pages concurrently? 
> > When walking the (huge) inode lists of different superblocks, is 
> > it possible to fork one thread for each superblock?
> > 
> > In the above situations, they would demand concurrent instances 
> > with different filename/pid/superblock options.
> 
> the iterators are certainly more complex, and harder to parallelise, 
> in those cases, i submit.

OK. I'm pushing the parallelism idea because 4+ cores is going to be
commonplace in desktop(not to mention the servers). And I have a clear 
use case for the parallelism: user space directed memory shrinking
before hibernation. Where the user space tool scan all/most pages in
all/most files in all superblocks and then selectively fadvise(DONTNEED).

In that case we want to work as fast as possible in order not to slow
down the hibernation speed. Parallelism definitely helps.

> But i like the page map example because it is (by far!) the largest 
> collection of objects. Four million pages on a test-box i have.

Yes!

> So if the design is right and we do dumping on that extreme-end very 
> well, we might not even care that much about parallelising dumping 
> in other situations, even if there are thousands of tasks - it will 
> just be even faster. And then we can keep the iterators and the APIs 
> as simple as simple.

That offset trick won't work well for small files. When we have lots
of small files, the parallelism granularity shall be files instead of
page chunks inside them. Maybe I'm too stressing.

> ( End even for tasks, which are perhaps the hardest to iterate, we
>   can still do the /proc method of iterating up to the offset by 
>   counting. It wastes some time for each separate thread as it has 
>   to count up to its offset, but it still allows the dumping itself
>   to be parallelised. Or we could dump blocks of the PID hash array. 
>   That distributes tasks well, and can be iterated very easily with 
>   low/zero contention. The result will come out unordered in any 
>   case. )

For task/file based page walking, the best parallelism unit can be
the task/file, instead of page segments inside them.

And there is the sparse file problem. There will be large holes in
the address space of file and process(and even physical memory!).

It would be good to not output any lines for the holes. Even better,
in the case of file/process, lots of pages will share the same flags,
count and mapcount. If not printing their pfn, the output can be
stripped from per-page lines
        index flags count mapcount
to per-page-range summaries:
        index len flags count mapcount

For example, here is an output from my filecache tool. This trick
could reduce 10x output size!

        # idx   len     state   refcnt
        0       1       RAMU___ 2
        1       3       ___U___ 1
        4       1       RAMU___ 2
        5       57      R_MU___ 2
        62      2       ___U___ 1
        64      60      R_MU___ 2
        124     6       ___U___ 1
        130     1       R_MU___ 2
        131     1       ___U___ 1
        132     2       R_MU___ 2
        134     1       ___U___ 1
        135     2       R_MU___ 2
        137     1       ___U___ 1
        138     5       R_MU___ 2
        143     1       ___U___ 1
        144     2       R_MU___ 2
        146     2       ___U___ 1
        148     26      R_MU___ 2
        174     3       ___U___ 1
        177     54      R_MU___ 2
        231     1       ___U___ 1
        232     16      R_MU___ 2
        248     2       ___U___ 1

Another problem can be, the holes are often really huge. If user space
walk the pages by 

        while true
        do
                echo n n+10000 > range-to-dump
                cat trace >> log
        done

Then the holes will still consume a lot of unnecessary context switches.
It would better to work this way:
        while true
        do 
                echo 10000 > amount-to-dump
                cat trace >> log
        done

Is this possible?

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
@ 2009-05-09 12:23                   ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-09 12:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Sat, May 09, 2009 at 07:05:13PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Sat, May 09, 2009 at 06:01:37PM +0800, Ingo Molnar wrote:
> > > 
> > > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > 2) support concurrent object iterations
> > > >    For example, a huge 1TB memory space can be split up into 10
> > > >    segments which can be queried concurrently (with different options).
> > > 
> > > this should already be possible. If you lseek the trigger file, that 
> > > will be understood as an 'offset' by the patch, and then write a 
> > > (decimal) value into the file, that will be the count.
> > > 
> > > So it should already be possible to fork off nr_cpus helper threads, 
> > > one bound to each CPU, each triggering trace output of a separate 
> > > segment of the memory map - and each reading that CPU's 
> > > trace_pipe_raw file to recover the data - all in parallel.
> > 
> 
> > How will this work out in general? More examples, when walking 
> > pages by file/process, is it possible to divide the 
> > files/processes into N sets, and dump their pages concurrently? 
> > When walking the (huge) inode lists of different superblocks, is 
> > it possible to fork one thread for each superblock?
> > 
> > In the above situations, they would demand concurrent instances 
> > with different filename/pid/superblock options.
> 
> the iterators are certainly more complex, and harder to parallelise, 
> in those cases, i submit.

OK. I'm pushing the parallelism idea because 4+ cores is going to be
commonplace in desktop(not to mention the servers). And I have a clear 
use case for the parallelism: user space directed memory shrinking
before hibernation. Where the user space tool scan all/most pages in
all/most files in all superblocks and then selectively fadvise(DONTNEED).

In that case we want to work as fast as possible in order not to slow
down the hibernation speed. Parallelism definitely helps.

> But i like the page map example because it is (by far!) the largest 
> collection of objects. Four million pages on a test-box i have.

Yes!

> So if the design is right and we do dumping on that extreme-end very 
> well, we might not even care that much about parallelising dumping 
> in other situations, even if there are thousands of tasks - it will 
> just be even faster. And then we can keep the iterators and the APIs 
> as simple as simple.

That offset trick won't work well for small files. When we have lots
of small files, the parallelism granularity shall be files instead of
page chunks inside them. Maybe I'm too stressing.

> ( End even for tasks, which are perhaps the hardest to iterate, we
>   can still do the /proc method of iterating up to the offset by 
>   counting. It wastes some time for each separate thread as it has 
>   to count up to its offset, but it still allows the dumping itself
>   to be parallelised. Or we could dump blocks of the PID hash array. 
>   That distributes tasks well, and can be iterated very easily with 
>   low/zero contention. The result will come out unordered in any 
>   case. )

For task/file based page walking, the best parallelism unit can be
the task/file, instead of page segments inside them.

And there is the sparse file problem. There will be large holes in
the address space of file and process(and even physical memory!).

It would be good to not output any lines for the holes. Even better,
in the case of file/process, lots of pages will share the same flags,
count and mapcount. If not printing their pfn, the output can be
stripped from per-page lines
        index flags count mapcount
to per-page-range summaries:
        index len flags count mapcount

For example, here is an output from my filecache tool. This trick
could reduce 10x output size!

        # idx   len     state   refcnt
        0       1       RAMU___ 2
        1       3       ___U___ 1
        4       1       RAMU___ 2
        5       57      R_MU___ 2
        62      2       ___U___ 1
        64      60      R_MU___ 2
        124     6       ___U___ 1
        130     1       R_MU___ 2
        131     1       ___U___ 1
        132     2       R_MU___ 2
        134     1       ___U___ 1
        135     2       R_MU___ 2
        137     1       ___U___ 1
        138     5       R_MU___ 2
        143     1       ___U___ 1
        144     2       R_MU___ 2
        146     2       ___U___ 1
        148     26      R_MU___ 2
        174     3       ___U___ 1
        177     54      R_MU___ 2
        231     1       ___U___ 1
        232     16      R_MU___ 2
        248     2       ___U___ 1

Another problem can be, the holes are often really huge. If user space
walk the pages by 

        while true
        do
                echo n n+10000 > range-to-dump
                cat trace >> log
        done

Then the holes will still consume a lot of unnecessary context switches.
It would better to work this way:
        while true
        do 
                echo 10000 > amount-to-dump
                cat trace >> log
        done

Is this possible?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
  2009-05-09 12:23                   ` Wu Fengguang
@ 2009-05-09 14:05                     ` Ingo Molnar
  -1 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-09 14:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> > ( End even for tasks, which are perhaps the hardest to iterate, we
> >   can still do the /proc method of iterating up to the offset by 
> >   counting. It wastes some time for each separate thread as it has 
> >   to count up to its offset, but it still allows the dumping itself
> >   to be parallelised. Or we could dump blocks of the PID hash array. 
> >   That distributes tasks well, and can be iterated very easily with 
> >   low/zero contention. The result will come out unordered in any 
> >   case. )
> 
> For task/file based page walking, the best parallelism unit can be 
> the task/file, instead of page segments inside them.
> 
> And there is the sparse file problem. There will be large holes in 
> the address space of file and process(and even physical memory!).

If we want to iterate in the file offset space then we should use 
the find_get_pages() trick: use the page radix tree and do gang 
lookups in ascending order. Holes will be skipped over in a natural 
way in the tree.

Regarding iterators, i think the best way would be to expose a 
number of 'natural iterators' in the object collection directory. 
The current dump_range could be changed to "pfn_index" (it's really 
a 'physical page number' index and iterator), and we could introduce 
a couple of other indices as well:

    /debug/tracing/objects/mm/pages/pfn_index
    /debug/tracing/objects/mm/pages/filename_index
    /debug/tracing/objects/mm/pages/sb_index
    /debug/tracing/objects/mm/pages/task_index

"filename_index" would take a file name (a string), and would dump 
all pages of that inode - perhaps with an additional index/range 
parameter as well. For example:

    echo "/home/foo/bar.txt 0 1000" > filename_index

Would look up that file and dump any pages in the page cache related 
to that file, in the 0..1000 pages offset range.

( We could support the 'batching' of such requests too, so 
  multi-line strings can be used to request multiple files, via a 
  single system call.

  We could perhaps even support directories and do 
  directory-and-all-child-dentries/inodes recursive lookups. )

Other indices/iterators would work like this:

    echo "/var" > sb_index

Would try to find the superblock associated to /var, and output all 
pages that relate to that superblock. (it would iterate over all 
inodes and look them all up in the pagecache and dump any matches)

Alternatively, we could do a reverse look up for the inode from the 
pfn, and output that name. That would bloat the records a bit, and 
would be more costly as well.

The 'task_index' would output based on a PID, it would find the mm 
of that task and dump all pages associated to that mm. Offset/range 
info would be virtual address page index based.

Are these things close to what you had in mind?

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
@ 2009-05-09 14:05                     ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-09 14:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> > ( End even for tasks, which are perhaps the hardest to iterate, we
> >   can still do the /proc method of iterating up to the offset by 
> >   counting. It wastes some time for each separate thread as it has 
> >   to count up to its offset, but it still allows the dumping itself
> >   to be parallelised. Or we could dump blocks of the PID hash array. 
> >   That distributes tasks well, and can be iterated very easily with 
> >   low/zero contention. The result will come out unordered in any 
> >   case. )
> 
> For task/file based page walking, the best parallelism unit can be 
> the task/file, instead of page segments inside them.
> 
> And there is the sparse file problem. There will be large holes in 
> the address space of file and process(and even physical memory!).

If we want to iterate in the file offset space then we should use 
the find_get_pages() trick: use the page radix tree and do gang 
lookups in ascending order. Holes will be skipped over in a natural 
way in the tree.

Regarding iterators, i think the best way would be to expose a 
number of 'natural iterators' in the object collection directory. 
The current dump_range could be changed to "pfn_index" (it's really 
a 'physical page number' index and iterator), and we could introduce 
a couple of other indices as well:

    /debug/tracing/objects/mm/pages/pfn_index
    /debug/tracing/objects/mm/pages/filename_index
    /debug/tracing/objects/mm/pages/sb_index
    /debug/tracing/objects/mm/pages/task_index

"filename_index" would take a file name (a string), and would dump 
all pages of that inode - perhaps with an additional index/range 
parameter as well. For example:

    echo "/home/foo/bar.txt 0 1000" > filename_index

Would look up that file and dump any pages in the page cache related 
to that file, in the 0..1000 pages offset range.

( We could support the 'batching' of such requests too, so 
  multi-line strings can be used to request multiple files, via a 
  single system call.

  We could perhaps even support directories and do 
  directory-and-all-child-dentries/inodes recursive lookups. )

Other indices/iterators would work like this:

    echo "/var" > sb_index

Would try to find the superblock associated to /var, and output all 
pages that relate to that superblock. (it would iterate over all 
inodes and look them all up in the pagecache and dump any matches)

Alternatively, we could do a reverse look up for the inode from the 
pfn, and output that name. That would bloat the records a bit, and 
would be more costly as well.

The 'task_index' would output based on a PID, it would find the mm 
of that task and dump all pages associated to that mm. Offset/range 
info would be virtual address page index based.

Are these things close to what you had in mind?

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
  2009-05-09 10:44         ` Ingo Molnar
@ 2009-05-10  3:58           ` Andrew Morton
  -1 siblings, 0 replies; 92+ messages in thread
From: Andrew Morton @ 2009-05-10  3:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: fengguang.wu, fweisbec, rostedt, a.p.zijlstra, lizf,
	linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm

On Sat, 9 May 2009 12:44:09 +0200 Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Fri, 8 May 2009 13:47:42 +0200
> > Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > > 
> > > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > Export all page flags faithfully in /proc/kpageflags.
> > > 
> > > Ongoing objection and NAK against extended haphazard exporting of 
> > > kernel internals via an ad-hoc ABI via ad-hoc, privatized 
> > > instrumentation that only helps the MM code and nothing else.
> > 
> > You're a year too late.  The pagemap interface is useful.
> 
> My NAK is against the extension of this mistake.
> 
> So is your answer to my NAK in essence:
> 
>  " We merged crappy MM instrumentation a short year ago, too bad.
>    And because it was so crappy to be in /proc we are now also
>    treating it as a hard ABI, not as a debugfs interface - for that 
>    single app that is using it. Furthermore, we are now going to 
>    make the API and ABI even more crappy via patches queued up in 
>    -mm, and we are ignoring NAKs. We are also going to make it even 
>    harder to have sane, generic instrumentation in the upstream 
>    kernel. Deal with it, this is our code and we can mess it up the 
>    way we wish to, it's none of your business."
> 
> right?
> 

If that was my answer, that is what I would have typed.

But I in fact typed something quite different.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
@ 2009-05-10  3:58           ` Andrew Morton
  0 siblings, 0 replies; 92+ messages in thread
From: Andrew Morton @ 2009-05-10  3:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: fengguang.wu, fweisbec, rostedt, a.p.zijlstra, lizf,
	linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm

On Sat, 9 May 2009 12:44:09 +0200 Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Fri, 8 May 2009 13:47:42 +0200
> > Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > > 
> > > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > Export all page flags faithfully in /proc/kpageflags.
> > > 
> > > Ongoing objection and NAK against extended haphazard exporting of 
> > > kernel internals via an ad-hoc ABI via ad-hoc, privatized 
> > > instrumentation that only helps the MM code and nothing else.
> > 
> > You're a year too late.  The pagemap interface is useful.
> 
> My NAK is against the extension of this mistake.
> 
> So is your answer to my NAK in essence:
> 
>  " We merged crappy MM instrumentation a short year ago, too bad.
>    And because it was so crappy to be in /proc we are now also
>    treating it as a hard ABI, not as a debugfs interface - for that 
>    single app that is using it. Furthermore, we are now going to 
>    make the API and ABI even more crappy via patches queued up in 
>    -mm, and we are ignoring NAKs. We are also going to make it even 
>    harder to have sane, generic instrumentation in the upstream 
>    kernel. Deal with it, this is our code and we can mess it up the 
>    way we wish to, it's none of your business."
> 
> right?
> 

If that was my answer, that is what I would have typed.

But I in fact typed something quite different.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
  2009-05-09 10:44         ` Ingo Molnar
@ 2009-05-10  5:26           ` Andrew Morton
  -1 siblings, 0 replies; 92+ messages in thread
From: Andrew Morton @ 2009-05-10  5:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: fengguang.wu, fweisbec, rostedt, a.p.zijlstra, lizf,
	linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm

On Sat, 9 May 2009 12:44:09 +0200 Ingo Molnar <mingo@elte.hu> wrote:

> And because it was so crappy to be in /proc we are now also
> treating it as a hard ABI, not as a debugfs interface - for that
> single app that is using it. 

We'd probably make better progress here were someone to explain what
pagemap actually is.


pagemap is a userspace interface via which application developers
(including embedded) can analyse, understand and optimise their use of
memory.

It is not debugging feature at all, let alone a kernel debugging
feature.  For this reason it is not appropriate that its interfaces be
presented in debugfs.

Furthermore the main control file for pagemap is in
/proc/<pid>/pagemap.  pagemap _cannot_ be put in debugfs because
debugfs doesn't maintain the per-process subdirectories in which to
place it.  /proc/<pid>/ is exactly the place where the pagemap file
should appear.

Yes, we could place pagemap's two auxiliary files into debugfs but it
would be rather stupid to split the feature's control files across two
pseudo filesystems, one of which may not even exist.  Plus pagemap is
not a kernel debugging feature.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
@ 2009-05-10  5:26           ` Andrew Morton
  0 siblings, 0 replies; 92+ messages in thread
From: Andrew Morton @ 2009-05-10  5:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: fengguang.wu, fweisbec, rostedt, a.p.zijlstra, lizf,
	linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm

On Sat, 9 May 2009 12:44:09 +0200 Ingo Molnar <mingo@elte.hu> wrote:

> And because it was so crappy to be in /proc we are now also
> treating it as a hard ABI, not as a debugfs interface - for that
> single app that is using it. 

We'd probably make better progress here were someone to explain what
pagemap actually is.


pagemap is a userspace interface via which application developers
(including embedded) can analyse, understand and optimise their use of
memory.

It is not debugging feature at all, let alone a kernel debugging
feature.  For this reason it is not appropriate that its interfaces be
presented in debugfs.

Furthermore the main control file for pagemap is in
/proc/<pid>/pagemap.  pagemap _cannot_ be put in debugfs because
debugfs doesn't maintain the per-process subdirectories in which to
place it.  /proc/<pid>/ is exactly the place where the pagemap file
should appear.

Yes, we could place pagemap's two auxiliary files into debugfs but it
would be rather stupid to split the feature's control files across two
pseudo filesystems, one of which may not even exist.  Plus pagemap is
not a kernel debugging feature.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
  2009-05-09 14:05                     ` Ingo Molnar
@ 2009-05-10  8:35                       ` Wu Fengguang
  -1 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-10  8:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Sat, May 09, 2009 at 10:05:12PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > ( End even for tasks, which are perhaps the hardest to iterate, we
> > >   can still do the /proc method of iterating up to the offset by 
> > >   counting. It wastes some time for each separate thread as it has 
> > >   to count up to its offset, but it still allows the dumping itself
> > >   to be parallelised. Or we could dump blocks of the PID hash array. 
> > >   That distributes tasks well, and can be iterated very easily with 
> > >   low/zero contention. The result will come out unordered in any 
> > >   case. )
> > 
> > For task/file based page walking, the best parallelism unit can be 
> > the task/file, instead of page segments inside them.
> > 
> > And there is the sparse file problem. There will be large holes in 
> > the address space of file and process(and even physical memory!).
> 
> If we want to iterate in the file offset space then we should use 
> the find_get_pages() trick: use the page radix tree and do gang 
> lookups in ascending order. Holes will be skipped over in a natural 
> way in the tree.

Right. I actually have code doing this, very neat trick.

> Regarding iterators, i think the best way would be to expose a 
> number of 'natural iterators' in the object collection directory. 
> The current dump_range could be changed to "pfn_index" (it's really 
> a 'physical page number' index and iterator), and we could introduce 
> a couple of other indices as well:
> 
>     /debug/tracing/objects/mm/pages/pfn_index
>     /debug/tracing/objects/mm/pages/filename_index
>     /debug/tracing/objects/mm/pages/task_index
>     /debug/tracing/objects/mm/pages/sb_index

How about 

     /debug/tracing/objects/mm/pages/walk-pfn
     /debug/tracing/objects/mm/pages/walk-file
     /debug/tracing/objects/mm/pages/walk-task

     /debug/tracing/objects/mm/pages/walk-fs
     (fs may be a more well known name than sb?)

They begin with a verb, because they are verbs when we echo some
parameters into them ;-)

> "filename_index" would take a file name (a string), and would dump 
> all pages of that inode - perhaps with an additional index/range 
> parameter as well. For example:
> 
>     echo "/home/foo/bar.txt 0 1000" > filename_index

Better to use

     "0 1000 /home/foo/bar.txt"

because there will be files named "/some/file 001".

But then echo will append an additional '\n' to filename and we are
faced with the question whether to ignore the trailing '\n'.

> Would look up that file and dump any pages in the page cache related 
> to that file, in the 0..1000 pages offset range.
> 
> ( We could support the 'batching' of such requests too, so 
>   multi-line strings can be used to request multiple files, via a 
>   single system call.

Yes, I'd expect it to make some difference in efficiency, when there
are many small files.

>   We could perhaps even support directories and do 
>   directory-and-all-child-dentries/inodes recursive lookups. )

Maybe, could do this when there comes such a need.

> Other indices/iterators would work like this:
> 
>     echo "/var" > sb_index
> 
> Would try to find the superblock associated to /var, and output all 
> pages that relate to that superblock. (it would iterate over all 
> inodes and look them all up in the pagecache and dump any matches)

Can we buffer so much outputs in kernel? Even if ftrace has no such
limitations, it may not be a good idea to pin too many pages in the
ring buffer.

I do need this feature. But it sounds like a mixture of
"files-inside-sb" walker and "pages-inside-file" walker. 
It's unclear how it will duplicate functions with the
"files object collection" to be added in:

        /debug/tracing/objects/mm/files/*

For example,

        /debug/tracing/objects/mm/files/walk-fs
        /debug/tracing/objects/mm/files/walk-dirty
        /debug/tracing/objects/mm/files/walk-global
and some filtering options, like size, cached_size, etc.

> Alternatively, we could do a reverse look up for the inode from the 
> pfn, and output that name. That would bloat the records a bit, and 
> would be more costly as well.

That sounds like "describe-pfn" and can serve as a good debugging tool.

> The 'task_index' would output based on a PID, it would find the mm 
> of that task and dump all pages associated to that mm. Offset/range 
> info would be virtual address page index based.

Right.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
@ 2009-05-10  8:35                       ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-10  8:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Sat, May 09, 2009 at 10:05:12PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > ( End even for tasks, which are perhaps the hardest to iterate, we
> > >   can still do the /proc method of iterating up to the offset by 
> > >   counting. It wastes some time for each separate thread as it has 
> > >   to count up to its offset, but it still allows the dumping itself
> > >   to be parallelised. Or we could dump blocks of the PID hash array. 
> > >   That distributes tasks well, and can be iterated very easily with 
> > >   low/zero contention. The result will come out unordered in any 
> > >   case. )
> > 
> > For task/file based page walking, the best parallelism unit can be 
> > the task/file, instead of page segments inside them.
> > 
> > And there is the sparse file problem. There will be large holes in 
> > the address space of file and process(and even physical memory!).
> 
> If we want to iterate in the file offset space then we should use 
> the find_get_pages() trick: use the page radix tree and do gang 
> lookups in ascending order. Holes will be skipped over in a natural 
> way in the tree.

Right. I actually have code doing this, very neat trick.

> Regarding iterators, i think the best way would be to expose a 
> number of 'natural iterators' in the object collection directory. 
> The current dump_range could be changed to "pfn_index" (it's really 
> a 'physical page number' index and iterator), and we could introduce 
> a couple of other indices as well:
> 
>     /debug/tracing/objects/mm/pages/pfn_index
>     /debug/tracing/objects/mm/pages/filename_index
>     /debug/tracing/objects/mm/pages/task_index
>     /debug/tracing/objects/mm/pages/sb_index

How about 

     /debug/tracing/objects/mm/pages/walk-pfn
     /debug/tracing/objects/mm/pages/walk-file
     /debug/tracing/objects/mm/pages/walk-task

     /debug/tracing/objects/mm/pages/walk-fs
     (fs may be a more well known name than sb?)

They begin with a verb, because they are verbs when we echo some
parameters into them ;-)

> "filename_index" would take a file name (a string), and would dump 
> all pages of that inode - perhaps with an additional index/range 
> parameter as well. For example:
> 
>     echo "/home/foo/bar.txt 0 1000" > filename_index

Better to use

     "0 1000 /home/foo/bar.txt"

because there will be files named "/some/file 001".

But then echo will append an additional '\n' to filename and we are
faced with the question whether to ignore the trailing '\n'.

> Would look up that file and dump any pages in the page cache related 
> to that file, in the 0..1000 pages offset range.
> 
> ( We could support the 'batching' of such requests too, so 
>   multi-line strings can be used to request multiple files, via a 
>   single system call.

Yes, I'd expect it to make some difference in efficiency, when there
are many small files.

>   We could perhaps even support directories and do 
>   directory-and-all-child-dentries/inodes recursive lookups. )

Maybe, could do this when there comes such a need.

> Other indices/iterators would work like this:
> 
>     echo "/var" > sb_index
> 
> Would try to find the superblock associated to /var, and output all 
> pages that relate to that superblock. (it would iterate over all 
> inodes and look them all up in the pagecache and dump any matches)

Can we buffer so much outputs in kernel? Even if ftrace has no such
limitations, it may not be a good idea to pin too many pages in the
ring buffer.

I do need this feature. But it sounds like a mixture of
"files-inside-sb" walker and "pages-inside-file" walker. 
It's unclear how it will duplicate functions with the
"files object collection" to be added in:

        /debug/tracing/objects/mm/files/*

For example,

        /debug/tracing/objects/mm/files/walk-fs
        /debug/tracing/objects/mm/files/walk-dirty
        /debug/tracing/objects/mm/files/walk-global
and some filtering options, like size, cached_size, etc.

> Alternatively, we could do a reverse look up for the inode from the 
> pfn, and output that name. That would bloat the records a bit, and 
> would be more costly as well.

That sounds like "describe-pfn" and can serve as a good debugging tool.

> The 'task_index' would output based on a PID, it would find the mm 
> of that task and dump all pages associated to that mm. Offset/range 
> info would be virtual address page index based.

Right.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
  2009-05-10  5:26           ` Andrew Morton
@ 2009-05-11 11:45             ` Ingo Molnar
  -1 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-11 11:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: fengguang.wu, fweisbec, rostedt, a.p.zijlstra, lizf,
	linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Sat, 9 May 2009 12:44:09 +0200 Ingo Molnar <mingo@elte.hu> wrote:
> 
> > And because it was so crappy to be in /proc we are now also 
> > treating it as a hard ABI, not as a debugfs interface - for that 
> > single app that is using it.
> 
> We'd probably make better progress here were someone to explain 
> what pagemap actually is.
> 
> pagemap is a userspace interface via which application developers 
> (including embedded) can analyse, understand and optimise their 
> use of memory.

IMHO that's really a fancy sentence for: 'to debug how their app 
interacts with the kernel'. Yes, it can be said without the word 
'debug' or 'instrumentation' in it. Maybe it could also be written 
without having any r's in it.

Doing any of that does not change the meaning of the feature though.

> It is not debugging feature at all, let alone a kernel debugging 
> feature.  For this reason it is not appropriate that its 
> interfaces be presented in debugfs.
> 
> Furthermore the main control file for pagemap is in 
> /proc/<pid>/pagemap.  pagemap _cannot_ be put in debugfs because 
> debugfs doesn't maintain the per-process subdirectories in which 
> to place it.  /proc/<pid>/ is exactly the place where the pagemap 
> file should appear.

only if done in a stupid way.

The thing is, nor are all active inodes enumerated in /debug and not 
in /proc either. And we've stopped stuffing new instrumentation into 
/proc about a decade ago and introduced debugfs for that.

_Especially_ when some piece of instrumentation is clearly growing 
in scope and nature, as here.

> Yes, we could place pagemap's two auxiliary files into debugfs but 
> it would be rather stupid to split the feature's control files 
> across two pseudo filesystems, one of which may not even exist.  
> Plus pagemap is not a kernel debugging feature.

That's not what i'm suggesting though.

What i'm suggesting is that there's a zillion ways to enumerate and 
index various kernel objects, doing that in /proc is fundamentally 
wrong. And there's no need to create a per PID/TID directory 
structure in /debug either, to be able to list and access objects by 
their PID.

_Especially_ when the end result is not human-readable to begin 
with, as it is in the pagemap/kpagecount/kpageflags case.

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
@ 2009-05-11 11:45             ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-11 11:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: fengguang.wu, fweisbec, rostedt, a.p.zijlstra, lizf,
	linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Sat, 9 May 2009 12:44:09 +0200 Ingo Molnar <mingo@elte.hu> wrote:
> 
> > And because it was so crappy to be in /proc we are now also 
> > treating it as a hard ABI, not as a debugfs interface - for that 
> > single app that is using it.
> 
> We'd probably make better progress here were someone to explain 
> what pagemap actually is.
> 
> pagemap is a userspace interface via which application developers 
> (including embedded) can analyse, understand and optimise their 
> use of memory.

IMHO that's really a fancy sentence for: 'to debug how their app 
interacts with the kernel'. Yes, it can be said without the word 
'debug' or 'instrumentation' in it. Maybe it could also be written 
without having any r's in it.

Doing any of that does not change the meaning of the feature though.

> It is not debugging feature at all, let alone a kernel debugging 
> feature.  For this reason it is not appropriate that its 
> interfaces be presented in debugfs.
> 
> Furthermore the main control file for pagemap is in 
> /proc/<pid>/pagemap.  pagemap _cannot_ be put in debugfs because 
> debugfs doesn't maintain the per-process subdirectories in which 
> to place it.  /proc/<pid>/ is exactly the place where the pagemap 
> file should appear.

only if done in a stupid way.

The thing is, nor are all active inodes enumerated in /debug and not 
in /proc either. And we've stopped stuffing new instrumentation into 
/proc about a decade ago and introduced debugfs for that.

_Especially_ when some piece of instrumentation is clearly growing 
in scope and nature, as here.

> Yes, we could place pagemap's two auxiliary files into debugfs but 
> it would be rather stupid to split the feature's control files 
> across two pseudo filesystems, one of which may not even exist.  
> Plus pagemap is not a kernel debugging feature.

That's not what i'm suggesting though.

What i'm suggesting is that there's a zillion ways to enumerate and 
index various kernel objects, doing that in /proc is fundamentally 
wrong. And there's no need to create a per PID/TID directory 
structure in /debug either, to be able to list and access objects by 
their PID.

_Especially_ when the end result is not human-readable to begin 
with, as it is in the pagemap/kpagecount/kpageflags case.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
  2009-05-10  8:35                       ` Wu Fengguang
@ 2009-05-11 12:01                         ` Ingo Molnar
  -1 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-11 12:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Sat, May 09, 2009 at 10:05:12PM +0800, Ingo Molnar wrote:
> > 
> > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > > ( End even for tasks, which are perhaps the hardest to iterate, we
> > > >   can still do the /proc method of iterating up to the offset by 
> > > >   counting. It wastes some time for each separate thread as it has 
> > > >   to count up to its offset, but it still allows the dumping itself
> > > >   to be parallelised. Or we could dump blocks of the PID hash array. 
> > > >   That distributes tasks well, and can be iterated very easily with 
> > > >   low/zero contention. The result will come out unordered in any 
> > > >   case. )
> > > 
> > > For task/file based page walking, the best parallelism unit can be 
> > > the task/file, instead of page segments inside them.
> > > 
> > > And there is the sparse file problem. There will be large holes in 
> > > the address space of file and process(and even physical memory!).
> > 
> > If we want to iterate in the file offset space then we should use 
> > the find_get_pages() trick: use the page radix tree and do gang 
> > lookups in ascending order. Holes will be skipped over in a natural 
> > way in the tree.
> 
> Right. I actually have code doing this, very neat trick.
> 
> > Regarding iterators, i think the best way would be to expose a 
> > number of 'natural iterators' in the object collection directory. 
> > The current dump_range could be changed to "pfn_index" (it's really 
> > a 'physical page number' index and iterator), and we could introduce 
> > a couple of other indices as well:
> > 
> >     /debug/tracing/objects/mm/pages/pfn_index
> >     /debug/tracing/objects/mm/pages/filename_index
> >     /debug/tracing/objects/mm/pages/task_index
> >     /debug/tracing/objects/mm/pages/sb_index
> 
> How about 
> 
>      /debug/tracing/objects/mm/pages/walk-pfn
>      /debug/tracing/objects/mm/pages/walk-file
>      /debug/tracing/objects/mm/pages/walk-task
> 
>      /debug/tracing/objects/mm/pages/walk-fs
>      (fs may be a more well known name than sb?)
> 
> They begin with a verb, because they are verbs when we echo some
> parameters into them ;-)

yeah, good idea :) I saw the _index naming ugliness but couldnt 
think of a better strategy straight away. 'Use verbs for iterators, 
dummy' is the answer ;-)

> > "filename_index" would take a file name (a string), and would dump 
> > all pages of that inode - perhaps with an additional index/range 
> > parameter as well. For example:
> > 
> >     echo "/home/foo/bar.txt 0 1000" > filename_index
> 
> Better to use
> 
>      "0 1000 /home/foo/bar.txt"
> 
> because there will be files named "/some/file 001".

ok, good point!

> But then echo will append an additional '\n' to filename and we 
> are faced with the question whether to ignore the trailing '\n'.

Yeah, we should ignore the first trailing \n, thus \n can be forced 
in a filename by trailing it with \n\n. Btw., is there any 
legitimate software that generates \n into pathnames?

> > Would look up that file and dump any pages in the page cache related 
> > to that file, in the 0..1000 pages offset range.
> > 
> > ( We could support the 'batching' of such requests too, so 
> >   multi-line strings can be used to request multiple files, via a 
> >   single system call.
> 
> Yes, I'd expect it to make some difference in efficiency, when 
> there are many small files.

yeah.

> >   We could perhaps even support directories and do 
> >   directory-and-all-child-dentries/inodes recursive lookups. )
> 
> Maybe, could do this when there comes such a need.
> 
> > Other indices/iterators would work like this:
> > 
> >     echo "/var" > sb_index
> > 
> > Would try to find the superblock associated to /var, and output 
> > all pages that relate to that superblock. (it would iterate over 
> > all inodes and look them all up in the pagecache and dump any 
> > matches)
> 
> Can we buffer so much outputs in kernel? Even if ftrace has no 
> such limitations, it may not be a good idea to pin too many pages 
> in the ring buffer.

Yes, we can even avoid the ring-buffer and create a small (but 
reasonably sized), dedicated one for each iterator.

It is a question whether we want to have multiple, parallel sessions 
of output pairs. It would be nice to allow it, but that needs some 
extra handshaking or ugly unix domain socket tricks.

Perhaps one simple thing would allow this: output to the same fd 
that gives the input? Not sure how scriptable this would be though, 
as the read() has to block until all output has been generated.

> I do need this feature. But it sounds like a mixture of
> "files-inside-sb" walker and "pages-inside-file" walker. 
> It's unclear how it will duplicate functions with the
> "files object collection" to be added in:
> 
>         /debug/tracing/objects/mm/files/*
> 
> For example,
> 
>         /debug/tracing/objects/mm/files/walk-fs
>         /debug/tracing/objects/mm/files/walk-dirty
>         /debug/tracing/objects/mm/files/walk-global
> and some filtering options, like size, cached_size, etc.

the walkers themselves will one-off functions for sure. This is 
inevitable, unless we convert the kernel to C++ ;-)

But that's not a big issue: each walker will be useful and the 
walking part will be relatively simple. As long as the rest of the 
infrastructure around it is librarized to the max, it will all look 
tidy and supportable.

And it sure beats having to keep:

 /proc
 /files-dirty
 /files-all
 /filesystems
 /pages

convoluted directory hierarchies just to walk along each index.

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [patch] tracing/mm: add page frame snapshot trace
@ 2009-05-11 12:01                         ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-11 12:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Frédéric Weisbecker, Steven Rostedt, Peter Zijlstra,
	Li Zefan, Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Sat, May 09, 2009 at 10:05:12PM +0800, Ingo Molnar wrote:
> > 
> > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > > ( End even for tasks, which are perhaps the hardest to iterate, we
> > > >   can still do the /proc method of iterating up to the offset by 
> > > >   counting. It wastes some time for each separate thread as it has 
> > > >   to count up to its offset, but it still allows the dumping itself
> > > >   to be parallelised. Or we could dump blocks of the PID hash array. 
> > > >   That distributes tasks well, and can be iterated very easily with 
> > > >   low/zero contention. The result will come out unordered in any 
> > > >   case. )
> > > 
> > > For task/file based page walking, the best parallelism unit can be 
> > > the task/file, instead of page segments inside them.
> > > 
> > > And there is the sparse file problem. There will be large holes in 
> > > the address space of file and process(and even physical memory!).
> > 
> > If we want to iterate in the file offset space then we should use 
> > the find_get_pages() trick: use the page radix tree and do gang 
> > lookups in ascending order. Holes will be skipped over in a natural 
> > way in the tree.
> 
> Right. I actually have code doing this, very neat trick.
> 
> > Regarding iterators, i think the best way would be to expose a 
> > number of 'natural iterators' in the object collection directory. 
> > The current dump_range could be changed to "pfn_index" (it's really 
> > a 'physical page number' index and iterator), and we could introduce 
> > a couple of other indices as well:
> > 
> >     /debug/tracing/objects/mm/pages/pfn_index
> >     /debug/tracing/objects/mm/pages/filename_index
> >     /debug/tracing/objects/mm/pages/task_index
> >     /debug/tracing/objects/mm/pages/sb_index
> 
> How about 
> 
>      /debug/tracing/objects/mm/pages/walk-pfn
>      /debug/tracing/objects/mm/pages/walk-file
>      /debug/tracing/objects/mm/pages/walk-task
> 
>      /debug/tracing/objects/mm/pages/walk-fs
>      (fs may be a more well known name than sb?)
> 
> They begin with a verb, because they are verbs when we echo some
> parameters into them ;-)

yeah, good idea :) I saw the _index naming ugliness but couldnt 
think of a better strategy straight away. 'Use verbs for iterators, 
dummy' is the answer ;-)

> > "filename_index" would take a file name (a string), and would dump 
> > all pages of that inode - perhaps with an additional index/range 
> > parameter as well. For example:
> > 
> >     echo "/home/foo/bar.txt 0 1000" > filename_index
> 
> Better to use
> 
>      "0 1000 /home/foo/bar.txt"
> 
> because there will be files named "/some/file 001".

ok, good point!

> But then echo will append an additional '\n' to filename and we 
> are faced with the question whether to ignore the trailing '\n'.

Yeah, we should ignore the first trailing \n, thus \n can be forced 
in a filename by trailing it with \n\n. Btw., is there any 
legitimate software that generates \n into pathnames?

> > Would look up that file and dump any pages in the page cache related 
> > to that file, in the 0..1000 pages offset range.
> > 
> > ( We could support the 'batching' of such requests too, so 
> >   multi-line strings can be used to request multiple files, via a 
> >   single system call.
> 
> Yes, I'd expect it to make some difference in efficiency, when 
> there are many small files.

yeah.

> >   We could perhaps even support directories and do 
> >   directory-and-all-child-dentries/inodes recursive lookups. )
> 
> Maybe, could do this when there comes such a need.
> 
> > Other indices/iterators would work like this:
> > 
> >     echo "/var" > sb_index
> > 
> > Would try to find the superblock associated to /var, and output 
> > all pages that relate to that superblock. (it would iterate over 
> > all inodes and look them all up in the pagecache and dump any 
> > matches)
> 
> Can we buffer so much outputs in kernel? Even if ftrace has no 
> such limitations, it may not be a good idea to pin too many pages 
> in the ring buffer.

Yes, we can even avoid the ring-buffer and create a small (but 
reasonably sized), dedicated one for each iterator.

It is a question whether we want to have multiple, parallel sessions 
of output pairs. It would be nice to allow it, but that needs some 
extra handshaking or ugly unix domain socket tricks.

Perhaps one simple thing would allow this: output to the same fd 
that gives the input? Not sure how scriptable this would be though, 
as the read() has to block until all output has been generated.

> I do need this feature. But it sounds like a mixture of
> "files-inside-sb" walker and "pages-inside-file" walker. 
> It's unclear how it will duplicate functions with the
> "files object collection" to be added in:
> 
>         /debug/tracing/objects/mm/files/*
> 
> For example,
> 
>         /debug/tracing/objects/mm/files/walk-fs
>         /debug/tracing/objects/mm/files/walk-dirty
>         /debug/tracing/objects/mm/files/walk-global
> and some filtering options, like size, cached_size, etc.

the walkers themselves will one-off functions for sure. This is 
inevitable, unless we convert the kernel to C++ ;-)

But that's not a big issue: each walker will be useful and the 
walking part will be relatively simple. As long as the rest of the 
infrastructure around it is librarized to the max, it will all look 
tidy and supportable.

And it sure beats having to keep:

 /proc
 /files-dirty
 /files-all
 /filesystems
 /pages

convoluted directory hierarchies just to walk along each index.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
  2009-05-11 11:45             ` Ingo Molnar
@ 2009-05-11 18:31               ` Andrew Morton
  -1 siblings, 0 replies; 92+ messages in thread
From: Andrew Morton @ 2009-05-11 18:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: fengguang.wu, fweisbec, rostedt, a.p.zijlstra, lizf,
	linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm

On Mon, 11 May 2009 13:45:54 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> > Yes, we could place pagemap's two auxiliary files into debugfs but 
> > it would be rather stupid to split the feature's control files 
> > across two pseudo filesystems, one of which may not even exist.  
> > Plus pagemap is not a kernel debugging feature.
> 
> That's not what i'm suggesting though.
> 
> What i'm suggesting is that there's a zillion ways to enumerate and 
> index various kernel objects, doing that in /proc is fundamentally 
> wrong. And there's no need to create a per PID/TID directory 
> structure in /debug either, to be able to list and access objects by 
> their PID.

The problem with procfs was that it was growing a lot of random
non-process-related stuff.  We never deprecated procfs - we decided
that it should be retained for its original purpose and that
non-process-realted things shouldn't go in there.

The /proc/<pid>/pagemap file clearly _is_ process-related, and
/proc/<pid> is the natural and correct place for it to live.

Yes, sure, there are any number of ways in which that data could be
presented to userspace in other locations and via other means.  But
there would need to be an extraordinarily good reason for violating the
existing paradigm/expectation/etc.



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
@ 2009-05-11 18:31               ` Andrew Morton
  0 siblings, 0 replies; 92+ messages in thread
From: Andrew Morton @ 2009-05-11 18:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: fengguang.wu, fweisbec, rostedt, a.p.zijlstra, lizf,
	linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm

On Mon, 11 May 2009 13:45:54 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> > Yes, we could place pagemap's two auxiliary files into debugfs but 
> > it would be rather stupid to split the feature's control files 
> > across two pseudo filesystems, one of which may not even exist.  
> > Plus pagemap is not a kernel debugging feature.
> 
> That's not what i'm suggesting though.
> 
> What i'm suggesting is that there's a zillion ways to enumerate and 
> index various kernel objects, doing that in /proc is fundamentally 
> wrong. And there's no need to create a per PID/TID directory 
> structure in /debug either, to be able to list and access objects by 
> their PID.

The problem with procfs was that it was growing a lot of random
non-process-related stuff.  We never deprecated procfs - we decided
that it should be retained for its original purpose and that
non-process-realted things shouldn't go in there.

The /proc/<pid>/pagemap file clearly _is_ process-related, and
/proc/<pid> is the natural and correct place for it to live.

Yes, sure, there are any number of ways in which that data could be
presented to userspace in other locations and via other means.  But
there would need to be an extraordinarily good reason for violating the
existing paradigm/expectation/etc.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
  2009-05-11 11:45             ` Ingo Molnar
@ 2009-05-11 19:03               ` Andy Isaacson
  -1 siblings, 0 replies; 92+ messages in thread
From: Andy Isaacson @ 2009-05-11 19:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, fengguang.wu, fweisbec, rostedt, a.p.zijlstra,
	lizf, linux-kernel, kosaki.motohiro, andi, mpm, adobriyan,
	linux-mm

On Mon, May 11, 2009 at 01:45:54PM +0200, Ingo Molnar wrote:
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> > Yes, we could place pagemap's two auxiliary files into debugfs but 
> > it would be rather stupid to split the feature's control files 
> > across two pseudo filesystems, one of which may not even exist.  
> > Plus pagemap is not a kernel debugging feature.
> 
> That's not what i'm suggesting though.
> 
> What i'm suggesting is that there's a zillion ways to enumerate and 
> index various kernel objects, doing that in /proc is fundamentally 
> wrong.

This sounds like you're saying that /proc/<pid>/pagemap is wrong, and
I'm pretty sure I disagree with that statement.  debugfs is not a
substitute for pagemap.  pagemap+kpageflags is a significant improvement
in the memory-usage-introspection capabilities provided to Linux
applications, and if it were harder to access (by depending on debugfs)
it would be significantly less useful.

> And there's no need to create a per PID/TID directory 
> structure in /debug either, to be able to list and access objects by 
> their PID.
> 
> _Especially_ when the end result is not human-readable to begin 
> with, as it is in the pagemap/kpagecount/kpageflags case.

FWIW, we had a support script break due to /proc/<pid>/pagemap (it
tarred up /proc/[0-9]*/* and /var/log/ and application logfiles and sent
it off to support@, so once pagemap appeared the support script started
filling up disks).  I toyed around with making pagemap read(2)s return
-EINVAL unless the reader lseek(2)ed first[1], but decided we were
better off just fixing the support script to enumerate interesting proc
files, since there's no guarantee against further suprising semantics
getting added to /proc (and we'd still need to support unpatched
kernels).

So while I love the capability that kpageflags and pagemap provides, its
implementation has not been without impact.

On a slightly different tangent -- it's pretty trivial to decode pagemap
with dd(1) and hd(1), or even perl, and it's not as if (for example)
/proc/<pid>/maps is made much easier to interpret just because its
contents are presented as ASCII rather than binary, so I feel like the
design decisions of pagemap are sane and defensible.

"I think that forms some kind of argument about kpageflags, but I'm not
sure if it's for or against."  -- someone witty

[1] fun fact -- cp(1) and cat(1) get the expected behavior with such a
patch, but dd(1) always lseek(2)s its input even if no skip= was
specified.

-andy

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
@ 2009-05-11 19:03               ` Andy Isaacson
  0 siblings, 0 replies; 92+ messages in thread
From: Andy Isaacson @ 2009-05-11 19:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, fengguang.wu, fweisbec, rostedt, a.p.zijlstra,
	lizf, linux-kernel, kosaki.motohiro, andi, mpm, adobriyan,
	linux-mm

On Mon, May 11, 2009 at 01:45:54PM +0200, Ingo Molnar wrote:
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> > Yes, we could place pagemap's two auxiliary files into debugfs but 
> > it would be rather stupid to split the feature's control files 
> > across two pseudo filesystems, one of which may not even exist.  
> > Plus pagemap is not a kernel debugging feature.
> 
> That's not what i'm suggesting though.
> 
> What i'm suggesting is that there's a zillion ways to enumerate and 
> index various kernel objects, doing that in /proc is fundamentally 
> wrong.

This sounds like you're saying that /proc/<pid>/pagemap is wrong, and
I'm pretty sure I disagree with that statement.  debugfs is not a
substitute for pagemap.  pagemap+kpageflags is a significant improvement
in the memory-usage-introspection capabilities provided to Linux
applications, and if it were harder to access (by depending on debugfs)
it would be significantly less useful.

> And there's no need to create a per PID/TID directory 
> structure in /debug either, to be able to list and access objects by 
> their PID.
> 
> _Especially_ when the end result is not human-readable to begin 
> with, as it is in the pagemap/kpagecount/kpageflags case.

FWIW, we had a support script break due to /proc/<pid>/pagemap (it
tarred up /proc/[0-9]*/* and /var/log/ and application logfiles and sent
it off to support@, so once pagemap appeared the support script started
filling up disks).  I toyed around with making pagemap read(2)s return
-EINVAL unless the reader lseek(2)ed first[1], but decided we were
better off just fixing the support script to enumerate interesting proc
files, since there's no guarantee against further suprising semantics
getting added to /proc (and we'd still need to support unpatched
kernels).

So while I love the capability that kpageflags and pagemap provides, its
implementation has not been without impact.

On a slightly different tangent -- it's pretty trivial to decode pagemap
with dd(1) and hd(1), or even perl, and it's not as if (for example)
/proc/<pid>/maps is made much easier to interpret just because its
contents are presented as ASCII rather than binary, so I feel like the
design decisions of pagemap are sane and defensible.

"I think that forms some kind of argument about kpageflags, but I'm not
sure if it's for or against."  -- someone witty

[1] fun fact -- cp(1) and cat(1) get the expected behavior with such a
patch, but dd(1) always lseek(2)s its input even if no skip= was
specified.

-andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
  2009-05-11 18:31               ` Andrew Morton
@ 2009-05-11 22:08                 ` Ingo Molnar
  -1 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-11 22:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: fengguang.wu, fweisbec, rostedt, a.p.zijlstra, lizf,
	linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 11 May 2009 13:45:54 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > Yes, we could place pagemap's two auxiliary files into debugfs but 
> > > it would be rather stupid to split the feature's control files 
> > > across two pseudo filesystems, one of which may not even exist.  
> > > Plus pagemap is not a kernel debugging feature.
> > 
> > That's not what i'm suggesting though.
> > 
> > What i'm suggesting is that there's a zillion ways to enumerate 
> > and index various kernel objects, doing that in /proc is 
> > fundamentally wrong. And there's no need to create a per PID/TID 
> > directory structure in /debug either, to be able to list and 
> > access objects by their PID.
> 
> The problem with procfs was that it was growing a lot of random 
> non-process-related stuff.  We never deprecated procfs - we 
> decided that it should be retained for its original purpose and 
> that non-process-realted things shouldn't go in there.
> 
> The /proc/<pid>/pagemap file clearly _is_ process-related, and 
> /proc/<pid> is the natural and correct place for it to live.
> 
> Yes, sure, there are any number of ways in which that data could 
> be presented to userspace in other locations and via other means.  
> But there would need to be an extraordinarily good reason for 
> violating the existing paradigm/expectation/etc.

It has also been clearly demonstrated in this thread that people 
want more enumeration than just the the process dimension. 

_Especially_ for an object like pages. Often most of the memory in a 
Linux system is _not mapped to any process_. It is in the page 
cache. Still, /proc enumeration does not capture it. Why? Because 
IMO it has been done at the wrong layer, at the wrong abstraction 
level.

Yes, /proc is for process enumeration (as the name tells us 
already), but it is not really suitable as a general object 
enumerator for kernel debugging or kernel instrumentation purposes. 

By putting kernel instrumentation into /proc, we limit all _future_ 
enumeration greatly. Instead of adding just another iterator 
(walker), we now have to move the whole thing across into another 
domain (which is being resisted, and /proc is an ABI anyway).

It's all doable, but a lot harder if it's not being relized why it's 
important to do it.

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] proc: export more page flags in /proc/kpageflags
@ 2009-05-11 22:08                 ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2009-05-11 22:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: fengguang.wu, fweisbec, rostedt, a.p.zijlstra, lizf,
	linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 11 May 2009 13:45:54 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > Yes, we could place pagemap's two auxiliary files into debugfs but 
> > > it would be rather stupid to split the feature's control files 
> > > across two pseudo filesystems, one of which may not even exist.  
> > > Plus pagemap is not a kernel debugging feature.
> > 
> > That's not what i'm suggesting though.
> > 
> > What i'm suggesting is that there's a zillion ways to enumerate 
> > and index various kernel objects, doing that in /proc is 
> > fundamentally wrong. And there's no need to create a per PID/TID 
> > directory structure in /debug either, to be able to list and 
> > access objects by their PID.
> 
> The problem with procfs was that it was growing a lot of random 
> non-process-related stuff.  We never deprecated procfs - we 
> decided that it should be retained for its original purpose and 
> that non-process-realted things shouldn't go in there.
> 
> The /proc/<pid>/pagemap file clearly _is_ process-related, and 
> /proc/<pid> is the natural and correct place for it to live.
> 
> Yes, sure, there are any number of ways in which that data could 
> be presented to userspace in other locations and via other means.  
> But there would need to be an extraordinarily good reason for 
> violating the existing paradigm/expectation/etc.

It has also been clearly demonstrated in this thread that people 
want more enumeration than just the the process dimension. 

_Especially_ for an object like pages. Often most of the memory in a 
Linux system is _not mapped to any process_. It is in the page 
cache. Still, /proc enumeration does not capture it. Why? Because 
IMO it has been done at the wrong layer, at the wrong abstraction 
level.

Yes, /proc is for process enumeration (as the name tells us 
already), but it is not really suitable as a general object 
enumerator for kernel debugging or kernel instrumentation purposes. 

By putting kernel instrumentation into /proc, we limit all _future_ 
enumeration greatly. Instead of adding just another iterator 
(walker), we now have to move the whole thing across into another 
domain (which is being resisted, and /proc is an ABI anyway).

It's all doable, but a lot harder if it's not being relized why it's 
important to do it.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/8] mm: introduce PageHuge() for testing huge/gigantic pages
  2009-05-08 10:53   ` Wu Fengguang
@ 2009-05-13 17:05     ` Mel Gorman
  -1 siblings, 0 replies; 92+ messages in thread
From: Mel Gorman @ 2009-05-13 17:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Matt Mackall, KOSAKI Motohiro, Andi Kleen, linux-mm

Sorry to join the game so late.

On Fri, May 08, 2009 at 06:53:21PM +0800, Wu Fengguang wrote:
> Introduce PageHuge(), which identifies huge/gigantic pages
> by their dedicated compound destructor functions.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/mm.h |   24 ++++++++++++++++++++++++
>  mm/hugetlb.c       |    2 +-
>  mm/page_alloc.c    |   11 ++++++++++-
>  3 files changed, 35 insertions(+), 2 deletions(-)
> 
> --- linux.orig/mm/page_alloc.c
> +++ linux/mm/page_alloc.c
> @@ -299,13 +299,22 @@ void prep_compound_page(struct page *pag
>  }
>  
>  #ifdef CONFIG_HUGETLBFS
> +/*
> + * This (duplicated) destructor function distinguishes gigantic pages from
> + * normal compound pages.
> + */
> +void free_gigantic_page(struct page *page)
> +{
> +	__free_pages_ok(page, compound_order(page));
> +}
> +
>  void prep_compound_gigantic_page(struct page *page, unsigned long order)
>  {
>  	int i;
>  	int nr_pages = 1 << order;
>  	struct page *p = page + 1;
>  
> -	set_compound_page_dtor(page, free_compound_page);
> +	set_compound_page_dtor(page, free_gigantic_page);
>  	set_compound_order(page, order);

This made me raise an eyebrow. gigantic pages can never end up back in the
page allocator.  It should cause bugs all over the place so I looked closer
and this free_gigantic_page() looks unnecessary.

This is what happens for gigantic pages at boot-time

gather_bootmem_prealloc() called at boot-time to gather gigantic pages
  -> Find the boot allocated pages and call prep_compound_huge_page()
    -> For gigantic pages, call prep_compound_gigantic_page(), sets destructor to free_compound_page()
    -> Call prep_new_huge_page(), sets destructor to free_huge_page()

So, free_gigantic_page() should never used as such in reality and you can
just check free_huge_page(). If a gigantic page was really freed that way,
it would be really bad.

Does that make sense?


>  	__SetPageHead(page);
>  	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
> --- linux.orig/mm/hugetlb.c
> +++ linux/mm/hugetlb.c
> @@ -550,7 +550,7 @@ struct hstate *size_to_hstate(unsigned l
>  	return NULL;
>  }
>  
> -static void free_huge_page(struct page *page)
> +void free_huge_page(struct page *page)
>  {
>  	/*
>  	 * Can't pass hstate in here because it is called from the
> --- linux.orig/include/linux/mm.h
> +++ linux/include/linux/mm.h
> @@ -355,6 +355,30 @@ static inline void set_compound_order(st
>  	page[1].lru.prev = (void *)order;
>  }
>  
> +#ifdef CONFIG_HUGETLBFS
> +void free_huge_page(struct page *page);
> +void free_gigantic_page(struct page *page);
> +
> +static inline int PageHuge(struct page *page)
> +{
> +	compound_page_dtor *dtor;
> +
> +	if (!PageCompound(page))
> +		return 0;
> +
> +	page = compound_head(page);
> +	dtor = get_compound_page_dtor(page);
> +
> +	return  dtor == free_huge_page ||
> +		dtor == free_gigantic_page;
> +}
> +#else
> +static inline int PageHuge(struct page *page)
> +{
> +	return 0;
> +}
> +#endif

That is fairly hefty function to be inline and it exports free_huge_page
and free_gigantic_page.  The latter of which is dead code and the former
which was previously a static function.

At least make PageHuge a non-inlined function contained in mm/hugetlb.c and
expose it via mm/internal.h if possible or include/linux/hugetlb.h otherwise.

> +
>  /*
>   * Multiple processes may "see" the same page. E.g. for untouched
>   * mappings of /dev/null, all processes see the same page full of
> 
> -- 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/8] mm: introduce PageHuge() for testing huge/gigantic pages
@ 2009-05-13 17:05     ` Mel Gorman
  0 siblings, 0 replies; 92+ messages in thread
From: Mel Gorman @ 2009-05-13 17:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Matt Mackall, KOSAKI Motohiro, Andi Kleen, linux-mm

Sorry to join the game so late.

On Fri, May 08, 2009 at 06:53:21PM +0800, Wu Fengguang wrote:
> Introduce PageHuge(), which identifies huge/gigantic pages
> by their dedicated compound destructor functions.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/mm.h |   24 ++++++++++++++++++++++++
>  mm/hugetlb.c       |    2 +-
>  mm/page_alloc.c    |   11 ++++++++++-
>  3 files changed, 35 insertions(+), 2 deletions(-)
> 
> --- linux.orig/mm/page_alloc.c
> +++ linux/mm/page_alloc.c
> @@ -299,13 +299,22 @@ void prep_compound_page(struct page *pag
>  }
>  
>  #ifdef CONFIG_HUGETLBFS
> +/*
> + * This (duplicated) destructor function distinguishes gigantic pages from
> + * normal compound pages.
> + */
> +void free_gigantic_page(struct page *page)
> +{
> +	__free_pages_ok(page, compound_order(page));
> +}
> +
>  void prep_compound_gigantic_page(struct page *page, unsigned long order)
>  {
>  	int i;
>  	int nr_pages = 1 << order;
>  	struct page *p = page + 1;
>  
> -	set_compound_page_dtor(page, free_compound_page);
> +	set_compound_page_dtor(page, free_gigantic_page);
>  	set_compound_order(page, order);

This made me raise an eyebrow. gigantic pages can never end up back in the
page allocator.  It should cause bugs all over the place so I looked closer
and this free_gigantic_page() looks unnecessary.

This is what happens for gigantic pages at boot-time

gather_bootmem_prealloc() called at boot-time to gather gigantic pages
  -> Find the boot allocated pages and call prep_compound_huge_page()
    -> For gigantic pages, call prep_compound_gigantic_page(), sets destructor to free_compound_page()
    -> Call prep_new_huge_page(), sets destructor to free_huge_page()

So, free_gigantic_page() should never used as such in reality and you can
just check free_huge_page(). If a gigantic page was really freed that way,
it would be really bad.

Does that make sense?


>  	__SetPageHead(page);
>  	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
> --- linux.orig/mm/hugetlb.c
> +++ linux/mm/hugetlb.c
> @@ -550,7 +550,7 @@ struct hstate *size_to_hstate(unsigned l
>  	return NULL;
>  }
>  
> -static void free_huge_page(struct page *page)
> +void free_huge_page(struct page *page)
>  {
>  	/*
>  	 * Can't pass hstate in here because it is called from the
> --- linux.orig/include/linux/mm.h
> +++ linux/include/linux/mm.h
> @@ -355,6 +355,30 @@ static inline void set_compound_order(st
>  	page[1].lru.prev = (void *)order;
>  }
>  
> +#ifdef CONFIG_HUGETLBFS
> +void free_huge_page(struct page *page);
> +void free_gigantic_page(struct page *page);
> +
> +static inline int PageHuge(struct page *page)
> +{
> +	compound_page_dtor *dtor;
> +
> +	if (!PageCompound(page))
> +		return 0;
> +
> +	page = compound_head(page);
> +	dtor = get_compound_page_dtor(page);
> +
> +	return  dtor == free_huge_page ||
> +		dtor == free_gigantic_page;
> +}
> +#else
> +static inline int PageHuge(struct page *page)
> +{
> +	return 0;
> +}
> +#endif

That is fairly hefty function to be inline and it exports free_huge_page
and free_gigantic_page.  The latter of which is dead code and the former
which was previously a static function.

At least make PageHuge a non-inlined function contained in mm/hugetlb.c and
expose it via mm/internal.h if possible or include/linux/hugetlb.h otherwise.

> +
>  /*
>   * Multiple processes may "see" the same page. E.g. for untouched
>   * mappings of /dev/null, all processes see the same page full of
> 
> -- 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/8] mm: introduce PageHuge() for testing huge/gigantic pages
  2009-05-13 17:05     ` Mel Gorman
@ 2009-05-17 13:09       ` Wu Fengguang
  -1 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-17 13:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, LKML, Matt Mackall, KOSAKI Motohiro, Andi Kleen, linux-mm

On Thu, May 14, 2009 at 01:05:53AM +0800, Mel Gorman wrote:
> Sorry to join the game so late.
> 
> On Fri, May 08, 2009 at 06:53:21PM +0800, Wu Fengguang wrote:
> > Introduce PageHuge(), which identifies huge/gigantic pages
> > by their dedicated compound destructor functions.
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  include/linux/mm.h |   24 ++++++++++++++++++++++++
> >  mm/hugetlb.c       |    2 +-
> >  mm/page_alloc.c    |   11 ++++++++++-
> >  3 files changed, 35 insertions(+), 2 deletions(-)
> > 
> > --- linux.orig/mm/page_alloc.c
> > +++ linux/mm/page_alloc.c
> > @@ -299,13 +299,22 @@ void prep_compound_page(struct page *pag
> >  }
> >  
> >  #ifdef CONFIG_HUGETLBFS
> > +/*
> > + * This (duplicated) destructor function distinguishes gigantic pages from
> > + * normal compound pages.
> > + */
> > +void free_gigantic_page(struct page *page)
> > +{
> > +	__free_pages_ok(page, compound_order(page));
> > +}
> > +
> >  void prep_compound_gigantic_page(struct page *page, unsigned long order)
> >  {
> >  	int i;
> >  	int nr_pages = 1 << order;
> >  	struct page *p = page + 1;
> >  
> > -	set_compound_page_dtor(page, free_compound_page);
> > +	set_compound_page_dtor(page, free_gigantic_page);
> >  	set_compound_order(page, order);
> 
> This made me raise an eyebrow. gigantic pages can never end up back in the
> page allocator.  It should cause bugs all over the place so I looked closer
> and this free_gigantic_page() looks unnecessary.
> 
> This is what happens for gigantic pages at boot-time
> 
> gather_bootmem_prealloc() called at boot-time to gather gigantic pages
>   -> Find the boot allocated pages and call prep_compound_huge_page()
>     -> For gigantic pages, call prep_compound_gigantic_page(), sets destructor to free_compound_page()
>     -> Call prep_new_huge_page(), sets destructor to free_huge_page()
> 
> So, free_gigantic_page() should never used as such in reality and you can
> just check free_huge_page(). If a gigantic page was really freed that way,
> it would be really bad.
> 
> Does that make sense?

You are right, thanks!

> > +#ifdef CONFIG_HUGETLBFS
> > +void free_huge_page(struct page *page);
> > +void free_gigantic_page(struct page *page);
> > +
> > +static inline int PageHuge(struct page *page)
> > +{
> > +	compound_page_dtor *dtor;
> > +
> > +	if (!PageCompound(page))
> > +		return 0;
> > +
> > +	page = compound_head(page);
> > +	dtor = get_compound_page_dtor(page);
> > +
> > +	return  dtor == free_huge_page ||
> > +		dtor == free_gigantic_page;
> > +}
> > +#else
> > +static inline int PageHuge(struct page *page)
> > +{
> > +	return 0;
> > +}
> > +#endif
> 
> That is fairly hefty function to be inline and it exports free_huge_page
> and free_gigantic_page.  The latter of which is dead code and the former
> which was previously a static function.
> 
> At least make PageHuge a non-inlined function contained in mm/hugetlb.c and
> expose it via mm/internal.h if possible or include/linux/hugetlb.h otherwise.

OK, moved the declaration to hugetlb.h, which will be included by fs/proc/page.c.

Andrew, will you replace the -mm patch
        mm-introduce-pagehuge-for-testing-huge-gigantic-pages.patch
with this one?

---
mm: introduce PageHuge() for testing huge/gigantic pages

Introduce PageHuge(), which identifies huge/gigantic pages
by their dedicated compound destructor functions.

Also move prep_compound_gigantic_page() to hugetlb.c and
move adjust_pool_surplus() close to its caller.

CC: Mel Gorman <mel@csn.ul.ie>
CC: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/proc/page.c          |    1 
 include/linux/hugetlb.h |    7 ++
 mm/hugetlb.c            |   98 ++++++++++++++++++++++++--------------
 mm/internal.h           |    5 -
 mm/page_alloc.c         |   17 ------
 5 files changed, 73 insertions(+), 55 deletions(-)

--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -298,23 +298,6 @@ void prep_compound_page(struct page *pag
 	}
 }
 
-#ifdef CONFIG_HUGETLBFS
-void prep_compound_gigantic_page(struct page *page, unsigned long order)
-{
-	int i;
-	int nr_pages = 1 << order;
-	struct page *p = page + 1;
-
-	set_compound_page_dtor(page, free_compound_page);
-	set_compound_order(page, order);
-	__SetPageHead(page);
-	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
-		__SetPageTail(p);
-		p->first_page = page;
-	}
-}
-#endif
-
 static int destroy_compound_page(struct page *page, unsigned long order)
 {
 	int i;
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -578,41 +578,6 @@ static void free_huge_page(struct page *
 		hugetlb_put_quota(mapping, 1);
 }
 
-/*
- * Increment or decrement surplus_huge_pages.  Keep node-specific counters
- * balanced by operating on them in a round-robin fashion.
- * Returns 1 if an adjustment was made.
- */
-static int adjust_pool_surplus(struct hstate *h, int delta)
-{
-	static int prev_nid;
-	int nid = prev_nid;
-	int ret = 0;
-
-	VM_BUG_ON(delta != -1 && delta != 1);
-	do {
-		nid = next_node(nid, node_online_map);
-		if (nid == MAX_NUMNODES)
-			nid = first_node(node_online_map);
-
-		/* To shrink on this node, there must be a surplus page */
-		if (delta < 0 && !h->surplus_huge_pages_node[nid])
-			continue;
-		/* Surplus cannot exceed the total number of pages */
-		if (delta > 0 && h->surplus_huge_pages_node[nid] >=
-						h->nr_huge_pages_node[nid])
-			continue;
-
-		h->surplus_huge_pages += delta;
-		h->surplus_huge_pages_node[nid] += delta;
-		ret = 1;
-		break;
-	} while (nid != prev_nid);
-
-	prev_nid = nid;
-	return ret;
-}
-
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 {
 	set_compound_page_dtor(page, free_huge_page);
@@ -623,6 +588,34 @@ static void prep_new_huge_page(struct hs
 	put_page(page); /* free it into the hugepage allocator */
 }
 
+static void prep_compound_gigantic_page(struct page *page, unsigned long order)
+{
+	int i;
+	int nr_pages = 1 << order;
+	struct page *p = page + 1;
+
+	/* we rely on prep_new_huge_page to set the destructor */
+	set_compound_order(page, order);
+	__SetPageHead(page);
+	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
+		__SetPageTail(p);
+		p->first_page = page;
+	}
+}
+
+int PageHuge(struct page *page)
+{
+	compound_page_dtor *dtor;
+
+	if (!PageCompound(page))
+		return 0;
+
+	page = compound_head(page);
+	dtor = get_compound_page_dtor(page);
+
+	return dtor == free_huge_page;
+}
+
 static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
@@ -1140,6 +1133,41 @@ static inline void try_to_free_low(struc
 }
 #endif
 
+/*
+ * Increment or decrement surplus_huge_pages.  Keep node-specific counters
+ * balanced by operating on them in a round-robin fashion.
+ * Returns 1 if an adjustment was made.
+ */
+static int adjust_pool_surplus(struct hstate *h, int delta)
+{
+	static int prev_nid;
+	int nid = prev_nid;
+	int ret = 0;
+
+	VM_BUG_ON(delta != -1 && delta != 1);
+	do {
+		nid = next_node(nid, node_online_map);
+		if (nid == MAX_NUMNODES)
+			nid = first_node(node_online_map);
+
+		/* To shrink on this node, there must be a surplus page */
+		if (delta < 0 && !h->surplus_huge_pages_node[nid])
+			continue;
+		/* Surplus cannot exceed the total number of pages */
+		if (delta > 0 && h->surplus_huge_pages_node[nid] >=
+						h->nr_huge_pages_node[nid])
+			continue;
+
+		h->surplus_huge_pages += delta;
+		h->surplus_huge_pages_node[nid] += delta;
+		ret = 1;
+		break;
+	} while (nid != prev_nid);
+
+	prev_nid = nid;
+	return ret;
+}
+
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
 static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
 {
--- linux.orig/mm/internal.h
+++ linux/mm/internal.h
@@ -16,9 +16,6 @@
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
-extern void prep_compound_page(struct page *page, unsigned long order);
-extern void prep_compound_gigantic_page(struct page *page, unsigned long order);
-
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
@@ -51,6 +48,8 @@ extern void putback_lru_page(struct page
  */
 extern unsigned long highest_memmap_pfn;
 extern void __free_pages_bootmem(struct page *page, unsigned int order);
+extern void prep_compound_page(struct page *page, unsigned long order);
+
 
 /*
  * function for dealing with page's order in buddy system.
--- linux.orig/include/linux/hugetlb.h
+++ linux/include/linux/hugetlb.h
@@ -11,6 +11,8 @@
 
 struct ctl_table;
 
+int PageHuge(struct page *page);
+
 static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
 {
 	return vma->vm_flags & VM_HUGETLB;
@@ -61,6 +63,11 @@ void hugetlb_change_protection(struct vm
 
 #else /* !CONFIG_HUGETLB_PAGE */
 
+static inline int PageHuge(struct page *page)
+{
+	return 0;
+}
+
 static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
 {
 	return 0;
--- linux.orig/fs/proc/page.c
+++ linux/fs/proc/page.c
@@ -6,6 +6,7 @@
 #include <linux/mmzone.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
+#include <linux/hugetlb.h>
 #include <asm/uaccess.h>
 #include "internal.h"
 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/8] mm: introduce PageHuge() for testing huge/gigantic pages
@ 2009-05-17 13:09       ` Wu Fengguang
  0 siblings, 0 replies; 92+ messages in thread
From: Wu Fengguang @ 2009-05-17 13:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, LKML, Matt Mackall, KOSAKI Motohiro, Andi Kleen, linux-mm

On Thu, May 14, 2009 at 01:05:53AM +0800, Mel Gorman wrote:
> Sorry to join the game so late.
> 
> On Fri, May 08, 2009 at 06:53:21PM +0800, Wu Fengguang wrote:
> > Introduce PageHuge(), which identifies huge/gigantic pages
> > by their dedicated compound destructor functions.
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  include/linux/mm.h |   24 ++++++++++++++++++++++++
> >  mm/hugetlb.c       |    2 +-
> >  mm/page_alloc.c    |   11 ++++++++++-
> >  3 files changed, 35 insertions(+), 2 deletions(-)
> > 
> > --- linux.orig/mm/page_alloc.c
> > +++ linux/mm/page_alloc.c
> > @@ -299,13 +299,22 @@ void prep_compound_page(struct page *pag
> >  }
> >  
> >  #ifdef CONFIG_HUGETLBFS
> > +/*
> > + * This (duplicated) destructor function distinguishes gigantic pages from
> > + * normal compound pages.
> > + */
> > +void free_gigantic_page(struct page *page)
> > +{
> > +	__free_pages_ok(page, compound_order(page));
> > +}
> > +
> >  void prep_compound_gigantic_page(struct page *page, unsigned long order)
> >  {
> >  	int i;
> >  	int nr_pages = 1 << order;
> >  	struct page *p = page + 1;
> >  
> > -	set_compound_page_dtor(page, free_compound_page);
> > +	set_compound_page_dtor(page, free_gigantic_page);
> >  	set_compound_order(page, order);
> 
> This made me raise an eyebrow. gigantic pages can never end up back in the
> page allocator.  It should cause bugs all over the place so I looked closer
> and this free_gigantic_page() looks unnecessary.
> 
> This is what happens for gigantic pages at boot-time
> 
> gather_bootmem_prealloc() called at boot-time to gather gigantic pages
>   -> Find the boot allocated pages and call prep_compound_huge_page()
>     -> For gigantic pages, call prep_compound_gigantic_page(), sets destructor to free_compound_page()
>     -> Call prep_new_huge_page(), sets destructor to free_huge_page()
> 
> So, free_gigantic_page() should never used as such in reality and you can
> just check free_huge_page(). If a gigantic page was really freed that way,
> it would be really bad.
> 
> Does that make sense?

You are right, thanks!

> > +#ifdef CONFIG_HUGETLBFS
> > +void free_huge_page(struct page *page);
> > +void free_gigantic_page(struct page *page);
> > +
> > +static inline int PageHuge(struct page *page)
> > +{
> > +	compound_page_dtor *dtor;
> > +
> > +	if (!PageCompound(page))
> > +		return 0;
> > +
> > +	page = compound_head(page);
> > +	dtor = get_compound_page_dtor(page);
> > +
> > +	return  dtor == free_huge_page ||
> > +		dtor == free_gigantic_page;
> > +}
> > +#else
> > +static inline int PageHuge(struct page *page)
> > +{
> > +	return 0;
> > +}
> > +#endif
> 
> That is fairly hefty function to be inline and it exports free_huge_page
> and free_gigantic_page.  The latter of which is dead code and the former
> which was previously a static function.
> 
> At least make PageHuge a non-inlined function contained in mm/hugetlb.c and
> expose it via mm/internal.h if possible or include/linux/hugetlb.h otherwise.

OK, moved the declaration to hugetlb.h, which will be included by fs/proc/page.c.

Andrew, will you replace the -mm patch
        mm-introduce-pagehuge-for-testing-huge-gigantic-pages.patch
with this one?

---
mm: introduce PageHuge() for testing huge/gigantic pages

Introduce PageHuge(), which identifies huge/gigantic pages
by their dedicated compound destructor functions.

Also move prep_compound_gigantic_page() to hugetlb.c and
move adjust_pool_surplus() close to its caller.

CC: Mel Gorman <mel@csn.ul.ie>
CC: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/proc/page.c          |    1 
 include/linux/hugetlb.h |    7 ++
 mm/hugetlb.c            |   98 ++++++++++++++++++++++++--------------
 mm/internal.h           |    5 -
 mm/page_alloc.c         |   17 ------
 5 files changed, 73 insertions(+), 55 deletions(-)

--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -298,23 +298,6 @@ void prep_compound_page(struct page *pag
 	}
 }
 
-#ifdef CONFIG_HUGETLBFS
-void prep_compound_gigantic_page(struct page *page, unsigned long order)
-{
-	int i;
-	int nr_pages = 1 << order;
-	struct page *p = page + 1;
-
-	set_compound_page_dtor(page, free_compound_page);
-	set_compound_order(page, order);
-	__SetPageHead(page);
-	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
-		__SetPageTail(p);
-		p->first_page = page;
-	}
-}
-#endif
-
 static int destroy_compound_page(struct page *page, unsigned long order)
 {
 	int i;
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -578,41 +578,6 @@ static void free_huge_page(struct page *
 		hugetlb_put_quota(mapping, 1);
 }
 
-/*
- * Increment or decrement surplus_huge_pages.  Keep node-specific counters
- * balanced by operating on them in a round-robin fashion.
- * Returns 1 if an adjustment was made.
- */
-static int adjust_pool_surplus(struct hstate *h, int delta)
-{
-	static int prev_nid;
-	int nid = prev_nid;
-	int ret = 0;
-
-	VM_BUG_ON(delta != -1 && delta != 1);
-	do {
-		nid = next_node(nid, node_online_map);
-		if (nid == MAX_NUMNODES)
-			nid = first_node(node_online_map);
-
-		/* To shrink on this node, there must be a surplus page */
-		if (delta < 0 && !h->surplus_huge_pages_node[nid])
-			continue;
-		/* Surplus cannot exceed the total number of pages */
-		if (delta > 0 && h->surplus_huge_pages_node[nid] >=
-						h->nr_huge_pages_node[nid])
-			continue;
-
-		h->surplus_huge_pages += delta;
-		h->surplus_huge_pages_node[nid] += delta;
-		ret = 1;
-		break;
-	} while (nid != prev_nid);
-
-	prev_nid = nid;
-	return ret;
-}
-
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 {
 	set_compound_page_dtor(page, free_huge_page);
@@ -623,6 +588,34 @@ static void prep_new_huge_page(struct hs
 	put_page(page); /* free it into the hugepage allocator */
 }
 
+static void prep_compound_gigantic_page(struct page *page, unsigned long order)
+{
+	int i;
+	int nr_pages = 1 << order;
+	struct page *p = page + 1;
+
+	/* we rely on prep_new_huge_page to set the destructor */
+	set_compound_order(page, order);
+	__SetPageHead(page);
+	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
+		__SetPageTail(p);
+		p->first_page = page;
+	}
+}
+
+int PageHuge(struct page *page)
+{
+	compound_page_dtor *dtor;
+
+	if (!PageCompound(page))
+		return 0;
+
+	page = compound_head(page);
+	dtor = get_compound_page_dtor(page);
+
+	return dtor == free_huge_page;
+}
+
 static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
@@ -1140,6 +1133,41 @@ static inline void try_to_free_low(struc
 }
 #endif
 
+/*
+ * Increment or decrement surplus_huge_pages.  Keep node-specific counters
+ * balanced by operating on them in a round-robin fashion.
+ * Returns 1 if an adjustment was made.
+ */
+static int adjust_pool_surplus(struct hstate *h, int delta)
+{
+	static int prev_nid;
+	int nid = prev_nid;
+	int ret = 0;
+
+	VM_BUG_ON(delta != -1 && delta != 1);
+	do {
+		nid = next_node(nid, node_online_map);
+		if (nid == MAX_NUMNODES)
+			nid = first_node(node_online_map);
+
+		/* To shrink on this node, there must be a surplus page */
+		if (delta < 0 && !h->surplus_huge_pages_node[nid])
+			continue;
+		/* Surplus cannot exceed the total number of pages */
+		if (delta > 0 && h->surplus_huge_pages_node[nid] >=
+						h->nr_huge_pages_node[nid])
+			continue;
+
+		h->surplus_huge_pages += delta;
+		h->surplus_huge_pages_node[nid] += delta;
+		ret = 1;
+		break;
+	} while (nid != prev_nid);
+
+	prev_nid = nid;
+	return ret;
+}
+
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
 static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
 {
--- linux.orig/mm/internal.h
+++ linux/mm/internal.h
@@ -16,9 +16,6 @@
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
-extern void prep_compound_page(struct page *page, unsigned long order);
-extern void prep_compound_gigantic_page(struct page *page, unsigned long order);
-
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
@@ -51,6 +48,8 @@ extern void putback_lru_page(struct page
  */
 extern unsigned long highest_memmap_pfn;
 extern void __free_pages_bootmem(struct page *page, unsigned int order);
+extern void prep_compound_page(struct page *page, unsigned long order);
+
 
 /*
  * function for dealing with page's order in buddy system.
--- linux.orig/include/linux/hugetlb.h
+++ linux/include/linux/hugetlb.h
@@ -11,6 +11,8 @@
 
 struct ctl_table;
 
+int PageHuge(struct page *page);
+
 static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
 {
 	return vma->vm_flags & VM_HUGETLB;
@@ -61,6 +63,11 @@ void hugetlb_change_protection(struct vm
 
 #else /* !CONFIG_HUGETLB_PAGE */
 
+static inline int PageHuge(struct page *page)
+{
+	return 0;
+}
+
 static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
 {
 	return 0;
--- linux.orig/fs/proc/page.c
+++ linux/fs/proc/page.c
@@ -6,6 +6,7 @@
 #include <linux/mmzone.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
+#include <linux/hugetlb.h>
 #include <asm/uaccess.h>
 #include "internal.h"
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

end of thread, other threads:[~2009-05-17 13:09 UTC | newest]

Thread overview: 92+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-08 10:53 [PATCH 0/8] export more page flags in /proc/kpageflags (take 6) Wu Fengguang
2009-05-08 10:53 ` Wu Fengguang
2009-05-08 10:53 ` [PATCH 1/8] mm: introduce PageHuge() for testing huge/gigantic pages Wu Fengguang
2009-05-08 10:53   ` Wu Fengguang
2009-05-08 11:40   ` Ingo Molnar
2009-05-08 11:40     ` Ingo Molnar
2009-05-08 12:21     ` Wu Fengguang
2009-05-08 12:21       ` Wu Fengguang
2009-05-13 17:05   ` Mel Gorman
2009-05-13 17:05     ` Mel Gorman
2009-05-17 13:09     ` Wu Fengguang
2009-05-17 13:09       ` Wu Fengguang
2009-05-08 10:53 ` [PATCH 2/8] slob: use PG_slab for identifying SLOB pages Wu Fengguang
2009-05-08 10:53   ` Wu Fengguang
2009-05-08 10:53 ` [PATCH 3/8] proc: kpagecount/kpageflags code cleanup Wu Fengguang
2009-05-08 10:53   ` Wu Fengguang
2009-05-08 10:53 ` [PATCH 4/8] proc: export more page flags in /proc/kpageflags Wu Fengguang
2009-05-08 10:53   ` Wu Fengguang
2009-05-08 11:47   ` Ingo Molnar
2009-05-08 11:47     ` Ingo Molnar
2009-05-08 12:44     ` Wu Fengguang
2009-05-08 12:44       ` Wu Fengguang
2009-05-09  5:59       ` Ingo Molnar
2009-05-09  5:59         ` Ingo Molnar
2009-05-09  7:56         ` Wu Fengguang
2009-05-09  7:56           ` Wu Fengguang
2009-05-09  6:27       ` [patch] tracing/mm: add page frame snapshot trace Ingo Molnar
2009-05-09  6:27         ` Ingo Molnar
2009-05-09  9:13         ` Wu Fengguang
2009-05-09  9:13           ` Wu Fengguang
2009-05-09  9:24           ` Ingo Molnar
2009-05-09  9:24             ` Ingo Molnar
2009-05-09  9:43             ` Wu Fengguang
2009-05-09  9:43               ` Wu Fengguang
2009-05-09 10:22               ` Ingo Molnar
2009-05-09 10:22                 ` Ingo Molnar
2009-05-09 10:45                 ` Wu Fengguang
2009-05-09 10:45                   ` Wu Fengguang
2009-05-09 10:01           ` Ingo Molnar
2009-05-09 10:01             ` Ingo Molnar
2009-05-09 10:27             ` Ingo Molnar
2009-05-09 10:27               ` Ingo Molnar
2009-05-09 10:57             ` Wu Fengguang
2009-05-09 10:57               ` Wu Fengguang
2009-05-09 11:05               ` Ingo Molnar
2009-05-09 11:05                 ` Ingo Molnar
2009-05-09 12:23                 ` Wu Fengguang
2009-05-09 12:23                   ` Wu Fengguang
2009-05-09 14:05                   ` Ingo Molnar
2009-05-09 14:05                     ` Ingo Molnar
2009-05-10  8:35                     ` Wu Fengguang
2009-05-10  8:35                       ` Wu Fengguang
2009-05-11 12:01                       ` Ingo Molnar
2009-05-11 12:01                         ` Ingo Molnar
2009-05-09 10:36           ` Ingo Molnar
2009-05-09 10:36             ` Ingo Molnar
2009-05-08 12:58     ` ftrace: concurrent accesses possible? Wu Fengguang
2009-05-08 12:58       ` Wu Fengguang
2009-05-08 13:17       ` Steven Rostedt
2009-05-08 13:17         ` Steven Rostedt
2009-05-08 13:43         ` Wu Fengguang
2009-05-08 13:43           ` Wu Fengguang
2009-05-08 20:24     ` [PATCH 4/8] proc: export more page flags in /proc/kpageflags Andrew Morton
2009-05-08 20:24       ` Andrew Morton
2009-05-09 10:44       ` Ingo Molnar
2009-05-09 10:44         ` Ingo Molnar
2009-05-10  3:58         ` Andrew Morton
2009-05-10  3:58           ` Andrew Morton
2009-05-10  5:26         ` Andrew Morton
2009-05-10  5:26           ` Andrew Morton
2009-05-11 11:45           ` Ingo Molnar
2009-05-11 11:45             ` Ingo Molnar
2009-05-11 18:31             ` Andrew Morton
2009-05-11 18:31               ` Andrew Morton
2009-05-11 22:08               ` Ingo Molnar
2009-05-11 22:08                 ` Ingo Molnar
2009-05-11 19:03             ` Andy Isaacson
2009-05-11 19:03               ` Andy Isaacson
2009-05-08 10:53 ` [PATCH 5/8] pagemap: document clarifications Wu Fengguang
2009-05-08 10:53   ` Wu Fengguang
2009-05-08 10:53 ` [PATCH 6/8] pagemap: document 9 more exported page flags Wu Fengguang
2009-05-08 10:53   ` Wu Fengguang
2009-05-09  8:13   ` KOSAKI Motohiro
2009-05-09  8:13     ` KOSAKI Motohiro
2009-05-09  8:18     ` Wu Fengguang
2009-05-09  8:18       ` Wu Fengguang
2009-05-08 10:53 ` [PATCH 7/8] pagemap: add page-types tool Wu Fengguang
2009-05-08 10:53   ` Wu Fengguang
2009-05-08 10:53 ` [PATCH 8/8] pagemap: export PG_hwpoison Wu Fengguang
2009-05-08 10:53   ` Wu Fengguang
2009-05-08 11:49   ` Ingo Molnar
2009-05-08 11:49     ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.