[PATCH 0/5] proc: export more page flags in /proc/kpageflags (take 4)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5] proc: export more page flags in /proc/kpageflags (take 4)
@ 2009-04-28  1:09 ` Wu Fengguang
  0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28  1:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, KOSAKI Motohiro, Wu, Fengguang, Andi Kleen, linux-mm

Hi all,

Export 9 more flags to end users (and more for kernel developers):

        11. KPF_MMAP            (pseudo flag) memory mapped page
        12. KPF_ANON            (pseudo flag) memory mapped page (anonymous)
        13. KPF_SWAPCACHE       page is in swap cache
        14. KPF_SWAPBACKED      page is swap/RAM backed
        15. KPF_COMPOUND_HEAD   (*)
        16. KPF_COMPOUND_TAIL   (*)
        17. KPF_UNEVICTABLE     page is in the unevictable LRU list
        18. KPF_HWPOISON        hardware detected corruption
        19. KPF_NOPAGE          (pseudo flag) no page frame at the address

        (*) For compound pages, exporting _both_ head/tail info enables
            users to tell where a compound page starts/ends, and its order.

Please check the documentary patch and changelog of the final patch
for the details.

	[PATCH 1/5] pagemap: document clarifications                                             
	[PATCH 2/5] pagemap: documentation new page flags                                        
	[PATCH 3/5] mm: introduce PageHuge() for testing huge/gigantic pages                     
	[PATCH 4/5] proc: kpagecount/kpageflags code cleanup                                     
	[PATCH 5/5] proc: export more page flags in /proc/kpageflags                             

Thanks,
Fengguang
-- 


^ permalink raw reply	[flat|nested] 137+ messages in thread

* [PATCH 0/5] proc: export more page flags in /proc/kpageflags (take 4)
@ 2009-04-28  1:09 ` Wu Fengguang
  0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28  1:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, KOSAKI Motohiro, Wu, Fengguang, Andi Kleen, linux-mm

Hi all,

Export 9 more flags to end users (and more for kernel developers):

        11. KPF_MMAP            (pseudo flag) memory mapped page
        12. KPF_ANON            (pseudo flag) memory mapped page (anonymous)
        13. KPF_SWAPCACHE       page is in swap cache
        14. KPF_SWAPBACKED      page is swap/RAM backed
        15. KPF_COMPOUND_HEAD   (*)
        16. KPF_COMPOUND_TAIL   (*)
        17. KPF_UNEVICTABLE     page is in the unevictable LRU list
        18. KPF_HWPOISON        hardware detected corruption
        19. KPF_NOPAGE          (pseudo flag) no page frame at the address

        (*) For compound pages, exporting _both_ head/tail info enables
            users to tell where a compound page starts/ends, and its order.

Please check the documentary patch and changelog of the final patch
for the details.

	[PATCH 1/5] pagemap: document clarifications                                             
	[PATCH 2/5] pagemap: documentation new page flags                                        
	[PATCH 3/5] mm: introduce PageHuge() for testing huge/gigantic pages                     
	[PATCH 4/5] proc: kpagecount/kpageflags code cleanup                                     
	[PATCH 5/5] proc: export more page flags in /proc/kpageflags                             

Thanks,
Fengguang
-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* [PATCH 1/5] pagemap: document clarifications
  2009-04-28  1:09 ` Wu Fengguang
@ 2009-04-28  1:09   ` Wu Fengguang
  -1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28  1:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Wu Fengguang, KOSAKI Motohiro, Andi Kleen, linux-mm

[-- Attachment #1: kpageflags-doc-fix.patch --]
[-- Type: text/plain, Size: 1171 bytes --]

Some bit ranges were inclusive and some not.
Fix them to be consistently inclusive.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/vm/pagemap.txt |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- mm.orig/Documentation/vm/pagemap.txt
+++ mm/Documentation/vm/pagemap.txt
@@ -12,9 +12,9 @@ There are three components to pagemap:
    value for each virtual page, containing the following data (from
    fs/proc/task_mmu.c, above pagemap_read):
 
-    * Bits 0-55  page frame number (PFN) if present
+    * Bits 0-54  page frame number (PFN) if present
     * Bits 0-4   swap type if swapped
-    * Bits 5-55  swap offset if swapped
+    * Bits 5-54  swap offset if swapped
     * Bits 55-60 page shift (page size = 1<<page shift)
     * Bit  61    reserved for future use
     * Bit  62    page swapped
@@ -36,7 +36,7 @@ There are three components to pagemap:
  * /proc/kpageflags.  This file contains a 64-bit set of flags for each
    page, indexed by PFN.
 
-   The flags are (from fs/proc/proc_misc, above kpageflags_read):
+   The flags are (from fs/proc/page.c, above kpageflags_read):
 
      0. LOCKED
      1. ERROR

-- 


^ permalink raw reply	[flat|nested] 137+ messages in thread

* [PATCH 1/5] pagemap: document clarifications
@ 2009-04-28  1:09   ` Wu Fengguang
  0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28  1:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Wu Fengguang, KOSAKI Motohiro, Andi Kleen, linux-mm

[-- Attachment #1: kpageflags-doc-fix.patch --]
[-- Type: text/plain, Size: 1396 bytes --]

Some bit ranges were inclusive and some not.
Fix them to be consistently inclusive.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/vm/pagemap.txt |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- mm.orig/Documentation/vm/pagemap.txt
+++ mm/Documentation/vm/pagemap.txt
@@ -12,9 +12,9 @@ There are three components to pagemap:
    value for each virtual page, containing the following data (from
    fs/proc/task_mmu.c, above pagemap_read):
 
-    * Bits 0-55  page frame number (PFN) if present
+    * Bits 0-54  page frame number (PFN) if present
     * Bits 0-4   swap type if swapped
-    * Bits 5-55  swap offset if swapped
+    * Bits 5-54  swap offset if swapped
     * Bits 55-60 page shift (page size = 1<<page shift)
     * Bit  61    reserved for future use
     * Bit  62    page swapped
@@ -36,7 +36,7 @@ There are three components to pagemap:
  * /proc/kpageflags.  This file contains a 64-bit set of flags for each
    page, indexed by PFN.
 
-   The flags are (from fs/proc/proc_misc, above kpageflags_read):
+   The flags are (from fs/proc/page.c, above kpageflags_read):
 
      0. LOCKED
      1. ERROR

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* [PATCH 2/5] pagemap: documentation 9 more exported page flags
  2009-04-28  1:09 ` Wu Fengguang
@ 2009-04-28  1:09   ` Wu Fengguang
  -1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28  1:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Wu Fengguang, KOSAKI Motohiro, Andi Kleen, linux-mm

[-- Attachment #1: kpageflags-doc.patch --]
[-- Type: text/plain, Size: 2990 bytes --]

Also add short descriptions for all of the 20 exported page flags.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/vm/pagemap.txt |   62 +++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

--- mm.orig/Documentation/vm/pagemap.txt
+++ mm/Documentation/vm/pagemap.txt
@@ -49,6 +49,68 @@ There are three components to pagemap:
      8. WRITEBACK
      9. RECLAIM
     10. BUDDY
+    11. MMAP
+    12. ANON
+    13. SWAPCACHE
+    14. SWAPBACKED
+    15. COMPOUND_HEAD
+    16. COMPOUND_TAIL
+    17. UNEVICTABLE
+    18. HWPOISON
+    19. NOPAGE
+
+Short descriptions to the page flags:
+
+ 0. LOCKED
+    page is being locked for exclusive access, eg. by undergoing read/write IO
+
+ 7. SLAB
+    page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
+    When compound page is used, SLUB/SLQB will only set this flag on the head
+    page; SLOB will not flag it at all.
+
+10. BUDDY
+    a free memory block managed by the buddy system allocator
+    The buddy system organizes free memory in blocks of various orders.
+    An order N block has 2^N physically contiguous pages, with the BUDDY flag
+    set for and _only_ for the first page.
+
+15. COMPOUND_HEAD
+16. COMPOUND_TAIL
+    A compound page with order N consists of 2^N physically contiguous pages.
+    A compound page with order 2 takes the form of "HTTT", where H donates its
+    head page and T donates its tail page(s).  The major consumers of compound
+    pages are hugeTLB pages (Documentation/vm/hugetlbpage.txt), the SLUB etc.
+    memory allocators and various device drivers. However in this interface,
+    only huge/giga pages are made visible to end users.
+
+18. HWPOISON
+    hardware detected memory corruption on this page: don't touch the data!
+
+19. NOPAGE
+    no page frame exists at the requested address
+
+    [IO related page flags]
+ 1. ERROR     IO error occurred
+ 3. UPTODATE  page has up-to-date data
+              ie. for file backed page: (in-memory data revision >= on-disk one)
+ 4. DIRTY     page has been written to, hence contains new data
+              ie. for file backed page: (in-memory data revision >  on-disk one)
+ 8. WRITEBACK page is being synced to disk
+
+    [LRU related page flags]
+ 5. LRU         page is in one of the LRU lists
+ 6. ACTIVE      page is in the active LRU list
+17. UNEVICTABLE page is in the unevictable (non-)LRU list
+                It is somehow pinned and not a candidate for LRU page reclaims,
+		eg. ramfs pages, shmctl(SHM_LOCK) and mlock() memory segments
+ 2. REFERENCED  page has been referenced since last LRU list enqueue/requeue
+ 9. RECLAIM     page will be reclaimed soon after its pageout IO completed
+11. MMAP        a memory mapped page
+12. ANON        a memory mapped page that is not part of a file
+13. SWAPCACHE   page is mapped to swap space, ie. has an associated swap entry
+14. SWAPBACKED  page is backed by swap/RAM
+
 
 Using pagemap to do something useful:
 

-- 


^ permalink raw reply	[flat|nested] 137+ messages in thread

* [PATCH 2/5] pagemap: documentation 9 more exported page flags
@ 2009-04-28  1:09   ` Wu Fengguang
  0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28  1:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Wu Fengguang, KOSAKI Motohiro, Andi Kleen, linux-mm

[-- Attachment #1: kpageflags-doc.patch --]
[-- Type: text/plain, Size: 3215 bytes --]

Also add short descriptions for all of the 20 exported page flags.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/vm/pagemap.txt |   62 +++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

--- mm.orig/Documentation/vm/pagemap.txt
+++ mm/Documentation/vm/pagemap.txt
@@ -49,6 +49,68 @@ There are three components to pagemap:
      8. WRITEBACK
      9. RECLAIM
     10. BUDDY
+    11. MMAP
+    12. ANON
+    13. SWAPCACHE
+    14. SWAPBACKED
+    15. COMPOUND_HEAD
+    16. COMPOUND_TAIL
+    17. UNEVICTABLE
+    18. HWPOISON
+    19. NOPAGE
+
+Short descriptions to the page flags:
+
+ 0. LOCKED
+    page is being locked for exclusive access, eg. by undergoing read/write IO
+
+ 7. SLAB
+    page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
+    When compound page is used, SLUB/SLQB will only set this flag on the head
+    page; SLOB will not flag it at all.
+
+10. BUDDY
+    a free memory block managed by the buddy system allocator
+    The buddy system organizes free memory in blocks of various orders.
+    An order N block has 2^N physically contiguous pages, with the BUDDY flag
+    set for and _only_ for the first page.
+
+15. COMPOUND_HEAD
+16. COMPOUND_TAIL
+    A compound page with order N consists of 2^N physically contiguous pages.
+    A compound page with order 2 takes the form of "HTTT", where H donates its
+    head page and T donates its tail page(s).  The major consumers of compound
+    pages are hugeTLB pages (Documentation/vm/hugetlbpage.txt), the SLUB etc.
+    memory allocators and various device drivers. However in this interface,
+    only huge/giga pages are made visible to end users.
+
+18. HWPOISON
+    hardware detected memory corruption on this page: don't touch the data!
+
+19. NOPAGE
+    no page frame exists at the requested address
+
+    [IO related page flags]
+ 1. ERROR     IO error occurred
+ 3. UPTODATE  page has up-to-date data
+              ie. for file backed page: (in-memory data revision >= on-disk one)
+ 4. DIRTY     page has been written to, hence contains new data
+              ie. for file backed page: (in-memory data revision >  on-disk one)
+ 8. WRITEBACK page is being synced to disk
+
+    [LRU related page flags]
+ 5. LRU         page is in one of the LRU lists
+ 6. ACTIVE      page is in the active LRU list
+17. UNEVICTABLE page is in the unevictable (non-)LRU list
+                It is somehow pinned and not a candidate for LRU page reclaims,
+		eg. ramfs pages, shmctl(SHM_LOCK) and mlock() memory segments
+ 2. REFERENCED  page has been referenced since last LRU list enqueue/requeue
+ 9. RECLAIM     page will be reclaimed soon after its pageout IO completed
+11. MMAP        a memory mapped page
+12. ANON        a memory mapped page that is not part of a file
+13. SWAPCACHE   page is mapped to swap space, ie. has an associated swap entry
+14. SWAPBACKED  page is backed by swap/RAM
+
 
 Using pagemap to do something useful:
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* [PATCH 3/5] mm: introduce PageHuge() for testing huge/gigantic pages
  2009-04-28  1:09 ` Wu Fengguang
@ 2009-04-28  1:09   ` Wu Fengguang
  -1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28  1:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Wu Fengguang, KOSAKI Motohiro, Andi Kleen, linux-mm

[-- Attachment #1: giga-page.patch --]
[-- Type: text/plain, Size: 2112 bytes --]

Introduce PageHuge(), which identifies huge/gigantic pages
by their dedicated compound destructor functions.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/mm.h |   24 ++++++++++++++++++++++++
 mm/hugetlb.c       |    2 +-
 mm/page_alloc.c    |   11 ++++++++++-
 3 files changed, 35 insertions(+), 2 deletions(-)

--- mm.orig/mm/page_alloc.c
+++ mm/mm/page_alloc.c
@@ -299,13 +299,22 @@ void prep_compound_page(struct page *pag
 }
 
 #ifdef CONFIG_HUGETLBFS
+/*
+ * This (duplicated) destructor function distinguishes gigantic pages from
+ * normal compound pages.
+ */
+void free_gigantic_page(struct page *page)
+{
+	__free_pages_ok(page, compound_order(page));
+}
+
 void prep_compound_gigantic_page(struct page *page, unsigned long order)
 {
 	int i;
 	int nr_pages = 1 << order;
 	struct page *p = page + 1;
 
-	set_compound_page_dtor(page, free_compound_page);
+	set_compound_page_dtor(page, free_gigantic_page);
 	set_compound_order(page, order);
 	__SetPageHead(page);
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
--- mm.orig/mm/hugetlb.c
+++ mm/mm/hugetlb.c
@@ -550,7 +550,7 @@ struct hstate *size_to_hstate(unsigned l
 	return NULL;
 }
 
-static void free_huge_page(struct page *page)
+void free_huge_page(struct page *page)
 {
 	/*
 	 * Can't pass hstate in here because it is called from the
--- mm.orig/include/linux/mm.h
+++ mm/include/linux/mm.h
@@ -355,6 +355,30 @@ static inline void set_compound_order(st
 	page[1].lru.prev = (void *)order;
 }
 
+#ifdef CONFIG_HUGETLBFS
+void free_huge_page(struct page *page);
+void free_gigantic_page(struct page *page);
+
+static inline int PageHuge(struct page *page)
+{
+	compound_page_dtor *dtor;
+
+	if (!PageCompound(page))
+		return 0;
+
+	page = compound_head(page);
+	dtor = get_compound_page_dtor(page);
+
+	return  dtor == free_huge_page ||
+		dtor == free_gigantic_page;
+}
+#else
+static inline int PageHuge(struct page *page)
+{
+	return 0;
+}
+#endif
+
 /*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of

-- 


^ permalink raw reply	[flat|nested] 137+ messages in thread

* [PATCH 3/5] mm: introduce PageHuge() for testing huge/gigantic pages
@ 2009-04-28  1:09   ` Wu Fengguang
  0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28  1:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Wu Fengguang, KOSAKI Motohiro, Andi Kleen, linux-mm

[-- Attachment #1: giga-page.patch --]
[-- Type: text/plain, Size: 2337 bytes --]

Introduce PageHuge(), which identifies huge/gigantic pages
by their dedicated compound destructor functions.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/mm.h |   24 ++++++++++++++++++++++++
 mm/hugetlb.c       |    2 +-
 mm/page_alloc.c    |   11 ++++++++++-
 3 files changed, 35 insertions(+), 2 deletions(-)

--- mm.orig/mm/page_alloc.c
+++ mm/mm/page_alloc.c
@@ -299,13 +299,22 @@ void prep_compound_page(struct page *pag
 }
 
 #ifdef CONFIG_HUGETLBFS
+/*
+ * This (duplicated) destructor function distinguishes gigantic pages from
+ * normal compound pages.
+ */
+void free_gigantic_page(struct page *page)
+{
+	__free_pages_ok(page, compound_order(page));
+}
+
 void prep_compound_gigantic_page(struct page *page, unsigned long order)
 {
 	int i;
 	int nr_pages = 1 << order;
 	struct page *p = page + 1;
 
-	set_compound_page_dtor(page, free_compound_page);
+	set_compound_page_dtor(page, free_gigantic_page);
 	set_compound_order(page, order);
 	__SetPageHead(page);
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
--- mm.orig/mm/hugetlb.c
+++ mm/mm/hugetlb.c
@@ -550,7 +550,7 @@ struct hstate *size_to_hstate(unsigned l
 	return NULL;
 }
 
-static void free_huge_page(struct page *page)
+void free_huge_page(struct page *page)
 {
 	/*
 	 * Can't pass hstate in here because it is called from the
--- mm.orig/include/linux/mm.h
+++ mm/include/linux/mm.h
@@ -355,6 +355,30 @@ static inline void set_compound_order(st
 	page[1].lru.prev = (void *)order;
 }
 
+#ifdef CONFIG_HUGETLBFS
+void free_huge_page(struct page *page);
+void free_gigantic_page(struct page *page);
+
+static inline int PageHuge(struct page *page)
+{
+	compound_page_dtor *dtor;
+
+	if (!PageCompound(page))
+		return 0;
+
+	page = compound_head(page);
+	dtor = get_compound_page_dtor(page);
+
+	return  dtor == free_huge_page ||
+		dtor == free_gigantic_page;
+}
+#else
+static inline int PageHuge(struct page *page)
+{
+	return 0;
+}
+#endif
+
 /*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* [PATCH 4/5] proc: kpagecount/kpageflags code cleanup
  2009-04-28  1:09 ` Wu Fengguang
@ 2009-04-28  1:09   ` Wu Fengguang
  -1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28  1:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Wu Fengguang, KOSAKI Motohiro, Andi Kleen, linux-mm

[-- Attachment #1: kpageflags-fix-out.patch --]
[-- Type: text/plain, Size: 1254 bytes --]

Move increments of pfn/out to bottom of the loop.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/proc/page.c |   16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

--- mm.orig/fs/proc/page.c
+++ mm/fs/proc/page.c
@@ -32,20 +32,22 @@ static ssize_t kpagecount_read(struct fi
 		return -EINVAL;
 
 	while (count > 0) {
-		ppage = NULL;
 		if (pfn_valid(pfn))
 			ppage = pfn_to_page(pfn);
-		pfn++;
+		else
+			ppage = NULL;
 		if (!ppage)
 			pcount = 0;
 		else
 			pcount = page_mapcount(ppage);
 
-		if (put_user(pcount, out++)) {
+		if (put_user(pcount, out)) {
 			ret = -EFAULT;
 			break;
 		}
 
+		pfn++;
+		out++;
 		count -= KPMSIZE;
 	}
 
@@ -98,10 +100,10 @@ static ssize_t kpageflags_read(struct fi
 		return -EINVAL;
 
 	while (count > 0) {
-		ppage = NULL;
 		if (pfn_valid(pfn))
 			ppage = pfn_to_page(pfn);
-		pfn++;
+		else
+			ppage = NULL;
 		if (!ppage)
 			kflags = 0;
 		else
@@ -119,11 +121,13 @@ static ssize_t kpageflags_read(struct fi
 			kpf_copy_bit(kflags, KPF_RECLAIM, PG_reclaim) |
 			kpf_copy_bit(kflags, KPF_BUDDY, PG_buddy);
 
-		if (put_user(uflags, out++)) {
+		if (put_user(uflags, out)) {
 			ret = -EFAULT;
 			break;
 		}
 
+		pfn++;
+		out++;
 		count -= KPMSIZE;
 	}
 

-- 


^ permalink raw reply	[flat|nested] 137+ messages in thread

* [PATCH 4/5] proc: kpagecount/kpageflags code cleanup
@ 2009-04-28  1:09   ` Wu Fengguang
  0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28  1:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Wu Fengguang, KOSAKI Motohiro, Andi Kleen, linux-mm

[-- Attachment #1: kpageflags-fix-out.patch --]
[-- Type: text/plain, Size: 1479 bytes --]

Move increments of pfn/out to bottom of the loop.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/proc/page.c |   16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

--- mm.orig/fs/proc/page.c
+++ mm/fs/proc/page.c
@@ -32,20 +32,22 @@ static ssize_t kpagecount_read(struct fi
 		return -EINVAL;
 
 	while (count > 0) {
-		ppage = NULL;
 		if (pfn_valid(pfn))
 			ppage = pfn_to_page(pfn);
-		pfn++;
+		else
+			ppage = NULL;
 		if (!ppage)
 			pcount = 0;
 		else
 			pcount = page_mapcount(ppage);
 
-		if (put_user(pcount, out++)) {
+		if (put_user(pcount, out)) {
 			ret = -EFAULT;
 			break;
 		}
 
+		pfn++;
+		out++;
 		count -= KPMSIZE;
 	}
 
@@ -98,10 +100,10 @@ static ssize_t kpageflags_read(struct fi
 		return -EINVAL;
 
 	while (count > 0) {
-		ppage = NULL;
 		if (pfn_valid(pfn))
 			ppage = pfn_to_page(pfn);
-		pfn++;
+		else
+			ppage = NULL;
 		if (!ppage)
 			kflags = 0;
 		else
@@ -119,11 +121,13 @@ static ssize_t kpageflags_read(struct fi
 			kpf_copy_bit(kflags, KPF_RECLAIM, PG_reclaim) |
 			kpf_copy_bit(kflags, KPF_BUDDY, PG_buddy);
 
-		if (put_user(uflags, out++)) {
+		if (put_user(uflags, out)) {
 			ret = -EFAULT;
 			break;
 		}
 
+		pfn++;
+		out++;
 		count -= KPMSIZE;
 	}
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  1:09 ` Wu Fengguang
@ 2009-04-28  1:09   ` Wu Fengguang
  -1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28  1:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, KOSAKI Motohiro, Andi Kleen, Matt Mackall, Alexey Dobriyan,
	Wu Fengguang, linux-mm

[-- Attachment #1: kpageflags-extending.patch --]
[-- Type: text/plain, Size: 13723 bytes --]

Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.

1) for kernel hackers (on CONFIG_DEBUG_KERNEL)
   - all available page flags are exported, and
   - exported as is
2) for admins and end users
   - only the more `well known' flags are exported:
	11. KPF_MMAP		(pseudo flag) memory mapped page
	12. KPF_ANON		(pseudo flag) memory mapped page (anonymous)
	13. KPF_SWAPCACHE	page is in swap cache
	14. KPF_SWAPBACKED	page is swap/RAM backed
	15. KPF_COMPOUND_HEAD	(*)
	16. KPF_COMPOUND_TAIL	(*)
	17. KPF_UNEVICTABLE	page is in the unevictable LRU list
	18. KPF_HWPOISON	hardware detected corruption
	19. KPF_NOPAGE		(pseudo flag) no page frame at the address

	(*) For compound pages, exporting _both_ head/tail info enables
	    users to tell where a compound page starts/ends, and its order.

   - limit flags to their typical usage scenario, as indicated by KOSAKI:
	- LRU pages: only export relevant flags
		- PG_lru
		- PG_unevictable
		- PG_active
		- PG_referenced
		- page_mapped()
		- PageAnon()
		- PG_swapcache
		- PG_swapbacked
		- PG_reclaim
	- no-IO pages: mask out irrelevant flags
		- PG_dirty
		- PG_uptodate
		- PG_writeback
	- SLAB pages: mask out overloaded flags:
		- PG_error
		- PG_active
		- PG_private
	- PG_reclaim: mask out the overloaded PG_readahead
	- compound flags: only export huge/gigantic pages

Here are the admin/linus views of all page flags on a newly booted nfs-root system:

# ./page-types # for admin
         flags  page-count       MB  symbolic-flags                     long-symbolic-flags
0x000000000000      491174     1918  ____________________________                
0x000000000020           1        0  _____l______________________       lru      
0x000000000028        2543        9  ___U_l______________________       uptodate,lru
0x00000000002c        5288       20  __RU_l______________________       referenced,uptodate,lru
0x000000004060           1        0  _____lA_______b_____________       lru,active,swapbacked
0x000000004064          19        0  __R__lA_______b_____________       referenced,lru,active,swapbacked
0x000000000068         225        0  ___U_lA_____________________       uptodate,lru,active
0x00000000006c         969        3  __RU_lA_____________________       referenced,uptodate,lru,active
0x000000000080        6832       26  _______S____________________       slab     
0x000000000400         576        2  __________B_________________       buddy    
0x000000000828        1159        4  ___U_l_____M________________       uptodate,lru,mmap
0x00000000082c         310        1  __RU_l_____M________________       referenced,uptodate,lru,mmap
0x000000004860           2        0  _____lA____M__b_____________       lru,active,mmap,swapbacked
0x000000000868         375        1  ___U_lA____M________________       uptodate,lru,active,mmap
0x00000000086c         635        2  __RU_lA____M________________       referenced,uptodate,lru,active,mmap
0x000000005860        3831       14  _____lA____Ma_b_____________       lru,active,mmap,anonymous,swapbacked
0x000000005864          28        0  __R__lA____Ma_b_____________       referenced,lru,active,mmap,anonymous,swapbacked
         total      513968     2007                                              

# ./page-types # for linus, when CONFIG_DEBUG_KERNEL is turned on
         flags  page-count       MB  symbolic-flags                     long-symbolic-flags
0x000000000000      471058     1840  ____________________________
0x000100000000       19288       75  ____________________r_______       reserved
0x000000010000        1064        4  ________________T___________       compound_tail
0x000000008000           1        0  _______________H____________       compound_head
0x000000008014           1        0  __R_D__________H____________       referenced,dirty,compound_head
0x000000010014           4        0  __R_D___________T___________       referenced,dirty,compound_tail
0x000000000020           1        0  _____l______________________       lru
0x000000000028        2522        9  ___U_l______________________       uptodate,lru
0x00000000002c        5207       20  __RU_l______________________       referenced,uptodate,lru
0x000000000068         203        0  ___U_lA_____________________       uptodate,lru,active
0x00000000006c         869        3  __RU_lA_____________________       referenced,uptodate,lru,active
0x000000004078           1        0  ___UDlA_______b_____________       uptodate,dirty,lru,active,swapbacked
0x00000000407c          19        0  __RUDlA_______b_____________       referenced,uptodate,dirty,lru,active,swapbacked
0x000000000080        5989       23  _______S____________________       slab
0x000000008080         778        3  _______S_______H____________       slab,compound_head
0x000000000228          44        0  ___U_l___I__________________       uptodate,lru,reclaim
0x00000000022c          39        0  __RU_l___I__________________       referenced,uptodate,lru,reclaim
0x000000000268          12        0  ___U_lA__I__________________       uptodate,lru,active,reclaim
0x00000000026c          44        0  __RU_lA__I__________________       referenced,uptodate,lru,active,reclaim
0x000000000400         550        2  __________B_________________       buddy
0x000000000804           1        0  __R________M________________       referenced,mmap
0x000000000828        1068        4  ___U_l_____M________________       uptodate,lru,mmap
0x00000000082c         326        1  __RU_l_____M________________       referenced,uptodate,lru,mmap
0x000000000868         335        1  ___U_lA____M________________       uptodate,lru,active,mmap
0x00000000086c         599        2  __RU_lA____M________________       referenced,uptodate,lru,active,mmap
0x000000004878           2        0  ___UDlA____M__b_____________       uptodate,dirty,lru,active,mmap,swapbacked
0x000000000a28          44        0  ___U_l___I_M________________       uptodate,lru,reclaim,mmap
0x000000000a2c          12        0  __RU_l___I_M________________       referenced,uptodate,lru,reclaim,mmap
0x000000000a68           8        0  ___U_lA__I_M________________       uptodate,lru,active,reclaim,mmap
0x000000000a6c          31        0  __RU_lA__I_M________________       referenced,uptodate,lru,active,reclaim,mmap
0x000000001000         442        1  ____________a_______________       anonymous
0x000000005808           7        0  ___U_______Ma_b_____________       uptodate,mmap,anonymous,swapbacked
0x000000005868        3371       13  ___U_lA____Ma_b_____________       uptodate,lru,active,mmap,anonymous,swapbacked                
0x00000000586c          28        0  __RU_lA____Ma_b_____________       referenced,uptodate,lru,active,mmap,anonymous,swapbacked
         total      513968     2007

Thanks to KOSAKI and Andi for their valuable recommendations!

Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/proc/page.c |  197 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 167 insertions(+), 30 deletions(-)

--- mm.orig/fs/proc/page.c
+++ mm/fs/proc/page.c
@@ -6,6 +6,7 @@
 #include <linux/mmzone.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
+#include <linux/backing-dev.h>
 #include <asm/uaccess.h>
 #include "internal.h"
 
@@ -70,19 +71,172 @@ static const struct file_operations proc
 
 /* These macros are used to decouple internal flags from exported ones */
 
-#define KPF_LOCKED     0
-#define KPF_ERROR      1
-#define KPF_REFERENCED 2
-#define KPF_UPTODATE   3
-#define KPF_DIRTY      4
-#define KPF_LRU        5
-#define KPF_ACTIVE     6
-#define KPF_SLAB       7
-#define KPF_WRITEBACK  8
-#define KPF_RECLAIM    9
-#define KPF_BUDDY     10
+#define KPF_LOCKED		0
+#define KPF_ERROR		1
+#define KPF_REFERENCED		2
+#define KPF_UPTODATE		3
+#define KPF_DIRTY		4
+#define KPF_LRU			5
+#define KPF_ACTIVE		6
+#define KPF_SLAB		7
+#define KPF_WRITEBACK		8
+#define KPF_RECLAIM		9
+#define KPF_BUDDY		10
+
+/* new additions in 2.6.31 */
+#define KPF_MMAP		11
+#define KPF_ANON		12
+#define KPF_SWAPCACHE		13
+#define KPF_SWAPBACKED		14
+#define KPF_COMPOUND_HEAD	15
+#define KPF_COMPOUND_TAIL	16
+#define KPF_UNEVICTABLE		17
+#define KPF_HWPOISON		18
+#define KPF_NOPAGE		19
+
+/* kernel hacking assistances */
+#define KPF_RESERVED		32
+#define KPF_MLOCKED		33
+#define KPF_MAPPEDTODISK	34
+#define KPF_PRIVATE		35
+#define KPF_PRIVATE2		36
+#define KPF_OWNER_PRIVATE	37
+#define KPF_ARCH		38
+#define KPF_UNCACHED		39
+
+/*
+ * Kernel flags are exported faithfully to Linus and his fellow hackers.
+ * Otherwise some details are masked to avoid confusing the end user:
+ * - some kernel flags are completely invisible
+ * - some kernel flags are conditionally invisible on their odd usages
+ */
+#ifdef CONFIG_DEBUG_KERNEL
+static inline int genuine_linus(void) { return 1; }
+#else
+static inline int genuine_linus(void) { return 0; }
+#endif
+
+#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit)		\
+	do {								\
+		if (visible || genuine_linus())				\
+			uflags |= ((kflags >> kbit) & 1) << ubit;	\
+	} while (0);
+
+/* a helper function _not_ intended for more general uses */
+static inline int page_cap_writeback_dirty(struct page *page)
+{
+	struct address_space *mapping;
+
+	if (!PageSlab(page))
+		mapping = page_mapping(page);
+	else
+		mapping = NULL;
+
+	return mapping && mapping_cap_writeback_dirty(mapping);
+}
+
+static u64 get_uflags(struct page *page)
+{
+	u64 k;
+	u64 u;
+	int io;
+	int lru;
+	int slab;
+
+	/*
+	 * pseudo flag: KPF_NOPAGE
+	 * it differentiates a memory hole from a page with no flags
+	 */
+	if (!page)
+		return 1 << KPF_NOPAGE;
+
+	k = page->flags;
+	u = 0;
+
+	io   = page_cap_writeback_dirty(page);
+	lru  = k & (1 << PG_lru);
+	slab = k & (1 << PG_slab);
+
+	/*
+	 * pseudo flags for the well known (anonymous) memory mapped pages
+	 */
+	if (lru || genuine_linus()) {
+		if (!slab && page_mapped(page))
+			u |= 1 << KPF_MMAP;
+		if (PageAnon(page))
+			u |= 1 << KPF_ANON;
+	}
 
-#define kpf_copy_bit(flags, dstpos, srcpos) (((flags >> srcpos) & 1) << dstpos)
+	/*
+	 * compound pages: export both head/tail info
+	 * they together define a compound page's start/end pos and order
+	 */
+	if (PageHuge(page) || genuine_linus()) {
+		if (PageHead(page))
+			u |= 1 << KPF_COMPOUND_HEAD;
+		if (PageTail(page))
+			u |= 1 << KPF_COMPOUND_TAIL;
+	}
+
+	kpf_copy_bit(u, k, 1,	  KPF_LOCKED,		PG_locked);
+
+	/*
+	 * Caveats on high order pages:
+	 * PG_buddy will only be set on the head page; SLUB/SLQB do the same
+	 * for PG_slab; SLOB won't set PG_slab at all on compound pages.
+	 */
+	kpf_copy_bit(u, k, 1,     KPF_SLAB,		PG_slab);
+	kpf_copy_bit(u, k, 1,     KPF_BUDDY,		PG_buddy);
+
+	kpf_copy_bit(u, k, io,    KPF_ERROR,		PG_error);
+	kpf_copy_bit(u, k, io,    KPF_DIRTY,		PG_dirty);
+	kpf_copy_bit(u, k, io,    KPF_UPTODATE,		PG_uptodate);
+	kpf_copy_bit(u, k, io,    KPF_WRITEBACK,	PG_writeback);
+
+	kpf_copy_bit(u, k, 1,     KPF_LRU,		PG_lru);
+	kpf_copy_bit(u, k, lru,	  KPF_REFERENCED,	PG_referenced);
+	kpf_copy_bit(u, k, lru,   KPF_ACTIVE,		PG_active);
+	kpf_copy_bit(u, k, lru,   KPF_RECLAIM,		PG_reclaim);
+
+	kpf_copy_bit(u, k, lru,   KPF_SWAPCACHE,	PG_swapcache);
+	kpf_copy_bit(u, k, lru,   KPF_SWAPBACKED,	PG_swapbacked);
+
+#ifdef CONFIG_MEMORY_FAILURE
+	kpf_copy_bit(u, k, 1,     KPF_HWPOISON,		PG_hwpoison);
+#endif
+
+#ifdef CONFIG_UNEVICTABLE_LRU
+	kpf_copy_bit(u, k, lru,   KPF_UNEVICTABLE,	PG_unevictable);
+	kpf_copy_bit(u, k, 0,     KPF_MLOCKED,		PG_mlocked);
+#endif
+
+	kpf_copy_bit(u, k, 0,     KPF_RESERVED,		PG_reserved);
+	kpf_copy_bit(u, k, 0,     KPF_MAPPEDTODISK,	PG_mappedtodisk);
+	kpf_copy_bit(u, k, 0,     KPF_PRIVATE,		PG_private);
+	kpf_copy_bit(u, k, 0,     KPF_PRIVATE2,		PG_private_2);
+	kpf_copy_bit(u, k, 0,     KPF_OWNER_PRIVATE,	PG_owner_priv_1);
+	kpf_copy_bit(u, k, 0,     KPF_ARCH,		PG_arch_1);
+
+#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
+	kpf_copy_bit(u, k, 0,     KPF_UNCACHED,		PG_uncached);
+#endif
+
+	if (!genuine_linus()) {
+		/*
+		 * SLUB overloads some page flags which may confuse end user.
+		 */
+		if (slab)
+			u &= ~((1 << KPF_ACTIVE) | (1 << KPF_ERROR));
+		/*
+		 * PG_reclaim could be overloaded as PG_readahead,
+		 * and we only want to export the first one.
+		 */
+		if (!(u & (1 << KPF_WRITEBACK)))
+			u &= ~(1 << KPF_RECLAIM);
+	}
+
+	return u;
+};
 
 static ssize_t kpageflags_read(struct file *file, char __user *buf,
 			     size_t count, loff_t *ppos)
@@ -92,7 +246,6 @@ static ssize_t kpageflags_read(struct fi
 	unsigned long src = *ppos;
 	unsigned long pfn;
 	ssize_t ret = 0;
-	u64 kflags, uflags;
 
 	pfn = src / KPMSIZE;
 	count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
@@ -104,24 +257,8 @@ static ssize_t kpageflags_read(struct fi
 			ppage = pfn_to_page(pfn);
 		else
 			ppage = NULL;
-		if (!ppage)
-			kflags = 0;
-		else
-			kflags = ppage->flags;
-
-		uflags = kpf_copy_bit(kflags, KPF_LOCKED, PG_locked) |
-			kpf_copy_bit(kflags, KPF_ERROR, PG_error) |
-			kpf_copy_bit(kflags, KPF_REFERENCED, PG_referenced) |
-			kpf_copy_bit(kflags, KPF_UPTODATE, PG_uptodate) |
-			kpf_copy_bit(kflags, KPF_DIRTY, PG_dirty) |
-			kpf_copy_bit(kflags, KPF_LRU, PG_lru) |
-			kpf_copy_bit(kflags, KPF_ACTIVE, PG_active) |
-			kpf_copy_bit(kflags, KPF_SLAB, PG_slab) |
-			kpf_copy_bit(kflags, KPF_WRITEBACK, PG_writeback) |
-			kpf_copy_bit(kflags, KPF_RECLAIM, PG_reclaim) |
-			kpf_copy_bit(kflags, KPF_BUDDY, PG_buddy);
 
-		if (put_user(uflags, out)) {
+		if (put_user(get_uflags(ppage), out)) {
 			ret = -EFAULT;
 			break;
 		}

-- 


^ permalink raw reply	[flat|nested] 137+ messages in thread

* [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28  1:09   ` Wu Fengguang
  0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28  1:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, KOSAKI Motohiro, Andi Kleen, Matt Mackall, Alexey Dobriyan,
	Wu Fengguang, linux-mm

[-- Attachment #1: kpageflags-extending.patch --]
[-- Type: text/plain, Size: 13948 bytes --]

Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.

1) for kernel hackers (on CONFIG_DEBUG_KERNEL)
   - all available page flags are exported, and
   - exported as is
2) for admins and end users
   - only the more `well known' flags are exported:
	11. KPF_MMAP		(pseudo flag) memory mapped page
	12. KPF_ANON		(pseudo flag) memory mapped page (anonymous)
	13. KPF_SWAPCACHE	page is in swap cache
	14. KPF_SWAPBACKED	page is swap/RAM backed
	15. KPF_COMPOUND_HEAD	(*)
	16. KPF_COMPOUND_TAIL	(*)
	17. KPF_UNEVICTABLE	page is in the unevictable LRU list
	18. KPF_HWPOISON	hardware detected corruption
	19. KPF_NOPAGE		(pseudo flag) no page frame at the address

	(*) For compound pages, exporting _both_ head/tail info enables
	    users to tell where a compound page starts/ends, and its order.

   - limit flags to their typical usage scenario, as indicated by KOSAKI:
	- LRU pages: only export relevant flags
		- PG_lru
		- PG_unevictable
		- PG_active
		- PG_referenced
		- page_mapped()
		- PageAnon()
		- PG_swapcache
		- PG_swapbacked
		- PG_reclaim
	- no-IO pages: mask out irrelevant flags
		- PG_dirty
		- PG_uptodate
		- PG_writeback
	- SLAB pages: mask out overloaded flags:
		- PG_error
		- PG_active
		- PG_private
	- PG_reclaim: mask out the overloaded PG_readahead
	- compound flags: only export huge/gigantic pages

Here are the admin/linus views of all page flags on a newly booted nfs-root system:

# ./page-types # for admin
         flags  page-count       MB  symbolic-flags                     long-symbolic-flags
0x000000000000      491174     1918  ____________________________                
0x000000000020           1        0  _____l______________________       lru      
0x000000000028        2543        9  ___U_l______________________       uptodate,lru
0x00000000002c        5288       20  __RU_l______________________       referenced,uptodate,lru
0x000000004060           1        0  _____lA_______b_____________       lru,active,swapbacked
0x000000004064          19        0  __R__lA_______b_____________       referenced,lru,active,swapbacked
0x000000000068         225        0  ___U_lA_____________________       uptodate,lru,active
0x00000000006c         969        3  __RU_lA_____________________       referenced,uptodate,lru,active
0x000000000080        6832       26  _______S____________________       slab     
0x000000000400         576        2  __________B_________________       buddy    
0x000000000828        1159        4  ___U_l_____M________________       uptodate,lru,mmap
0x00000000082c         310        1  __RU_l_____M________________       referenced,uptodate,lru,mmap
0x000000004860           2        0  _____lA____M__b_____________       lru,active,mmap,swapbacked
0x000000000868         375        1  ___U_lA____M________________       uptodate,lru,active,mmap
0x00000000086c         635        2  __RU_lA____M________________       referenced,uptodate,lru,active,mmap
0x000000005860        3831       14  _____lA____Ma_b_____________       lru,active,mmap,anonymous,swapbacked
0x000000005864          28        0  __R__lA____Ma_b_____________       referenced,lru,active,mmap,anonymous,swapbacked
         total      513968     2007                                              

# ./page-types # for linus, when CONFIG_DEBUG_KERNEL is turned on
         flags  page-count       MB  symbolic-flags                     long-symbolic-flags
0x000000000000      471058     1840  ____________________________
0x000100000000       19288       75  ____________________r_______       reserved
0x000000010000        1064        4  ________________T___________       compound_tail
0x000000008000           1        0  _______________H____________       compound_head
0x000000008014           1        0  __R_D__________H____________       referenced,dirty,compound_head
0x000000010014           4        0  __R_D___________T___________       referenced,dirty,compound_tail
0x000000000020           1        0  _____l______________________       lru
0x000000000028        2522        9  ___U_l______________________       uptodate,lru
0x00000000002c        5207       20  __RU_l______________________       referenced,uptodate,lru
0x000000000068         203        0  ___U_lA_____________________       uptodate,lru,active
0x00000000006c         869        3  __RU_lA_____________________       referenced,uptodate,lru,active
0x000000004078           1        0  ___UDlA_______b_____________       uptodate,dirty,lru,active,swapbacked
0x00000000407c          19        0  __RUDlA_______b_____________       referenced,uptodate,dirty,lru,active,swapbacked
0x000000000080        5989       23  _______S____________________       slab
0x000000008080         778        3  _______S_______H____________       slab,compound_head
0x000000000228          44        0  ___U_l___I__________________       uptodate,lru,reclaim
0x00000000022c          39        0  __RU_l___I__________________       referenced,uptodate,lru,reclaim
0x000000000268          12        0  ___U_lA__I__________________       uptodate,lru,active,reclaim
0x00000000026c          44        0  __RU_lA__I__________________       referenced,uptodate,lru,active,reclaim
0x000000000400         550        2  __________B_________________       buddy
0x000000000804           1        0  __R________M________________       referenced,mmap
0x000000000828        1068        4  ___U_l_____M________________       uptodate,lru,mmap
0x00000000082c         326        1  __RU_l_____M________________       referenced,uptodate,lru,mmap
0x000000000868         335        1  ___U_lA____M________________       uptodate,lru,active,mmap
0x00000000086c         599        2  __RU_lA____M________________       referenced,uptodate,lru,active,mmap
0x000000004878           2        0  ___UDlA____M__b_____________       uptodate,dirty,lru,active,mmap,swapbacked
0x000000000a28          44        0  ___U_l___I_M________________       uptodate,lru,reclaim,mmap
0x000000000a2c          12        0  __RU_l___I_M________________       referenced,uptodate,lru,reclaim,mmap
0x000000000a68           8        0  ___U_lA__I_M________________       uptodate,lru,active,reclaim,mmap
0x000000000a6c          31        0  __RU_lA__I_M________________       referenced,uptodate,lru,active,reclaim,mmap
0x000000001000         442        1  ____________a_______________       anonymous
0x000000005808           7        0  ___U_______Ma_b_____________       uptodate,mmap,anonymous,swapbacked
0x000000005868        3371       13  ___U_lA____Ma_b_____________       uptodate,lru,active,mmap,anonymous,swapbacked                
0x00000000586c          28        0  __RU_lA____Ma_b_____________       referenced,uptodate,lru,active,mmap,anonymous,swapbacked
         total      513968     2007

Thanks to KOSAKI and Andi for their valuable recommendations!

Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/proc/page.c |  197 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 167 insertions(+), 30 deletions(-)

--- mm.orig/fs/proc/page.c
+++ mm/fs/proc/page.c
@@ -6,6 +6,7 @@
 #include <linux/mmzone.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
+#include <linux/backing-dev.h>
 #include <asm/uaccess.h>
 #include "internal.h"
 
@@ -70,19 +71,172 @@ static const struct file_operations proc
 
 /* These macros are used to decouple internal flags from exported ones */
 
-#define KPF_LOCKED     0
-#define KPF_ERROR      1
-#define KPF_REFERENCED 2
-#define KPF_UPTODATE   3
-#define KPF_DIRTY      4
-#define KPF_LRU        5
-#define KPF_ACTIVE     6
-#define KPF_SLAB       7
-#define KPF_WRITEBACK  8
-#define KPF_RECLAIM    9
-#define KPF_BUDDY     10
+#define KPF_LOCKED		0
+#define KPF_ERROR		1
+#define KPF_REFERENCED		2
+#define KPF_UPTODATE		3
+#define KPF_DIRTY		4
+#define KPF_LRU			5
+#define KPF_ACTIVE		6
+#define KPF_SLAB		7
+#define KPF_WRITEBACK		8
+#define KPF_RECLAIM		9
+#define KPF_BUDDY		10
+
+/* new additions in 2.6.31 */
+#define KPF_MMAP		11
+#define KPF_ANON		12
+#define KPF_SWAPCACHE		13
+#define KPF_SWAPBACKED		14
+#define KPF_COMPOUND_HEAD	15
+#define KPF_COMPOUND_TAIL	16
+#define KPF_UNEVICTABLE		17
+#define KPF_HWPOISON		18
+#define KPF_NOPAGE		19
+
+/* kernel hacking assistances */
+#define KPF_RESERVED		32
+#define KPF_MLOCKED		33
+#define KPF_MAPPEDTODISK	34
+#define KPF_PRIVATE		35
+#define KPF_PRIVATE2		36
+#define KPF_OWNER_PRIVATE	37
+#define KPF_ARCH		38
+#define KPF_UNCACHED		39
+
+/*
+ * Kernel flags are exported faithfully to Linus and his fellow hackers.
+ * Otherwise some details are masked to avoid confusing the end user:
+ * - some kernel flags are completely invisible
+ * - some kernel flags are conditionally invisible on their odd usages
+ */
+#ifdef CONFIG_DEBUG_KERNEL
+static inline int genuine_linus(void) { return 1; }
+#else
+static inline int genuine_linus(void) { return 0; }
+#endif
+
+#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit)		\
+	do {								\
+		if (visible || genuine_linus())				\
+			uflags |= ((kflags >> kbit) & 1) << ubit;	\
+	} while (0);
+
+/* a helper function _not_ intended for more general uses */
+static inline int page_cap_writeback_dirty(struct page *page)
+{
+	struct address_space *mapping;
+
+	if (!PageSlab(page))
+		mapping = page_mapping(page);
+	else
+		mapping = NULL;
+
+	return mapping && mapping_cap_writeback_dirty(mapping);
+}
+
+static u64 get_uflags(struct page *page)
+{
+	u64 k;
+	u64 u;
+	int io;
+	int lru;
+	int slab;
+
+	/*
+	 * pseudo flag: KPF_NOPAGE
+	 * it differentiates a memory hole from a page with no flags
+	 */
+	if (!page)
+		return 1 << KPF_NOPAGE;
+
+	k = page->flags;
+	u = 0;
+
+	io   = page_cap_writeback_dirty(page);
+	lru  = k & (1 << PG_lru);
+	slab = k & (1 << PG_slab);
+
+	/*
+	 * pseudo flags for the well known (anonymous) memory mapped pages
+	 */
+	if (lru || genuine_linus()) {
+		if (!slab && page_mapped(page))
+			u |= 1 << KPF_MMAP;
+		if (PageAnon(page))
+			u |= 1 << KPF_ANON;
+	}
 
-#define kpf_copy_bit(flags, dstpos, srcpos) (((flags >> srcpos) & 1) << dstpos)
+	/*
+	 * compound pages: export both head/tail info
+	 * they together define a compound page's start/end pos and order
+	 */
+	if (PageHuge(page) || genuine_linus()) {
+		if (PageHead(page))
+			u |= 1 << KPF_COMPOUND_HEAD;
+		if (PageTail(page))
+			u |= 1 << KPF_COMPOUND_TAIL;
+	}
+
+	kpf_copy_bit(u, k, 1,	  KPF_LOCKED,		PG_locked);
+
+	/*
+	 * Caveats on high order pages:
+	 * PG_buddy will only be set on the head page; SLUB/SLQB do the same
+	 * for PG_slab; SLOB won't set PG_slab at all on compound pages.
+	 */
+	kpf_copy_bit(u, k, 1,     KPF_SLAB,		PG_slab);
+	kpf_copy_bit(u, k, 1,     KPF_BUDDY,		PG_buddy);
+
+	kpf_copy_bit(u, k, io,    KPF_ERROR,		PG_error);
+	kpf_copy_bit(u, k, io,    KPF_DIRTY,		PG_dirty);
+	kpf_copy_bit(u, k, io,    KPF_UPTODATE,		PG_uptodate);
+	kpf_copy_bit(u, k, io,    KPF_WRITEBACK,	PG_writeback);
+
+	kpf_copy_bit(u, k, 1,     KPF_LRU,		PG_lru);
+	kpf_copy_bit(u, k, lru,	  KPF_REFERENCED,	PG_referenced);
+	kpf_copy_bit(u, k, lru,   KPF_ACTIVE,		PG_active);
+	kpf_copy_bit(u, k, lru,   KPF_RECLAIM,		PG_reclaim);
+
+	kpf_copy_bit(u, k, lru,   KPF_SWAPCACHE,	PG_swapcache);
+	kpf_copy_bit(u, k, lru,   KPF_SWAPBACKED,	PG_swapbacked);
+
+#ifdef CONFIG_MEMORY_FAILURE
+	kpf_copy_bit(u, k, 1,     KPF_HWPOISON,		PG_hwpoison);
+#endif
+
+#ifdef CONFIG_UNEVICTABLE_LRU
+	kpf_copy_bit(u, k, lru,   KPF_UNEVICTABLE,	PG_unevictable);
+	kpf_copy_bit(u, k, 0,     KPF_MLOCKED,		PG_mlocked);
+#endif
+
+	kpf_copy_bit(u, k, 0,     KPF_RESERVED,		PG_reserved);
+	kpf_copy_bit(u, k, 0,     KPF_MAPPEDTODISK,	PG_mappedtodisk);
+	kpf_copy_bit(u, k, 0,     KPF_PRIVATE,		PG_private);
+	kpf_copy_bit(u, k, 0,     KPF_PRIVATE2,		PG_private_2);
+	kpf_copy_bit(u, k, 0,     KPF_OWNER_PRIVATE,	PG_owner_priv_1);
+	kpf_copy_bit(u, k, 0,     KPF_ARCH,		PG_arch_1);
+
+#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
+	kpf_copy_bit(u, k, 0,     KPF_UNCACHED,		PG_uncached);
+#endif
+
+	if (!genuine_linus()) {
+		/*
+		 * SLUB overloads some page flags which may confuse end user.
+		 */
+		if (slab)
+			u &= ~((1 << KPF_ACTIVE) | (1 << KPF_ERROR));
+		/*
+		 * PG_reclaim could be overloaded as PG_readahead,
+		 * and we only want to export the first one.
+		 */
+		if (!(u & (1 << KPF_WRITEBACK)))
+			u &= ~(1 << KPF_RECLAIM);
+	}
+
+	return u;
+};
 
 static ssize_t kpageflags_read(struct file *file, char __user *buf,
 			     size_t count, loff_t *ppos)
@@ -92,7 +246,6 @@ static ssize_t kpageflags_read(struct fi
 	unsigned long src = *ppos;
 	unsigned long pfn;
 	ssize_t ret = 0;
-	u64 kflags, uflags;
 
 	pfn = src / KPMSIZE;
 	count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
@@ -104,24 +257,8 @@ static ssize_t kpageflags_read(struct fi
 			ppage = pfn_to_page(pfn);
 		else
 			ppage = NULL;
-		if (!ppage)
-			kflags = 0;
-		else
-			kflags = ppage->flags;
-
-		uflags = kpf_copy_bit(kflags, KPF_LOCKED, PG_locked) |
-			kpf_copy_bit(kflags, KPF_ERROR, PG_error) |
-			kpf_copy_bit(kflags, KPF_REFERENCED, PG_referenced) |
-			kpf_copy_bit(kflags, KPF_UPTODATE, PG_uptodate) |
-			kpf_copy_bit(kflags, KPF_DIRTY, PG_dirty) |
-			kpf_copy_bit(kflags, KPF_LRU, PG_lru) |
-			kpf_copy_bit(kflags, KPF_ACTIVE, PG_active) |
-			kpf_copy_bit(kflags, KPF_SLAB, PG_slab) |
-			kpf_copy_bit(kflags, KPF_WRITEBACK, PG_writeback) |
-			kpf_copy_bit(kflags, KPF_RECLAIM, PG_reclaim) |
-			kpf_copy_bit(kflags, KPF_BUDDY, PG_buddy);
 
-		if (put_user(uflags, out)) {
+		if (put_user(get_uflags(ppage), out)) {
 			ret = -EFAULT;
 			break;
 		}

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  1:09   ` Wu Fengguang
@ 2009-04-28  6:55     ` Ingo Molnar
  -1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28  6:55 UTC (permalink / raw)
  To: Wu Fengguang, Steven Rostedt, Frédéric Weisbecker,
	Larry Woodman, Peter Zijlstra, Pekka Enberg,
	Eduard - Gabriel Munteanu
  Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen, Matt Mackall,
	Alexey Dobriyan, linux-mm

* Wu Fengguang <fengguang.wu@intel.com> wrote:

> Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> 
> 1) for kernel hackers (on CONFIG_DEBUG_KERNEL)
>    - all available page flags are exported, and
>    - exported as is
> 2) for admins and end users
>    - only the more `well known' flags are exported:
> 	11. KPF_MMAP		(pseudo flag) memory mapped page
> 	12. KPF_ANON		(pseudo flag) memory mapped page (anonymous)
> 	13. KPF_SWAPCACHE	page is in swap cache
> 	14. KPF_SWAPBACKED	page is swap/RAM backed
> 	15. KPF_COMPOUND_HEAD	(*)
> 	16. KPF_COMPOUND_TAIL	(*)
> 	17. KPF_UNEVICTABLE	page is in the unevictable LRU list
> 	18. KPF_HWPOISON	hardware detected corruption
> 	19. KPF_NOPAGE		(pseudo flag) no page frame at the address
> 
> 	(*) For compound pages, exporting _both_ head/tail info enables
> 	    users to tell where a compound page starts/ends, and its order.
> 
>    - limit flags to their typical usage scenario, as indicated by KOSAKI:
> 	- LRU pages: only export relevant flags
> 		- PG_lru
> 		- PG_unevictable
> 		- PG_active
> 		- PG_referenced
> 		- page_mapped()
> 		- PageAnon()
> 		- PG_swapcache
> 		- PG_swapbacked
> 		- PG_reclaim
> 	- no-IO pages: mask out irrelevant flags
> 		- PG_dirty
> 		- PG_uptodate
> 		- PG_writeback
> 	- SLAB pages: mask out overloaded flags:
> 		- PG_error
> 		- PG_active
> 		- PG_private
> 	- PG_reclaim: mask out the overloaded PG_readahead
> 	- compound flags: only export huge/gigantic pages
> 
> Here are the admin/linus views of all page flags on a newly booted nfs-root system:
> 
> # ./page-types # for admin
>          flags  page-count       MB  symbolic-flags                     long-symbolic-flags
> 0x000000000000      491174     1918  ____________________________                
> 0x000000000020           1        0  _____l______________________       lru      
> 0x000000000028        2543        9  ___U_l______________________       uptodate,lru
> 0x00000000002c        5288       20  __RU_l______________________       referenced,uptodate,lru
> 0x000000004060           1        0  _____lA_______b_____________       lru,active,swapbacked

I think i have to NAK this kind of ad-hoc instrumentation of kernel 
internals and statistics until we clear up why such instrumentation 
measures are being accepted into the MM while other, more dynamic 
and more flexible MM instrumentation are being resisted by Andrew.

The above type of condensed information can be built out of dynamic 
trace data too - and much more. Being able to track page state 
transitions is very valuable when debugging VM problems. One such 
'view' of trace data would be a summary histogram like above.

( done after a "echo 3 > /proc/sys/vm/drop_caches" to make sure all 
  interesting pages have been re-established and their state is 
  present in the trace. )

The SLAB code already has such a facility, kmemtrace: it's very 
useful and successful in visualizing complex SLAB details, both 
dynamically and statically.

I think the same general approach should be used for the page 
allocator too (and for the page cache and some other struct page 
based caches): the life-time of an object should be followed. If we 
capture the important details we capture the big picture too. Pekka 
already sent an RFC patch to extend kmemtrace in such a fashion. Why 
is that more useful method not being pursued?

By extending upon the (existing) /proc/kpageflags hack a usecase is 
taken away from the tracing based solution and a needless overlap is 
created - and that's not particularly helpful IMHO. We now have all 
the facilities upstream that allow us to do intelligent 
instrumentation - we should make use of them.

	Ingo

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28  6:55     ` Ingo Molnar
  0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28  6:55 UTC (permalink / raw)
  To: Wu Fengguang, Steven Rostedt, Frédéric Weisbecker,
	Larry Woodman, Peter Zijlstra, Pekka Enberg,
	Eduard - Gabriel Munteanu
  Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen, Matt Mackall,
	Alexey Dobriyan, linux-mm

* Wu Fengguang <fengguang.wu@intel.com> wrote:

> Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> 
> 1) for kernel hackers (on CONFIG_DEBUG_KERNEL)
>    - all available page flags are exported, and
>    - exported as is
> 2) for admins and end users
>    - only the more `well known' flags are exported:
> 	11. KPF_MMAP		(pseudo flag) memory mapped page
> 	12. KPF_ANON		(pseudo flag) memory mapped page (anonymous)
> 	13. KPF_SWAPCACHE	page is in swap cache
> 	14. KPF_SWAPBACKED	page is swap/RAM backed
> 	15. KPF_COMPOUND_HEAD	(*)
> 	16. KPF_COMPOUND_TAIL	(*)
> 	17. KPF_UNEVICTABLE	page is in the unevictable LRU list
> 	18. KPF_HWPOISON	hardware detected corruption
> 	19. KPF_NOPAGE		(pseudo flag) no page frame at the address
> 
> 	(*) For compound pages, exporting _both_ head/tail info enables
> 	    users to tell where a compound page starts/ends, and its order.
> 
>    - limit flags to their typical usage scenario, as indicated by KOSAKI:
> 	- LRU pages: only export relevant flags
> 		- PG_lru
> 		- PG_unevictable
> 		- PG_active
> 		- PG_referenced
> 		- page_mapped()
> 		- PageAnon()
> 		- PG_swapcache
> 		- PG_swapbacked
> 		- PG_reclaim
> 	- no-IO pages: mask out irrelevant flags
> 		- PG_dirty
> 		- PG_uptodate
> 		- PG_writeback
> 	- SLAB pages: mask out overloaded flags:
> 		- PG_error
> 		- PG_active
> 		- PG_private
> 	- PG_reclaim: mask out the overloaded PG_readahead
> 	- compound flags: only export huge/gigantic pages
> 
> Here are the admin/linus views of all page flags on a newly booted nfs-root system:
> 
> # ./page-types # for admin
>          flags  page-count       MB  symbolic-flags                     long-symbolic-flags
> 0x000000000000      491174     1918  ____________________________                
> 0x000000000020           1        0  _____l______________________       lru      
> 0x000000000028        2543        9  ___U_l______________________       uptodate,lru
> 0x00000000002c        5288       20  __RU_l______________________       referenced,uptodate,lru
> 0x000000004060           1        0  _____lA_______b_____________       lru,active,swapbacked

I think i have to NAK this kind of ad-hoc instrumentation of kernel 
internals and statistics until we clear up why such instrumentation 
measures are being accepted into the MM while other, more dynamic 
and more flexible MM instrumentation are being resisted by Andrew.

The above type of condensed information can be built out of dynamic 
trace data too - and much more. Being able to track page state 
transitions is very valuable when debugging VM problems. One such 
'view' of trace data would be a summary histogram like above.

( done after a "echo 3 > /proc/sys/vm/drop_caches" to make sure all 
  interesting pages have been re-established and their state is 
  present in the trace. )

The SLAB code already has such a facility, kmemtrace: it's very 
useful and successful in visualizing complex SLAB details, both 
dynamically and statically.

I think the same general approach should be used for the page 
allocator too (and for the page cache and some other struct page 
based caches): the life-time of an object should be followed. If we 
capture the important details we capture the big picture too. Pekka 
already sent an RFC patch to extend kmemtrace in such a fashion. Why 
is that more useful method not being pursued?

By extending upon the (existing) /proc/kpageflags hack a usecase is 
taken away from the tracing based solution and a needless overlap is 
created - and that's not particularly helpful IMHO. We now have all 
the facilities upstream that allow us to do intelligent 
instrumentation - we should make use of them.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 1/5] pagemap: document clarifications
  2009-04-28  1:09   ` Wu Fengguang
@ 2009-04-28  7:11     ` Tommi Rantala
  -1 siblings, 0 replies; 137+ messages in thread
From: Tommi Rantala @ 2009-04-28  7:11 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen, linux-mm

2009/4/28 Wu Fengguang <fengguang.wu@intel.com>:
> Some bit ranges were inclusive and some not.
> Fix them to be consistently inclusive.
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  Documentation/vm/pagemap.txt |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> --- mm.orig/Documentation/vm/pagemap.txt
> +++ mm/Documentation/vm/pagemap.txt
> @@ -12,9 +12,9 @@ There are three components to pagemap:
>    value for each virtual page, containing the following data (from
>    fs/proc/task_mmu.c, above pagemap_read):
>
> -    * Bits 0-55  page frame number (PFN) if present
> +    * Bits 0-54  page frame number (PFN) if present
>     * Bits 0-4   swap type if swapped
> -    * Bits 5-55  swap offset if swapped
> +    * Bits 5-54  swap offset if swapped
>     * Bits 55-60 page shift (page size = 1<<page shift)
>     * Bit  61    reserved for future use
>     * Bit  62    page swapped

The same fix should be applied to fs/proc/task_mmu.c as well,
it includes the same description of the bits.

Regards,
Tommi Rantala

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 1/5] pagemap: document clarifications
@ 2009-04-28  7:11     ` Tommi Rantala
  0 siblings, 0 replies; 137+ messages in thread
From: Tommi Rantala @ 2009-04-28  7:11 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen, linux-mm

2009/4/28 Wu Fengguang <fengguang.wu@intel.com>:
> Some bit ranges were inclusive and some not.
> Fix them to be consistently inclusive.
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  Documentation/vm/pagemap.txt |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> --- mm.orig/Documentation/vm/pagemap.txt
> +++ mm/Documentation/vm/pagemap.txt
> @@ -12,9 +12,9 @@ There are three components to pagemap:
>    value for each virtual page, containing the following data (from
>    fs/proc/task_mmu.c, above pagemap_read):
>
> -    * Bits 0-55  page frame number (PFN) if present
> +    * Bits 0-54  page frame number (PFN) if present
>     * Bits 0-4   swap type if swapped
> -    * Bits 5-55  swap offset if swapped
> +    * Bits 5-54  swap offset if swapped
>     * Bits 55-60 page shift (page size = 1<<page shift)
>     * Bit  61    reserved for future use
>     * Bit  62    page swapped

The same fix should be applied to fs/proc/task_mmu.c as well,
it includes the same description of the bits.

Regards,
Tommi Rantala

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  6:55     ` Ingo Molnar
@ 2009-04-28  7:40       ` Andi Kleen
  -1 siblings, 0 replies; 137+ messages in thread
From: Andi Kleen @ 2009-04-28  7:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Wu Fengguang, Steven Rostedt, Frédéric Weisbecker,
	Larry Woodman, Peter Zijlstra, Pekka Enberg,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Andi Kleen, Matt Mackall, Alexey Dobriyan, linux-mm

> I think i have to NAK this kind of ad-hoc instrumentation of kernel 
> internals and statistics until we clear up why such instrumentation 

I think because it has zero fast path overhead and can be used
any time without enabling anything special.

> measures are being accepted into the MM while other, more dynamic 

While the dynamic instrumentation you're proposing 
has non zero fast path overhead, especially if you consider the
CPU time needed for the backend computation in user space too.

And it requires explicit tracing first and some backend 
that counts the events and maintains a shadow data structure
covering all of mem_map again.

So it's clear your alternative will be much more costly, plus
have additional drawbacks (needs enabling first, cannot
take a snapshot at arbitary time)

Also dynamic tracing tends to have trouble with full memory
observation. I experimented with systemtap tracing for my
memory usage paper I did a couple of years ago, but ended 
up with integrated counters (similar to those) because it was
impossible to do proper accounting for the pages set up
in early boot with the standard tracers.

I suspect both have their uses (that's indeed some things
that can only be done with dynamic tracing), but they're clearly
complementary and the static facility seems useful enough
on its own. 

I think Fengguang is demonstrating that clearly by the great
improvements he's doing for readahead which are enabled by these
patches.

-Andi

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28  7:40       ` Andi Kleen
  0 siblings, 0 replies; 137+ messages in thread
From: Andi Kleen @ 2009-04-28  7:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Wu Fengguang, Steven Rostedt, Frédéric Weisbecker,
	Larry Woodman, Peter Zijlstra, Pekka Enberg,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Andi Kleen, Matt Mackall, Alexey Dobriyan, linux-mm

> I think i have to NAK this kind of ad-hoc instrumentation of kernel 
> internals and statistics until we clear up why such instrumentation 

I think because it has zero fast path overhead and can be used
any time without enabling anything special.

> measures are being accepted into the MM while other, more dynamic 

While the dynamic instrumentation you're proposing 
has non zero fast path overhead, especially if you consider the
CPU time needed for the backend computation in user space too.

And it requires explicit tracing first and some backend 
that counts the events and maintains a shadow data structure
covering all of mem_map again.

So it's clear your alternative will be much more costly, plus
have additional drawbacks (needs enabling first, cannot
take a snapshot at arbitary time)

Also dynamic tracing tends to have trouble with full memory
observation. I experimented with systemtap tracing for my
memory usage paper I did a couple of years ago, but ended 
up with integrated counters (similar to those) because it was
impossible to do proper accounting for the pages set up
in early boot with the standard tracers.

I suspect both have their uses (that's indeed some things
that can only be done with dynamic tracing), but they're clearly
complementary and the static facility seems useful enough
on its own. 

I think Fengguang is demonstrating that clearly by the great
improvements he's doing for readahead which are enabled by these
patches.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  6:55     ` Ingo Molnar
@ 2009-04-28  8:33       ` Wu Fengguang
  -1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28  8:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Frédéric Weisbecker, Larry Woodman,
	Peter Zijlstra, Pekka Enberg, Eduard - Gabriel Munteanu,
	Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen, Matt Mackall,
	Alexey Dobriyan, linux-mm

On Tue, Apr 28, 2009 at 08:55:07AM +0200, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> > 
> > 1) for kernel hackers (on CONFIG_DEBUG_KERNEL)
> >    - all available page flags are exported, and
> >    - exported as is
> > 2) for admins and end users
> >    - only the more `well known' flags are exported:
> > 	11. KPF_MMAP		(pseudo flag) memory mapped page
> > 	12. KPF_ANON		(pseudo flag) memory mapped page (anonymous)
> > 	13. KPF_SWAPCACHE	page is in swap cache
> > 	14. KPF_SWAPBACKED	page is swap/RAM backed
> > 	15. KPF_COMPOUND_HEAD	(*)
> > 	16. KPF_COMPOUND_TAIL	(*)
> > 	17. KPF_UNEVICTABLE	page is in the unevictable LRU list
> > 	18. KPF_HWPOISON	hardware detected corruption
> > 	19. KPF_NOPAGE		(pseudo flag) no page frame at the address
> > 
> > 	(*) For compound pages, exporting _both_ head/tail info enables
> > 	    users to tell where a compound page starts/ends, and its order.
> > 
> >    - limit flags to their typical usage scenario, as indicated by KOSAKI:
> > 	- LRU pages: only export relevant flags
> > 		- PG_lru
> > 		- PG_unevictable
> > 		- PG_active
> > 		- PG_referenced
> > 		- page_mapped()
> > 		- PageAnon()
> > 		- PG_swapcache
> > 		- PG_swapbacked
> > 		- PG_reclaim
> > 	- no-IO pages: mask out irrelevant flags
> > 		- PG_dirty
> > 		- PG_uptodate
> > 		- PG_writeback
> > 	- SLAB pages: mask out overloaded flags:
> > 		- PG_error
> > 		- PG_active
> > 		- PG_private
> > 	- PG_reclaim: mask out the overloaded PG_readahead
> > 	- compound flags: only export huge/gigantic pages
> > 
> > Here are the admin/linus views of all page flags on a newly booted nfs-root system:
> > 
> > # ./page-types # for admin
> >          flags  page-count       MB  symbolic-flags                     long-symbolic-flags
> > 0x000000000000      491174     1918  ____________________________                
> > 0x000000000020           1        0  _____l______________________       lru      
> > 0x000000000028        2543        9  ___U_l______________________       uptodate,lru
> > 0x00000000002c        5288       20  __RU_l______________________       referenced,uptodate,lru
> > 0x000000004060           1        0  _____lA_______b_____________       lru,active,swapbacked
> 
> I think i have to NAK this kind of ad-hoc instrumentation of kernel 
> internals and statistics until we clear up why such instrumentation 
> measures are being accepted into the MM while other, more dynamic 
> and more flexible MM instrumentation are being resisted by Andrew.

An unexpected NAK - to throw away an orange because we are to have an apple? ;-)

Anyway here are the missing rationals.

1) FAST

It takes merely 0.2s to scan 4GB pages:

        ./page-types  0.02s user 0.20s system 99% cpu 0.216 total

2) SIMPLE

/proc/kpageflags will be a *long standing* hack we have to live with -
it was originally introduced by Matt to do shared memory accounting and
a facility to analyze applications' memory consumptions, with the hope
it will also help kernel developers someday.

So why not extend and embrace it, in a straightforward way?

3) USE CASES

I have/will take advantage of the above page-types command in a number ways:
- to help track down memory leak (the recent trace/ring_buffer.c case)
- to estimate the system wide readahead miss ratio
- Andi want to examine the major page types in different workloads
  (for the hwpoison work)
- Me too, for fun of learning: read/write/lock/whatever a lot of pages
  and examine their flags, to get an idea of some random kernel behaviors.
  (the dynamic tracing tools can be more helpful, as a different view)

4) COMPLEMENTARITY

In some cases the dynamic tracing tool is not enough (or too complex)
to rebuild the current status view.

I myself have a dynamic readahead tracing tool(very useful!).
At the same time I also use readahead accounting numbers, and the
/proc/filecache tool(frequently!), and the above page-types tool.
I simply need them all - they are handy for different cases.

Thanks,
Fengguang

> The above type of condensed information can be built out of dynamic 
> trace data too - and much more. Being able to track page state 
> transitions is very valuable when debugging VM problems. One such 
> 'view' of trace data would be a summary histogram like above.
> 
> ( done after a "echo 3 > /proc/sys/vm/drop_caches" to make sure all 
>   interesting pages have been re-established and their state is 
>   present in the trace. )
> 
> The SLAB code already has such a facility, kmemtrace: it's very 
> useful and successful in visualizing complex SLAB details, both 
> dynamically and statically.
> 
> I think the same general approach should be used for the page 
> allocator too (and for the page cache and some other struct page 
> based caches): the life-time of an object should be followed. If we 
> capture the important details we capture the big picture too. Pekka 
> already sent an RFC patch to extend kmemtrace in such a fashion. Why 
> is that more useful method not being pursued?
> 
> By extending upon the (existing) /proc/kpageflags hack a usecase is 
> taken away from the tracing based solution and a needless overlap is 
> created - and that's not particularly helpful IMHO. We now have all 
> the facilities upstream that allow us to do intelligent 
> instrumentation - we should make use of them.
> 
> 	Ingo
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28  8:33       ` Wu Fengguang
  0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28  8:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Frédéric Weisbecker, Larry Woodman,
	Peter Zijlstra, Pekka Enberg, Eduard - Gabriel Munteanu,
	Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen, Matt Mackall,
	Alexey Dobriyan, linux-mm

On Tue, Apr 28, 2009 at 08:55:07AM +0200, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> > 
> > 1) for kernel hackers (on CONFIG_DEBUG_KERNEL)
> >    - all available page flags are exported, and
> >    - exported as is
> > 2) for admins and end users
> >    - only the more `well known' flags are exported:
> > 	11. KPF_MMAP		(pseudo flag) memory mapped page
> > 	12. KPF_ANON		(pseudo flag) memory mapped page (anonymous)
> > 	13. KPF_SWAPCACHE	page is in swap cache
> > 	14. KPF_SWAPBACKED	page is swap/RAM backed
> > 	15. KPF_COMPOUND_HEAD	(*)
> > 	16. KPF_COMPOUND_TAIL	(*)
> > 	17. KPF_UNEVICTABLE	page is in the unevictable LRU list
> > 	18. KPF_HWPOISON	hardware detected corruption
> > 	19. KPF_NOPAGE		(pseudo flag) no page frame at the address
> > 
> > 	(*) For compound pages, exporting _both_ head/tail info enables
> > 	    users to tell where a compound page starts/ends, and its order.
> > 
> >    - limit flags to their typical usage scenario, as indicated by KOSAKI:
> > 	- LRU pages: only export relevant flags
> > 		- PG_lru
> > 		- PG_unevictable
> > 		- PG_active
> > 		- PG_referenced
> > 		- page_mapped()
> > 		- PageAnon()
> > 		- PG_swapcache
> > 		- PG_swapbacked
> > 		- PG_reclaim
> > 	- no-IO pages: mask out irrelevant flags
> > 		- PG_dirty
> > 		- PG_uptodate
> > 		- PG_writeback
> > 	- SLAB pages: mask out overloaded flags:
> > 		- PG_error
> > 		- PG_active
> > 		- PG_private
> > 	- PG_reclaim: mask out the overloaded PG_readahead
> > 	- compound flags: only export huge/gigantic pages
> > 
> > Here are the admin/linus views of all page flags on a newly booted nfs-root system:
> > 
> > # ./page-types # for admin
> >          flags  page-count       MB  symbolic-flags                     long-symbolic-flags
> > 0x000000000000      491174     1918  ____________________________                
> > 0x000000000020           1        0  _____l______________________       lru      
> > 0x000000000028        2543        9  ___U_l______________________       uptodate,lru
> > 0x00000000002c        5288       20  __RU_l______________________       referenced,uptodate,lru
> > 0x000000004060           1        0  _____lA_______b_____________       lru,active,swapbacked
> 
> I think i have to NAK this kind of ad-hoc instrumentation of kernel 
> internals and statistics until we clear up why such instrumentation 
> measures are being accepted into the MM while other, more dynamic 
> and more flexible MM instrumentation are being resisted by Andrew.

An unexpected NAK - to throw away an orange because we are to have an apple? ;-)

Anyway here are the missing rationals.

1) FAST

It takes merely 0.2s to scan 4GB pages:

        ./page-types  0.02s user 0.20s system 99% cpu 0.216 total

2) SIMPLE

/proc/kpageflags will be a *long standing* hack we have to live with -
it was originally introduced by Matt to do shared memory accounting and
a facility to analyze applications' memory consumptions, with the hope
it will also help kernel developers someday.

So why not extend and embrace it, in a straightforward way?

3) USE CASES

I have/will take advantage of the above page-types command in a number ways:
- to help track down memory leak (the recent trace/ring_buffer.c case)
- to estimate the system wide readahead miss ratio
- Andi want to examine the major page types in different workloads
  (for the hwpoison work)
- Me too, for fun of learning: read/write/lock/whatever a lot of pages
  and examine their flags, to get an idea of some random kernel behaviors.
  (the dynamic tracing tools can be more helpful, as a different view)

4) COMPLEMENTARITY

In some cases the dynamic tracing tool is not enough (or too complex)
to rebuild the current status view.

I myself have a dynamic readahead tracing tool(very useful!).
At the same time I also use readahead accounting numbers, and the
/proc/filecache tool(frequently!), and the above page-types tool.
I simply need them all - they are handy for different cases.

Thanks,
Fengguang

> The above type of condensed information can be built out of dynamic 
> trace data too - and much more. Being able to track page state 
> transitions is very valuable when debugging VM problems. One such 
> 'view' of trace data would be a summary histogram like above.
> 
> ( done after a "echo 3 > /proc/sys/vm/drop_caches" to make sure all 
>   interesting pages have been re-established and their state is 
>   present in the trace. )
> 
> The SLAB code already has such a facility, kmemtrace: it's very 
> useful and successful in visualizing complex SLAB details, both 
> dynamically and statically.
> 
> I think the same general approach should be used for the page 
> allocator too (and for the page cache and some other struct page 
> based caches): the life-time of an object should be followed. If we 
> capture the important details we capture the big picture too. Pekka 
> already sent an RFC patch to extend kmemtrace in such a fashion. Why 
> is that more useful method not being pursued?
> 
> By extending upon the (existing) /proc/kpageflags hack a usecase is 
> taken away from the tracing based solution and a needless overlap is 
> created - and that's not particularly helpful IMHO. We now have all 
> the facilities upstream that allow us to do intelligent 
> instrumentation - we should make use of them.
> 
> 	Ingo
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  7:40       ` Andi Kleen
@ 2009-04-28  9:04         ` Pekka Enberg
  -1 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28  9:04 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm

Hi Andi,

On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
> > I think i have to NAK this kind of ad-hoc instrumentation of kernel 
> > internals and statistics until we clear up why such instrumentation 
> 
> I think because it has zero fast path overhead and can be used
> any time without enabling anything special.

Yes, zero overhead is important for certain things (like
CONFIG_SLUB_STATS, for example). However, putting slab allocator
specific checks in fs/proc looks pretty fragile to me. It would be nice
to have this under the "kmemtrace umbrella" so that there's just one
place that needs to be fixed up when allocators change.

Also, while you probably don't want to use tracepoints for this kind of
instrumentation, you might want to look into reusing the ftrace
reporting bits.

			Pekka

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28  9:04         ` Pekka Enberg
  0 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28  9:04 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm

Hi Andi,

On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
> > I think i have to NAK this kind of ad-hoc instrumentation of kernel 
> > internals and statistics until we clear up why such instrumentation 
> 
> I think because it has zero fast path overhead and can be used
> any time without enabling anything special.

Yes, zero overhead is important for certain things (like
CONFIG_SLUB_STATS, for example). However, putting slab allocator
specific checks in fs/proc looks pretty fragile to me. It would be nice
to have this under the "kmemtrace umbrella" so that there's just one
place that needs to be fixed up when allocators change.

Also, while you probably don't want to use tracepoints for this kind of
instrumentation, you might want to look into reusing the ftrace
reporting bits.

			Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  9:04         ` Pekka Enberg
@ 2009-04-28  9:10           ` Andi Kleen
  -1 siblings, 0 replies; 137+ messages in thread
From: Andi Kleen @ 2009-04-28  9:10 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andi Kleen, Ingo Molnar, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm

> Yes, zero overhead is important for certain things (like
> CONFIG_SLUB_STATS, for example). However, putting slab allocator
> specific checks in fs/proc looks pretty fragile to me. It would be nice

Ok, perhaps that could be put into a inline into slab.h. Would
that address your concerns?

> Also, while you probably don't want to use tracepoints for this kind of
> instrumentation, you might want to look into reusing the ftrace
> reporting bits.

There's already perfectly fine code in tree for this, I don't see why it would
need another infrastructure that doesn't really fit anyways.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28  9:10           ` Andi Kleen
  0 siblings, 0 replies; 137+ messages in thread
From: Andi Kleen @ 2009-04-28  9:10 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andi Kleen, Ingo Molnar, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm

> Yes, zero overhead is important for certain things (like
> CONFIG_SLUB_STATS, for example). However, putting slab allocator
> specific checks in fs/proc looks pretty fragile to me. It would be nice

Ok, perhaps that could be put into a inline into slab.h. Would
that address your concerns?

> Also, while you probably don't want to use tracepoints for this kind of
> instrumentation, you might want to look into reusing the ftrace
> reporting bits.

There's already perfectly fine code in tree for this, I don't see why it would
need another infrastructure that doesn't really fit anyways.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  9:04         ` Pekka Enberg
@ 2009-04-28  9:15           ` Ingo Molnar
  -1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28  9:15 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Pekka Enberg <penberg@cs.helsinki.fi> wrote:

> Hi Andi,
> 
> On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
> > > I think i have to NAK this kind of ad-hoc instrumentation of kernel 
> > > internals and statistics until we clear up why such instrumentation 
> > 
> > I think because it has zero fast path overhead and can be used 
> > any time without enabling anything special.

( That's a dubious claim in any case - tracepoints are very cheap.
  And they could be made even cheaper and such efforts would benefit
  all the tracepoint users so it's a prime focus of interest.
  Andi is a SystemTap proponent, right? I saw him oppose pretty much 
  everything built-in kernel tracing related. I consider that a 
  pretty extreme position. )

> Yes, zero overhead is important for certain things (like 
> CONFIG_SLUB_STATS, for example). However, putting slab allocator 
> specific checks in fs/proc looks pretty fragile to me. It would be 
> nice to have this under the "kmemtrace umbrella" so that there's 
> just one place that needs to be fixed up when allocators change.
> 
> Also, while you probably don't want to use tracepoints for this 
> kind of instrumentation, you might want to look into reusing the 
> ftrace reporting bits.

Exactly - we have a tracing and statistics framework for a reason.

	Ingo

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28  9:15           ` Ingo Molnar
  0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28  9:15 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Pekka Enberg <penberg@cs.helsinki.fi> wrote:

> Hi Andi,
> 
> On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
> > > I think i have to NAK this kind of ad-hoc instrumentation of kernel 
> > > internals and statistics until we clear up why such instrumentation 
> > 
> > I think because it has zero fast path overhead and can be used 
> > any time without enabling anything special.

( That's a dubious claim in any case - tracepoints are very cheap.
  And they could be made even cheaper and such efforts would benefit
  all the tracepoint users so it's a prime focus of interest.
  Andi is a SystemTap proponent, right? I saw him oppose pretty much 
  everything built-in kernel tracing related. I consider that a 
  pretty extreme position. )

> Yes, zero overhead is important for certain things (like 
> CONFIG_SLUB_STATS, for example). However, putting slab allocator 
> specific checks in fs/proc looks pretty fragile to me. It would be 
> nice to have this under the "kmemtrace umbrella" so that there's 
> just one place that needs to be fixed up when allocators change.
> 
> Also, while you probably don't want to use tracepoints for this 
> kind of instrumentation, you might want to look into reusing the 
> ftrace reporting bits.

Exactly - we have a tracing and statistics framework for a reason.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  9:10           ` Andi Kleen
@ 2009-04-28  9:15             ` Pekka Enberg
  -1 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28  9:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm

Hi Andi,

On Tue, Apr 28, 2009 at 12:10 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> Yes, zero overhead is important for certain things (like
>> CONFIG_SLUB_STATS, for example). However, putting slab allocator
>> specific checks in fs/proc looks pretty fragile to me. It would be nice
>
> Ok, perhaps that could be put into a inline into slab.h. Would
> that address your concerns?

Yeah, I'm fine with that. Putting them in the individual
slub_def.h/slob_def.h headers might be even better.

On Tue, Apr 28, 2009 at 12:10 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> Also, while you probably don't want to use tracepoints for this kind of
>> instrumentation, you might want to look into reusing the ftrace
>> reporting bits.
>
> There's already perfectly fine code in tree for this, I don't see why it would
> need another infrastructure that doesn't really fit anyways.

It's just that I suspect that we want page flag printing and
zero-overhead statistics for kmemtrace at some point. But anyway, I'm
not objecting to extending /proc/kpageflags if that's what people want
to do.

                                          Pekka

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28  9:15             ` Pekka Enberg
  0 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28  9:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm

Hi Andi,

On Tue, Apr 28, 2009 at 12:10 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> Yes, zero overhead is important for certain things (like
>> CONFIG_SLUB_STATS, for example). However, putting slab allocator
>> specific checks in fs/proc looks pretty fragile to me. It would be nice
>
> Ok, perhaps that could be put into a inline into slab.h. Would
> that address your concerns?

Yeah, I'm fine with that. Putting them in the individual
slub_def.h/slob_def.h headers might be even better.

On Tue, Apr 28, 2009 at 12:10 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> Also, while you probably don't want to use tracepoints for this kind of
>> instrumentation, you might want to look into reusing the ftrace
>> reporting bits.
>
> There's already perfectly fine code in tree for this, I don't see why it would
> need another infrastructure that doesn't really fit anyways.

It's just that I suspect that we want page flag printing and
zero-overhead statistics for kmemtrace at some point. But anyway, I'm
not objecting to extending /proc/kpageflags if that's what people want
to do.

                                          Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  9:15           ` Ingo Molnar
@ 2009-04-28  9:19             ` Pekka Enberg
  -1 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28  9:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm

Hi Ingo,

On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
>> > > I think i have to NAK this kind of ad-hoc instrumentation of kernel
>> > > internals and statistics until we clear up why such instrumentation

* Pekka Enberg <penberg@cs.helsinki.fi> wrote:
>> > I think because it has zero fast path overhead and can be used
>> > any time without enabling anything special.

On Tue, Apr 28, 2009 at 12:15 PM, Ingo Molnar <mingo@elte.hu> wrote:
> ( That's a dubious claim in any case - tracepoints are very cheap.
>  And they could be made even cheaper and such efforts would benefit
>  all the tracepoint users so it's a prime focus of interest.
>  Andi is a SystemTap proponent, right? I saw him oppose pretty much
>  everything built-in kernel tracing related. I consider that a
>  pretty extreme position. )

I have no idea how expensive tracepoints are but I suspect they don't
make too much sense for this particular scenario. After all, kmemtrace
is mainly interested in _allocation patterns_ whereas this patch seems
to be more interested in "memory layout" type of things.

                                                Pekka

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28  9:19             ` Pekka Enberg
  0 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28  9:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm

Hi Ingo,

On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
>> > > I think i have to NAK this kind of ad-hoc instrumentation of kernel
>> > > internals and statistics until we clear up why such instrumentation

* Pekka Enberg <penberg@cs.helsinki.fi> wrote:
>> > I think because it has zero fast path overhead and can be used
>> > any time without enabling anything special.

On Tue, Apr 28, 2009 at 12:15 PM, Ingo Molnar <mingo@elte.hu> wrote:
> ( That's a dubious claim in any case - tracepoints are very cheap.
>  And they could be made even cheaper and such efforts would benefit
>  all the tracepoint users so it's a prime focus of interest.
>  Andi is a SystemTap proponent, right? I saw him oppose pretty much
>  everything built-in kernel tracing related. I consider that a
>  pretty extreme position. )

I have no idea how expensive tracepoints are but I suspect they don't
make too much sense for this particular scenario. After all, kmemtrace
is mainly interested in _allocation patterns_ whereas this patch seems
to be more interested in "memory layout" type of things.

                                                Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  8:33       ` Wu Fengguang
@ 2009-04-28  9:24         ` Ingo Molnar
  -1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28  9:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Steven Rostedt, Frédéric Weisbecker, Larry Woodman,
	Peter Zijlstra, Pekka Enberg, Eduard - Gabriel Munteanu,
	Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen, Matt Mackall,
	Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Tue, Apr 28, 2009 at 08:55:07AM +0200, Ingo Molnar wrote:
> > 
> > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> > > 
> > > 1) for kernel hackers (on CONFIG_DEBUG_KERNEL)
> > >    - all available page flags are exported, and
> > >    - exported as is
> > > 2) for admins and end users
> > >    - only the more `well known' flags are exported:
> > > 	11. KPF_MMAP		(pseudo flag) memory mapped page
> > > 	12. KPF_ANON		(pseudo flag) memory mapped page (anonymous)
> > > 	13. KPF_SWAPCACHE	page is in swap cache
> > > 	14. KPF_SWAPBACKED	page is swap/RAM backed
> > > 	15. KPF_COMPOUND_HEAD	(*)
> > > 	16. KPF_COMPOUND_TAIL	(*)
> > > 	17. KPF_UNEVICTABLE	page is in the unevictable LRU list
> > > 	18. KPF_HWPOISON	hardware detected corruption
> > > 	19. KPF_NOPAGE		(pseudo flag) no page frame at the address
> > > 
> > > 	(*) For compound pages, exporting _both_ head/tail info enables
> > > 	    users to tell where a compound page starts/ends, and its order.
> > > 
> > >    - limit flags to their typical usage scenario, as indicated by KOSAKI:
> > > 	- LRU pages: only export relevant flags
> > > 		- PG_lru
> > > 		- PG_unevictable
> > > 		- PG_active
> > > 		- PG_referenced
> > > 		- page_mapped()
> > > 		- PageAnon()
> > > 		- PG_swapcache
> > > 		- PG_swapbacked
> > > 		- PG_reclaim
> > > 	- no-IO pages: mask out irrelevant flags
> > > 		- PG_dirty
> > > 		- PG_uptodate
> > > 		- PG_writeback
> > > 	- SLAB pages: mask out overloaded flags:
> > > 		- PG_error
> > > 		- PG_active
> > > 		- PG_private
> > > 	- PG_reclaim: mask out the overloaded PG_readahead
> > > 	- compound flags: only export huge/gigantic pages
> > > 
> > > Here are the admin/linus views of all page flags on a newly booted nfs-root system:
> > > 
> > > # ./page-types # for admin
> > >          flags  page-count       MB  symbolic-flags                     long-symbolic-flags
> > > 0x000000000000      491174     1918  ____________________________                
> > > 0x000000000020           1        0  _____l______________________       lru      
> > > 0x000000000028        2543        9  ___U_l______________________       uptodate,lru
> > > 0x00000000002c        5288       20  __RU_l______________________       referenced,uptodate,lru
> > > 0x000000004060           1        0  _____lA_______b_____________       lru,active,swapbacked
> > 
> > I think i have to NAK this kind of ad-hoc instrumentation of kernel 
> > internals and statistics until we clear up why such instrumentation 
> > measures are being accepted into the MM while other, more dynamic 
> > and more flexible MM instrumentation are being resisted by Andrew.
> 
> An unexpected NAK - to throw away an orange because we are to have an apple? ;-)
> 
> Anyway here are the missing rationals.
> 
> 1) FAST
> 
> It takes merely 0.2s to scan 4GB pages:
> 
>         ./page-types  0.02s user 0.20s system 99% cpu 0.216 total
> 
> 2) SIMPLE
> 
> /proc/kpageflags will be a *long standing* hack we have to live 
> with - it was originally introduced by Matt to do shared memory 
> accounting and a facility to analyze applications' memory 
> consumptions, with the hope it will also help kernel developers 
> someday.
> 
> So why not extend and embrace it, in a straightforward way?
> 
> 3) USE CASES
> 
> I have/will take advantage of the above page-types command in a number ways:
> - to help track down memory leak (the recent trace/ring_buffer.c case)
> - to estimate the system wide readahead miss ratio
> - Andi want to examine the major page types in different workloads
>   (for the hwpoison work)
> - Me too, for fun of learning: read/write/lock/whatever a lot of pages
>   and examine their flags, to get an idea of some random kernel behaviors.
>   (the dynamic tracing tools can be more helpful, as a different view)
> 
> 4) COMPLEMENTARITY
> 
> In some cases the dynamic tracing tool is not enough (or too complex)
> to rebuild the current status view.
> 
> I myself have a dynamic readahead tracing tool(very useful!). At 
> the same time I also use readahead accounting numbers, and the 
> /proc/filecache tool(frequently!), and the above page-types tool. 
> I simply need them all - they are handy for different cases.

Well, the main counter argument here is that statistics is _derived_ 
from events. In their simplest form the 'counts' are the integral of 
events over time.

So if we capture all interesting events, and do that with low 
overhead (and in fact can even collect and integrate them in-kernel, 
today), we _dont have_ to maintain various overlapping counters all 
around the kernel. This is really a general instrumentation design 
observation.

Every time we add yet another /proc hack we splinter Linux 
instrumentation, in a hard to reverse way.

So your single-purpose /proc hack could be made multi-purpose and 
could help a much broader range of people, with just a little bit of 
effort i believe. Pekka already wrote the page tracking patch for 
example, that would be a good starting point.

Does it mean more work to do? You bet ;-)

	Ingo

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28  9:24         ` Ingo Molnar
  0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28  9:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Steven Rostedt, Frédéric Weisbecker, Larry Woodman,
	Peter Zijlstra, Pekka Enberg, Eduard - Gabriel Munteanu,
	Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen, Matt Mackall,
	Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Tue, Apr 28, 2009 at 08:55:07AM +0200, Ingo Molnar wrote:
> > 
> > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> > > 
> > > 1) for kernel hackers (on CONFIG_DEBUG_KERNEL)
> > >    - all available page flags are exported, and
> > >    - exported as is
> > > 2) for admins and end users
> > >    - only the more `well known' flags are exported:
> > > 	11. KPF_MMAP		(pseudo flag) memory mapped page
> > > 	12. KPF_ANON		(pseudo flag) memory mapped page (anonymous)
> > > 	13. KPF_SWAPCACHE	page is in swap cache
> > > 	14. KPF_SWAPBACKED	page is swap/RAM backed
> > > 	15. KPF_COMPOUND_HEAD	(*)
> > > 	16. KPF_COMPOUND_TAIL	(*)
> > > 	17. KPF_UNEVICTABLE	page is in the unevictable LRU list
> > > 	18. KPF_HWPOISON	hardware detected corruption
> > > 	19. KPF_NOPAGE		(pseudo flag) no page frame at the address
> > > 
> > > 	(*) For compound pages, exporting _both_ head/tail info enables
> > > 	    users to tell where a compound page starts/ends, and its order.
> > > 
> > >    - limit flags to their typical usage scenario, as indicated by KOSAKI:
> > > 	- LRU pages: only export relevant flags
> > > 		- PG_lru
> > > 		- PG_unevictable
> > > 		- PG_active
> > > 		- PG_referenced
> > > 		- page_mapped()
> > > 		- PageAnon()
> > > 		- PG_swapcache
> > > 		- PG_swapbacked
> > > 		- PG_reclaim
> > > 	- no-IO pages: mask out irrelevant flags
> > > 		- PG_dirty
> > > 		- PG_uptodate
> > > 		- PG_writeback
> > > 	- SLAB pages: mask out overloaded flags:
> > > 		- PG_error
> > > 		- PG_active
> > > 		- PG_private
> > > 	- PG_reclaim: mask out the overloaded PG_readahead
> > > 	- compound flags: only export huge/gigantic pages
> > > 
> > > Here are the admin/linus views of all page flags on a newly booted nfs-root system:
> > > 
> > > # ./page-types # for admin
> > >          flags  page-count       MB  symbolic-flags                     long-symbolic-flags
> > > 0x000000000000      491174     1918  ____________________________                
> > > 0x000000000020           1        0  _____l______________________       lru      
> > > 0x000000000028        2543        9  ___U_l______________________       uptodate,lru
> > > 0x00000000002c        5288       20  __RU_l______________________       referenced,uptodate,lru
> > > 0x000000004060           1        0  _____lA_______b_____________       lru,active,swapbacked
> > 
> > I think i have to NAK this kind of ad-hoc instrumentation of kernel 
> > internals and statistics until we clear up why such instrumentation 
> > measures are being accepted into the MM while other, more dynamic 
> > and more flexible MM instrumentation are being resisted by Andrew.
> 
> An unexpected NAK - to throw away an orange because we are to have an apple? ;-)
> 
> Anyway here are the missing rationals.
> 
> 1) FAST
> 
> It takes merely 0.2s to scan 4GB pages:
> 
>         ./page-types  0.02s user 0.20s system 99% cpu 0.216 total
> 
> 2) SIMPLE
> 
> /proc/kpageflags will be a *long standing* hack we have to live 
> with - it was originally introduced by Matt to do shared memory 
> accounting and a facility to analyze applications' memory 
> consumptions, with the hope it will also help kernel developers 
> someday.
> 
> So why not extend and embrace it, in a straightforward way?
> 
> 3) USE CASES
> 
> I have/will take advantage of the above page-types command in a number ways:
> - to help track down memory leak (the recent trace/ring_buffer.c case)
> - to estimate the system wide readahead miss ratio
> - Andi want to examine the major page types in different workloads
>   (for the hwpoison work)
> - Me too, for fun of learning: read/write/lock/whatever a lot of pages
>   and examine their flags, to get an idea of some random kernel behaviors.
>   (the dynamic tracing tools can be more helpful, as a different view)
> 
> 4) COMPLEMENTARITY
> 
> In some cases the dynamic tracing tool is not enough (or too complex)
> to rebuild the current status view.
> 
> I myself have a dynamic readahead tracing tool(very useful!). At 
> the same time I also use readahead accounting numbers, and the 
> /proc/filecache tool(frequently!), and the above page-types tool. 
> I simply need them all - they are handy for different cases.

Well, the main counter argument here is that statistics is _derived_ 
from events. In their simplest form the 'counts' are the integral of 
events over time.

So if we capture all interesting events, and do that with low 
overhead (and in fact can even collect and integrate them in-kernel, 
today), we _dont have_ to maintain various overlapping counters all 
around the kernel. This is really a general instrumentation design 
observation.

Every time we add yet another /proc hack we splinter Linux 
instrumentation, in a hard to reverse way.

So your single-purpose /proc hack could be made multi-purpose and 
could help a much broader range of people, with just a little bit of 
effort i believe. Pekka already wrote the page tracking patch for 
example, that would be a good starting point.

Does it mean more work to do? You bet ;-)

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  9:19             ` Pekka Enberg
@ 2009-04-28  9:25               ` Pekka Enberg
  -1 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28  9:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
>>> > > I think i have to NAK this kind of ad-hoc instrumentation of kernel
>>> > > internals and statistics until we clear up why such instrumentation

* Pekka Enberg <penberg@cs.helsinki.fi> wrote:
>>> > I think because it has zero fast path overhead and can be used
>>> > any time without enabling anything special.
>
> On Tue, Apr 28, 2009 at 12:15 PM, Ingo Molnar <mingo@elte.hu> wrote:
>> ( That's a dubious claim in any case - tracepoints are very cheap.
>>  And they could be made even cheaper and such efforts would benefit
>>  all the tracepoint users so it's a prime focus of interest.
>>  Andi is a SystemTap proponent, right? I saw him oppose pretty much
>>  everything built-in kernel tracing related. I consider that a
>>  pretty extreme position. )

On Tue, Apr 28, 2009 at 12:19 PM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> I have no idea how expensive tracepoints are but I suspect they don't
> make too much sense for this particular scenario. After all, kmemtrace
> is mainly interested in _allocation patterns_ whereas this patch seems
> to be more interested in "memory layout" type of things.

That said, I do foresee a need to be able to turn on more detailed
tracing after you've identified problematic areas from kpageflags type
of overview report. And for that, you almost certainly want
kmemtrace/tracepoints style solution with pid/function/whatever regexp
matching ftrace already provides.

                        Pekka

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28  9:25               ` Pekka Enberg
  0 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28  9:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
>>> > > I think i have to NAK this kind of ad-hoc instrumentation of kernel
>>> > > internals and statistics until we clear up why such instrumentation

* Pekka Enberg <penberg@cs.helsinki.fi> wrote:
>>> > I think because it has zero fast path overhead and can be used
>>> > any time without enabling anything special.
>
> On Tue, Apr 28, 2009 at 12:15 PM, Ingo Molnar <mingo@elte.hu> wrote:
>> ( That's a dubious claim in any case - tracepoints are very cheap.
>>  And they could be made even cheaper and such efforts would benefit
>>  all the tracepoint users so it's a prime focus of interest.
>>  Andi is a SystemTap proponent, right? I saw him oppose pretty much
>>  everything built-in kernel tracing related. I consider that a
>>  pretty extreme position. )

On Tue, Apr 28, 2009 at 12:19 PM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> I have no idea how expensive tracepoints are but I suspect they don't
> make too much sense for this particular scenario. After all, kmemtrace
> is mainly interested in _allocation patterns_ whereas this patch seems
> to be more interested in "memory layout" type of things.

That said, I do foresee a need to be able to turn on more detailed
tracing after you've identified problematic areas from kpageflags type
of overview report. And for that, you almost certainly want
kmemtrace/tracepoints style solution with pid/function/whatever regexp
matching ftrace already provides.

                        Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  9:19             ` Pekka Enberg
@ 2009-04-28  9:29               ` Ingo Molnar
  -1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28  9:29 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm

* Pekka Enberg <penberg@cs.helsinki.fi> wrote:

> I have no idea how expensive tracepoints are but I suspect they 
> don't make too much sense for this particular scenario. After all, 
> kmemtrace is mainly interested in _allocation patterns_ whereas 
> this patch seems to be more interested in "memory layout" type of 
> things.

My point is that the allocation patterns can be derived from dynamic 
events. We can build a map of everything if we know all the events 
that led up to it. Doing:

  echo 3 > /proc/sys/vm/drop_caches

will clear 99% of the memory allocations, so we can build a new map 
from scratch just about anytime. (and if boot allocations are 
interesting they can be traced too)

_And_ via this angle we'll also have access to the dynamic events, 
in a different 'view' of the same tracepoints - which is obviously 
very useful for different purposes.

	Ingo

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28  9:29               ` Ingo Molnar
  0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28  9:29 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm

* Pekka Enberg <penberg@cs.helsinki.fi> wrote:

> I have no idea how expensive tracepoints are but I suspect they 
> don't make too much sense for this particular scenario. After all, 
> kmemtrace is mainly interested in _allocation patterns_ whereas 
> this patch seems to be more interested in "memory layout" type of 
> things.

My point is that the allocation patterns can be derived from dynamic 
events. We can build a map of everything if we know all the events 
that led up to it. Doing:

  echo 3 > /proc/sys/vm/drop_caches

will clear 99% of the memory allocations, so we can build a new map 
from scratch just about anytime. (and if boot allocations are 
interesting they can be traced too)

_And_ via this angle we'll also have access to the dynamic events, 
in a different 'view' of the same tracepoints - which is obviously 
very useful for different purposes.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  9:29               ` Ingo Molnar
@ 2009-04-28  9:34                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 137+ messages in thread
From: KOSAKI Motohiro @ 2009-04-28  9:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: kosaki.motohiro, Pekka Enberg, Andi Kleen, Wu Fengguang,
	Steven Rostedt, Fr馘駻ic Weisbecker, Larry Woodman,
	Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	Matt Mackall, Alexey Dobriyan, linux-mm

> 
> * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> 
> > I have no idea how expensive tracepoints are but I suspect they 
> > don't make too much sense for this particular scenario. After all, 
> > kmemtrace is mainly interested in _allocation patterns_ whereas 
> > this patch seems to be more interested in "memory layout" type of 
> > things.
> 
> My point is that the allocation patterns can be derived from dynamic 
> events. We can build a map of everything if we know all the events 
> that led up to it. Doing:
> 
>   echo 3 > /proc/sys/vm/drop_caches
> 
> will clear 99% of the memory allocations, so we can build a new map 
> from scratch just about anytime. (and if boot allocations are 
> interesting they can be traced too)
> 
> _And_ via this angle we'll also have access to the dynamic events, 
> in a different 'view' of the same tracepoints - which is obviously 
> very useful for different purposes.

I am one of most strongly want guys to MM tracepoint.
but No, many cunstomer never permit to use drop_caches.

I believe this patch and tracepoint are _both_ necessary and useful.




^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28  9:34                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 137+ messages in thread
From: KOSAKI Motohiro @ 2009-04-28  9:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: kosaki.motohiro, Pekka Enberg, Andi Kleen, Wu Fengguang,
	Steven Rostedt, Fr馘駻ic Weisbecker, Larry Woodman,
	Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	Matt Mackall, Alexey Dobriyan, linux-mm

> 
> * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> 
> > I have no idea how expensive tracepoints are but I suspect they 
> > don't make too much sense for this particular scenario. After all, 
> > kmemtrace is mainly interested in _allocation patterns_ whereas 
> > this patch seems to be more interested in "memory layout" type of 
> > things.
> 
> My point is that the allocation patterns can be derived from dynamic 
> events. We can build a map of everything if we know all the events 
> that led up to it. Doing:
> 
>   echo 3 > /proc/sys/vm/drop_caches
> 
> will clear 99% of the memory allocations, so we can build a new map 
> from scratch just about anytime. (and if boot allocations are 
> interesting they can be traced too)
> 
> _And_ via this angle we'll also have access to the dynamic events, 
> in a different 'view' of the same tracepoints - which is obviously 
> very useful for different purposes.

I am one of most strongly want guys to MM tracepoint.
but No, many cunstomer never permit to use drop_caches.

I believe this patch and tracepoint are _both_ necessary and useful.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  9:25               ` Pekka Enberg
@ 2009-04-28  9:36                 ` Wu Fengguang
  -1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28  9:36 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Ingo Molnar, Andi Kleen, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Tue, Apr 28, 2009 at 05:25:06PM +0800, Pekka Enberg wrote:
> On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
> >>> > > I think i have to NAK this kind of ad-hoc instrumentation of kernel
> >>> > > internals and statistics until we clear up why such instrumentation
> 
> * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> >>> > I think because it has zero fast path overhead and can be used
> >>> > any time without enabling anything special.
> >
> > On Tue, Apr 28, 2009 at 12:15 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >> ( That's a dubious claim in any case - tracepoints are very cheap.
> >>  And they could be made even cheaper and such efforts would benefit
> >>  all the tracepoint users so it's a prime focus of interest.
> >>  Andi is a SystemTap proponent, right? I saw him oppose pretty much
> >>  everything built-in kernel tracing related. I consider that a
> >>  pretty extreme position. )
> 
> On Tue, Apr 28, 2009 at 12:19 PM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> > I have no idea how expensive tracepoints are but I suspect they don't
> > make too much sense for this particular scenario. After all, kmemtrace
> > is mainly interested in _allocation patterns_ whereas this patch seems
> > to be more interested in "memory layout" type of things.
> 
> That said, I do foresee a need to be able to turn on more detailed
> tracing after you've identified problematic areas from kpageflags type
> of overview report. And for that, you almost certainly want
> kmemtrace/tracepoints style solution with pid/function/whatever regexp
> matching ftrace already provides.

Exactly - kmemtrace is the tool I looked for when hunting down the
page flags of the leaked ring buffer memory :-)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28  9:36                 ` Wu Fengguang
  0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28  9:36 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Ingo Molnar, Andi Kleen, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Tue, Apr 28, 2009 at 05:25:06PM +0800, Pekka Enberg wrote:
> On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
> >>> > > I think i have to NAK this kind of ad-hoc instrumentation of kernel
> >>> > > internals and statistics until we clear up why such instrumentation
> 
> * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> >>> > I think because it has zero fast path overhead and can be used
> >>> > any time without enabling anything special.
> >
> > On Tue, Apr 28, 2009 at 12:15 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >> ( That's a dubious claim in any case - tracepoints are very cheap.
> >> A And they could be made even cheaper and such efforts would benefit
> >> A all the tracepoint users so it's a prime focus of interest.
> >> A Andi is a SystemTap proponent, right? I saw him oppose pretty much
> >> A everything built-in kernel tracing related. I consider that a
> >> A pretty extreme position. )
> 
> On Tue, Apr 28, 2009 at 12:19 PM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> > I have no idea how expensive tracepoints are but I suspect they don't
> > make too much sense for this particular scenario. After all, kmemtrace
> > is mainly interested in _allocation patterns_ whereas this patch seems
> > to be more interested in "memory layout" type of things.
> 
> That said, I do foresee a need to be able to turn on more detailed
> tracing after you've identified problematic areas from kpageflags type
> of overview report. And for that, you almost certainly want
> kmemtrace/tracepoints style solution with pid/function/whatever regexp
> matching ftrace already provides.

Exactly - kmemtrace is the tool I looked for when hunting down the
page flags of the leaked ring buffer memory :-)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  9:25               ` Pekka Enberg
@ 2009-04-28  9:36                 ` Ingo Molnar
  -1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28  9:36 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm

* Pekka Enberg <penberg@cs.helsinki.fi> wrote:

> > I have no idea how expensive tracepoints are but I suspect they 
> > don't make too much sense for this particular scenario. After 
> > all, kmemtrace is mainly interested in _allocation patterns_ 
> > whereas this patch seems to be more interested in "memory 
> > layout" type of things.
> 
> That said, I do foresee a need to be able to turn on more detailed 
> tracing after you've identified problematic areas from kpageflags 
> type of overview report. And for that, you almost certainly want 
> kmemtrace/tracepoints style solution with pid/function/whatever 
> regexp matching ftrace already provides.

yes. My point is that by having the latter, we pretty much have the 
former as well!

I 'integrate' traces all the time to get summary counts. This series 
of dynamic events:

  allocation
  page count up
  page count up
  page count down
  page count up
  page count up
  page count up
  page count up

integrates into: "page count is 6".

Note that "integration" can be done wholly in the kernel too, 
without going to the overhead of streaming all dynamic events to 
user-space, just to summarize data into counts, in-kernel. That is 
what the ftrace statistics framework and various ftrace plugins are 
about.

Also, it might make sense to extend the framework with a series of 
'get current object state' events when tracing is turned on. A 
special case of _that_ would in essence be what the /proc hack does 
now - just expressed in a much more generic, and a much more usable 
form.

	Ingo

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28  9:36                 ` Ingo Molnar
  0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28  9:36 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm

* Pekka Enberg <penberg@cs.helsinki.fi> wrote:

> > I have no idea how expensive tracepoints are but I suspect they 
> > don't make too much sense for this particular scenario. After 
> > all, kmemtrace is mainly interested in _allocation patterns_ 
> > whereas this patch seems to be more interested in "memory 
> > layout" type of things.
> 
> That said, I do foresee a need to be able to turn on more detailed 
> tracing after you've identified problematic areas from kpageflags 
> type of overview report. And for that, you almost certainly want 
> kmemtrace/tracepoints style solution with pid/function/whatever 
> regexp matching ftrace already provides.

yes. My point is that by having the latter, we pretty much have the 
former as well!

I 'integrate' traces all the time to get summary counts. This series 
of dynamic events:

  allocation
  page count up
  page count up
  page count down
  page count up
  page count up
  page count up
  page count up

integrates into: "page count is 6".

Note that "integration" can be done wholly in the kernel too, 
without going to the overhead of streaming all dynamic events to 
user-space, just to summarize data into counts, in-kernel. That is 
what the ftrace statistics framework and various ftrace plugins are 
about.

Also, it might make sense to extend the framework with a series of 
'get current object state' events when tracing is turned on. A 
special case of _that_ would in essence be what the /proc hack does 
now - just expressed in a much more generic, and a much more usable 
form.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  9:34                 ` KOSAKI Motohiro
@ 2009-04-28  9:38                   ` Ingo Molnar
  -1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28  9:38 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Pekka Enberg, Andi Kleen, Wu Fengguang, Steven Rostedt,
	Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
	Alexey Dobriyan, linux-mm


* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > 
> > * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> > 
> > > I have no idea how expensive tracepoints are but I suspect they 
> > > don't make too much sense for this particular scenario. After all, 
> > > kmemtrace is mainly interested in _allocation patterns_ whereas 
> > > this patch seems to be more interested in "memory layout" type of 
> > > things.
> > 
> > My point is that the allocation patterns can be derived from dynamic 
> > events. We can build a map of everything if we know all the events 
> > that led up to it. Doing:
> > 
> >   echo 3 > /proc/sys/vm/drop_caches
> > 
> > will clear 99% of the memory allocations, so we can build a new map 
> > from scratch just about anytime. (and if boot allocations are 
> > interesting they can be traced too)
> > 
> > _And_ via this angle we'll also have access to the dynamic events, 
> > in a different 'view' of the same tracepoints - which is obviously 
> > very useful for different purposes.
> 
> I am one of most strongly want guys to MM tracepoint. but No, many 
> cunstomer never permit to use drop_caches.

See my other mail i just sent: it would be a natural extension of 
tracing to also dump all current object state when tracing is turned 
on. That way no drop_caches is needed at all.

But it has to be expressed in one framework that cares about the 
totality of the kernel - not just these splintered bits of 
instrumentation and pieces of statistics.

	Ingo

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28  9:38                   ` Ingo Molnar
  0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28  9:38 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Pekka Enberg, Andi Kleen, Wu Fengguang, Steven Rostedt,
	Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
	Alexey Dobriyan, linux-mm


* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > 
> > * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> > 
> > > I have no idea how expensive tracepoints are but I suspect they 
> > > don't make too much sense for this particular scenario. After all, 
> > > kmemtrace is mainly interested in _allocation patterns_ whereas 
> > > this patch seems to be more interested in "memory layout" type of 
> > > things.
> > 
> > My point is that the allocation patterns can be derived from dynamic 
> > events. We can build a map of everything if we know all the events 
> > that led up to it. Doing:
> > 
> >   echo 3 > /proc/sys/vm/drop_caches
> > 
> > will clear 99% of the memory allocations, so we can build a new map 
> > from scratch just about anytime. (and if boot allocations are 
> > interesting they can be traced too)
> > 
> > _And_ via this angle we'll also have access to the dynamic events, 
> > in a different 'view' of the same tracepoints - which is obviously 
> > very useful for different purposes.
> 
> I am one of most strongly want guys to MM tracepoint. but No, many 
> cunstomer never permit to use drop_caches.

See my other mail i just sent: it would be a natural extension of 
tracing to also dump all current object state when tracing is turned 
on. That way no drop_caches is needed at all.

But it has to be expressed in one framework that cares about the 
totality of the kernel - not just these splintered bits of 
instrumentation and pieces of statistics.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  9:38                   ` Ingo Molnar
@ 2009-04-28  9:55                     ` Wu Fengguang
  -1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28  9:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Steven Rostedt,
	Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
	Alexey Dobriyan, linux-mm

On Tue, Apr 28, 2009 at 05:38:33PM +0800, Ingo Molnar wrote:
> 
> * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > 
> > > * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> > > 
> > > > I have no idea how expensive tracepoints are but I suspect they 
> > > > don't make too much sense for this particular scenario. After all, 
> > > > kmemtrace is mainly interested in _allocation patterns_ whereas 
> > > > this patch seems to be more interested in "memory layout" type of 
> > > > things.
> > > 
> > > My point is that the allocation patterns can be derived from dynamic 
> > > events. We can build a map of everything if we know all the events 
> > > that led up to it. Doing:
> > > 
> > >   echo 3 > /proc/sys/vm/drop_caches
> > > 
> > > will clear 99% of the memory allocations, so we can build a new map 
> > > from scratch just about anytime. (and if boot allocations are 
> > > interesting they can be traced too)
> > > 
> > > _And_ via this angle we'll also have access to the dynamic events, 
> > > in a different 'view' of the same tracepoints - which is obviously 
> > > very useful for different purposes.
> > 
> > I am one of most strongly want guys to MM tracepoint. but No, many 
> > cunstomer never permit to use drop_caches.
> 
> See my other mail i just sent: it would be a natural extension of 
> tracing to also dump all current object state when tracing is turned 
> on. That way no drop_caches is needed at all.

I can understand the merits here - I also did readahead
tracing/accounting in _one_ piece of code. Very handy.

The readahead traces are now raw printks - converting to the ftrace
framework would be a big win.

But. It's still not a fit-all solution. Imagine when full data _since_
booting is required, but the user cannot afford a reboot.

> But it has to be expressed in one framework that cares about the 
> totality of the kernel - not just these splintered bits of 
> instrumentation and pieces of statistics.

Though minded to push the kpageflags interface, I totally agree the
above fine principle and discipline :-)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28  9:55                     ` Wu Fengguang
  0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28  9:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Steven Rostedt,
	Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
	Alexey Dobriyan, linux-mm

On Tue, Apr 28, 2009 at 05:38:33PM +0800, Ingo Molnar wrote:
> 
> * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > 
> > > * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> > > 
> > > > I have no idea how expensive tracepoints are but I suspect they 
> > > > don't make too much sense for this particular scenario. After all, 
> > > > kmemtrace is mainly interested in _allocation patterns_ whereas 
> > > > this patch seems to be more interested in "memory layout" type of 
> > > > things.
> > > 
> > > My point is that the allocation patterns can be derived from dynamic 
> > > events. We can build a map of everything if we know all the events 
> > > that led up to it. Doing:
> > > 
> > >   echo 3 > /proc/sys/vm/drop_caches
> > > 
> > > will clear 99% of the memory allocations, so we can build a new map 
> > > from scratch just about anytime. (and if boot allocations are 
> > > interesting they can be traced too)
> > > 
> > > _And_ via this angle we'll also have access to the dynamic events, 
> > > in a different 'view' of the same tracepoints - which is obviously 
> > > very useful for different purposes.
> > 
> > I am one of most strongly want guys to MM tracepoint. but No, many 
> > cunstomer never permit to use drop_caches.
> 
> See my other mail i just sent: it would be a natural extension of 
> tracing to also dump all current object state when tracing is turned 
> on. That way no drop_caches is needed at all.

I can understand the merits here - I also did readahead
tracing/accounting in _one_ piece of code. Very handy.

The readahead traces are now raw printks - converting to the ftrace
framework would be a big win.

But. It's still not a fit-all solution. Imagine when full data _since_
booting is required, but the user cannot afford a reboot.

> But it has to be expressed in one framework that cares about the 
> totality of the kernel - not just these splintered bits of 
> instrumentation and pieces of statistics.

Though minded to push the kpageflags interface, I totally agree the
above fine principle and discipline :-)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  9:36                 ` Ingo Molnar
@ 2009-04-28  9:57                   ` Pekka Enberg
  -1 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28  9:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm

Hi Ingo,

On Tue, Apr 28, 2009 at 12:36 PM, Ingo Molnar <mingo@elte.hu> wrote:
> I 'integrate' traces all the time to get summary counts. This series
> of dynamic events:
>
>  allocation
>  page count up
>  page count up
>  page count down
>  page count up
>  page count up
>  page count up
>  page count up
>
> integrates into: "page count is 6".
>
> Note that "integration" can be done wholly in the kernel too,
> without going to the overhead of streaming all dynamic events to
> user-space, just to summarize data into counts, in-kernel. That is
> what the ftrace statistics framework and various ftrace plugins are
> about.
>
> Also, it might make sense to extend the framework with a series of
> 'get current object state' events when tracing is turned on. A
> special case of _that_ would in essence be what the /proc hack does
> now - just expressed in a much more generic, and a much more usable
> form.

I guess the main question here is whether this approach will scale to
something like kmalloc() or the page allocator in production
environments. For any serious workload, the frequency of events is
going to be pretty high.

                                            Pekka

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28  9:57                   ` Pekka Enberg
  0 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28  9:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm

Hi Ingo,

On Tue, Apr 28, 2009 at 12:36 PM, Ingo Molnar <mingo@elte.hu> wrote:
> I 'integrate' traces all the time to get summary counts. This series
> of dynamic events:
>
>  allocation
>  page count up
>  page count up
>  page count down
>  page count up
>  page count up
>  page count up
>  page count up
>
> integrates into: "page count is 6".
>
> Note that "integration" can be done wholly in the kernel too,
> without going to the overhead of streaming all dynamic events to
> user-space, just to summarize data into counts, in-kernel. That is
> what the ftrace statistics framework and various ftrace plugins are
> about.
>
> Also, it might make sense to extend the framework with a series of
> 'get current object state' events when tracing is turned on. A
> special case of _that_ would in essence be what the /proc hack does
> now - just expressed in a much more generic, and a much more usable
> form.

I guess the main question here is whether this approach will scale to
something like kmalloc() or the page allocator in production
environments. For any serious workload, the frequency of events is
going to be pretty high.

                                            Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  9:57                   ` Pekka Enberg
@ 2009-04-28 10:10                     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 137+ messages in thread
From: KOSAKI Motohiro @ 2009-04-28 10:10 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: kosaki.motohiro, Ingo Molnar, Andi Kleen, Wu Fengguang,
	Steven Rostedt, Fr馘駻ic Weisbecker, Larry Woodman,
	Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	Matt Mackall, Alexey Dobriyan, linux-mm

> I guess the main question here is whether this approach will scale to
> something like kmalloc() or the page allocator in production
> environments. For any serious workload, the frequency of events is
> going to be pretty high.

Immediate Values patch series makes zero-overhead to tracepoint
while it's not used.

So, We have to implement to stop collect stastics way. it restore
zero overhead world.
We don't lose any performance by trace.




^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 10:10                     ` KOSAKI Motohiro
  0 siblings, 0 replies; 137+ messages in thread
From: KOSAKI Motohiro @ 2009-04-28 10:10 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: kosaki.motohiro, Ingo Molnar, Andi Kleen, Wu Fengguang,
	Steven Rostedt, Fr馘駻ic Weisbecker, Larry Woodman,
	Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	Matt Mackall, Alexey Dobriyan, linux-mm

> I guess the main question here is whether this approach will scale to
> something like kmalloc() or the page allocator in production
> environments. For any serious workload, the frequency of events is
> going to be pretty high.

Immediate Values patch series makes zero-overhead to tracepoint
while it's not used.

So, We have to implement to stop collect stastics way. it restore
zero overhead world.
We don't lose any performance by trace.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  9:55                     ` Wu Fengguang
@ 2009-04-28 10:11                       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 137+ messages in thread
From: KOSAKI Motohiro @ 2009-04-28 10:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Ingo Molnar, Pekka Enberg, Andi Kleen,
	Steven Rostedt, Fr馘駻ic Weisbecker, Larry Woodman,
	Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	Matt Mackall, Alexey Dobriyan, linux-mm

> > > I am one of most strongly want guys to MM tracepoint. but No, many 
> > > cunstomer never permit to use drop_caches.
> > 
> > See my other mail i just sent: it would be a natural extension of 
> > tracing to also dump all current object state when tracing is turned 
> > on. That way no drop_caches is needed at all.
> 
> I can understand the merits here - I also did readahead
> tracing/accounting in _one_ piece of code. Very handy.
> 
> The readahead traces are now raw printks - converting to the ftrace
> framework would be a big win.
>
> But. It's still not a fit-all solution. Imagine when full data _since_
> booting is required, but the user cannot afford a reboot.
> 
> > But it has to be expressed in one framework that cares about the 
> > totality of the kernel - not just these splintered bits of 
> > instrumentation and pieces of statistics.
> 
> Though minded to push the kpageflags interface, I totally agree the
> above fine principle and discipline :-)

Yeah.
I totally agree your claim.

I'm interest to both ftrace based readahead tracer and this patch :)





^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 10:11                       ` KOSAKI Motohiro
  0 siblings, 0 replies; 137+ messages in thread
From: KOSAKI Motohiro @ 2009-04-28 10:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Ingo Molnar, Pekka Enberg, Andi Kleen,
	Steven Rostedt, Fr馘駻ic Weisbecker, Larry Woodman,
	Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	Matt Mackall, Alexey Dobriyan, linux-mm

> > > I am one of most strongly want guys to MM tracepoint. but No, many 
> > > cunstomer never permit to use drop_caches.
> > 
> > See my other mail i just sent: it would be a natural extension of 
> > tracing to also dump all current object state when tracing is turned 
> > on. That way no drop_caches is needed at all.
> 
> I can understand the merits here - I also did readahead
> tracing/accounting in _one_ piece of code. Very handy.
> 
> The readahead traces are now raw printks - converting to the ftrace
> framework would be a big win.
>
> But. It's still not a fit-all solution. Imagine when full data _since_
> booting is required, but the user cannot afford a reboot.
> 
> > But it has to be expressed in one framework that cares about the 
> > totality of the kernel - not just these splintered bits of 
> > instrumentation and pieces of statistics.
> 
> Though minded to push the kpageflags interface, I totally agree the
> above fine principle and discipline :-)

Yeah.
I totally agree your claim.

I'm interest to both ftrace based readahead tracer and this patch :)




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  9:38                   ` Ingo Molnar
@ 2009-04-28 10:18                     ` Andi Kleen
  -1 siblings, 0 replies; 137+ messages in thread
From: Andi Kleen @ 2009-04-28 10:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Wu Fengguang,
	Steven Rostedt, Fr馘駻ic Weisbecker, Larry Woodman,
	Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	Matt Mackall, Alexey Dobriyan, linux-mm

> But it has to be expressed in one framework that cares about the 
> totality of the kernel - not just these splintered bits of 

Can you perhaps expand a bit what code that framework would
provide to kpageflags? As far as I can see it only needs
a ->read callback from somewhere and I admit it's hard to see for me
how that could share much code with anything else.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 10:18                     ` Andi Kleen
  0 siblings, 0 replies; 137+ messages in thread
From: Andi Kleen @ 2009-04-28 10:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Wu Fengguang,
	Steven Rostedt, Fr馘駻ic Weisbecker, Larry Woodman,
	Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	Matt Mackall, Alexey Dobriyan, linux-mm

> But it has to be expressed in one framework that cares about the 
> totality of the kernel - not just these splintered bits of 

Can you perhaps expand a bit what code that framework would
provide to kpageflags? As far as I can see it only needs
a ->read callback from somewhere and I admit it's hard to see for me
how that could share much code with anything else.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28 10:10                     ` KOSAKI Motohiro
@ 2009-04-28 10:21                       ` Pekka Enberg
  -1 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28 10:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Ingo Molnar, Andi Kleen, Wu Fengguang, Steven Rostedt,
	Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
	Alexey Dobriyan, linux-mm

Hi!

2009/4/28 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
>> I guess the main question here is whether this approach will scale to
>> something like kmalloc() or the page allocator in production
>> environments. For any serious workload, the frequency of events is
>> going to be pretty high.
>
> Immediate Values patch series makes zero-overhead to tracepoint
> while it's not used.
>
> So, We have to implement to stop collect stastics way. it restore
> zero overhead world.
> We don't lose any performance by trace.

Sure but I meant the _enabled_ case here. kmalloc() (and the page
allocator to some extent) is very performance sensitive in many
workloads so you probably don't want to use tracepoints if you're
collecting some overall statistics (i.e. tracing all events) like we
do here.

                                      Pekka

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 10:21                       ` Pekka Enberg
  0 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28 10:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Ingo Molnar, Andi Kleen, Wu Fengguang, Steven Rostedt,
	Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
	Alexey Dobriyan, linux-mm

Hi!

2009/4/28 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
>> I guess the main question here is whether this approach will scale to
>> something like kmalloc() or the page allocator in production
>> environments. For any serious workload, the frequency of events is
>> going to be pretty high.
>
> Immediate Values patch series makes zero-overhead to tracepoint
> while it's not used.
>
> So, We have to implement to stop collect stastics way. it restore
> zero overhead world.
> We don't lose any performance by trace.

Sure but I meant the _enabled_ case here. kmalloc() (and the page
allocator to some extent) is very performance sensitive in many
workloads so you probably don't want to use tracepoints if you're
collecting some overall statistics (i.e. tracing all events) like we
do here.

                                      Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28 10:21                       ` Pekka Enberg
@ 2009-04-28 10:56                         ` Ingo Molnar
  -1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 10:56 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: KOSAKI Motohiro, Andi Kleen, Wu Fengguang, Steven Rostedt,
	Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
	Alexey Dobriyan, linux-mm


* Pekka Enberg <penberg@cs.helsinki.fi> wrote:

> Hi!
> 
> 2009/4/28 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
> >> I guess the main question here is whether this approach will scale to
> >> something like kmalloc() or the page allocator in production
> >> environments. For any serious workload, the frequency of events is
> >> going to be pretty high.
> >
> > Immediate Values patch series makes zero-overhead to tracepoint
> > while it's not used.
> >
> > So, We have to implement to stop collect stastics way. it restore
> > zero overhead world.
> > We don't lose any performance by trace.
> 
> Sure but I meant the _enabled_ case here. kmalloc() (and the page 
> allocator to some extent) is very performance sensitive in many 
> workloads so you probably don't want to use tracepoints if you're 
> collecting some overall statistics (i.e. tracing all events) like 
> we do here.

That's where 'collect current state' kind of tracepoints would help 
- they could be used even without enabling any of the other 
tracepoints. And they'd still be in a coherent whole with the 
dynamic-events tracepoints.

So i'm not arguing against these techniques at all - and we can move 
on a wide scale from zero-overhead to lots-of-tracing-enabled models 
- what i'm arguing against is the splintering.

	Ingo

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 10:56                         ` Ingo Molnar
  0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 10:56 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: KOSAKI Motohiro, Andi Kleen, Wu Fengguang, Steven Rostedt,
	Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
	Alexey Dobriyan, linux-mm


* Pekka Enberg <penberg@cs.helsinki.fi> wrote:

> Hi!
> 
> 2009/4/28 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
> >> I guess the main question here is whether this approach will scale to
> >> something like kmalloc() or the page allocator in production
> >> environments. For any serious workload, the frequency of events is
> >> going to be pretty high.
> >
> > Immediate Values patch series makes zero-overhead to tracepoint
> > while it's not used.
> >
> > So, We have to implement to stop collect stastics way. it restore
> > zero overhead world.
> > We don't lose any performance by trace.
> 
> Sure but I meant the _enabled_ case here. kmalloc() (and the page 
> allocator to some extent) is very performance sensitive in many 
> workloads so you probably don't want to use tracepoints if you're 
> collecting some overall statistics (i.e. tracing all events) like 
> we do here.

That's where 'collect current state' kind of tracepoints would help 
- they could be used even without enabling any of the other 
tracepoints. And they'd still be in a coherent whole with the 
dynamic-events tracepoints.

So i'm not arguing against these techniques at all - and we can move 
on a wide scale from zero-overhead to lots-of-tracing-enabled models 
- what i'm arguing against is the splintering.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  9:57                   ` Pekka Enberg
@ 2009-04-28 11:03                     ` Ingo Molnar
  -1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 11:03 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Pekka Enberg <penberg@cs.helsinki.fi> wrote:

> Hi Ingo,
> 
> On Tue, Apr 28, 2009 at 12:36 PM, Ingo Molnar <mingo@elte.hu> wrote:
> > I 'integrate' traces all the time to get summary counts. This series
> > of dynamic events:
> >
> >  allocation
> >  page count up
> >  page count up
> >  page count down
> >  page count up
> >  page count up
> >  page count up
> >  page count up
> >
> > integrates into: "page count is 6".
> >
> > Note that "integration" can be done wholly in the kernel too, 
> > without going to the overhead of streaming all dynamic events to 
> > user-space, just to summarize data into counts, in-kernel. That 
> > is what the ftrace statistics framework and various ftrace 
> > plugins are about.
> >
> > Also, it might make sense to extend the framework with a series 
> > of 'get current object state' events when tracing is turned on. 
> > A special case of _that_ would in essence be what the /proc hack 
> > does now - just expressed in a much more generic, and a much 
> > more usable form.
> 
> I guess the main question here is whether this approach will scale 
> to something like kmalloc() or the page allocator in production 
> environments. For any serious workload, the frequency of events is 
> going to be pretty high.

it depends on the level of integration. If the integration is done 
right at the tracepoint callback, performance overhead will be very 
small. If everything is traced and then streamed to user-space then 
there is going to be noticeable overhead starting somewhere around a 
few hundred thousand events per second per cpu.

Note that the 'get object state' approach i outlined above in the 
final paragraph has no runtime overhead at all. As long as 'object 
state' only covers fields that we maintain already for normal kernel 
functionality, it costs nothing to allow the passive sampling of 
that state. The /proc patch is a subset of such a facility in 
essence.

	Ingo

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 11:03                     ` Ingo Molnar
  0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 11:03 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Matt Mackall, Alexey Dobriyan, linux-mm


* Pekka Enberg <penberg@cs.helsinki.fi> wrote:

> Hi Ingo,
> 
> On Tue, Apr 28, 2009 at 12:36 PM, Ingo Molnar <mingo@elte.hu> wrote:
> > I 'integrate' traces all the time to get summary counts. This series
> > of dynamic events:
> >
> >  allocation
> >  page count up
> >  page count up
> >  page count down
> >  page count up
> >  page count up
> >  page count up
> >  page count up
> >
> > integrates into: "page count is 6".
> >
> > Note that "integration" can be done wholly in the kernel too, 
> > without going to the overhead of streaming all dynamic events to 
> > user-space, just to summarize data into counts, in-kernel. That 
> > is what the ftrace statistics framework and various ftrace 
> > plugins are about.
> >
> > Also, it might make sense to extend the framework with a series 
> > of 'get current object state' events when tracing is turned on. 
> > A special case of _that_ would in essence be what the /proc hack 
> > does now - just expressed in a much more generic, and a much 
> > more usable form.
> 
> I guess the main question here is whether this approach will scale 
> to something like kmalloc() or the page allocator in production 
> environments. For any serious workload, the frequency of events is 
> going to be pretty high.

it depends on the level of integration. If the integration is done 
right at the tracepoint callback, performance overhead will be very 
small. If everything is traced and then streamed to user-space then 
there is going to be noticeable overhead starting somewhere around a 
few hundred thousand events per second per cpu.

Note that the 'get object state' approach i outlined above in the 
final paragraph has no runtime overhead at all. As long as 'object 
state' only covers fields that we maintain already for normal kernel 
functionality, it costs nothing to allow the passive sampling of 
that state. The /proc patch is a subset of such a facility in 
essence.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  9:55                     ` Wu Fengguang
@ 2009-04-28 11:05                       ` Ingo Molnar
  -1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 11:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Steven Rostedt,
	Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
	Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> > See my other mail i just sent: it would be a natural extension 
> > of tracing to also dump all current object state when tracing is 
> > turned on. That way no drop_caches is needed at all.
> 
> I can understand the merits here - I also did readahead 
> tracing/accounting in _one_ piece of code. Very handy.
> 
> The readahead traces are now raw printks - converting to the 
> ftrace framework would be a big win.
> 
> But. It's still not a fit-all solution. Imagine when full data 
> _since_ booting is required, but the user cannot afford a reboot.

The above 'get object state' interface (which allows passive 
sampling) - integrated into the tracing framework - would serve that 
goal, agreed?

	Ingo

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 11:05                       ` Ingo Molnar
  0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 11:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Steven Rostedt,
	Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
	Alexey Dobriyan, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> > See my other mail i just sent: it would be a natural extension 
> > of tracing to also dump all current object state when tracing is 
> > turned on. That way no drop_caches is needed at all.
> 
> I can understand the merits here - I also did readahead 
> tracing/accounting in _one_ piece of code. Very handy.
> 
> The readahead traces are now raw printks - converting to the 
> ftrace framework would be a big win.
> 
> But. It's still not a fit-all solution. Imagine when full data 
> _since_ booting is required, but the user cannot afford a reboot.

The above 'get object state' interface (which allows passive 
sampling) - integrated into the tracing framework - would serve that 
goal, agreed?

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28 10:56                         ` Ingo Molnar
@ 2009-04-28 11:09                           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 137+ messages in thread
From: KOSAKI Motohiro @ 2009-04-28 11:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pekka Enberg, Andi Kleen, Wu Fengguang, Steven Rostedt,
	Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
	Alexey Dobriyan, linux-mm

2009/4/28 Ingo Molnar <mingo@elte.hu>:
>
> * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
>
>> Hi!
>>
>> 2009/4/28 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
>> >> I guess the main question here is whether this approach will scale to
>> >> something like kmalloc() or the page allocator in production
>> >> environments. For any serious workload, the frequency of events is
>> >> going to be pretty high.
>> >
>> > Immediate Values patch series makes zero-overhead to tracepoint
>> > while it's not used.
>> >
>> > So, We have to implement to stop collect stastics way. it restore
>> > zero overhead world.
>> > We don't lose any performance by trace.
>>
>> Sure but I meant the _enabled_ case here. kmalloc() (and the page
>> allocator to some extent) is very performance sensitive in many
>> workloads so you probably don't want to use tracepoints if you're
>> collecting some overall statistics (i.e. tracing all events) like
>> we do here.
>
> That's where 'collect current state' kind of tracepoints would help
> - they could be used even without enabling any of the other
> tracepoints. And they'd still be in a coherent whole with the
> dynamic-events tracepoints.
>
> So i'm not arguing against these techniques at all - and we can move
> on a wide scale from zero-overhead to lots-of-tracing-enabled models
> - what i'm arguing against is the splintering.

umm.
I guess Pekka and you talk about different thing.

if tracepoint is ON, tracepoint makes one function call. but few hot spot don't
have patience to one function call overhead.

scheduler stat and slab stat are one of good example, I think.

I really don't want convert slab_stat and sched_stat to ftrace base stastics.
currently it don't need extra function call and it only touch per-cpu variable.
So, a overhead is extream small.

Unfortunately, tracepoint still don't reach this extream performance.

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 11:09                           ` KOSAKI Motohiro
  0 siblings, 0 replies; 137+ messages in thread
From: KOSAKI Motohiro @ 2009-04-28 11:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pekka Enberg, Andi Kleen, Wu Fengguang, Steven Rostedt,
	Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
	Alexey Dobriyan, linux-mm

2009/4/28 Ingo Molnar <mingo@elte.hu>:
>
> * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
>
>> Hi!
>>
>> 2009/4/28 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
>> >> I guess the main question here is whether this approach will scale to
>> >> something like kmalloc() or the page allocator in production
>> >> environments. For any serious workload, the frequency of events is
>> >> going to be pretty high.
>> >
>> > Immediate Values patch series makes zero-overhead to tracepoint
>> > while it's not used.
>> >
>> > So, We have to implement to stop collect stastics way. it restore
>> > zero overhead world.
>> > We don't lose any performance by trace.
>>
>> Sure but I meant the _enabled_ case here. kmalloc() (and the page
>> allocator to some extent) is very performance sensitive in many
>> workloads so you probably don't want to use tracepoints if you're
>> collecting some overall statistics (i.e. tracing all events) like
>> we do here.
>
> That's where 'collect current state' kind of tracepoints would help
> - they could be used even without enabling any of the other
> tracepoints. And they'd still be in a coherent whole with the
> dynamic-events tracepoints.
>
> So i'm not arguing against these techniques at all - and we can move
> on a wide scale from zero-overhead to lots-of-tracing-enabled models
> - what i'm arguing against is the splintering.

umm.
I guess Pekka and you talk about different thing.

if tracepoint is ON, tracepoint makes one function call. but few hot spot don't
have patience to one function call overhead.

scheduler stat and slab stat are one of good example, I think.

I really don't want convert slab_stat and sched_stat to ftrace base stastics.
currently it don't need extra function call and it only touch per-cpu variable.
So, a overhead is extream small.

Unfortunately, tracepoint still don't reach this extream performance.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28 11:05                       ` Ingo Molnar
@ 2009-04-28 11:36                         ` Wu Fengguang
  -1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 11:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Steven Rostedt,
	Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
	Alexey Dobriyan, linux-mm

On Tue, Apr 28, 2009 at 07:05:53PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > See my other mail i just sent: it would be a natural extension 
> > > of tracing to also dump all current object state when tracing is 
> > > turned on. That way no drop_caches is needed at all.
> > 
> > I can understand the merits here - I also did readahead 
> > tracing/accounting in _one_ piece of code. Very handy.
> > 
> > The readahead traces are now raw printks - converting to the 
> > ftrace framework would be a big win.
> > 
> > But. It's still not a fit-all solution. Imagine when full data 
> > _since_ booting is required, but the user cannot afford a reboot.
> 
> The above 'get object state' interface (which allows passive 
> sampling) - integrated into the tracing framework - would serve that 
> goal, agreed?

Agreed. That could in theory a good complement to dynamic tracings.

Then what will be the canonical form for all the 'get object state'
interfaces - "object.attr=value", or whatever? I'm afraid we will have
to sacrifice efficiency or human readability to have a normalized form.
Or to define two standard forms? One "key value" form and one "value1
value2 value3..." form?

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 11:36                         ` Wu Fengguang
  0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 11:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Steven Rostedt,
	Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
	Alexey Dobriyan, linux-mm

On Tue, Apr 28, 2009 at 07:05:53PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > See my other mail i just sent: it would be a natural extension 
> > > of tracing to also dump all current object state when tracing is 
> > > turned on. That way no drop_caches is needed at all.
> > 
> > I can understand the merits here - I also did readahead 
> > tracing/accounting in _one_ piece of code. Very handy.
> > 
> > The readahead traces are now raw printks - converting to the 
> > ftrace framework would be a big win.
> > 
> > But. It's still not a fit-all solution. Imagine when full data 
> > _since_ booting is required, but the user cannot afford a reboot.
> 
> The above 'get object state' interface (which allows passive 
> sampling) - integrated into the tracing framework - would serve that 
> goal, agreed?

Agreed. That could in theory a good complement to dynamic tracings.

Then what will be the canonical form for all the 'get object state'
interfaces - "object.attr=value", or whatever? I'm afraid we will have
to sacrifice efficiency or human readability to have a normalized form.
Or to define two standard forms? One "key value" form and one "value1
value2 value3..." form?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
  2009-04-28 11:36                         ` Wu Fengguang
@ 2009-04-28 12:17                           ` Ingo Molnar
  -1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 12:17 UTC (permalink / raw)
  To: Wu Fengguang, Li Zefan, Tom Zanussi
  Cc: KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Steven Rostedt,
	Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
	Alexey Dobriyan, linux-mm

* Wu Fengguang <fengguang.wu@intel.com> wrote:

> > The above 'get object state' interface (which allows passive 
> > sampling) - integrated into the tracing framework - would serve 
> > that goal, agreed?
> 
> Agreed. That could in theory a good complement to dynamic 
> tracings.
> 
> Then what will be the canonical form for all the 'get object 
> state' interfaces - "object.attr=value", or whatever? [...]

Lemme outline what i'm thinking of.

I'd call the feature "object collection tracing", which would live 
in /debug/tracing, accessed via such files:

  /debug/tracing/objects/mm/pages/
  /debug/tracing/objects/mm/pages/format
  /debug/tracing/objects/mm/pages/filter
  /debug/tracing/objects/mm/pages/trace_pipe
  /debug/tracing/objects/mm/pages/stats
  /debug/tracing/objects/mm/pages/events/

here's the (proposed) semantics of those files:

1) /debug/tracing/objects/mm/pages/

There's a subsystem / object basic directory structure to make it 
easy and intuitive to find our way around there.

2) /debug/tracing/objects/mm/pages/format

the format file:

  /debug/tracing/objects/mm/pages/format

Would reuse the existing dynamic-tracepoint structured-logging 
descriptor format and code (this is upstream already):

 [root@phoenix sched_signal_send]# pwd
 /debug/tracing/events/sched/sched_signal_send

 [root@phoenix sched_signal_send]# cat format 
 name: sched_signal_send
 ID: 24
 format:
	field:unsigned short common_type;		offset:0;	size:2;
	field:unsigned char common_flags;		offset:2;	size:1;
	field:unsigned char common_preempt_count;	offset:3;	size:1;
	field:int common_pid;				offset:4;	size:4;
	field:int common_tgid;				offset:8;	size:4;

	field:int sig;					offset:12;	size:4;
	field:char comm[TASK_COMM_LEN];			offset:16;	size:16;
	field:pid_t pid;				offset:32;	size:4;

 print fmt: "sig: %d  task %s:%d", REC->sig, REC->comm, REC->pid

These format descriptors enumerate fields, types and sizes, in a 
structured way that user-space tools can parse easily. (The binary 
records that come from the trace_pipe file follow this format 
description.)

3) /debug/tracing/objects/mm/pages/filter

This is the tracing filter that can be set based on the 'format' 
descriptor. So with the above (signal-send tracepoint) you can 
define such filter expressions:

  echo "(sig == 10 && comm == bash) || sig == 13" > filter

To restrict the 'scope' of the object collection along pretty much 
any key or combination of keys. (Or you can leave it as it is and 
dump all objects and do keying in user-space.)

[ Using in-kernel filtering is obviously faster that streaming it 
  out to user-space - but there might be details and types of 
  visualization you want to do in user-space - so we dont want to 
  restrict things here. ]

For the mm object collection tracepoint i could imagine such filter 
expressions:

  echo "type == shared && file == /sbin/init" > filter

To dump all shared pages that are mapped to /sbin/init.

4) /debug/tracing/objects/mm/pages/trace_pipe

The 'trace_pipe' file can be used to dump all objects in the 
collection, which match the filter ('all objects' by default). The 
record format is described in 'format'.

trace_pipe would be a reuse of the existing trace_pipe code: it is a 
modern, poll()-able, read()-able, splice()-able pipe abstraction.

5) /debug/tracing/objects/mm/pages/stats

The 'stats' file would be a reuse of the existing histogram code of 
the tracing code. We already make use of it for the branch tracers 
and for the workqueue tracer - it could be extended to be applicable 
to object collections as well.

The advantage there would be that there's no dumping at all - all 
the integration is done straight in the kernel. ( The 'filter' 
condition is listened to - increasing flexibility. The filter file 
could perhaps also act as a default histogram key. )

6) /debug/tracing/objects/mm/pages/events/

The 'events' directory offers links back to existing dynamic 
tracepoints that are under /debug/tracing/events/. This would serve 
as an additional coherent force that keeps dynamic tracepoints 
collected by subsystem and by object type as well. (Tools could make 
use of this information as well - without being aware of actual 
object semantics.)

There would be a number of other object collections we could 
enumerate:

 tasks:

  /debug/tracing/objects/sched/tasks/

 active inodes known to the kernel:

  /debug/tracing/objects/fs/inodes/

 interrupts:

  /debug/tracing/objects/hw/irqs/

etc.

These would use the same 'object collection' framework. Once done we 
can use it for many other thing too.

Note how organically integrated it all is with the tracing 
framework. You could start from an 'object view' to get an overview 
and then go towards a more dynamic view of specific object 
attributes (or specific objects), as you drill down on a specific 
problem you want to analyze.

How does this all sound to you?

Can you see any conceptual holes in the scheme, any use-case that 
/proc/kpageflags supports but the object collection approach does 
not?

Would you be interested in seeing something like this, if we tried 
to implement it in the tracing tree? The majority of the code 
already exists, we just need interest from the MM side and we have 
to hook it all up. (it is by no means trivial to do - but looks like 
a very exciting feature.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 137+ messages in thread

* [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
@ 2009-04-28 12:17                           ` Ingo Molnar
  0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 12:17 UTC (permalink / raw)
  To: Wu Fengguang, Li Zefan, Tom Zanussi
  Cc: KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Steven Rostedt,
	Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
	Alexey Dobriyan, linux-mm

* Wu Fengguang <fengguang.wu@intel.com> wrote:

> > The above 'get object state' interface (which allows passive 
> > sampling) - integrated into the tracing framework - would serve 
> > that goal, agreed?
> 
> Agreed. That could in theory a good complement to dynamic 
> tracings.
> 
> Then what will be the canonical form for all the 'get object 
> state' interfaces - "object.attr=value", or whatever? [...]

Lemme outline what i'm thinking of.

I'd call the feature "object collection tracing", which would live 
in /debug/tracing, accessed via such files:

  /debug/tracing/objects/mm/pages/
  /debug/tracing/objects/mm/pages/format
  /debug/tracing/objects/mm/pages/filter
  /debug/tracing/objects/mm/pages/trace_pipe
  /debug/tracing/objects/mm/pages/stats
  /debug/tracing/objects/mm/pages/events/

here's the (proposed) semantics of those files:

1) /debug/tracing/objects/mm/pages/

There's a subsystem / object basic directory structure to make it 
easy and intuitive to find our way around there.

2) /debug/tracing/objects/mm/pages/format

the format file:

  /debug/tracing/objects/mm/pages/format

Would reuse the existing dynamic-tracepoint structured-logging 
descriptor format and code (this is upstream already):

 [root@phoenix sched_signal_send]# pwd
 /debug/tracing/events/sched/sched_signal_send

 [root@phoenix sched_signal_send]# cat format 
 name: sched_signal_send
 ID: 24
 format:
	field:unsigned short common_type;		offset:0;	size:2;
	field:unsigned char common_flags;		offset:2;	size:1;
	field:unsigned char common_preempt_count;	offset:3;	size:1;
	field:int common_pid;				offset:4;	size:4;
	field:int common_tgid;				offset:8;	size:4;

	field:int sig;					offset:12;	size:4;
	field:char comm[TASK_COMM_LEN];			offset:16;	size:16;
	field:pid_t pid;				offset:32;	size:4;

 print fmt: "sig: %d  task %s:%d", REC->sig, REC->comm, REC->pid

These format descriptors enumerate fields, types and sizes, in a 
structured way that user-space tools can parse easily. (The binary 
records that come from the trace_pipe file follow this format 
description.)

3) /debug/tracing/objects/mm/pages/filter

This is the tracing filter that can be set based on the 'format' 
descriptor. So with the above (signal-send tracepoint) you can 
define such filter expressions:

  echo "(sig == 10 && comm == bash) || sig == 13" > filter

To restrict the 'scope' of the object collection along pretty much 
any key or combination of keys. (Or you can leave it as it is and 
dump all objects and do keying in user-space.)

[ Using in-kernel filtering is obviously faster that streaming it 
  out to user-space - but there might be details and types of 
  visualization you want to do in user-space - so we dont want to 
  restrict things here. ]

For the mm object collection tracepoint i could imagine such filter 
expressions:

  echo "type == shared && file == /sbin/init" > filter

To dump all shared pages that are mapped to /sbin/init.

4) /debug/tracing/objects/mm/pages/trace_pipe

The 'trace_pipe' file can be used to dump all objects in the 
collection, which match the filter ('all objects' by default). The 
record format is described in 'format'.

trace_pipe would be a reuse of the existing trace_pipe code: it is a 
modern, poll()-able, read()-able, splice()-able pipe abstraction.

5) /debug/tracing/objects/mm/pages/stats

The 'stats' file would be a reuse of the existing histogram code of 
the tracing code. We already make use of it for the branch tracers 
and for the workqueue tracer - it could be extended to be applicable 
to object collections as well.

The advantage there would be that there's no dumping at all - all 
the integration is done straight in the kernel. ( The 'filter' 
condition is listened to - increasing flexibility. The filter file 
could perhaps also act as a default histogram key. )

6) /debug/tracing/objects/mm/pages/events/

The 'events' directory offers links back to existing dynamic 
tracepoints that are under /debug/tracing/events/. This would serve 
as an additional coherent force that keeps dynamic tracepoints 
collected by subsystem and by object type as well. (Tools could make 
use of this information as well - without being aware of actual 
object semantics.)

There would be a number of other object collections we could 
enumerate:

 tasks:

  /debug/tracing/objects/sched/tasks/

 active inodes known to the kernel:

  /debug/tracing/objects/fs/inodes/

 interrupts:

  /debug/tracing/objects/hw/irqs/

etc.

These would use the same 'object collection' framework. Once done we 
can use it for many other thing too.

Note how organically integrated it all is with the tracing 
framework. You could start from an 'object view' to get an overview 
and then go towards a more dynamic view of specific object 
attributes (or specific objects), as you drill down on a specific 
problem you want to analyze.

How does this all sound to you?

Can you see any conceptual holes in the scheme, any use-case that 
/proc/kpageflags supports but the object collection approach does 
not?

Would you be interested in seeing something like this, if we tried 
to implement it in the tracing tree? The majority of the code 
already exists, we just need interest from the MM side and we have 
to hook it all up. (it is by no means trivial to do - but looks like 
a very exciting feature.)

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28 11:09                           ` KOSAKI Motohiro
@ 2009-04-28 12:42                             ` Ingo Molnar
  -1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 12:42 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Pekka Enberg, Andi Kleen, Wu Fengguang, Steven Rostedt,
	Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
	Alexey Dobriyan, linux-mm


* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> 2009/4/28 Ingo Molnar <mingo@elte.hu>:
> >
> > * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> >
> >> Hi!
> >>
> >> 2009/4/28 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
> >> >> I guess the main question here is whether this approach will scale to
> >> >> something like kmalloc() or the page allocator in production
> >> >> environments. For any serious workload, the frequency of events is
> >> >> going to be pretty high.
> >> >
> >> > Immediate Values patch series makes zero-overhead to tracepoint
> >> > while it's not used.
> >> >
> >> > So, We have to implement to stop collect stastics way. it restore
> >> > zero overhead world.
> >> > We don't lose any performance by trace.
> >>
> >> Sure but I meant the _enabled_ case here. kmalloc() (and the page
> >> allocator to some extent) is very performance sensitive in many
> >> workloads so you probably don't want to use tracepoints if you're
> >> collecting some overall statistics (i.e. tracing all events) like
> >> we do here.
> >
> > That's where 'collect current state' kind of tracepoints would help
> > - they could be used even without enabling any of the other
> > tracepoints. And they'd still be in a coherent whole with the
> > dynamic-events tracepoints.
> >
> > So i'm not arguing against these techniques at all - and we can move
> > on a wide scale from zero-overhead to lots-of-tracing-enabled models
> > - what i'm arguing against is the splintering.
> 
> umm.
> I guess Pekka and you talk about different thing.
> 
> if tracepoint is ON, tracepoint makes one function call. but few 
> hot spot don't have patience to one function call overhead.
> 
> scheduler stat and slab stat are one of good example, I think.
> 
> I really don't want convert slab_stat and sched_stat to ftrace 
> base stastics. currently it don't need extra function call and it 
> only touch per-cpu variable. So, a overhead is extream small.
> 
> Unfortunately, tracepoint still don't reach this extream 
> performance.

I understand that - please see my "[rfc] object collection tracing" 
reply in this thread, for a more detailed description about what i 
meant under 'object state tracing'.

	Ingo

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 12:42                             ` Ingo Molnar
  0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 12:42 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Pekka Enberg, Andi Kleen, Wu Fengguang, Steven Rostedt,
	Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
	Alexey Dobriyan, linux-mm


* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> 2009/4/28 Ingo Molnar <mingo@elte.hu>:
> >
> > * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> >
> >> Hi!
> >>
> >> 2009/4/28 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
> >> >> I guess the main question here is whether this approach will scale to
> >> >> something like kmalloc() or the page allocator in production
> >> >> environments. For any serious workload, the frequency of events is
> >> >> going to be pretty high.
> >> >
> >> > Immediate Values patch series makes zero-overhead to tracepoint
> >> > while it's not used.
> >> >
> >> > So, We have to implement to stop collect stastics way. it restore
> >> > zero overhead world.
> >> > We don't lose any performance by trace.
> >>
> >> Sure but I meant the _enabled_ case here. kmalloc() (and the page
> >> allocator to some extent) is very performance sensitive in many
> >> workloads so you probably don't want to use tracepoints if you're
> >> collecting some overall statistics (i.e. tracing all events) like
> >> we do here.
> >
> > That's where 'collect current state' kind of tracepoints would help
> > - they could be used even without enabling any of the other
> > tracepoints. And they'd still be in a coherent whole with the
> > dynamic-events tracepoints.
> >
> > So i'm not arguing against these techniques at all - and we can move
> > on a wide scale from zero-overhead to lots-of-tracing-enabled models
> > - what i'm arguing against is the splintering.
> 
> umm.
> I guess Pekka and you talk about different thing.
> 
> if tracepoint is ON, tracepoint makes one function call. but few 
> hot spot don't have patience to one function call overhead.
> 
> scheduler stat and slab stat are one of good example, I think.
> 
> I really don't want convert slab_stat and sched_stat to ftrace 
> base stastics. currently it don't need extra function call and it 
> only touch per-cpu variable. So, a overhead is extream small.
> 
> Unfortunately, tracepoint still don't reach this extream 
> performance.

I understand that - please see my "[rfc] object collection tracing" 
reply in this thread, for a more detailed description about what i 
meant under 'object state tracing'.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
  2009-04-28 12:17                           ` Ingo Molnar
@ 2009-04-28 13:31                             ` Wu Fengguang
  -1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 13:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Li Zefan, Tom Zanussi, KOSAKI Motohiro, Pekka Enberg, Andi Kleen,
	Steven Rostedt, Fr馘駻ic Weisbecker, Larry Woodman,
	Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> tent-Transfer-Encoding: quoted-printable
> Status: RO
> Content-Length: 5480
> Lines: 161
> 
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > The above 'get object state' interface (which allows passive 
> > > sampling) - integrated into the tracing framework - would serve 
> > > that goal, agreed?
> > 
> > Agreed. That could in theory a good complement to dynamic 
> > tracings.
> > 
> > Then what will be the canonical form for all the 'get object 
> > state' interfaces - "object.attr=value", or whatever? [...]
> 
> Lemme outline what i'm thinking of.
> 
> I'd call the feature "object collection tracing", which would live 
> in /debug/tracing, accessed via such files:
> 
>   /debug/tracing/objects/mm/pages/
>   /debug/tracing/objects/mm/pages/format
>   /debug/tracing/objects/mm/pages/filter
>   /debug/tracing/objects/mm/pages/trace_pipe
>   /debug/tracing/objects/mm/pages/stats
>   /debug/tracing/objects/mm/pages/events/
> 
> here's the (proposed) semantics of those files:
> 
> 1) /debug/tracing/objects/mm/pages/
> 
> There's a subsystem / object basic directory structure to make it 
> easy and intuitive to find our way around there.
> 
> 2) /debug/tracing/objects/mm/pages/format
> 
> the format file:
> 
>   /debug/tracing/objects/mm/pages/format
> 
> Would reuse the existing dynamic-tracepoint structured-logging 
> descriptor format and code (this is upstream already):
> 
>  [root@phoenix sched_signal_send]# pwd
>  /debug/tracing/events/sched/sched_signal_send
> 
>  [root@phoenix sched_signal_send]# cat format 
>  name: sched_signal_send
>  ID: 24
>  format:
> 	field:unsigned short common_type;		offset:0;	size:2;
> 	field:unsigned char common_flags;		offset:2;	size:1;
> 	field:unsigned char common_preempt_count;	offset:3;	size:1;
> 	field:int common_pid;				offset:4;	size:4;
> 	field:int common_tgid;				offset:8;	size:4;
> 
> 	field:int sig;					offset:12;	size:4;
> 	field:char comm[TASK_COMM_LEN];			offset:16;	size:16;
> 	field:pid_t pid;				offset:32;	size:4;
> 
>  print fmt: "sig: %d  task %s:%d", REC->sig, REC->comm, REC->pid
> 
> These format descriptors enumerate fields, types and sizes, in a 
> structured way that user-space tools can parse easily. (The binary 
> records that come from the trace_pipe file follow this format 
> description.)
> 
> 3) /debug/tracing/objects/mm/pages/filter
> 
> This is the tracing filter that can be set based on the 'format' 
> descriptor. So with the above (signal-send tracepoint) you can 
> define such filter expressions:
> 
>   echo "(sig == 10 && comm == bash) || sig == 13" > filter
> 
> To restrict the 'scope' of the object collection along pretty much 
> any key or combination of keys. (Or you can leave it as it is and 
> dump all objects and do keying in user-space.)
> 
> [ Using in-kernel filtering is obviously faster that streaming it 
>   out to user-space - but there might be details and types of 
>   visualization you want to do in user-space - so we dont want to 
>   restrict things here. ]
> 
> For the mm object collection tracepoint i could imagine such filter 
> expressions:
> 
>   echo "type == shared && file == /sbin/init" > filter
> 
> To dump all shared pages that are mapped to /sbin/init.
> 
> 4) /debug/tracing/objects/mm/pages/trace_pipe
> 
> The 'trace_pipe' file can be used to dump all objects in the 
> collection, which match the filter ('all objects' by default). The 
> record format is described in 'format'.
> 
> trace_pipe would be a reuse of the existing trace_pipe code: it is a 
> modern, poll()-able, read()-able, splice()-able pipe abstraction.
> 
> 5) /debug/tracing/objects/mm/pages/stats
> 
> The 'stats' file would be a reuse of the existing histogram code of 
> the tracing code. We already make use of it for the branch tracers 
> and for the workqueue tracer - it could be extended to be applicable 
> to object collections as well.
> 
> The advantage there would be that there's no dumping at all - all 
> the integration is done straight in the kernel. ( The 'filter' 
> condition is listened to - increasing flexibility. The filter file 
> could perhaps also act as a default histogram key. )
> 
> 6) /debug/tracing/objects/mm/pages/events/
> 
> The 'events' directory offers links back to existing dynamic 
> tracepoints that are under /debug/tracing/events/. This would serve 
> as an additional coherent force that keeps dynamic tracepoints 
> collected by subsystem and by object type as well. (Tools could make 
> use of this information as well - without being aware of actual 
> object semantics.)
> 
> 
> There would be a number of other object collections we could 
> enumerate:
> 
>  tasks:
> 
>   /debug/tracing/objects/sched/tasks/
> 
>  active inodes known to the kernel:
> 
>   /debug/tracing/objects/fs/inodes/
> 
>  interrupts:
> 
>   /debug/tracing/objects/hw/irqs/
> 
> etc.
> 
> These would use the same 'object collection' framework. Once done we 
> can use it for many other thing too.
> 
> Note how organically integrated it all is with the tracing 
> framework. You could start from an 'object view' to get an overview 
> and then go towards a more dynamic view of specific object 
> attributes (or specific objects), as you drill down on a specific 
> problem you want to analyze.
> 
> How does this all sound to you?

Great! I saw much opportunity to adapt the not yet submitted
/proc/filecache interface to the proposed framework.

Its basic form is:

#      ino       size   cached cached% refcnt state       age accessed  process         dev             file
[snip]
       320          1        4     100      1    D-     50443     1085 udevd           00:11(tmpfs)     /.udev/uevent_seqnum
    460725        123      124     100     35    --     50444     6795 touch           08:02(sda2)      /lib/libpthread-2.9.so
    460727         31       32     100     14    --     50444     2007 touch           08:02(sda2)      /lib/librt-2.9.so
    458865         97       80      82      1    --     50444       49 mount           08:02(sda2)      /lib/libdevmapper.so.1.02.1
    460090         15       16     100      1    --     50444       48 mount           08:02(sda2)      /lib/libuuid.so.1.2
    458866         46       48     100      1    --     50444       47 mount           08:02(sda2)      /lib/libblkid.so.1.0
    460732         43       44     100     69    --     50444     3581 rcS             08:02(sda2)      /lib/libnss_nis-2.9.so
    460739         87       88     100     73    --     50444     3597 rcS             08:02(sda2)      /lib/libnsl-2.9.so
    460726         31       32     100     69    --     50444     3581 rcS             08:02(sda2)      /lib/libnss_compat-2.9.so
    458804        250      252     100     11    --     50445     8175 rcS             08:02(sda2)      /lib/libncurses.so.5.6
    229540        780      752      96      3    --     50445     7594 init            08:02(sda2)      /bin/bash
    460735         15       16     100     89    --     50445    17581 init            08:02(sda2)      /lib/libdl-2.9.so
    460721       1344     1340      99    117    --     50445    48732 init            08:02(sda2)      /lib/libc-2.9.so
    458801        107      104      97     24    --     50445     3586 init            08:02(sda2)      /lib/libselinux.so.1
    671870         37       24      65      1    --     50446        1 swapper         08:02(sda2)      /sbin/init
       175          1    24412     100      1    --     50446        0 swapper         00:01(rootfs)    /dev/root

The patch basically does a traversal through one or more of the inode
lists to produce the output:
        inode_in_use
        inode_unused
        sb->s_dirty
        sb->s_io
        sb->s_more_io
        sb->s_inodes

The filtering feature is a necessity for this interface - or it will
take considerable time to do a full listing. It supports the following
filters:
        { LS_OPT_DIRTY,         "dirty"         },
        { LS_OPT_CLEAN,         "clean"         },
        { LS_OPT_INUSE,         "inuse"         },
        { LS_OPT_EMPTY,         "empty"         },
        { LS_OPT_ALL,           "all"           },
        { LS_OPT_DEV,           "dev=%s"        },

There are two possible challenges for the conversion:

- One trick it does is to select different lists to traverse on
  different filter options. Will this be possible in the object
  tracing framework?
- The file name lookup(last field) is the performance killer. Is it
  possible to skip the file name lookup when the filter failed on the
  leading fields?

Will the object tracing interface allow such flexibilities?
(Sorry I'm not yet familiar with the tracing framework.)

> Can you see any conceptual holes in the scheme, any use-case that 
> /proc/kpageflags supports but the object collection approach does 
> not?

kpageflags is simply a big (perhaps sparse) binary array.
I'd still prefer to retain its current form - the kernel patches and
user space tools are all ready made, and I see no benefits in
converting to the tracing framework.

> Would you be interested in seeing something like this, if we tried 
> to implement it in the tracing tree? The majority of the code 
> already exists, we just need interest from the MM side and we have 
> to hook it all up. (it is by no means trivial to do - but looks like
> a very exciting feature.)

Definitely! /proc/filecache has another 'page view':

        # head /proc/filecache
        # file /bin/bash
        # flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback X:readahead P:private O:owner b:buffer d:dirty w:writeback
        # idx   len     state           refcnt
        0       1       RAMU________    4
        3       8       RAMU________    4
        12      1       RAMU________    4
        14      5       RAMU________    4
        20      7       RAMU________    4
        27      2       RAMU________    5
        29      1       RAMU________    4

Which is also a good candidate. However I still need to investigate
whether it offers considerable margins over the mincore() syscall.

Thanks and Regards,
Fengguang

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
@ 2009-04-28 13:31                             ` Wu Fengguang
  0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 13:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Li Zefan, Tom Zanussi, KOSAKI Motohiro, Pekka Enberg, Andi Kleen,
	Steven Rostedt, Fr馘駻ic Weisbecker, Larry Woodman,
	Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> tent-Transfer-Encoding: quoted-printable
> Status: RO
> Content-Length: 5480
> Lines: 161
> 
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > The above 'get object state' interface (which allows passive 
> > > sampling) - integrated into the tracing framework - would serve 
> > > that goal, agreed?
> > 
> > Agreed. That could in theory a good complement to dynamic 
> > tracings.
> > 
> > Then what will be the canonical form for all the 'get object 
> > state' interfaces - "object.attr=value", or whatever? [...]
> 
> Lemme outline what i'm thinking of.
> 
> I'd call the feature "object collection tracing", which would live 
> in /debug/tracing, accessed via such files:
> 
>   /debug/tracing/objects/mm/pages/
>   /debug/tracing/objects/mm/pages/format
>   /debug/tracing/objects/mm/pages/filter
>   /debug/tracing/objects/mm/pages/trace_pipe
>   /debug/tracing/objects/mm/pages/stats
>   /debug/tracing/objects/mm/pages/events/
> 
> here's the (proposed) semantics of those files:
> 
> 1) /debug/tracing/objects/mm/pages/
> 
> There's a subsystem / object basic directory structure to make it 
> easy and intuitive to find our way around there.
> 
> 2) /debug/tracing/objects/mm/pages/format
> 
> the format file:
> 
>   /debug/tracing/objects/mm/pages/format
> 
> Would reuse the existing dynamic-tracepoint structured-logging 
> descriptor format and code (this is upstream already):
> 
>  [root@phoenix sched_signal_send]# pwd
>  /debug/tracing/events/sched/sched_signal_send
> 
>  [root@phoenix sched_signal_send]# cat format 
>  name: sched_signal_send
>  ID: 24
>  format:
> 	field:unsigned short common_type;		offset:0;	size:2;
> 	field:unsigned char common_flags;		offset:2;	size:1;
> 	field:unsigned char common_preempt_count;	offset:3;	size:1;
> 	field:int common_pid;				offset:4;	size:4;
> 	field:int common_tgid;				offset:8;	size:4;
> 
> 	field:int sig;					offset:12;	size:4;
> 	field:char comm[TASK_COMM_LEN];			offset:16;	size:16;
> 	field:pid_t pid;				offset:32;	size:4;
> 
>  print fmt: "sig: %d  task %s:%d", REC->sig, REC->comm, REC->pid
> 
> These format descriptors enumerate fields, types and sizes, in a 
> structured way that user-space tools can parse easily. (The binary 
> records that come from the trace_pipe file follow this format 
> description.)
> 
> 3) /debug/tracing/objects/mm/pages/filter
> 
> This is the tracing filter that can be set based on the 'format' 
> descriptor. So with the above (signal-send tracepoint) you can 
> define such filter expressions:
> 
>   echo "(sig == 10 && comm == bash) || sig == 13" > filter
> 
> To restrict the 'scope' of the object collection along pretty much 
> any key or combination of keys. (Or you can leave it as it is and 
> dump all objects and do keying in user-space.)
> 
> [ Using in-kernel filtering is obviously faster that streaming it 
>   out to user-space - but there might be details and types of 
>   visualization you want to do in user-space - so we dont want to 
>   restrict things here. ]
> 
> For the mm object collection tracepoint i could imagine such filter 
> expressions:
> 
>   echo "type == shared && file == /sbin/init" > filter
> 
> To dump all shared pages that are mapped to /sbin/init.
> 
> 4) /debug/tracing/objects/mm/pages/trace_pipe
> 
> The 'trace_pipe' file can be used to dump all objects in the 
> collection, which match the filter ('all objects' by default). The 
> record format is described in 'format'.
> 
> trace_pipe would be a reuse of the existing trace_pipe code: it is a 
> modern, poll()-able, read()-able, splice()-able pipe abstraction.
> 
> 5) /debug/tracing/objects/mm/pages/stats
> 
> The 'stats' file would be a reuse of the existing histogram code of 
> the tracing code. We already make use of it for the branch tracers 
> and for the workqueue tracer - it could be extended to be applicable 
> to object collections as well.
> 
> The advantage there would be that there's no dumping at all - all 
> the integration is done straight in the kernel. ( The 'filter' 
> condition is listened to - increasing flexibility. The filter file 
> could perhaps also act as a default histogram key. )
> 
> 6) /debug/tracing/objects/mm/pages/events/
> 
> The 'events' directory offers links back to existing dynamic 
> tracepoints that are under /debug/tracing/events/. This would serve 
> as an additional coherent force that keeps dynamic tracepoints 
> collected by subsystem and by object type as well. (Tools could make 
> use of this information as well - without being aware of actual 
> object semantics.)
> 
> 
> There would be a number of other object collections we could 
> enumerate:
> 
>  tasks:
> 
>   /debug/tracing/objects/sched/tasks/
> 
>  active inodes known to the kernel:
> 
>   /debug/tracing/objects/fs/inodes/
> 
>  interrupts:
> 
>   /debug/tracing/objects/hw/irqs/
> 
> etc.
> 
> These would use the same 'object collection' framework. Once done we 
> can use it for many other thing too.
> 
> Note how organically integrated it all is with the tracing 
> framework. You could start from an 'object view' to get an overview 
> and then go towards a more dynamic view of specific object 
> attributes (or specific objects), as you drill down on a specific 
> problem you want to analyze.
> 
> How does this all sound to you?

Great! I saw much opportunity to adapt the not yet submitted
/proc/filecache interface to the proposed framework.

Its basic form is:

#      ino       size   cached cached% refcnt state       age accessed  process         dev             file
[snip]
       320          1        4     100      1    D-     50443     1085 udevd           00:11(tmpfs)     /.udev/uevent_seqnum
    460725        123      124     100     35    --     50444     6795 touch           08:02(sda2)      /lib/libpthread-2.9.so
    460727         31       32     100     14    --     50444     2007 touch           08:02(sda2)      /lib/librt-2.9.so
    458865         97       80      82      1    --     50444       49 mount           08:02(sda2)      /lib/libdevmapper.so.1.02.1
    460090         15       16     100      1    --     50444       48 mount           08:02(sda2)      /lib/libuuid.so.1.2
    458866         46       48     100      1    --     50444       47 mount           08:02(sda2)      /lib/libblkid.so.1.0
    460732         43       44     100     69    --     50444     3581 rcS             08:02(sda2)      /lib/libnss_nis-2.9.so
    460739         87       88     100     73    --     50444     3597 rcS             08:02(sda2)      /lib/libnsl-2.9.so
    460726         31       32     100     69    --     50444     3581 rcS             08:02(sda2)      /lib/libnss_compat-2.9.so
    458804        250      252     100     11    --     50445     8175 rcS             08:02(sda2)      /lib/libncurses.so.5.6
    229540        780      752      96      3    --     50445     7594 init            08:02(sda2)      /bin/bash
    460735         15       16     100     89    --     50445    17581 init            08:02(sda2)      /lib/libdl-2.9.so
    460721       1344     1340      99    117    --     50445    48732 init            08:02(sda2)      /lib/libc-2.9.so
    458801        107      104      97     24    --     50445     3586 init            08:02(sda2)      /lib/libselinux.so.1
    671870         37       24      65      1    --     50446        1 swapper         08:02(sda2)      /sbin/init
       175          1    24412     100      1    --     50446        0 swapper         00:01(rootfs)    /dev/root

The patch basically does a traversal through one or more of the inode
lists to produce the output:
        inode_in_use
        inode_unused
        sb->s_dirty
        sb->s_io
        sb->s_more_io
        sb->s_inodes

The filtering feature is a necessity for this interface - or it will
take considerable time to do a full listing. It supports the following
filters:
        { LS_OPT_DIRTY,         "dirty"         },
        { LS_OPT_CLEAN,         "clean"         },
        { LS_OPT_INUSE,         "inuse"         },
        { LS_OPT_EMPTY,         "empty"         },
        { LS_OPT_ALL,           "all"           },
        { LS_OPT_DEV,           "dev=%s"        },

There are two possible challenges for the conversion:

- One trick it does is to select different lists to traverse on
  different filter options. Will this be possible in the object
  tracing framework?
- The file name lookup(last field) is the performance killer. Is it
  possible to skip the file name lookup when the filter failed on the
  leading fields?

Will the object tracing interface allow such flexibilities?
(Sorry I'm not yet familiar with the tracing framework.)

> Can you see any conceptual holes in the scheme, any use-case that 
> /proc/kpageflags supports but the object collection approach does 
> not?

kpageflags is simply a big (perhaps sparse) binary array.
I'd still prefer to retain its current form - the kernel patches and
user space tools are all ready made, and I see no benefits in
converting to the tracing framework.

> Would you be interested in seeing something like this, if we tried 
> to implement it in the tracing tree? The majority of the code 
> already exists, we just need interest from the MM side and we have 
> to hook it all up. (it is by no means trivial to do - but looks like
> a very exciting feature.)

Definitely! /proc/filecache has another 'page view':

        # head /proc/filecache
        # file /bin/bash
        # flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback X:readahead P:private O:owner b:buffer d:dirty w:writeback
        # idx   len     state           refcnt
        0       1       RAMU________    4
        3       8       RAMU________    4
        12      1       RAMU________    4
        14      5       RAMU________    4
        20      7       RAMU________    4
        27      2       RAMU________    5
        29      1       RAMU________    4

Which is also a good candidate. However I still need to investigate
whether it offers considerable margins over the mincore() syscall.

Thanks and Regards,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  9:36                 ` Ingo Molnar
@ 2009-04-28 17:42                   ` Matt Mackall
  -1 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 17:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pekka Enberg, Andi Kleen, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Alexey Dobriyan, linux-mm

On Tue, 2009-04-28 at 11:36 +0200, Ingo Molnar wrote:
> I 'integrate' traces all the time to get summary counts. This series 
> of dynamic events:
> 
>   allocation
>   page count up
>   page count up
>   page count down
>   page count up
>   page count up
>   page count up
>   page count up
> 
> integrates into: "page count is 6".

Perhaps you've failed calculus. The integral is 6 + C.

This is a critical distinction. Tracing is great for looking at changes,
but it completely falls down for static system-wide measurements because
it would require integrating from time=0 to get a meaningful summation.
That's completely useless for taking a measurement on a system that
already has an uptime of months. 

Never mind that summing up page flag changes for every page on the
system since boot time through the trace interface is incredibly
wasteful given that we're keeping a per-page integral in the page tables
anyway.

Tracing is not the answer for everything.

-- 
http://selenic.com : development and support for Mercurial and Linux

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 17:42                   ` Matt Mackall
  0 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 17:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pekka Enberg, Andi Kleen, Wu Fengguang, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Alexey Dobriyan, linux-mm

On Tue, 2009-04-28 at 11:36 +0200, Ingo Molnar wrote:
> I 'integrate' traces all the time to get summary counts. This series 
> of dynamic events:
> 
>   allocation
>   page count up
>   page count up
>   page count down
>   page count up
>   page count up
>   page count up
>   page count up
> 
> integrates into: "page count is 6".

Perhaps you've failed calculus. The integral is 6 + C.

This is a critical distinction. Tracing is great for looking at changes,
but it completely falls down for static system-wide measurements because
it would require integrating from time=0 to get a meaningful summation.
That's completely useless for taking a measurement on a system that
already has an uptime of months. 

Never mind that summing up page flag changes for every page on the
system since boot time through the trace interface is incredibly
wasteful given that we're keeping a per-page integral in the page tables
anyway.

Tracing is not the answer for everything.

-- 
http://selenic.com : development and support for Mercurial and Linux

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  1:09   ` Wu Fengguang
@ 2009-04-28 17:49     ` Matt Mackall
  -1 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 17:49 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Alexey Dobriyan, linux-mm

On Tue, 2009-04-28 at 09:09 +0800, Wu Fengguang wrote:
> plain text document attachment (kpageflags-extending.patch)
> Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.

My only concern with this patch is it knows a bit too much about SLUB
internals (and perhaps not enough about SLOB, which also overloads
flags). 

-- 
http://selenic.com : development and support for Mercurial and Linux



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 17:49     ` Matt Mackall
  0 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 17:49 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Alexey Dobriyan, linux-mm

On Tue, 2009-04-28 at 09:09 +0800, Wu Fengguang wrote:
> plain text document attachment (kpageflags-extending.patch)
> Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.

My only concern with this patch is it knows a bit too much about SLUB
internals (and perhaps not enough about SLOB, which also overloads
flags). 

-- 
http://selenic.com : development and support for Mercurial and Linux


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  8:33       ` Wu Fengguang
@ 2009-04-28 18:11         ` Tony Luck
  -1 siblings, 0 replies; 137+ messages in thread
From: Tony Luck @ 2009-04-28 18:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Ingo Molnar, Steven Rostedt, Frédéric Weisbecker,
	Larry Woodman, Peter Zijlstra, Pekka Enberg,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Andi Kleen, Matt Mackall, Alexey Dobriyan, linux-mm

On Tue, Apr 28, 2009 at 1:33 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> 1) FAST
>
> It takes merely 0.2s to scan 4GB pages:
>
>        ./page-types  0.02s user 0.20s system 99% cpu 0.216 total

OK on a tiny system ... but sounds painful on a big
server. 0.2s for 4G scales up to 3 minutes 25 seconds
on a 4TB system (4TB systems were being sold two
years ago ... so by now the high end will have moved
up to 8TB or perhaps 16TB).

Would the resulting output be anything but noise on
a big system (a *lot* of pages can change state in
3 minutes)?

-Tony

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 18:11         ` Tony Luck
  0 siblings, 0 replies; 137+ messages in thread
From: Tony Luck @ 2009-04-28 18:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Ingo Molnar, Steven Rostedt, Frédéric Weisbecker,
	Larry Woodman, Peter Zijlstra, Pekka Enberg,
	Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
	Andi Kleen, Matt Mackall, Alexey Dobriyan, linux-mm

On Tue, Apr 28, 2009 at 1:33 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> 1) FAST
>
> It takes merely 0.2s to scan 4GB pages:
>
>        ./page-types  0.02s user 0.20s system 99% cpu 0.216 total

OK on a tiny system ... but sounds painful on a big
server. 0.2s for 4G scales up to 3 minutes 25 seconds
on a 4TB system (4TB systems were being sold two
years ago ... so by now the high end will have moved
up to 8TB or perhaps 16TB).

Would the resulting output be anything but noise on
a big system (a *lot* of pages can change state in
3 minutes)?

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28 18:11         ` Tony Luck
@ 2009-04-28 18:34           ` Matt Mackall
  -1 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 18:34 UTC (permalink / raw)
  To: Tony Luck
  Cc: Wu Fengguang, Ingo Molnar, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Pekka Enberg, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	KOSAKI Motohiro, Andi Kleen, Alexey Dobriyan, linux-mm

On Tue, 2009-04-28 at 11:11 -0700, Tony Luck wrote:
> On Tue, Apr 28, 2009 at 1:33 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 1) FAST
> >
> > It takes merely 0.2s to scan 4GB pages:
> >
> >        ./page-types  0.02s user 0.20s system 99% cpu 0.216 total
> 
> OK on a tiny system ... but sounds painful on a big
> server. 0.2s for 4G scales up to 3 minutes 25 seconds
> on a 4TB system (4TB systems were being sold two
> years ago ... so by now the high end will have moved
> up to 8TB or perhaps 16TB).
> 
> Would the resulting output be anything but noise on
> a big system (a *lot* of pages can change state in
> 3 minutes)?

Bah. The rate of change is proportional to #cpus, not #pages. Assuming
you've got 1024 processors, you could run the scan in parallel in .2
seconds still.

It won't be an atomic snapshot, obviously. But stopping the whole
machine on a system that size is probably not what you want anyway.

-- 
http://selenic.com : development and support for Mercurial and Linux



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 18:34           ` Matt Mackall
  0 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 18:34 UTC (permalink / raw)
  To: Tony Luck
  Cc: Wu Fengguang, Ingo Molnar, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Pekka Enberg, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	KOSAKI Motohiro, Andi Kleen, Alexey Dobriyan, linux-mm

On Tue, 2009-04-28 at 11:11 -0700, Tony Luck wrote:
> On Tue, Apr 28, 2009 at 1:33 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 1) FAST
> >
> > It takes merely 0.2s to scan 4GB pages:
> >
> >        ./page-types  0.02s user 0.20s system 99% cpu 0.216 total
> 
> OK on a tiny system ... but sounds painful on a big
> server. 0.2s for 4G scales up to 3 minutes 25 seconds
> on a 4TB system (4TB systems were being sold two
> years ago ... so by now the high end will have moved
> up to 8TB or perhaps 16TB).
> 
> Would the resulting output be anything but noise on
> a big system (a *lot* of pages can change state in
> 3 minutes)?

Bah. The rate of change is proportional to #cpus, not #pages. Assuming
you've got 1024 processors, you could run the scan in parallel in .2
seconds still.

It won't be an atomic snapshot, obviously. But stopping the whole
machine on a system that size is probably not what you want anyway.

-- 
http://selenic.com : development and support for Mercurial and Linux


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28 18:34           ` Matt Mackall
@ 2009-04-28 20:47             ` Tony Luck
  -1 siblings, 0 replies; 137+ messages in thread
From: Tony Luck @ 2009-04-28 20:47 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Wu Fengguang, Ingo Molnar, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Pekka Enberg, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	KOSAKI Motohiro, Andi Kleen, Alexey Dobriyan, linux-mm

On Tue, Apr 28, 2009 at 11:34 AM, Matt Mackall <mpm@selenic.com> wrote:
> Bah. The rate of change is proportional to #cpus, not #pages. Assuming
> you've got 1024 processors, you could run the scan in parallel in .2
> seconds still.

That would help ... it would also make the patch to support this
functionality a lot more complex.

-Tony

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 20:47             ` Tony Luck
  0 siblings, 0 replies; 137+ messages in thread
From: Tony Luck @ 2009-04-28 20:47 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Wu Fengguang, Ingo Molnar, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Pekka Enberg, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	KOSAKI Motohiro, Andi Kleen, Alexey Dobriyan, linux-mm

On Tue, Apr 28, 2009 at 11:34 AM, Matt Mackall <mpm@selenic.com> wrote:
> Bah. The rate of change is proportional to #cpus, not #pages. Assuming
> you've got 1024 processors, you could run the scan in parallel in .2
> seconds still.

That would help ... it would also make the patch to support this
functionality a lot more complex.

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28 20:47             ` Tony Luck
@ 2009-04-28 20:54               ` Andi Kleen
  -1 siblings, 0 replies; 137+ messages in thread
From: Andi Kleen @ 2009-04-28 20:54 UTC (permalink / raw)
  To: Tony Luck
  Cc: Matt Mackall, Wu Fengguang, Ingo Molnar, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Pekka Enberg, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	KOSAKI Motohiro, Andi Kleen, Alexey Dobriyan, linux-mm

On Tue, Apr 28, 2009 at 01:47:07PM -0700, Tony Luck wrote:
> On Tue, Apr 28, 2009 at 11:34 AM, Matt Mackall <mpm@selenic.com> wrote:
> > Bah. The rate of change is proportional to #cpus, not #pages. Assuming
> > you've got 1024 processors, you could run the scan in parallel in .2
> > seconds still.
> 
> That would help ... it would also make the patch to support this
> functionality a lot more complex.

I suspect 4TB memory users are used to some things running
a little slower. I'm not sure we really need to make every obscure
debugging functionality scale well to these systems too.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 20:54               ` Andi Kleen
  0 siblings, 0 replies; 137+ messages in thread
From: Andi Kleen @ 2009-04-28 20:54 UTC (permalink / raw)
  To: Tony Luck
  Cc: Matt Mackall, Wu Fengguang, Ingo Molnar, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Pekka Enberg, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	KOSAKI Motohiro, Andi Kleen, Alexey Dobriyan, linux-mm

On Tue, Apr 28, 2009 at 01:47:07PM -0700, Tony Luck wrote:
> On Tue, Apr 28, 2009 at 11:34 AM, Matt Mackall <mpm@selenic.com> wrote:
> > Bah. The rate of change is proportional to #cpus, not #pages. Assuming
> > you've got 1024 processors, you could run the scan in parallel in .2
> > seconds still.
> 
> That would help ... it would also make the patch to support this
> functionality a lot more complex.

I suspect 4TB memory users are used to some things running
a little slower. I'm not sure we really need to make every obscure
debugging functionality scale well to these systems too.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28 20:47             ` Tony Luck
@ 2009-04-28 20:59               ` Matt Mackall
  -1 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 20:59 UTC (permalink / raw)
  To: Tony Luck
  Cc: Wu Fengguang, Ingo Molnar, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Pekka Enberg, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	KOSAKI Motohiro, Andi Kleen, Alexey Dobriyan, linux-mm

On Tue, 2009-04-28 at 13:47 -0700, Tony Luck wrote:
> On Tue, Apr 28, 2009 at 11:34 AM, Matt Mackall <mpm@selenic.com> wrote:
> > Bah. The rate of change is proportional to #cpus, not #pages. Assuming
> > you've got 1024 processors, you could run the scan in parallel in .2
> > seconds still.
> 
> That would help ... it would also make the patch to support this
> functionality a lot more complex.

The kernel bits should handle this already today. You just need 1k
userspace threads to open /proc/kpageflags, seek() appropriately, and
read().

-- 
http://selenic.com : development and support for Mercurial and Linux



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 20:59               ` Matt Mackall
  0 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 20:59 UTC (permalink / raw)
  To: Tony Luck
  Cc: Wu Fengguang, Ingo Molnar, Steven Rostedt,
	Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
	Pekka Enberg, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	KOSAKI Motohiro, Andi Kleen, Alexey Dobriyan, linux-mm

On Tue, 2009-04-28 at 13:47 -0700, Tony Luck wrote:
> On Tue, Apr 28, 2009 at 11:34 AM, Matt Mackall <mpm@selenic.com> wrote:
> > Bah. The rate of change is proportional to #cpus, not #pages. Assuming
> > you've got 1024 processors, you could run the scan in parallel in .2
> > seconds still.
> 
> That would help ... it would also make the patch to support this
> functionality a lot more complex.

The kernel bits should handle this already today. You just need 1k
userspace threads to open /proc/kpageflags, seek() appropriately, and
read().

-- 
http://selenic.com : development and support for Mercurial and Linux


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28 18:11         ` Tony Luck
@ 2009-04-28 21:17           ` Andrew Morton
  -1 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-28 21:17 UTC (permalink / raw)
  To: Tony Luck
  Cc: fengguang.wu, mingo, rostedt, fweisbec, lwoodman, a.p.zijlstra,
	penberg, eduard.munteanu, linux-kernel, kosaki.motohiro, andi,
	mpm, adobriyan, linux-mm

On Tue, 28 Apr 2009 11:11:52 -0700
Tony Luck <tony.luck@gmail.com> wrote:

> On Tue, Apr 28, 2009 at 1:33 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 1) FAST
> >
> > It takes merely 0.2s to scan 4GB pages:
> >
> > __ __ __ __./page-types __0.02s user 0.20s system 99% cpu 0.216 total
> 
> OK on a tiny system ... but sounds painful on a big
> server. 0.2s for 4G scales up to 3 minutes 25 seconds
> on a 4TB system (4TB systems were being sold two
> years ago ... so by now the high end will have moved
> up to 8TB or perhaps 16TB).
> 
> Would the resulting output be anything but noise on
> a big system (a *lot* of pages can change state in
> 3 minutes)?
> 

Reading the state of all of memory in this fashion would be a somewhat
peculiar thing to do.  Bear in mind that kpagemap and friends are also
designed to allow userspace to inspect the state of a particular
process's memory.

Documentation/vm/pagemap.txt describes it nicely:

: The general procedure for using pagemap to find out about a process' memory
: usage goes like this:
: 
:  1. Read /proc/pid/maps to determine which parts of the memory space are
:     mapped to what.
:  2. Select the maps you are interested in -- all of them, or a particular
:     library, or the stack or the heap, etc.
:  3. Open /proc/pid/pagemap and seek to the pages you would like to examine.
:  4. Read a u64 for each page from pagemap.
:  5. Open /proc/kpagecount and/or /proc/kpageflags.  For each PFN you just
:     read, seek to that entry in the file, and read the data you want.

although I expect that this is not the use case when the feature is
being used to debug/tune readahead.

But yes, if you have huge amounts of memory and you decide to write an
application which inspects the state of every physical page in the
machine, you can expect it to take a long time!

Of course, the VM does also accumulate bulk aggregated page statistics
and presents them in /proc/meminfo, /proc/vmstat and probably other
places.  These numbers are maintained at runtime and the cost of doing
this is significant.

I don't _think_ there are presently any such counters which are
accumulated simply for instrumentation purposes - the kernel needs to
maintain them anyway for various reasons and it's a simple (and useful)
matter to make them available to userspace.

Generally, I think that pagemap is another of those things where we've
failed on the follow-through.  There's a nice and powerful interface
for inspecting the state of a process's VM, but nobody knows about it
and there are no tools for accessing it and nobody is using it.

(Or maybe I'm wrong about that - I expect I'd have bugged Matt about
this and I expect that he'd have done something.  Brain failed).

Either way, I think we'd serve the world better if we were to have some
nice little userspace tools which users could use to access this
information.  Documentation/vm already has a Makefile!

Fengguang, you mention an executable called "page-types".  Perhaps you
could "productise" that sometime?

A model here is Documentation/accounting/getdelays.c - that proved
quite useful and successful in the development of taskstats and I know
that several people are actually using getdelays.c as-is in serious
production environments.  If we hadn't provided and maintained that
code in the kernel tree, it's unlikely that taskstats would have proved
as useful to users.

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 21:17           ` Andrew Morton
  0 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-28 21:17 UTC (permalink / raw)
  To: Tony Luck
  Cc: fengguang.wu, mingo, rostedt, fweisbec, lwoodman, a.p.zijlstra,
	penberg, eduard.munteanu, linux-kernel, kosaki.motohiro, andi,
	mpm, adobriyan, linux-mm

On Tue, 28 Apr 2009 11:11:52 -0700
Tony Luck <tony.luck@gmail.com> wrote:

> On Tue, Apr 28, 2009 at 1:33 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 1) FAST
> >
> > It takes merely 0.2s to scan 4GB pages:
> >
> > __ __ __ __./page-types __0.02s user 0.20s system 99% cpu 0.216 total
> 
> OK on a tiny system ... but sounds painful on a big
> server. 0.2s for 4G scales up to 3 minutes 25 seconds
> on a 4TB system (4TB systems were being sold two
> years ago ... so by now the high end will have moved
> up to 8TB or perhaps 16TB).
> 
> Would the resulting output be anything but noise on
> a big system (a *lot* of pages can change state in
> 3 minutes)?
> 

Reading the state of all of memory in this fashion would be a somewhat
peculiar thing to do.  Bear in mind that kpagemap and friends are also
designed to allow userspace to inspect the state of a particular
process's memory.

Documentation/vm/pagemap.txt describes it nicely:

: The general procedure for using pagemap to find out about a process' memory
: usage goes like this:
: 
:  1. Read /proc/pid/maps to determine which parts of the memory space are
:     mapped to what.
:  2. Select the maps you are interested in -- all of them, or a particular
:     library, or the stack or the heap, etc.
:  3. Open /proc/pid/pagemap and seek to the pages you would like to examine.
:  4. Read a u64 for each page from pagemap.
:  5. Open /proc/kpagecount and/or /proc/kpageflags.  For each PFN you just
:     read, seek to that entry in the file, and read the data you want.

although I expect that this is not the use case when the feature is
being used to debug/tune readahead.

But yes, if you have huge amounts of memory and you decide to write an
application which inspects the state of every physical page in the
machine, you can expect it to take a long time!

Of course, the VM does also accumulate bulk aggregated page statistics
and presents them in /proc/meminfo, /proc/vmstat and probably other
places.  These numbers are maintained at runtime and the cost of doing
this is significant.

I don't _think_ there are presently any such counters which are
accumulated simply for instrumentation purposes - the kernel needs to
maintain them anyway for various reasons and it's a simple (and useful)
matter to make them available to userspace.

Generally, I think that pagemap is another of those things where we've
failed on the follow-through.  There's a nice and powerful interface
for inspecting the state of a process's VM, but nobody knows about it
and there are no tools for accessing it and nobody is using it.

(Or maybe I'm wrong about that - I expect I'd have bugged Matt about
this and I expect that he'd have done something.  Brain failed).

Either way, I think we'd serve the world better if we were to have some
nice little userspace tools which users could use to access this
information.  Documentation/vm already has a Makefile!

Fengguang, you mention an executable called "page-types".  Perhaps you
could "productise" that sometime?

A model here is Documentation/accounting/getdelays.c - that proved
quite useful and successful in the development of taskstats and I know
that several people are actually using getdelays.c as-is in serious
production environments.  If we hadn't provided and maintained that
code in the kernel tree, it's unlikely that taskstats would have proved
as useful to users.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28  1:09   ` Wu Fengguang
@ 2009-04-28 21:32     ` Andrew Morton
  -1 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-28 21:32 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, kosaki.motohiro, andi, mpm, adobriyan,
	fengguang.wu, linux-mm

On Tue, 28 Apr 2009 09:09:12 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> +/*
> + * Kernel flags are exported faithfully to Linus and his fellow hackers.
> + * Otherwise some details are masked to avoid confusing the end user:
> + * - some kernel flags are completely invisible
> + * - some kernel flags are conditionally invisible on their odd usages
> + */
> +#ifdef CONFIG_DEBUG_KERNEL
> +static inline int genuine_linus(void) { return 1; }

Although he's a fine chap, the use of the "_linus" tag isn't terribly
clear (to me).  I think what you're saying here is that this enables
kernel-developer-only features, yes?

If so, perhaps we could come up with an identifier which expresses that
more clearly.

But I'd expect that everyone and all distros enable CONFIG_DEBUG_KERNEL
for _some_ reason, so what's the point?

It is preferable that we always implement the same interface for all
Kconfig settings.  If this exposes information which is confusing or
not useful to end-users then so be it - we should be able to cover that
in supporting documentation.

Also, as mentioned in the other email, it would be good if we were to
publish a little userspace app which people can use to access this raw
data.  We could give that application an `--i-am-a-kernel-developer'
option!

> +#else
> +static inline int genuine_linus(void) { return 0; }
> +#endif

This isn't an appropriate use of CONFIG_DEBUG_KERNEL.

DEBUG_KERNEL is a Kconfig-only construct which is use to enable _other_
debugging features.  The way you've used it here, if the person who is
configuring the kernel wants to enable any other completely-unrelated
debug feature, they have to enable DEBUG_KERNEL first.  But when they
do that, they unexpectedly alter the behaviour of pagemap!

There are two other places where CONFIG_DEBUG_KERNEL affects code
generation in .c files: arch/parisc/mm/init.c and
arch/powerpc/kernel/sysfs.c.  These are both wrong, and need slapping ;)

> +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit)		\
> +	do {								\
> +		if (visible || genuine_linus())				\
> +			uflags |= ((kflags >> kbit) & 1) << ubit;	\
> +	} while (0);

Did this have to be implemented as a macro?

It's bad, because it might or might not reference its argument, so if
someone passes it an expression-with-side-effects, the end result is
unpredictable.  A C function is almost always preferable if possible.

> +/* a helper function _not_ intended for more general uses */
> +static inline int page_cap_writeback_dirty(struct page *page)
> +{
> +	struct address_space *mapping;
> +
> +	if (!PageSlab(page))
> +		mapping = page_mapping(page);
> +	else
> +		mapping = NULL;
> +
> +	return mapping && mapping_cap_writeback_dirty(mapping);
> +}

If the page isn't locked then page->mapping can be concurrently removed
and freed.  This actually happened to me in real-life testing several
years ago.

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 21:32     ` Andrew Morton
  0 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-28 21:32 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm

On Tue, 28 Apr 2009 09:09:12 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> +/*
> + * Kernel flags are exported faithfully to Linus and his fellow hackers.
> + * Otherwise some details are masked to avoid confusing the end user:
> + * - some kernel flags are completely invisible
> + * - some kernel flags are conditionally invisible on their odd usages
> + */
> +#ifdef CONFIG_DEBUG_KERNEL
> +static inline int genuine_linus(void) { return 1; }

Although he's a fine chap, the use of the "_linus" tag isn't terribly
clear (to me).  I think what you're saying here is that this enables
kernel-developer-only features, yes?

If so, perhaps we could come up with an identifier which expresses that
more clearly.

But I'd expect that everyone and all distros enable CONFIG_DEBUG_KERNEL
for _some_ reason, so what's the point?

It is preferable that we always implement the same interface for all
Kconfig settings.  If this exposes information which is confusing or
not useful to end-users then so be it - we should be able to cover that
in supporting documentation.

Also, as mentioned in the other email, it would be good if we were to
publish a little userspace app which people can use to access this raw
data.  We could give that application an `--i-am-a-kernel-developer'
option!

> +#else
> +static inline int genuine_linus(void) { return 0; }
> +#endif

This isn't an appropriate use of CONFIG_DEBUG_KERNEL.

DEBUG_KERNEL is a Kconfig-only construct which is use to enable _other_
debugging features.  The way you've used it here, if the person who is
configuring the kernel wants to enable any other completely-unrelated
debug feature, they have to enable DEBUG_KERNEL first.  But when they
do that, they unexpectedly alter the behaviour of pagemap!

There are two other places where CONFIG_DEBUG_KERNEL affects code
generation in .c files: arch/parisc/mm/init.c and
arch/powerpc/kernel/sysfs.c.  These are both wrong, and need slapping ;)

> +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit)		\
> +	do {								\
> +		if (visible || genuine_linus())				\
> +			uflags |= ((kflags >> kbit) & 1) << ubit;	\
> +	} while (0);

Did this have to be implemented as a macro?

It's bad, because it might or might not reference its argument, so if
someone passes it an expression-with-side-effects, the end result is
unpredictable.  A C function is almost always preferable if possible.

> +/* a helper function _not_ intended for more general uses */
> +static inline int page_cap_writeback_dirty(struct page *page)
> +{
> +	struct address_space *mapping;
> +
> +	if (!PageSlab(page))
> +		mapping = page_mapping(page);
> +	else
> +		mapping = NULL;
> +
> +	return mapping && mapping_cap_writeback_dirty(mapping);
> +}

If the page isn't locked then page->mapping can be concurrently removed
and freed.  This actually happened to me in real-life testing several
years ago.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28 21:17           ` Andrew Morton
@ 2009-04-28 21:49             ` Matt Mackall
  -1 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 21:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tony Luck, fengguang.wu, mingo, rostedt, fweisbec, lwoodman,
	a.p.zijlstra, penberg, eduard.munteanu, linux-kernel,
	kosaki.motohiro, andi, adobriyan, linux-mm

On Tue, 2009-04-28 at 14:17 -0700, Andrew Morton wrote:
> On Tue, 28 Apr 2009 11:11:52 -0700
> Tony Luck <tony.luck@gmail.com> wrote:
> 
> > On Tue, Apr 28, 2009 at 1:33 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 1) FAST
> > >
> > > It takes merely 0.2s to scan 4GB pages:
> > >
> > > __ __ __ __./page-types __0.02s user 0.20s system 99% cpu 0.216 total
> > 
> > OK on a tiny system ... but sounds painful on a big
> > server. 0.2s for 4G scales up to 3 minutes 25 seconds
> > on a 4TB system (4TB systems were being sold two
> > years ago ... so by now the high end will have moved
> > up to 8TB or perhaps 16TB).
> > 
> > Would the resulting output be anything but noise on
> > a big system (a *lot* of pages can change state in
> > 3 minutes)?
> > 
> 
> Reading the state of all of memory in this fashion would be a somewhat
> peculiar thing to do.

Not entirely. If you've got, say, a large NUMA box, it could be
incredibly illustrative to see that "oh, this node is entirely dominated
by SLAB allocations". Or on a smaller machine "oh, this is fragmented to
hell and there's no way I'm going to get a huge page". Things you're not
going to get from individual stats.

> Generally, I think that pagemap is another of those things where we've
> failed on the follow-through.  There's a nice and powerful interface
> for inspecting the state of a process's VM, but nobody knows about it
> and there are no tools for accessing it and nobody is using it.

People keep finding bugs in the thing exercising it in new ways, so I
presume people are writing their own tools. My hope was that my original
tools would inspire someone to take it and run with it - I really have
no stomach for writing GUI tools.

However, I've recent gone and written a pretty generically useful
command-line tool that hopefully will get more traction:

http://www.selenic.com/smem/

I'm expecting it to get written up on LWN shortly, so I haven't spent
much time doing my own advertising.

-- 
http://selenic.com : development and support for Mercurial and Linux

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 21:49             ` Matt Mackall
  0 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 21:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tony Luck, fengguang.wu, mingo, rostedt, fweisbec, lwoodman,
	a.p.zijlstra, penberg, eduard.munteanu, linux-kernel,
	kosaki.motohiro, andi, adobriyan, linux-mm

On Tue, 2009-04-28 at 14:17 -0700, Andrew Morton wrote:
> On Tue, 28 Apr 2009 11:11:52 -0700
> Tony Luck <tony.luck@gmail.com> wrote:
> 
> > On Tue, Apr 28, 2009 at 1:33 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 1) FAST
> > >
> > > It takes merely 0.2s to scan 4GB pages:
> > >
> > > __ __ __ __./page-types __0.02s user 0.20s system 99% cpu 0.216 total
> > 
> > OK on a tiny system ... but sounds painful on a big
> > server. 0.2s for 4G scales up to 3 minutes 25 seconds
> > on a 4TB system (4TB systems were being sold two
> > years ago ... so by now the high end will have moved
> > up to 8TB or perhaps 16TB).
> > 
> > Would the resulting output be anything but noise on
> > a big system (a *lot* of pages can change state in
> > 3 minutes)?
> > 
> 
> Reading the state of all of memory in this fashion would be a somewhat
> peculiar thing to do.

Not entirely. If you've got, say, a large NUMA box, it could be
incredibly illustrative to see that "oh, this node is entirely dominated
by SLAB allocations". Or on a smaller machine "oh, this is fragmented to
hell and there's no way I'm going to get a huge page". Things you're not
going to get from individual stats.

> Generally, I think that pagemap is another of those things where we've
> failed on the follow-through.  There's a nice and powerful interface
> for inspecting the state of a process's VM, but nobody knows about it
> and there are no tools for accessing it and nobody is using it.

People keep finding bugs in the thing exercising it in new ways, so I
presume people are writing their own tools. My hope was that my original
tools would inspire someone to take it and run with it - I really have
no stomach for writing GUI tools.

However, I've recent gone and written a pretty generically useful
command-line tool that hopefully will get more traction:

http://www.selenic.com/smem/

I'm expecting it to get written up on LWN shortly, so I haven't spent
much time doing my own advertising.

-- 
http://selenic.com : development and support for Mercurial and Linux


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28 21:32     ` Andrew Morton
@ 2009-04-28 22:46       ` Matt Mackall
  -1 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 22:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm

On Tue, 2009-04-28 at 14:32 -0700, Andrew Morton wrote:

> > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit)		\
> > +	do {								\
> > +		if (visible || genuine_linus())				\
> > +			uflags |= ((kflags >> kbit) & 1) << ubit;	\
> > +	} while (0);
> 
> Did this have to be implemented as a macro?

I'm mostly to blame for that. I seem to recall the optimizer doing a
better job on this as a macro.

> It's bad, because it might or might not reference its argument, so if
> someone passes it an expression-with-side-effects, the end result is
> unpredictable.  A C function is almost always preferable if possible.

I don't think there's any use case for it outside of its one user?

> > +/* a helper function _not_ intended for more general uses */
> > +static inline int page_cap_writeback_dirty(struct page *page)
> > +{
> > +	struct address_space *mapping;
> > +
> > +	if (!PageSlab(page))
> > +		mapping = page_mapping(page);
> > +	else
> > +		mapping = NULL;
> > +
> > +	return mapping && mapping_cap_writeback_dirty(mapping);
> > +}
> 
> If the page isn't locked then page->mapping can be concurrently removed
> and freed.  This actually happened to me in real-life testing several
> years ago.

We certainly don't want to be taking locks per page to build the flags
data here. As we don't have any pretense of being atomic, it's ok if we
can find a way to do the test that's inaccurate when a race occurs, so
long as it doesn't dereference null.

But if there's not an obvious way to do that, we should probably just
drop this flag bit for this iteration.

-- 
http://selenic.com : development and support for Mercurial and Linux



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 22:46       ` Matt Mackall
  0 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 22:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm

On Tue, 2009-04-28 at 14:32 -0700, Andrew Morton wrote:

> > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit)		\
> > +	do {								\
> > +		if (visible || genuine_linus())				\
> > +			uflags |= ((kflags >> kbit) & 1) << ubit;	\
> > +	} while (0);
> 
> Did this have to be implemented as a macro?

I'm mostly to blame for that. I seem to recall the optimizer doing a
better job on this as a macro.

> It's bad, because it might or might not reference its argument, so if
> someone passes it an expression-with-side-effects, the end result is
> unpredictable.  A C function is almost always preferable if possible.

I don't think there's any use case for it outside of its one user?

> > +/* a helper function _not_ intended for more general uses */
> > +static inline int page_cap_writeback_dirty(struct page *page)
> > +{
> > +	struct address_space *mapping;
> > +
> > +	if (!PageSlab(page))
> > +		mapping = page_mapping(page);
> > +	else
> > +		mapping = NULL;
> > +
> > +	return mapping && mapping_cap_writeback_dirty(mapping);
> > +}
> 
> If the page isn't locked then page->mapping can be concurrently removed
> and freed.  This actually happened to me in real-life testing several
> years ago.

We certainly don't want to be taking locks per page to build the flags
data here. As we don't have any pretense of being atomic, it's ok if we
can find a way to do the test that's inaccurate when a race occurs, so
long as it doesn't dereference null.

But if there's not an obvious way to do that, we should probably just
drop this flag bit for this iteration.

-- 
http://selenic.com : development and support for Mercurial and Linux


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28 22:46       ` Matt Mackall
@ 2009-04-28 23:02         ` Andrew Morton
  -1 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-28 23:02 UTC (permalink / raw)
  To: Matt Mackall
  Cc: fengguang.wu, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm

On Tue, 28 Apr 2009 17:46:34 -0500
Matt Mackall <mpm@selenic.com> wrote:

> > > +/* a helper function _not_ intended for more general uses */
> > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > +{
> > > +	struct address_space *mapping;
> > > +
> > > +	if (!PageSlab(page))
> > > +		mapping = page_mapping(page);
> > > +	else
> > > +		mapping = NULL;
> > > +
> > > +	return mapping && mapping_cap_writeback_dirty(mapping);
> > > +}
> > 
> > If the page isn't locked then page->mapping can be concurrently removed
> > and freed.  This actually happened to me in real-life testing several
> > years ago.
> 
> We certainly don't want to be taking locks per page to build the flags
> data here. As we don't have any pretense of being atomic, it's ok if we
> can find a way to do the test that's inaccurate when a race occurs, so
> long as it doesn't dereference null.
> 
> But if there's not an obvious way to do that, we should probably just
> drop this flag bit for this iteration.

trylock_page() could be used here, perhaps.

Then again, why _not_ just do lock_page()?  After all, few pages are
ever locked.  There will be latency if the caller stumbles across a
page which is under read I/O, but so be it?


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 23:02         ` Andrew Morton
  0 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-28 23:02 UTC (permalink / raw)
  To: Matt Mackall
  Cc: fengguang.wu, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm

On Tue, 28 Apr 2009 17:46:34 -0500
Matt Mackall <mpm@selenic.com> wrote:

> > > +/* a helper function _not_ intended for more general uses */
> > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > +{
> > > +	struct address_space *mapping;
> > > +
> > > +	if (!PageSlab(page))
> > > +		mapping = page_mapping(page);
> > > +	else
> > > +		mapping = NULL;
> > > +
> > > +	return mapping && mapping_cap_writeback_dirty(mapping);
> > > +}
> > 
> > If the page isn't locked then page->mapping can be concurrently removed
> > and freed.  This actually happened to me in real-life testing several
> > years ago.
> 
> We certainly don't want to be taking locks per page to build the flags
> data here. As we don't have any pretense of being atomic, it's ok if we
> can find a way to do the test that's inaccurate when a race occurs, so
> long as it doesn't dereference null.
> 
> But if there's not an obvious way to do that, we should probably just
> drop this flag bit for this iteration.

trylock_page() could be used here, perhaps.

Then again, why _not_ just do lock_page()?  After all, few pages are
ever locked.  There will be latency if the caller stumbles across a
page which is under read I/O, but so be it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28 23:02         ` Andrew Morton
@ 2009-04-28 23:31           ` Matt Mackall
  -1 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 23:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: fengguang.wu, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm

On Tue, 2009-04-28 at 16:02 -0700, Andrew Morton wrote:
> On Tue, 28 Apr 2009 17:46:34 -0500
> Matt Mackall <mpm@selenic.com> wrote:
> 
> > > > +/* a helper function _not_ intended for more general uses */
> > > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > > +{
> > > > +	struct address_space *mapping;
> > > > +
> > > > +	if (!PageSlab(page))
> > > > +		mapping = page_mapping(page);
> > > > +	else
> > > > +		mapping = NULL;
> > > > +
> > > > +	return mapping && mapping_cap_writeback_dirty(mapping);
> > > > +}
> > > 
> > > If the page isn't locked then page->mapping can be concurrently removed
> > > and freed.  This actually happened to me in real-life testing several
> > > years ago.
> > 
> > We certainly don't want to be taking locks per page to build the flags
> > data here. As we don't have any pretense of being atomic, it's ok if we
> > can find a way to do the test that's inaccurate when a race occurs, so
> > long as it doesn't dereference null.
> > 
> > But if there's not an obvious way to do that, we should probably just
> > drop this flag bit for this iteration.
> 
> trylock_page() could be used here, perhaps.
> 
> Then again, why _not_ just do lock_page()?  After all, few pages are
> ever locked.  There will be latency if the caller stumbles across a
> page which is under read I/O, but so be it?

As I mentioned just a bit ago, it's really not an unreasonable use case
to want to do this on every page in the system back to back. So per page
overhead matters. And the odds of stalling on a locked page when
visiting 1M pages while under load are probably not negligible.

Our lock primitives are pretty low overhead in the fast path, but every
cycle counts. The new tests and branches this code already adds are a
bit worrisome, but on balance probably worth it.

-- 
http://selenic.com : development and support for Mercurial and Linux



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 23:31           ` Matt Mackall
  0 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 23:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: fengguang.wu, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm

On Tue, 2009-04-28 at 16:02 -0700, Andrew Morton wrote:
> On Tue, 28 Apr 2009 17:46:34 -0500
> Matt Mackall <mpm@selenic.com> wrote:
> 
> > > > +/* a helper function _not_ intended for more general uses */
> > > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > > +{
> > > > +	struct address_space *mapping;
> > > > +
> > > > +	if (!PageSlab(page))
> > > > +		mapping = page_mapping(page);
> > > > +	else
> > > > +		mapping = NULL;
> > > > +
> > > > +	return mapping && mapping_cap_writeback_dirty(mapping);
> > > > +}
> > > 
> > > If the page isn't locked then page->mapping can be concurrently removed
> > > and freed.  This actually happened to me in real-life testing several
> > > years ago.
> > 
> > We certainly don't want to be taking locks per page to build the flags
> > data here. As we don't have any pretense of being atomic, it's ok if we
> > can find a way to do the test that's inaccurate when a race occurs, so
> > long as it doesn't dereference null.
> > 
> > But if there's not an obvious way to do that, we should probably just
> > drop this flag bit for this iteration.
> 
> trylock_page() could be used here, perhaps.
> 
> Then again, why _not_ just do lock_page()?  After all, few pages are
> ever locked.  There will be latency if the caller stumbles across a
> page which is under read I/O, but so be it?

As I mentioned just a bit ago, it's really not an unreasonable use case
to want to do this on every page in the system back to back. So per page
overhead matters. And the odds of stalling on a locked page when
visiting 1M pages while under load are probably not negligible.

Our lock primitives are pretty low overhead in the fast path, but every
cycle counts. The new tests and branches this code already adds are a
bit worrisome, but on balance probably worth it.

-- 
http://selenic.com : development and support for Mercurial and Linux


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28 23:31           ` Matt Mackall
@ 2009-04-28 23:42             ` Andrew Morton
  -1 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-28 23:42 UTC (permalink / raw)
  To: Matt Mackall
  Cc: fengguang.wu, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm

On Tue, 28 Apr 2009 18:31:09 -0500
Matt Mackall <mpm@selenic.com> wrote:

> On Tue, 2009-04-28 at 16:02 -0700, Andrew Morton wrote:
> > On Tue, 28 Apr 2009 17:46:34 -0500
> > Matt Mackall <mpm@selenic.com> wrote:
> > 
> > > > > +/* a helper function _not_ intended for more general uses */
> > > > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > > > +{
> > > > > +	struct address_space *mapping;
> > > > > +
> > > > > +	if (!PageSlab(page))
> > > > > +		mapping = page_mapping(page);
> > > > > +	else
> > > > > +		mapping = NULL;
> > > > > +
> > > > > +	return mapping && mapping_cap_writeback_dirty(mapping);
> > > > > +}
> > > > 
> > > > If the page isn't locked then page->mapping can be concurrently removed
> > > > and freed.  This actually happened to me in real-life testing several
> > > > years ago.
> > > 
> > > We certainly don't want to be taking locks per page to build the flags
> > > data here. As we don't have any pretense of being atomic, it's ok if we
> > > can find a way to do the test that's inaccurate when a race occurs, so
> > > long as it doesn't dereference null.
> > > 
> > > But if there's not an obvious way to do that, we should probably just
> > > drop this flag bit for this iteration.
> > 
> > trylock_page() could be used here, perhaps.
> > 
> > Then again, why _not_ just do lock_page()?  After all, few pages are
> > ever locked.  There will be latency if the caller stumbles across a
> > page which is under read I/O, but so be it?
> 
> As I mentioned just a bit ago, it's really not an unreasonable use case
> to want to do this on every page in the system back to back. So per page
> overhead matters. And the odds of stalling on a locked page when
> visiting 1M pages while under load are probably not negligible.

The chances of stalling on a locked page are pretty good, and the
duration of the stall might be long indeed.  Perhaps a trylock is a
decent compromise - it depends on the value of this metric, and I've
forgotten what we're talking about ;)

umm, seems that this flag is needed to enable PG_error, PG_dirty,
PG_uptodate and PG_writeback reporting.  So simply removing this code
would put a huge hole in the patchset, no?

> Our lock primitives are pretty low overhead in the fast path, but every
> cycle counts. The new tests and branches this code already adds are a
> bit worrisome, but on balance probably worth it.

That should be easy to quantify (hint).

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 23:42             ` Andrew Morton
  0 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-28 23:42 UTC (permalink / raw)
  To: Matt Mackall
  Cc: fengguang.wu, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm

On Tue, 28 Apr 2009 18:31:09 -0500
Matt Mackall <mpm@selenic.com> wrote:

> On Tue, 2009-04-28 at 16:02 -0700, Andrew Morton wrote:
> > On Tue, 28 Apr 2009 17:46:34 -0500
> > Matt Mackall <mpm@selenic.com> wrote:
> > 
> > > > > +/* a helper function _not_ intended for more general uses */
> > > > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > > > +{
> > > > > +	struct address_space *mapping;
> > > > > +
> > > > > +	if (!PageSlab(page))
> > > > > +		mapping = page_mapping(page);
> > > > > +	else
> > > > > +		mapping = NULL;
> > > > > +
> > > > > +	return mapping && mapping_cap_writeback_dirty(mapping);
> > > > > +}
> > > > 
> > > > If the page isn't locked then page->mapping can be concurrently removed
> > > > and freed.  This actually happened to me in real-life testing several
> > > > years ago.
> > > 
> > > We certainly don't want to be taking locks per page to build the flags
> > > data here. As we don't have any pretense of being atomic, it's ok if we
> > > can find a way to do the test that's inaccurate when a race occurs, so
> > > long as it doesn't dereference null.
> > > 
> > > But if there's not an obvious way to do that, we should probably just
> > > drop this flag bit for this iteration.
> > 
> > trylock_page() could be used here, perhaps.
> > 
> > Then again, why _not_ just do lock_page()?  After all, few pages are
> > ever locked.  There will be latency if the caller stumbles across a
> > page which is under read I/O, but so be it?
> 
> As I mentioned just a bit ago, it's really not an unreasonable use case
> to want to do this on every page in the system back to back. So per page
> overhead matters. And the odds of stalling on a locked page when
> visiting 1M pages while under load are probably not negligible.

The chances of stalling on a locked page are pretty good, and the
duration of the stall might be long indeed.  Perhaps a trylock is a
decent compromise - it depends on the value of this metric, and I've
forgotten what we're talking about ;)

umm, seems that this flag is needed to enable PG_error, PG_dirty,
PG_uptodate and PG_writeback reporting.  So simply removing this code
would put a huge hole in the patchset, no?

> Our lock primitives are pretty low overhead in the fast path, but every
> cycle counts. The new tests and branches this code already adds are a
> bit worrisome, but on balance probably worth it.

That should be easy to quantify (hint).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28 23:42             ` Andrew Morton
@ 2009-04-28 23:55               ` Matt Mackall
  -1 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 23:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: fengguang.wu, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm

On Tue, 2009-04-28 at 16:42 -0700, Andrew Morton wrote:
> On Tue, 28 Apr 2009 18:31:09 -0500
> Matt Mackall <mpm@selenic.com> wrote:
> 
> > On Tue, 2009-04-28 at 16:02 -0700, Andrew Morton wrote:
> > > On Tue, 28 Apr 2009 17:46:34 -0500
> > > Matt Mackall <mpm@selenic.com> wrote:
> > > 
> > > > > > +/* a helper function _not_ intended for more general uses */
> > > > > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > > > > +{
> > > > > > +	struct address_space *mapping;
> > > > > > +
> > > > > > +	if (!PageSlab(page))
> > > > > > +		mapping = page_mapping(page);
> > > > > > +	else
> > > > > > +		mapping = NULL;
> > > > > > +
> > > > > > +	return mapping && mapping_cap_writeback_dirty(mapping);
> > > > > > +}
> > > > > 
> > > > > If the page isn't locked then page->mapping can be concurrently removed
> > > > > and freed.  This actually happened to me in real-life testing several
> > > > > years ago.
> > > > 
> > > > We certainly don't want to be taking locks per page to build the flags
> > > > data here. As we don't have any pretense of being atomic, it's ok if we
> > > > can find a way to do the test that's inaccurate when a race occurs, so
> > > > long as it doesn't dereference null.
> > > > 
> > > > But if there's not an obvious way to do that, we should probably just
> > > > drop this flag bit for this iteration.
> > > 
> > > trylock_page() could be used here, perhaps.
> > > 
> > > Then again, why _not_ just do lock_page()?  After all, few pages are
> > > ever locked.  There will be latency if the caller stumbles across a
> > > page which is under read I/O, but so be it?
> > 
> > As I mentioned just a bit ago, it's really not an unreasonable use case
> > to want to do this on every page in the system back to back. So per page
> > overhead matters. And the odds of stalling on a locked page when
> > visiting 1M pages while under load are probably not negligible.
> 
> The chances of stalling on a locked page are pretty good, and the
> duration of the stall might be long indeed.  Perhaps a trylock is a
> decent compromise - it depends on the value of this metric, and I've
> forgotten what we're talking about ;)
> 
> umm, seems that this flag is needed to enable PG_error, PG_dirty,
> PG_uptodate and PG_writeback reporting.  So simply removing this code
> would put a huge hole in the patchset, no?

We can report those bits anyway. But this patchset does something
clever: it filters irrelevant (and possibly overloaded) bits in various
contexts. 

> > Our lock primitives are pretty low overhead in the fast path, but every
> > cycle counts. The new tests and branches this code already adds are a
> > bit worrisome, but on balance probably worth it.
> 
> That should be easy to quantify (hint).

I'll let Fengguang address both these points.

-- 
http://selenic.com : development and support for Mercurial and Linux



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 23:55               ` Matt Mackall
  0 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 23:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: fengguang.wu, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm

On Tue, 2009-04-28 at 16:42 -0700, Andrew Morton wrote:
> On Tue, 28 Apr 2009 18:31:09 -0500
> Matt Mackall <mpm@selenic.com> wrote:
> 
> > On Tue, 2009-04-28 at 16:02 -0700, Andrew Morton wrote:
> > > On Tue, 28 Apr 2009 17:46:34 -0500
> > > Matt Mackall <mpm@selenic.com> wrote:
> > > 
> > > > > > +/* a helper function _not_ intended for more general uses */
> > > > > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > > > > +{
> > > > > > +	struct address_space *mapping;
> > > > > > +
> > > > > > +	if (!PageSlab(page))
> > > > > > +		mapping = page_mapping(page);
> > > > > > +	else
> > > > > > +		mapping = NULL;
> > > > > > +
> > > > > > +	return mapping && mapping_cap_writeback_dirty(mapping);
> > > > > > +}
> > > > > 
> > > > > If the page isn't locked then page->mapping can be concurrently removed
> > > > > and freed.  This actually happened to me in real-life testing several
> > > > > years ago.
> > > > 
> > > > We certainly don't want to be taking locks per page to build the flags
> > > > data here. As we don't have any pretense of being atomic, it's ok if we
> > > > can find a way to do the test that's inaccurate when a race occurs, so
> > > > long as it doesn't dereference null.
> > > > 
> > > > But if there's not an obvious way to do that, we should probably just
> > > > drop this flag bit for this iteration.
> > > 
> > > trylock_page() could be used here, perhaps.
> > > 
> > > Then again, why _not_ just do lock_page()?  After all, few pages are
> > > ever locked.  There will be latency if the caller stumbles across a
> > > page which is under read I/O, but so be it?
> > 
> > As I mentioned just a bit ago, it's really not an unreasonable use case
> > to want to do this on every page in the system back to back. So per page
> > overhead matters. And the odds of stalling on a locked page when
> > visiting 1M pages while under load are probably not negligible.
> 
> The chances of stalling on a locked page are pretty good, and the
> duration of the stall might be long indeed.  Perhaps a trylock is a
> decent compromise - it depends on the value of this metric, and I've
> forgotten what we're talking about ;)
> 
> umm, seems that this flag is needed to enable PG_error, PG_dirty,
> PG_uptodate and PG_writeback reporting.  So simply removing this code
> would put a huge hole in the patchset, no?

We can report those bits anyway. But this patchset does something
clever: it filters irrelevant (and possibly overloaded) bits in various
contexts. 

> > Our lock primitives are pretty low overhead in the fast path, but every
> > cycle counts. The new tests and branches this code already adds are a
> > bit worrisome, but on balance probably worth it.
> 
> That should be easy to quantify (hint).

I'll let Fengguang address both these points.

-- 
http://selenic.com : development and support for Mercurial and Linux


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28 21:49             ` Matt Mackall
@ 2009-04-29  0:02               ` Robin Holt
  -1 siblings, 0 replies; 137+ messages in thread
From: Robin Holt @ 2009-04-29  0:02 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, Tony Luck, fengguang.wu, mingo, rostedt, fweisbec,
	lwoodman, a.p.zijlstra, penberg, eduard.munteanu, linux-kernel,
	kosaki.motohiro, andi, adobriyan, linux-mm

On Tue, Apr 28, 2009 at 04:49:55PM -0500, Matt Mackall wrote:
> > Reading the state of all of memory in this fashion would be a somewhat
> > peculiar thing to do.
> 
> Not entirely. If you've got, say, a large NUMA box, it could be
> incredibly illustrative to see that "oh, this node is entirely dominated
> by SLAB allocations". Or on a smaller machine "oh, this is fragmented to
> hell and there's no way I'm going to get a huge page". Things you're not
> going to get from individual stats.

I have, in the past, simply used grep on
/sys/devices/system/node/node*/meminfo and gotten the individual stats
I was concerned about.  Not sure how much more detail would have been
needed or useful.  I don't think I can recall a time where I needed to
write another tool.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29  0:02               ` Robin Holt
  0 siblings, 0 replies; 137+ messages in thread
From: Robin Holt @ 2009-04-29  0:02 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, Tony Luck, fengguang.wu, mingo, rostedt, fweisbec,
	lwoodman, a.p.zijlstra, penberg, eduard.munteanu, linux-kernel,
	kosaki.motohiro, andi, adobriyan, linux-mm

On Tue, Apr 28, 2009 at 04:49:55PM -0500, Matt Mackall wrote:
> > Reading the state of all of memory in this fashion would be a somewhat
> > peculiar thing to do.
> 
> Not entirely. If you've got, say, a large NUMA box, it could be
> incredibly illustrative to see that "oh, this node is entirely dominated
> by SLAB allocations". Or on a smaller machine "oh, this is fragmented to
> hell and there's no way I'm going to get a huge page". Things you're not
> going to get from individual stats.

I have, in the past, simply used grep on
/sys/devices/system/node/node*/meminfo and gotten the individual stats
I was concerned about.  Not sure how much more detail would have been
needed or useful.  I don't think I can recall a time where I needed to
write another tool.

Thanks,
Robin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28 21:32     ` Andrew Morton
@ 2009-04-29  2:38       ` Wu Fengguang
  -1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-29  2:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm,
	Stephen Rothwell, Chandra Seetharaman, Nathan Lynch,
	Olof Johansson, Helge Deller

On Wed, Apr 29, 2009 at 05:32:44AM +0800, Andrew Morton wrote:
> On Tue, 28 Apr 2009 09:09:12 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > +/*
> > + * Kernel flags are exported faithfully to Linus and his fellow hackers.
> > + * Otherwise some details are masked to avoid confusing the end user:
> > + * - some kernel flags are completely invisible
> > + * - some kernel flags are conditionally invisible on their odd usages
> > + */
> > +#ifdef CONFIG_DEBUG_KERNEL
> > +static inline int genuine_linus(void) { return 1; }
> 
> Although he's a fine chap, the use of the "_linus" tag isn't terribly
> clear (to me).  I think what you're saying here is that this enables
> kernel-developer-only features, yes?

Yes.

> If so, perhaps we could come up with an identifier which expresses that
> more clearly.
> 
> But I'd expect that everyone and all distros enable CONFIG_DEBUG_KERNEL
> for _some_ reason, so what's the point?

Good point! I can confirm my debian has CONFIG_DEBUG_KERNEL=Y!

> It is preferable that we always implement the same interface for all
> Kconfig settings.  If this exposes information which is confusing or
> not useful to end-users then so be it - we should be able to cover that
> in supporting documentation.

My original patch takes that straightforward manner - and I still like it.
I would be very glad to move the filtering code from kernel to user space.

The use of more obscure flags could be discouraged by _not_ documenting
them. A really curious user is encouraged to refer to the code for the
exact meaning (and perhaps become a kernel developer ;-)

> Also, as mentioned in the other email, it would be good if we were to
> publish a little userspace app which people can use to access this raw
> data.  We could give that application an `--i-am-a-kernel-developer'
> option!

OK. I'll include page-types.c in the next take.

> > +#else
> > +static inline int genuine_linus(void) { return 0; }
> > +#endif
> 
> This isn't an appropriate use of CONFIG_DEBUG_KERNEL.
> 
> DEBUG_KERNEL is a Kconfig-only construct which is use to enable _other_
> debugging features.  The way you've used it here, if the person who is
> configuring the kernel wants to enable any other completely-unrelated
> debug feature, they have to enable DEBUG_KERNEL first.  But when they
> do that, they unexpectedly alter the behaviour of pagemap!
> 
> There are two other places where CONFIG_DEBUG_KERNEL affects code
> generation in .c files: arch/parisc/mm/init.c and
> arch/powerpc/kernel/sysfs.c.  These are both wrong, and need slapping ;)

(add cc to related maintainers)

CONFIG_DEBUG_KERNEL being enabled in distro kernels effectively means 

        #ifdef CONFIG_DEBUG_KERNEL == #if 1

as the following patch demos. Now it becomes obviously silly.

diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
index 4356ceb..59fb910 100644
--- a/arch/parisc/mm/init.c
+++ b/arch/parisc/mm/init.c
@@ -368,19 +368,19 @@ static void __init setup_bootmem(void)
 	request_resource(&sysram_resources[0], &pdcdata_resource);
 }
 
 void free_initmem(void)
 {
 	unsigned long addr, init_begin, init_end;
 
 	printk(KERN_INFO "Freeing unused kernel memory: ");
 
-#ifdef CONFIG_DEBUG_KERNEL
+#if 1
 	/* Attempt to catch anyone trying to execute code here
 	 * by filling the page with BRK insns.
 	 * 
 	 * If we disable interrupts for all CPUs, then IPI stops working.
 	 * Kinda breaks the global cache flushing.
 	 */
 	local_irq_disable();
 
 	memset(__init_begin, 0x00,
@@ -519,19 +519,19 @@ void __init mem_init(void)
 	printk(KERN_INFO "Memory: %luk/%luk available (%dk kernel code, %dk reserved, %dk data, %dk init)\n",
 		(unsigned long)nr_free_pages() << (PAGE_SHIFT-10),
 		num_physpages << (PAGE_SHIFT-10),
 		codesize >> 10,
 		reservedpages << (PAGE_SHIFT-10),
 		datasize >> 10,
 		initsize >> 10
 	);
 
-#ifdef CONFIG_DEBUG_KERNEL /* double-sanity-check paranoia */
+#if 1 /* double-sanity-check paranoia */
 	printk("virtual kernel memory layout:\n"
 	       "    vmalloc : 0x%p - 0x%p   (%4ld MB)\n"
 	       "    memory  : 0x%p - 0x%p   (%4ld MB)\n"
 	       "      .init : 0x%p - 0x%p   (%4ld kB)\n"
 	       "      .data : 0x%p - 0x%p   (%4ld kB)\n"
 	       "      .text : 0x%p - 0x%p   (%4ld kB)\n",
 
 	       (void*)VMALLOC_START, (void*)VMALLOC_END,
 	       (VMALLOC_END - VMALLOC_START) >> 20,
diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index f41aec8..0d54c6b 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -212,19 +212,19 @@ static SYSDEV_ATTR(purr, 0600, show_purr, store_purr);
 #endif /* CONFIG_PPC64 */
 
 #ifdef HAS_PPC_PMC_PA6T
 SYSFS_PMCSETUP(pa6t_pmc0, SPRN_PA6T_PMC0);
 SYSFS_PMCSETUP(pa6t_pmc1, SPRN_PA6T_PMC1);
 SYSFS_PMCSETUP(pa6t_pmc2, SPRN_PA6T_PMC2);
 SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3);
 SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4);
 SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5);
-#ifdef CONFIG_DEBUG_KERNEL
+#if 1
 SYSFS_PMCSETUP(hid0, SPRN_HID0);
 SYSFS_PMCSETUP(hid1, SPRN_HID1);
 SYSFS_PMCSETUP(hid4, SPRN_HID4);
 SYSFS_PMCSETUP(hid5, SPRN_HID5);
 SYSFS_PMCSETUP(ima0, SPRN_PA6T_IMA0);
 SYSFS_PMCSETUP(ima1, SPRN_PA6T_IMA1);
 SYSFS_PMCSETUP(ima2, SPRN_PA6T_IMA2);
 SYSFS_PMCSETUP(ima3, SPRN_PA6T_IMA3);
 SYSFS_PMCSETUP(ima4, SPRN_PA6T_IMA4);
@@ -282,19 +282,19 @@ static struct sysdev_attribute classic_pmc_attrs[] = {
 static struct sysdev_attribute pa6t_attrs[] = {
 	_SYSDEV_ATTR(mmcr0, 0600, show_mmcr0, store_mmcr0),
 	_SYSDEV_ATTR(mmcr1, 0600, show_mmcr1, store_mmcr1),
 	_SYSDEV_ATTR(pmc0, 0600, show_pa6t_pmc0, store_pa6t_pmc0),
 	_SYSDEV_ATTR(pmc1, 0600, show_pa6t_pmc1, store_pa6t_pmc1),
 	_SYSDEV_ATTR(pmc2, 0600, show_pa6t_pmc2, store_pa6t_pmc2),
 	_SYSDEV_ATTR(pmc3, 0600, show_pa6t_pmc3, store_pa6t_pmc3),
 	_SYSDEV_ATTR(pmc4, 0600, show_pa6t_pmc4, store_pa6t_pmc4),
 	_SYSDEV_ATTR(pmc5, 0600, show_pa6t_pmc5, store_pa6t_pmc5),
-#ifdef CONFIG_DEBUG_KERNEL
+#if 1
 	_SYSDEV_ATTR(hid0, 0600, show_hid0, store_hid0),
 	_SYSDEV_ATTR(hid1, 0600, show_hid1, store_hid1),
 	_SYSDEV_ATTR(hid4, 0600, show_hid4, store_hid4),
 	_SYSDEV_ATTR(hid5, 0600, show_hid5, store_hid5),
 	_SYSDEV_ATTR(ima0, 0600, show_ima0, store_ima0),
 	_SYSDEV_ATTR(ima1, 0600, show_ima1, store_ima1),
 	_SYSDEV_ATTR(ima2, 0600, show_ima2, store_ima2),
 	_SYSDEV_ATTR(ima3, 0600, show_ima3, store_ima3),
 	_SYSDEV_ATTR(ima4, 0600, show_ima4, store_ima4),

> > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit)		\
> > +	do {								\
> > +		if (visible || genuine_linus())				\
> > +			uflags |= ((kflags >> kbit) & 1) << ubit;	\
> > +	} while (0);
> 
> Did this have to be implemented as a macro?
> 
> It's bad, because it might or might not reference its argument, so if
> someone passes it an expression-with-side-effects, the end result is
> unpredictable.  A C function is almost always preferable if possible.

Just tried inline function, the code size is increased slightly:

          text   data    bss     dec    hex   filename
macro     1804    128      0    1932    78c   fs/proc/page.o
inline    1828    128      0    1956    7a4   fs/proc/page.o

So I'll keep the macro, but add brackets to make it a bit safer.

Thanks,
Fengguang


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29  2:38       ` Wu Fengguang
  0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-29  2:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm,
	Stephen Rothwell, Chandra Seetharaman, Nathan Lynch,
	Olof Johansson, Helge Deller

On Wed, Apr 29, 2009 at 05:32:44AM +0800, Andrew Morton wrote:
> On Tue, 28 Apr 2009 09:09:12 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > +/*
> > + * Kernel flags are exported faithfully to Linus and his fellow hackers.
> > + * Otherwise some details are masked to avoid confusing the end user:
> > + * - some kernel flags are completely invisible
> > + * - some kernel flags are conditionally invisible on their odd usages
> > + */
> > +#ifdef CONFIG_DEBUG_KERNEL
> > +static inline int genuine_linus(void) { return 1; }
> 
> Although he's a fine chap, the use of the "_linus" tag isn't terribly
> clear (to me).  I think what you're saying here is that this enables
> kernel-developer-only features, yes?

Yes.

> If so, perhaps we could come up with an identifier which expresses that
> more clearly.
> 
> But I'd expect that everyone and all distros enable CONFIG_DEBUG_KERNEL
> for _some_ reason, so what's the point?

Good point! I can confirm my debian has CONFIG_DEBUG_KERNEL=Y!

> It is preferable that we always implement the same interface for all
> Kconfig settings.  If this exposes information which is confusing or
> not useful to end-users then so be it - we should be able to cover that
> in supporting documentation.

My original patch takes that straightforward manner - and I still like it.
I would be very glad to move the filtering code from kernel to user space.

The use of more obscure flags could be discouraged by _not_ documenting
them. A really curious user is encouraged to refer to the code for the
exact meaning (and perhaps become a kernel developer ;-)

> Also, as mentioned in the other email, it would be good if we were to
> publish a little userspace app which people can use to access this raw
> data.  We could give that application an `--i-am-a-kernel-developer'
> option!

OK. I'll include page-types.c in the next take.

> > +#else
> > +static inline int genuine_linus(void) { return 0; }
> > +#endif
> 
> This isn't an appropriate use of CONFIG_DEBUG_KERNEL.
> 
> DEBUG_KERNEL is a Kconfig-only construct which is use to enable _other_
> debugging features.  The way you've used it here, if the person who is
> configuring the kernel wants to enable any other completely-unrelated
> debug feature, they have to enable DEBUG_KERNEL first.  But when they
> do that, they unexpectedly alter the behaviour of pagemap!
> 
> There are two other places where CONFIG_DEBUG_KERNEL affects code
> generation in .c files: arch/parisc/mm/init.c and
> arch/powerpc/kernel/sysfs.c.  These are both wrong, and need slapping ;)

(add cc to related maintainers)

CONFIG_DEBUG_KERNEL being enabled in distro kernels effectively means 

        #ifdef CONFIG_DEBUG_KERNEL == #if 1

as the following patch demos. Now it becomes obviously silly.

diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
index 4356ceb..59fb910 100644
--- a/arch/parisc/mm/init.c
+++ b/arch/parisc/mm/init.c
@@ -368,19 +368,19 @@ static void __init setup_bootmem(void)
 	request_resource(&sysram_resources[0], &pdcdata_resource);
 }
 
 void free_initmem(void)
 {
 	unsigned long addr, init_begin, init_end;
 
 	printk(KERN_INFO "Freeing unused kernel memory: ");
 
-#ifdef CONFIG_DEBUG_KERNEL
+#if 1
 	/* Attempt to catch anyone trying to execute code here
 	 * by filling the page with BRK insns.
 	 * 
 	 * If we disable interrupts for all CPUs, then IPI stops working.
 	 * Kinda breaks the global cache flushing.
 	 */
 	local_irq_disable();
 
 	memset(__init_begin, 0x00,
@@ -519,19 +519,19 @@ void __init mem_init(void)
 	printk(KERN_INFO "Memory: %luk/%luk available (%dk kernel code, %dk reserved, %dk data, %dk init)\n",
 		(unsigned long)nr_free_pages() << (PAGE_SHIFT-10),
 		num_physpages << (PAGE_SHIFT-10),
 		codesize >> 10,
 		reservedpages << (PAGE_SHIFT-10),
 		datasize >> 10,
 		initsize >> 10
 	);
 
-#ifdef CONFIG_DEBUG_KERNEL /* double-sanity-check paranoia */
+#if 1 /* double-sanity-check paranoia */
 	printk("virtual kernel memory layout:\n"
 	       "    vmalloc : 0x%p - 0x%p   (%4ld MB)\n"
 	       "    memory  : 0x%p - 0x%p   (%4ld MB)\n"
 	       "      .init : 0x%p - 0x%p   (%4ld kB)\n"
 	       "      .data : 0x%p - 0x%p   (%4ld kB)\n"
 	       "      .text : 0x%p - 0x%p   (%4ld kB)\n",
 
 	       (void*)VMALLOC_START, (void*)VMALLOC_END,
 	       (VMALLOC_END - VMALLOC_START) >> 20,
diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index f41aec8..0d54c6b 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -212,19 +212,19 @@ static SYSDEV_ATTR(purr, 0600, show_purr, store_purr);
 #endif /* CONFIG_PPC64 */
 
 #ifdef HAS_PPC_PMC_PA6T
 SYSFS_PMCSETUP(pa6t_pmc0, SPRN_PA6T_PMC0);
 SYSFS_PMCSETUP(pa6t_pmc1, SPRN_PA6T_PMC1);
 SYSFS_PMCSETUP(pa6t_pmc2, SPRN_PA6T_PMC2);
 SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3);
 SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4);
 SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5);
-#ifdef CONFIG_DEBUG_KERNEL
+#if 1
 SYSFS_PMCSETUP(hid0, SPRN_HID0);
 SYSFS_PMCSETUP(hid1, SPRN_HID1);
 SYSFS_PMCSETUP(hid4, SPRN_HID4);
 SYSFS_PMCSETUP(hid5, SPRN_HID5);
 SYSFS_PMCSETUP(ima0, SPRN_PA6T_IMA0);
 SYSFS_PMCSETUP(ima1, SPRN_PA6T_IMA1);
 SYSFS_PMCSETUP(ima2, SPRN_PA6T_IMA2);
 SYSFS_PMCSETUP(ima3, SPRN_PA6T_IMA3);
 SYSFS_PMCSETUP(ima4, SPRN_PA6T_IMA4);
@@ -282,19 +282,19 @@ static struct sysdev_attribute classic_pmc_attrs[] = {
 static struct sysdev_attribute pa6t_attrs[] = {
 	_SYSDEV_ATTR(mmcr0, 0600, show_mmcr0, store_mmcr0),
 	_SYSDEV_ATTR(mmcr1, 0600, show_mmcr1, store_mmcr1),
 	_SYSDEV_ATTR(pmc0, 0600, show_pa6t_pmc0, store_pa6t_pmc0),
 	_SYSDEV_ATTR(pmc1, 0600, show_pa6t_pmc1, store_pa6t_pmc1),
 	_SYSDEV_ATTR(pmc2, 0600, show_pa6t_pmc2, store_pa6t_pmc2),
 	_SYSDEV_ATTR(pmc3, 0600, show_pa6t_pmc3, store_pa6t_pmc3),
 	_SYSDEV_ATTR(pmc4, 0600, show_pa6t_pmc4, store_pa6t_pmc4),
 	_SYSDEV_ATTR(pmc5, 0600, show_pa6t_pmc5, store_pa6t_pmc5),
-#ifdef CONFIG_DEBUG_KERNEL
+#if 1
 	_SYSDEV_ATTR(hid0, 0600, show_hid0, store_hid0),
 	_SYSDEV_ATTR(hid1, 0600, show_hid1, store_hid1),
 	_SYSDEV_ATTR(hid4, 0600, show_hid4, store_hid4),
 	_SYSDEV_ATTR(hid5, 0600, show_hid5, store_hid5),
 	_SYSDEV_ATTR(ima0, 0600, show_ima0, store_ima0),
 	_SYSDEV_ATTR(ima1, 0600, show_ima1, store_ima1),
 	_SYSDEV_ATTR(ima2, 0600, show_ima2, store_ima2),
 	_SYSDEV_ATTR(ima3, 0600, show_ima3, store_ima3),
 	_SYSDEV_ATTR(ima4, 0600, show_ima4, store_ima4),

> > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit)		\
> > +	do {								\
> > +		if (visible || genuine_linus())				\
> > +			uflags |= ((kflags >> kbit) & 1) << ubit;	\
> > +	} while (0);
> 
> Did this have to be implemented as a macro?
> 
> It's bad, because it might or might not reference its argument, so if
> someone passes it an expression-with-side-effects, the end result is
> unpredictable.  A C function is almost always preferable if possible.

Just tried inline function, the code size is increased slightly:

          text   data    bss     dec    hex   filename
macro     1804    128      0    1932    78c   fs/proc/page.o
inline    1828    128      0    1956    7a4   fs/proc/page.o

So I'll keep the macro, but add brackets to make it a bit safer.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-29  2:38       ` Wu Fengguang
@ 2009-04-29  2:55         ` Andrew Morton
  -1 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-29  2:55 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm,
	Stephen Rothwell, Chandra Seetharaman, Nathan Lynch,
	Olof Johansson, Helge Deller

On Wed, 29 Apr 2009 10:38:42 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:

> > > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit)		\
> > > +	do {								\
> > > +		if (visible || genuine_linus())				\
> > > +			uflags |= ((kflags >> kbit) & 1) << ubit;	\
> > > +	} while (0);
> > 
> > Did this have to be implemented as a macro?
> > 
> > It's bad, because it might or might not reference its argument, so if
> > someone passes it an expression-with-side-effects, the end result is
> > unpredictable.  A C function is almost always preferable if possible.
> 
> Just tried inline function, the code size is increased slightly:
> 
>           text   data    bss     dec    hex   filename
> macro     1804    128      0    1932    78c   fs/proc/page.o
> inline    1828    128      0    1956    7a4   fs/proc/page.o
> 

hm, I wonder why.  Maybe it fixed a bug ;)

The code is effectively doing

	if (expr1)
		something();
	if (expr1)
		something_else();
	if (expr1)
		something_else2();

etc.  Obviously we _hope_ that the compiler turns that into

	if (expr1) {
		something();
		something_else();
		something_else2();
	}

for us, but it would be good to check...

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29  2:55         ` Andrew Morton
  0 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-29  2:55 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm,
	Stephen Rothwell, Chandra Seetharaman, Nathan Lynch,
	Olof Johansson, Helge Deller

On Wed, 29 Apr 2009 10:38:42 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:

> > > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit)		\
> > > +	do {								\
> > > +		if (visible || genuine_linus())				\
> > > +			uflags |= ((kflags >> kbit) & 1) << ubit;	\
> > > +	} while (0);
> > 
> > Did this have to be implemented as a macro?
> > 
> > It's bad, because it might or might not reference its argument, so if
> > someone passes it an expression-with-side-effects, the end result is
> > unpredictable.  A C function is almost always preferable if possible.
> 
> Just tried inline function, the code size is increased slightly:
> 
>           text   data    bss     dec    hex   filename
> macro     1804    128      0    1932    78c   fs/proc/page.o
> inline    1828    128      0    1956    7a4   fs/proc/page.o
> 

hm, I wonder why.  Maybe it fixed a bug ;)

The code is effectively doing

	if (expr1)
		something();
	if (expr1)
		something_else();
	if (expr1)
		something_else2();

etc.  Obviously we _hope_ that the compiler turns that into

	if (expr1) {
		something();
		something_else();
		something_else2();
	}

for us, but it would be good to check...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28 23:55               ` Matt Mackall
@ 2009-04-29  3:33                 ` Wu Fengguang
  -1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-29  3:33 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm

On Wed, Apr 29, 2009 at 07:55:10AM +0800, Matt Mackall wrote:
> On Tue, 2009-04-28 at 16:42 -0700, Andrew Morton wrote:
> > On Tue, 28 Apr 2009 18:31:09 -0500
> > Matt Mackall <mpm@selenic.com> wrote:
> > 
> > > On Tue, 2009-04-28 at 16:02 -0700, Andrew Morton wrote:
> > > > On Tue, 28 Apr 2009 17:46:34 -0500
> > > > Matt Mackall <mpm@selenic.com> wrote:
> > > > 
> > > > > > > +/* a helper function _not_ intended for more general uses */
> > > > > > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > > > > > +{
> > > > > > > +	struct address_space *mapping;
> > > > > > > +
> > > > > > > +	if (!PageSlab(page))
> > > > > > > +		mapping = page_mapping(page);
> > > > > > > +	else
> > > > > > > +		mapping = NULL;
> > > > > > > +
> > > > > > > +	return mapping && mapping_cap_writeback_dirty(mapping);
> > > > > > > +}
> > > > > > 
> > > > > > If the page isn't locked then page->mapping can be concurrently removed
> > > > > > and freed.  This actually happened to me in real-life testing several
> > > > > > years ago.
> > > > > 
> > > > > We certainly don't want to be taking locks per page to build the flags
> > > > > data here. As we don't have any pretense of being atomic, it's ok if we
> > > > > can find a way to do the test that's inaccurate when a race occurs, so
> > > > > long as it doesn't dereference null.
> > > > > 
> > > > > But if there's not an obvious way to do that, we should probably just
> > > > > drop this flag bit for this iteration.
> > > > 
> > > > trylock_page() could be used here, perhaps.
> > > > 
> > > > Then again, why _not_ just do lock_page()?  After all, few pages are
> > > > ever locked.  There will be latency if the caller stumbles across a
> > > > page which is under read I/O, but so be it?
> > > 
> > > As I mentioned just a bit ago, it's really not an unreasonable use case
> > > to want to do this on every page in the system back to back. So per page
> > > overhead matters. And the odds of stalling on a locked page when
> > > visiting 1M pages while under load are probably not negligible.
> > 
> > The chances of stalling on a locked page are pretty good, and the
> > duration of the stall might be long indeed.  Perhaps a trylock is a
> > decent compromise - it depends on the value of this metric, and I've
> > forgotten what we're talking about ;)
> > 
> > umm, seems that this flag is needed to enable PG_error, PG_dirty,
> > PG_uptodate and PG_writeback reporting.  So simply removing this code
> > would put a huge hole in the patchset, no?
> 
> We can report those bits anyway. But this patchset does something
> clever: it filters irrelevant (and possibly overloaded) bits in various
> contexts. 
> 
> > > Our lock primitives are pretty low overhead in the fast path, but every
> > > cycle counts. The new tests and branches this code already adds are a
> > > bit worrisome, but on balance probably worth it.
> > 
> > That should be easy to quantify (hint).
> 
> I'll let Fengguang address both these points.

A quick micro bench: 100 runs on another T7300@2GHz 2GB laptop:

             user      system       total
no lock      0.270     22.850       23.607 
trylock      0.310     25.890       26.484 
                       +13.3%       +12.2%

But anyway, the plan is to move filtering to user space and eliminate
the complex kernel logics.

The IO filtering is no longer possible in user space, but I didn't see
the error/dirty/writeback bits on this testing system. So I guess it
won't be a big loss.

The huge/gigantic page filtering is also not possible in user space.
So I tend to add a KPF_HUGE flag to distinguish (hardware supported)
huge pages from normal (software) compound pages. Any objections?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29  3:33                 ` Wu Fengguang
  0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-29  3:33 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm

On Wed, Apr 29, 2009 at 07:55:10AM +0800, Matt Mackall wrote:
> On Tue, 2009-04-28 at 16:42 -0700, Andrew Morton wrote:
> > On Tue, 28 Apr 2009 18:31:09 -0500
> > Matt Mackall <mpm@selenic.com> wrote:
> > 
> > > On Tue, 2009-04-28 at 16:02 -0700, Andrew Morton wrote:
> > > > On Tue, 28 Apr 2009 17:46:34 -0500
> > > > Matt Mackall <mpm@selenic.com> wrote:
> > > > 
> > > > > > > +/* a helper function _not_ intended for more general uses */
> > > > > > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > > > > > +{
> > > > > > > +	struct address_space *mapping;
> > > > > > > +
> > > > > > > +	if (!PageSlab(page))
> > > > > > > +		mapping = page_mapping(page);
> > > > > > > +	else
> > > > > > > +		mapping = NULL;
> > > > > > > +
> > > > > > > +	return mapping && mapping_cap_writeback_dirty(mapping);
> > > > > > > +}
> > > > > > 
> > > > > > If the page isn't locked then page->mapping can be concurrently removed
> > > > > > and freed.  This actually happened to me in real-life testing several
> > > > > > years ago.
> > > > > 
> > > > > We certainly don't want to be taking locks per page to build the flags
> > > > > data here. As we don't have any pretense of being atomic, it's ok if we
> > > > > can find a way to do the test that's inaccurate when a race occurs, so
> > > > > long as it doesn't dereference null.
> > > > > 
> > > > > But if there's not an obvious way to do that, we should probably just
> > > > > drop this flag bit for this iteration.
> > > > 
> > > > trylock_page() could be used here, perhaps.
> > > > 
> > > > Then again, why _not_ just do lock_page()?  After all, few pages are
> > > > ever locked.  There will be latency if the caller stumbles across a
> > > > page which is under read I/O, but so be it?
> > > 
> > > As I mentioned just a bit ago, it's really not an unreasonable use case
> > > to want to do this on every page in the system back to back. So per page
> > > overhead matters. And the odds of stalling on a locked page when
> > > visiting 1M pages while under load are probably not negligible.
> > 
> > The chances of stalling on a locked page are pretty good, and the
> > duration of the stall might be long indeed.  Perhaps a trylock is a
> > decent compromise - it depends on the value of this metric, and I've
> > forgotten what we're talking about ;)
> > 
> > umm, seems that this flag is needed to enable PG_error, PG_dirty,
> > PG_uptodate and PG_writeback reporting.  So simply removing this code
> > would put a huge hole in the patchset, no?
> 
> We can report those bits anyway. But this patchset does something
> clever: it filters irrelevant (and possibly overloaded) bits in various
> contexts. 
> 
> > > Our lock primitives are pretty low overhead in the fast path, but every
> > > cycle counts. The new tests and branches this code already adds are a
> > > bit worrisome, but on balance probably worth it.
> > 
> > That should be easy to quantify (hint).
> 
> I'll let Fengguang address both these points.

A quick micro bench: 100 runs on another T7300@2GHz 2GB laptop:

             user      system       total
no lock      0.270     22.850       23.607 
trylock      0.310     25.890       26.484 
                       +13.3%       +12.2%

But anyway, the plan is to move filtering to user space and eliminate
the complex kernel logics.

The IO filtering is no longer possible in user space, but I didn't see
the error/dirty/writeback bits on this testing system. So I guess it
won't be a big loss.

The huge/gigantic page filtering is also not possible in user space.
So I tend to add a KPF_HUGE flag to distinguish (hardware supported)
huge pages from normal (software) compound pages. Any objections?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-29  2:55         ` Andrew Morton
@ 2009-04-29  3:48           ` Wu Fengguang
  -1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-29  3:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm,
	Stephen Rothwell, Chandra Seetharaman, Nathan Lynch,
	Olof Johansson, Helge Deller

On Wed, Apr 29, 2009 at 10:55:27AM +0800, Andrew Morton wrote:
> On Wed, 29 Apr 2009 10:38:42 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit)		\
> > > > +	do {								\
> > > > +		if (visible || genuine_linus())				\
> > > > +			uflags |= ((kflags >> kbit) & 1) << ubit;	\
> > > > +	} while (0);
> > > 
> > > Did this have to be implemented as a macro?
> > > 
> > > It's bad, because it might or might not reference its argument, so if
> > > someone passes it an expression-with-side-effects, the end result is
> > > unpredictable.  A C function is almost always preferable if possible.
> > 
> > Just tried inline function, the code size is increased slightly:
> > 
> >           text   data    bss     dec    hex   filename
> > macro     1804    128      0    1932    78c   fs/proc/page.o
> > inline    1828    128      0    1956    7a4   fs/proc/page.o
> > 
> 
> hm, I wonder why.  Maybe it fixed a bug ;)
> 
> The code is effectively doing
> 
> 	if (expr1)
> 		something();
> 	if (expr1)
> 		something_else();
> 	if (expr1)
> 		something_else2();
> 
> etc.  Obviously we _hope_ that the compiler turns that into
> 
> 	if (expr1) {
> 		something();
> 		something_else();
> 		something_else2();
> 	}
> 
> for us, but it would be good to check...

By 'expr1', you mean (visible || genuine_linus())?

No, I can confirm the inefficiency does not lie here.

I simplified the kpf_copy_bit() to

        #define kpf_copy_bit(uflags, kflags, ubit, kbit)                     \
                        uflags |= (((kflags) >> (kbit)) & 1) << (ubit);

or

        static inline u64 kpf_copy_bit(u64 kflags, int ubit, int kbit)
        {       
                return (((kflags) >> (kbit)) & 1) << (ubit);
        }

and double checked the differences: the gap grows unexpectedly!

              text               data                bss                dec            hex filename
macro         1829                168                  0               1997            7cd fs/proc/page.o
inline        1893                168                  0               2061            80d fs/proc/page.o
              +3.5%

(note: the larger absolute text size is due to some experimental code elsewhere.)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29  3:48           ` Wu Fengguang
  0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-29  3:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm,
	Stephen Rothwell, Chandra Seetharaman, Nathan Lynch,
	Olof Johansson, Helge Deller

On Wed, Apr 29, 2009 at 10:55:27AM +0800, Andrew Morton wrote:
> On Wed, 29 Apr 2009 10:38:42 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit)		\
> > > > +	do {								\
> > > > +		if (visible || genuine_linus())				\
> > > > +			uflags |= ((kflags >> kbit) & 1) << ubit;	\
> > > > +	} while (0);
> > > 
> > > Did this have to be implemented as a macro?
> > > 
> > > It's bad, because it might or might not reference its argument, so if
> > > someone passes it an expression-with-side-effects, the end result is
> > > unpredictable.  A C function is almost always preferable if possible.
> > 
> > Just tried inline function, the code size is increased slightly:
> > 
> >           text   data    bss     dec    hex   filename
> > macro     1804    128      0    1932    78c   fs/proc/page.o
> > inline    1828    128      0    1956    7a4   fs/proc/page.o
> > 
> 
> hm, I wonder why.  Maybe it fixed a bug ;)
> 
> The code is effectively doing
> 
> 	if (expr1)
> 		something();
> 	if (expr1)
> 		something_else();
> 	if (expr1)
> 		something_else2();
> 
> etc.  Obviously we _hope_ that the compiler turns that into
> 
> 	if (expr1) {
> 		something();
> 		something_else();
> 		something_else2();
> 	}
> 
> for us, but it would be good to check...

By 'expr1', you mean (visible || genuine_linus())?

No, I can confirm the inefficiency does not lie here.

I simplified the kpf_copy_bit() to

        #define kpf_copy_bit(uflags, kflags, ubit, kbit)                     \
                        uflags |= (((kflags) >> (kbit)) & 1) << (ubit);

or

        static inline u64 kpf_copy_bit(u64 kflags, int ubit, int kbit)
        {       
                return (((kflags) >> (kbit)) & 1) << (ubit);
        }

and double checked the differences: the gap grows unexpectedly!

              text               data                bss                dec            hex filename
macro         1829                168                  0               1997            7cd fs/proc/page.o
inline        1893                168                  0               2061            80d fs/proc/page.o
              +3.5%

(note: the larger absolute text size is due to some experimental code elsewhere.)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-29  2:38       ` Wu Fengguang
  (?)
@ 2009-04-29  4:41         ` Nathan Lynch
  -1 siblings, 0 replies; 137+ messages in thread
From: Nathan Lynch @ 2009-04-29  4:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, linux-kernel, kosaki.motohiro, andi, mpm,
	adobriyan, linux-mm, Stephen Rothwell, Chandra Seetharaman,
	Olof Johansson, Helge Deller, linuxppc-dev

Wu Fengguang <fengguang.wu@intel.com> writes:

> On Wed, Apr 29, 2009 at 05:32:44AM +0800, Andrew Morton wrote:
>> On Tue, 28 Apr 2009 09:09:12 +0800
>> Wu Fengguang <fengguang.wu@intel.com> wrote:
>> 
>> > +/*
>> > + * Kernel flags are exported faithfully to Linus and his fellow hackers.
>> > + * Otherwise some details are masked to avoid confusing the end user:
>> > + * - some kernel flags are completely invisible
>> > + * - some kernel flags are conditionally invisible on their odd usages
>> > + */
>> > +#ifdef CONFIG_DEBUG_KERNEL
>> > +static inline int genuine_linus(void) { return 1; }
>> 
>> Although he's a fine chap, the use of the "_linus" tag isn't terribly
>> clear (to me).  I think what you're saying here is that this enables
>> kernel-developer-only features, yes?
>
> Yes.
>
>> If so, perhaps we could come up with an identifier which expresses that
>> more clearly.
>> 
>> But I'd expect that everyone and all distros enable CONFIG_DEBUG_KERNEL
>> for _some_ reason, so what's the point?

At the least, it has not always been so...

>
> Good point! I can confirm my debian has CONFIG_DEBUG_KERNEL=Y!

I can confirm mine does not.

etch-i386:~# uname -a
Linux etch-i386 2.6.18-6-686 #1 SMP Fri Dec 12 16:48:28 UTC 2008 i686 GNU/Linux
etch-i386:~# grep DEBUG_KERNEL /boot/config-2.6.18-6-686 
# CONFIG_DEBUG_KERNEL is not set

For what that's worth.


>> It is preferable that we always implement the same interface for all
>> Kconfig settings.  If this exposes information which is confusing or
>> not useful to end-users then so be it - we should be able to cover that
>> in supporting documentation.
>
> My original patch takes that straightforward manner - and I still like it.
> I would be very glad to move the filtering code from kernel to user space.
>
> The use of more obscure flags could be discouraged by _not_ documenting
> them. A really curious user is encouraged to refer to the code for the
> exact meaning (and perhaps become a kernel developer ;-)
>
>> Also, as mentioned in the other email, it would be good if we were to
>> publish a little userspace app which people can use to access this raw
>> data.  We could give that application an `--i-am-a-kernel-developer'
>> option!
>
> OK. I'll include page-types.c in the next take.
>
>> > +#else
>> > +static inline int genuine_linus(void) { return 0; }
>> > +#endif
>> 
>> This isn't an appropriate use of CONFIG_DEBUG_KERNEL.
>> 
>> DEBUG_KERNEL is a Kconfig-only construct which is use to enable _other_
>> debugging features.  The way you've used it here, if the person who is
>> configuring the kernel wants to enable any other completely-unrelated
>> debug feature, they have to enable DEBUG_KERNEL first.  But when they
>> do that, they unexpectedly alter the behaviour of pagemap!
>> 
>> There are two other places where CONFIG_DEBUG_KERNEL affects code
>> generation in .c files: arch/parisc/mm/init.c and
>> arch/powerpc/kernel/sysfs.c.  These are both wrong, and need slapping ;)
>
> (add cc to related maintainers)

I assume I was cc'd because I've changed arch/powerpc/kernel/sysfs.c a
couple of times in the last year, but I can't claim to maintain that
code.  I'm pretty sure I haven't touched the code in question in this
discussion.  I've cc'd linuxppc-dev.


> CONFIG_DEBUG_KERNEL being enabled in distro kernels effectively means 
>
>         #ifdef CONFIG_DEBUG_KERNEL == #if 1
>
> as the following patch demos. Now it becomes obviously silly.

Sure, #if 1 is usually silly.  But if the point is that DEBUG_KERNEL is
not supposed to directly affect code generation, then I see two options
for powerpc:

- remove the #ifdef CONFIG_DEBUG_KERNEL guards from
  arch/powerpc/kernel/sysfs.c, unconditionally enabling the hid/ima
  sysfs attributes, or

- define a new config symbol which governs whether those attributes are
  enabled, and make it depend on DEBUG_KERNEL


> --- a/arch/powerpc/kernel/sysfs.c
> +++ b/arch/powerpc/kernel/sysfs.c
> @@ -212,19 +212,19 @@ static SYSDEV_ATTR(purr, 0600, show_purr, store_purr);
>  #endif /* CONFIG_PPC64 */
>  
>  #ifdef HAS_PPC_PMC_PA6T
>  SYSFS_PMCSETUP(pa6t_pmc0, SPRN_PA6T_PMC0);
>  SYSFS_PMCSETUP(pa6t_pmc1, SPRN_PA6T_PMC1);
>  SYSFS_PMCSETUP(pa6t_pmc2, SPRN_PA6T_PMC2);
>  SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3);
>  SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4);
>  SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5);
> -#ifdef CONFIG_DEBUG_KERNEL
> +#if 1
>  SYSFS_PMCSETUP(hid0, SPRN_HID0);
>  SYSFS_PMCSETUP(hid1, SPRN_HID1);
>  SYSFS_PMCSETUP(hid4, SPRN_HID4);
>  SYSFS_PMCSETUP(hid5, SPRN_HID5);
>  SYSFS_PMCSETUP(ima0, SPRN_PA6T_IMA0);
>  SYSFS_PMCSETUP(ima1, SPRN_PA6T_IMA1);
>  SYSFS_PMCSETUP(ima2, SPRN_PA6T_IMA2);
>  SYSFS_PMCSETUP(ima3, SPRN_PA6T_IMA3);
>  SYSFS_PMCSETUP(ima4, SPRN_PA6T_IMA4);
> @@ -282,19 +282,19 @@ static struct sysdev_attribute classic_pmc_attrs[] = {
>  static struct sysdev_attribute pa6t_attrs[] = {
>  	_SYSDEV_ATTR(mmcr0, 0600, show_mmcr0, store_mmcr0),
>  	_SYSDEV_ATTR(mmcr1, 0600, show_mmcr1, store_mmcr1),
>  	_SYSDEV_ATTR(pmc0, 0600, show_pa6t_pmc0, store_pa6t_pmc0),
>  	_SYSDEV_ATTR(pmc1, 0600, show_pa6t_pmc1, store_pa6t_pmc1),
>  	_SYSDEV_ATTR(pmc2, 0600, show_pa6t_pmc2, store_pa6t_pmc2),
>  	_SYSDEV_ATTR(pmc3, 0600, show_pa6t_pmc3, store_pa6t_pmc3),
>  	_SYSDEV_ATTR(pmc4, 0600, show_pa6t_pmc4, store_pa6t_pmc4),
>  	_SYSDEV_ATTR(pmc5, 0600, show_pa6t_pmc5, store_pa6t_pmc5),
> -#ifdef CONFIG_DEBUG_KERNEL
> +#if 1
>  	_SYSDEV_ATTR(hid0, 0600, show_hid0, store_hid0),
>  	_SYSDEV_ATTR(hid1, 0600, show_hid1, store_hid1),
>  	_SYSDEV_ATTR(hid4, 0600, show_hid4, store_hid4),
>  	_SYSDEV_ATTR(hid5, 0600, show_hid5, store_hid5),
>  	_SYSDEV_ATTR(ima0, 0600, show_ima0, store_ima0),
>  	_SYSDEV_ATTR(ima1, 0600, show_ima1, store_ima1),
>  	_SYSDEV_ATTR(ima2, 0600, show_ima2, store_ima2),
>  	_SYSDEV_ATTR(ima3, 0600, show_ima3, store_ima3),
>  	_SYSDEV_ATTR(ima4, 0600, show_ima4, store_ima4),
>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29  4:41         ` Nathan Lynch
  0 siblings, 0 replies; 137+ messages in thread
From: Nathan Lynch @ 2009-04-29  4:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, linux-kernel, kosaki.motohiro, andi, mpm,
	adobriyan, linux-mm, Stephen Rothwell, Chandra Seetharaman,
	Olof Johansson, Helge Deller, linuxppc-dev

Wu Fengguang <fengguang.wu@intel.com> writes:

> On Wed, Apr 29, 2009 at 05:32:44AM +0800, Andrew Morton wrote:
>> On Tue, 28 Apr 2009 09:09:12 +0800
>> Wu Fengguang <fengguang.wu@intel.com> wrote:
>> 
>> > +/*
>> > + * Kernel flags are exported faithfully to Linus and his fellow hackers.
>> > + * Otherwise some details are masked to avoid confusing the end user:
>> > + * - some kernel flags are completely invisible
>> > + * - some kernel flags are conditionally invisible on their odd usages
>> > + */
>> > +#ifdef CONFIG_DEBUG_KERNEL
>> > +static inline int genuine_linus(void) { return 1; }
>> 
>> Although he's a fine chap, the use of the "_linus" tag isn't terribly
>> clear (to me).  I think what you're saying here is that this enables
>> kernel-developer-only features, yes?
>
> Yes.
>
>> If so, perhaps we could come up with an identifier which expresses that
>> more clearly.
>> 
>> But I'd expect that everyone and all distros enable CONFIG_DEBUG_KERNEL
>> for _some_ reason, so what's the point?

At the least, it has not always been so...

>
> Good point! I can confirm my debian has CONFIG_DEBUG_KERNEL=Y!

I can confirm mine does not.

etch-i386:~# uname -a
Linux etch-i386 2.6.18-6-686 #1 SMP Fri Dec 12 16:48:28 UTC 2008 i686 GNU/Linux
etch-i386:~# grep DEBUG_KERNEL /boot/config-2.6.18-6-686 
# CONFIG_DEBUG_KERNEL is not set

For what that's worth.


>> It is preferable that we always implement the same interface for all
>> Kconfig settings.  If this exposes information which is confusing or
>> not useful to end-users then so be it - we should be able to cover that
>> in supporting documentation.
>
> My original patch takes that straightforward manner - and I still like it.
> I would be very glad to move the filtering code from kernel to user space.
>
> The use of more obscure flags could be discouraged by _not_ documenting
> them. A really curious user is encouraged to refer to the code for the
> exact meaning (and perhaps become a kernel developer ;-)
>
>> Also, as mentioned in the other email, it would be good if we were to
>> publish a little userspace app which people can use to access this raw
>> data.  We could give that application an `--i-am-a-kernel-developer'
>> option!
>
> OK. I'll include page-types.c in the next take.
>
>> > +#else
>> > +static inline int genuine_linus(void) { return 0; }
>> > +#endif
>> 
>> This isn't an appropriate use of CONFIG_DEBUG_KERNEL.
>> 
>> DEBUG_KERNEL is a Kconfig-only construct which is use to enable _other_
>> debugging features.  The way you've used it here, if the person who is
>> configuring the kernel wants to enable any other completely-unrelated
>> debug feature, they have to enable DEBUG_KERNEL first.  But when they
>> do that, they unexpectedly alter the behaviour of pagemap!
>> 
>> There are two other places where CONFIG_DEBUG_KERNEL affects code
>> generation in .c files: arch/parisc/mm/init.c and
>> arch/powerpc/kernel/sysfs.c.  These are both wrong, and need slapping ;)
>
> (add cc to related maintainers)

I assume I was cc'd because I've changed arch/powerpc/kernel/sysfs.c a
couple of times in the last year, but I can't claim to maintain that
code.  I'm pretty sure I haven't touched the code in question in this
discussion.  I've cc'd linuxppc-dev.


> CONFIG_DEBUG_KERNEL being enabled in distro kernels effectively means 
>
>         #ifdef CONFIG_DEBUG_KERNEL == #if 1
>
> as the following patch demos. Now it becomes obviously silly.

Sure, #if 1 is usually silly.  But if the point is that DEBUG_KERNEL is
not supposed to directly affect code generation, then I see two options
for powerpc:

- remove the #ifdef CONFIG_DEBUG_KERNEL guards from
  arch/powerpc/kernel/sysfs.c, unconditionally enabling the hid/ima
  sysfs attributes, or

- define a new config symbol which governs whether those attributes are
  enabled, and make it depend on DEBUG_KERNEL


> --- a/arch/powerpc/kernel/sysfs.c
> +++ b/arch/powerpc/kernel/sysfs.c
> @@ -212,19 +212,19 @@ static SYSDEV_ATTR(purr, 0600, show_purr, store_purr);
>  #endif /* CONFIG_PPC64 */
>  
>  #ifdef HAS_PPC_PMC_PA6T
>  SYSFS_PMCSETUP(pa6t_pmc0, SPRN_PA6T_PMC0);
>  SYSFS_PMCSETUP(pa6t_pmc1, SPRN_PA6T_PMC1);
>  SYSFS_PMCSETUP(pa6t_pmc2, SPRN_PA6T_PMC2);
>  SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3);
>  SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4);
>  SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5);
> -#ifdef CONFIG_DEBUG_KERNEL
> +#if 1
>  SYSFS_PMCSETUP(hid0, SPRN_HID0);
>  SYSFS_PMCSETUP(hid1, SPRN_HID1);
>  SYSFS_PMCSETUP(hid4, SPRN_HID4);
>  SYSFS_PMCSETUP(hid5, SPRN_HID5);
>  SYSFS_PMCSETUP(ima0, SPRN_PA6T_IMA0);
>  SYSFS_PMCSETUP(ima1, SPRN_PA6T_IMA1);
>  SYSFS_PMCSETUP(ima2, SPRN_PA6T_IMA2);
>  SYSFS_PMCSETUP(ima3, SPRN_PA6T_IMA3);
>  SYSFS_PMCSETUP(ima4, SPRN_PA6T_IMA4);
> @@ -282,19 +282,19 @@ static struct sysdev_attribute classic_pmc_attrs[] = {
>  static struct sysdev_attribute pa6t_attrs[] = {
>  	_SYSDEV_ATTR(mmcr0, 0600, show_mmcr0, store_mmcr0),
>  	_SYSDEV_ATTR(mmcr1, 0600, show_mmcr1, store_mmcr1),
>  	_SYSDEV_ATTR(pmc0, 0600, show_pa6t_pmc0, store_pa6t_pmc0),
>  	_SYSDEV_ATTR(pmc1, 0600, show_pa6t_pmc1, store_pa6t_pmc1),
>  	_SYSDEV_ATTR(pmc2, 0600, show_pa6t_pmc2, store_pa6t_pmc2),
>  	_SYSDEV_ATTR(pmc3, 0600, show_pa6t_pmc3, store_pa6t_pmc3),
>  	_SYSDEV_ATTR(pmc4, 0600, show_pa6t_pmc4, store_pa6t_pmc4),
>  	_SYSDEV_ATTR(pmc5, 0600, show_pa6t_pmc5, store_pa6t_pmc5),
> -#ifdef CONFIG_DEBUG_KERNEL
> +#if 1
>  	_SYSDEV_ATTR(hid0, 0600, show_hid0, store_hid0),
>  	_SYSDEV_ATTR(hid1, 0600, show_hid1, store_hid1),
>  	_SYSDEV_ATTR(hid4, 0600, show_hid4, store_hid4),
>  	_SYSDEV_ATTR(hid5, 0600, show_hid5, store_hid5),
>  	_SYSDEV_ATTR(ima0, 0600, show_ima0, store_ima0),
>  	_SYSDEV_ATTR(ima1, 0600, show_ima1, store_ima1),
>  	_SYSDEV_ATTR(ima2, 0600, show_ima2, store_ima2),
>  	_SYSDEV_ATTR(ima3, 0600, show_ima3, store_ima3),
>  	_SYSDEV_ATTR(ima4, 0600, show_ima4, store_ima4),
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29  4:41         ` Nathan Lynch
  0 siblings, 0 replies; 137+ messages in thread
From: Nathan Lynch @ 2009-04-29  4:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Stephen Rothwell, Chandra Seetharaman, Olof, linuxppc-dev,
	linux-kernel, Helge Deller, linux-mm, andi, kosaki.motohiro, mpm,
	Johansson, Andrew Morton, adobriyan

Wu Fengguang <fengguang.wu@intel.com> writes:

> On Wed, Apr 29, 2009 at 05:32:44AM +0800, Andrew Morton wrote:
>> On Tue, 28 Apr 2009 09:09:12 +0800
>> Wu Fengguang <fengguang.wu@intel.com> wrote:
>> 
>> > +/*
>> > + * Kernel flags are exported faithfully to Linus and his fellow hackers.
>> > + * Otherwise some details are masked to avoid confusing the end user:
>> > + * - some kernel flags are completely invisible
>> > + * - some kernel flags are conditionally invisible on their odd usages
>> > + */
>> > +#ifdef CONFIG_DEBUG_KERNEL
>> > +static inline int genuine_linus(void) { return 1; }
>> 
>> Although he's a fine chap, the use of the "_linus" tag isn't terribly
>> clear (to me).  I think what you're saying here is that this enables
>> kernel-developer-only features, yes?
>
> Yes.
>
>> If so, perhaps we could come up with an identifier which expresses that
>> more clearly.
>> 
>> But I'd expect that everyone and all distros enable CONFIG_DEBUG_KERNEL
>> for _some_ reason, so what's the point?

At the least, it has not always been so...

>
> Good point! I can confirm my debian has CONFIG_DEBUG_KERNEL=Y!

I can confirm mine does not.

etch-i386:~# uname -a
Linux etch-i386 2.6.18-6-686 #1 SMP Fri Dec 12 16:48:28 UTC 2008 i686 GNU/Linux
etch-i386:~# grep DEBUG_KERNEL /boot/config-2.6.18-6-686 
# CONFIG_DEBUG_KERNEL is not set

For what that's worth.


>> It is preferable that we always implement the same interface for all
>> Kconfig settings.  If this exposes information which is confusing or
>> not useful to end-users then so be it - we should be able to cover that
>> in supporting documentation.
>
> My original patch takes that straightforward manner - and I still like it.
> I would be very glad to move the filtering code from kernel to user space.
>
> The use of more obscure flags could be discouraged by _not_ documenting
> them. A really curious user is encouraged to refer to the code for the
> exact meaning (and perhaps become a kernel developer ;-)
>
>> Also, as mentioned in the other email, it would be good if we were to
>> publish a little userspace app which people can use to access this raw
>> data.  We could give that application an `--i-am-a-kernel-developer'
>> option!
>
> OK. I'll include page-types.c in the next take.
>
>> > +#else
>> > +static inline int genuine_linus(void) { return 0; }
>> > +#endif
>> 
>> This isn't an appropriate use of CONFIG_DEBUG_KERNEL.
>> 
>> DEBUG_KERNEL is a Kconfig-only construct which is use to enable _other_
>> debugging features.  The way you've used it here, if the person who is
>> configuring the kernel wants to enable any other completely-unrelated
>> debug feature, they have to enable DEBUG_KERNEL first.  But when they
>> do that, they unexpectedly alter the behaviour of pagemap!
>> 
>> There are two other places where CONFIG_DEBUG_KERNEL affects code
>> generation in .c files: arch/parisc/mm/init.c and
>> arch/powerpc/kernel/sysfs.c.  These are both wrong, and need slapping ;)
>
> (add cc to related maintainers)

I assume I was cc'd because I've changed arch/powerpc/kernel/sysfs.c a
couple of times in the last year, but I can't claim to maintain that
code.  I'm pretty sure I haven't touched the code in question in this
discussion.  I've cc'd linuxppc-dev.


> CONFIG_DEBUG_KERNEL being enabled in distro kernels effectively means 
>
>         #ifdef CONFIG_DEBUG_KERNEL == #if 1
>
> as the following patch demos. Now it becomes obviously silly.

Sure, #if 1 is usually silly.  But if the point is that DEBUG_KERNEL is
not supposed to directly affect code generation, then I see two options
for powerpc:

- remove the #ifdef CONFIG_DEBUG_KERNEL guards from
  arch/powerpc/kernel/sysfs.c, unconditionally enabling the hid/ima
  sysfs attributes, or

- define a new config symbol which governs whether those attributes are
  enabled, and make it depend on DEBUG_KERNEL


> --- a/arch/powerpc/kernel/sysfs.c
> +++ b/arch/powerpc/kernel/sysfs.c
> @@ -212,19 +212,19 @@ static SYSDEV_ATTR(purr, 0600, show_purr, store_purr);
>  #endif /* CONFIG_PPC64 */
>  
>  #ifdef HAS_PPC_PMC_PA6T
>  SYSFS_PMCSETUP(pa6t_pmc0, SPRN_PA6T_PMC0);
>  SYSFS_PMCSETUP(pa6t_pmc1, SPRN_PA6T_PMC1);
>  SYSFS_PMCSETUP(pa6t_pmc2, SPRN_PA6T_PMC2);
>  SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3);
>  SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4);
>  SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5);
> -#ifdef CONFIG_DEBUG_KERNEL
> +#if 1
>  SYSFS_PMCSETUP(hid0, SPRN_HID0);
>  SYSFS_PMCSETUP(hid1, SPRN_HID1);
>  SYSFS_PMCSETUP(hid4, SPRN_HID4);
>  SYSFS_PMCSETUP(hid5, SPRN_HID5);
>  SYSFS_PMCSETUP(ima0, SPRN_PA6T_IMA0);
>  SYSFS_PMCSETUP(ima1, SPRN_PA6T_IMA1);
>  SYSFS_PMCSETUP(ima2, SPRN_PA6T_IMA2);
>  SYSFS_PMCSETUP(ima3, SPRN_PA6T_IMA3);
>  SYSFS_PMCSETUP(ima4, SPRN_PA6T_IMA4);
> @@ -282,19 +282,19 @@ static struct sysdev_attribute classic_pmc_attrs[] = {
>  static struct sysdev_attribute pa6t_attrs[] = {
>  	_SYSDEV_ATTR(mmcr0, 0600, show_mmcr0, store_mmcr0),
>  	_SYSDEV_ATTR(mmcr1, 0600, show_mmcr1, store_mmcr1),
>  	_SYSDEV_ATTR(pmc0, 0600, show_pa6t_pmc0, store_pa6t_pmc0),
>  	_SYSDEV_ATTR(pmc1, 0600, show_pa6t_pmc1, store_pa6t_pmc1),
>  	_SYSDEV_ATTR(pmc2, 0600, show_pa6t_pmc2, store_pa6t_pmc2),
>  	_SYSDEV_ATTR(pmc3, 0600, show_pa6t_pmc3, store_pa6t_pmc3),
>  	_SYSDEV_ATTR(pmc4, 0600, show_pa6t_pmc4, store_pa6t_pmc4),
>  	_SYSDEV_ATTR(pmc5, 0600, show_pa6t_pmc5, store_pa6t_pmc5),
> -#ifdef CONFIG_DEBUG_KERNEL
> +#if 1
>  	_SYSDEV_ATTR(hid0, 0600, show_hid0, store_hid0),
>  	_SYSDEV_ATTR(hid1, 0600, show_hid1, store_hid1),
>  	_SYSDEV_ATTR(hid4, 0600, show_hid4, store_hid4),
>  	_SYSDEV_ATTR(hid5, 0600, show_hid5, store_hid5),
>  	_SYSDEV_ATTR(ima0, 0600, show_ima0, store_ima0),
>  	_SYSDEV_ATTR(ima1, 0600, show_ima1, store_ima1),
>  	_SYSDEV_ATTR(ima2, 0600, show_ima2, store_ima2),
>  	_SYSDEV_ATTR(ima3, 0600, show_ima3, store_ima3),
>  	_SYSDEV_ATTR(ima4, 0600, show_ima4, store_ima4),
>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-29  4:41         ` Nathan Lynch
  (?)
@ 2009-04-29  4:50           ` Andrew Morton
  -1 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-29  4:50 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: Wu Fengguang, linux-kernel, kosaki.motohiro, andi, mpm,
	adobriyan, linux-mm, Stephen Rothwell, Chandra Seetharaman,
	Olof Johansson, Helge Deller, linuxppc-dev

On Tue, 28 Apr 2009 23:41:52 -0500 Nathan Lynch <ntl@pobox.com> wrote:

> > CONFIG_DEBUG_KERNEL being enabled in distro kernels effectively means 
> >
> >         #ifdef CONFIG_DEBUG_KERNEL == #if 1
> >
> > as the following patch demos. Now it becomes obviously silly.
> 
> Sure, #if 1 is usually silly.  But if the point is that DEBUG_KERNEL is
> not supposed to directly affect code generation, then I see two options
> for powerpc:
> 
> - remove the #ifdef CONFIG_DEBUG_KERNEL guards from
>   arch/powerpc/kernel/sysfs.c, unconditionally enabling the hid/ima
>   sysfs attributes, or
> 
> - define a new config symbol which governs whether those attributes are
>   enabled, and make it depend on DEBUG_KERNEL

yup.

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29  4:50           ` Andrew Morton
  0 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-29  4:50 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: Wu Fengguang, linux-kernel, kosaki.motohiro, andi, mpm,
	adobriyan, linux-mm, Stephen Rothwell, Chandra Seetharaman,
	Olof Johansson, Helge Deller, linuxppc-dev

On Tue, 28 Apr 2009 23:41:52 -0500 Nathan Lynch <ntl@pobox.com> wrote:

> > CONFIG_DEBUG_KERNEL being enabled in distro kernels effectively means 
> >
> >         #ifdef CONFIG_DEBUG_KERNEL == #if 1
> >
> > as the following patch demos. Now it becomes obviously silly.
> 
> Sure, #if 1 is usually silly.  But if the point is that DEBUG_KERNEL is
> not supposed to directly affect code generation, then I see two options
> for powerpc:
> 
> - remove the #ifdef CONFIG_DEBUG_KERNEL guards from
>   arch/powerpc/kernel/sysfs.c, unconditionally enabling the hid/ima
>   sysfs attributes, or
> 
> - define a new config symbol which governs whether those attributes are
>   enabled, and make it depend on DEBUG_KERNEL

yup.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29  4:50           ` Andrew Morton
  0 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-29  4:50 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: Stephen Rothwell, Helge, Seetharaman, Deller, linuxppc-dev,
	linux-kernel, linux-mm, andi, Chandra, kosaki.motohiro, mpm,
	Olof Johansson, Wu Fengguang, adobriyan

On Tue, 28 Apr 2009 23:41:52 -0500 Nathan Lynch <ntl@pobox.com> wrote:

> > CONFIG_DEBUG_KERNEL being enabled in distro kernels effectively means 
> >
> >         #ifdef CONFIG_DEBUG_KERNEL == #if 1
> >
> > as the following patch demos. Now it becomes obviously silly.
> 
> Sure, #if 1 is usually silly.  But if the point is that DEBUG_KERNEL is
> not supposed to directly affect code generation, then I see two options
> for powerpc:
> 
> - remove the #ifdef CONFIG_DEBUG_KERNEL guards from
>   arch/powerpc/kernel/sysfs.c, unconditionally enabling the hid/ima
>   sysfs attributes, or
> 
> - define a new config symbol which governs whether those attributes are
>   enabled, and make it depend on DEBUG_KERNEL

yup.

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-29  3:48           ` Wu Fengguang
@ 2009-04-29  5:09             ` Wu Fengguang
  -1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-29  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm,
	Stephen Rothwell, Chandra Seetharaman, Nathan Lynch,
	Olof Johansson, Helge Deller

On Wed, Apr 29, 2009 at 11:48:29AM +0800, Wu Fengguang wrote:
> On Wed, Apr 29, 2009 at 10:55:27AM +0800, Andrew Morton wrote:
> > On Wed, 29 Apr 2009 10:38:42 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > > > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit)		\
> > > > > +	do {								\
> > > > > +		if (visible || genuine_linus())				\
> > > > > +			uflags |= ((kflags >> kbit) & 1) << ubit;	\
> > > > > +	} while (0);
> > > > 
> > > > Did this have to be implemented as a macro?
> > > > 
> > > > It's bad, because it might or might not reference its argument, so if
> > > > someone passes it an expression-with-side-effects, the end result is
> > > > unpredictable.  A C function is almost always preferable if possible.
> > > 
> > > Just tried inline function, the code size is increased slightly:
> > > 
> > >           text   data    bss     dec    hex   filename
> > > macro     1804    128      0    1932    78c   fs/proc/page.o
> > > inline    1828    128      0    1956    7a4   fs/proc/page.o
> > > 
> > 
> > hm, I wonder why.  Maybe it fixed a bug ;)
> > 
> > The code is effectively doing
> > 
> > 	if (expr1)
> > 		something();
> > 	if (expr1)
> > 		something_else();
> > 	if (expr1)
> > 		something_else2();
> > 
> > etc.  Obviously we _hope_ that the compiler turns that into
> > 
> > 	if (expr1) {
> > 		something();
> > 		something_else();
> > 		something_else2();
> > 	}
> > 
> > for us, but it would be good to check...
> 
> By 'expr1', you mean (visible || genuine_linus())?
> 
> No, I can confirm the inefficiency does not lie here.
> 
> I simplified the kpf_copy_bit() to
> 
>         #define kpf_copy_bit(uflags, kflags, ubit, kbit)                     \
>                         uflags |= (((kflags) >> (kbit)) & 1) << (ubit);
> 
> or
> 
>         static inline u64 kpf_copy_bit(u64 kflags, int ubit, int kbit)
>         {       
>                 return (((kflags) >> (kbit)) & 1) << (ubit);
>         }
> 
> and double checked the differences: the gap grows unexpectedly!
> 
>               text               data                bss                dec            hex filename
> macro         1829                168                  0               1997            7cd fs/proc/page.o
> inline        1893                168                  0               2061            80d fs/proc/page.o
>               +3.5%
> 
> (note: the larger absolute text size is due to some experimental code elsewhere.)

Wow, after simplifications the text size goes down by -13.2%:

              text               data                bss                dec            hex filename
macro         1644                  8                  0               1652            674 fs/proc/page.o
inline        1644                  8                  0               1652            674 fs/proc/page.o

Amazingly we can now use inline function without performance penalty!

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29  5:09             ` Wu Fengguang
  0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-29  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm,
	Stephen Rothwell, Chandra Seetharaman, Nathan Lynch,
	Olof Johansson, Helge Deller

On Wed, Apr 29, 2009 at 11:48:29AM +0800, Wu Fengguang wrote:
> On Wed, Apr 29, 2009 at 10:55:27AM +0800, Andrew Morton wrote:
> > On Wed, 29 Apr 2009 10:38:42 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > > > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit)		\
> > > > > +	do {								\
> > > > > +		if (visible || genuine_linus())				\
> > > > > +			uflags |= ((kflags >> kbit) & 1) << ubit;	\
> > > > > +	} while (0);
> > > > 
> > > > Did this have to be implemented as a macro?
> > > > 
> > > > It's bad, because it might or might not reference its argument, so if
> > > > someone passes it an expression-with-side-effects, the end result is
> > > > unpredictable.  A C function is almost always preferable if possible.
> > > 
> > > Just tried inline function, the code size is increased slightly:
> > > 
> > >           text   data    bss     dec    hex   filename
> > > macro     1804    128      0    1932    78c   fs/proc/page.o
> > > inline    1828    128      0    1956    7a4   fs/proc/page.o
> > > 
> > 
> > hm, I wonder why.  Maybe it fixed a bug ;)
> > 
> > The code is effectively doing
> > 
> > 	if (expr1)
> > 		something();
> > 	if (expr1)
> > 		something_else();
> > 	if (expr1)
> > 		something_else2();
> > 
> > etc.  Obviously we _hope_ that the compiler turns that into
> > 
> > 	if (expr1) {
> > 		something();
> > 		something_else();
> > 		something_else2();
> > 	}
> > 
> > for us, but it would be good to check...
> 
> By 'expr1', you mean (visible || genuine_linus())?
> 
> No, I can confirm the inefficiency does not lie here.
> 
> I simplified the kpf_copy_bit() to
> 
>         #define kpf_copy_bit(uflags, kflags, ubit, kbit)                     \
>                         uflags |= (((kflags) >> (kbit)) & 1) << (ubit);
> 
> or
> 
>         static inline u64 kpf_copy_bit(u64 kflags, int ubit, int kbit)
>         {       
>                 return (((kflags) >> (kbit)) & 1) << (ubit);
>         }
> 
> and double checked the differences: the gap grows unexpectedly!
> 
>               text               data                bss                dec            hex filename
> macro         1829                168                  0               1997            7cd fs/proc/page.o
> inline        1893                168                  0               2061            80d fs/proc/page.o
>               +3.5%
> 
> (note: the larger absolute text size is due to some experimental code elsewhere.)

Wow, after simplifications the text size goes down by -13.2%:

              text               data                bss                dec            hex filename
macro         1644                  8                  0               1652            674 fs/proc/page.o
inline        1644                  8                  0               1652            674 fs/proc/page.o

Amazingly we can now use inline function without performance penalty!

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-28 17:49     ` Matt Mackall
@ 2009-04-29  8:05       ` Wu Fengguang
  -1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-29  8:05 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Alexey Dobriyan, linux-mm

On Wed, Apr 29, 2009 at 01:49:21AM +0800, Matt Mackall wrote:
> On Tue, 2009-04-28 at 09:09 +0800, Wu Fengguang wrote:
> > plain text document attachment (kpageflags-extending.patch)
> > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> 
> My only concern with this patch is it knows a bit too much about SLUB
> internals (and perhaps not enough about SLOB, which also overloads
> flags). 

Yup. PG_private=PG_slob_free is not masked because SLOB actually does
not set PG_slab at all. I wonder if it's safe to do this change:

        /* SLOB */
-       PG_slob_page = PG_active,
+       PG_slob_page = PG_slab,
        PG_slob_free = PG_private,


In the page-types output:

         flags  page-count       MB  symbolic-flags                     long-symbolic-flags
0x000800000040        7113       27  ______A_________________P____      active,private
0x000000000040          66        0  ______A______________________      active

The above two lines are obviously for SLOB pages.  It indicates lots of
free SLOB pages. So my question is:

- Do you have other means to get the nr_free_slobs info? (I found none in the code)
or
- Will exporting the SL*B overloaded flags going to help?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29  8:05       ` Wu Fengguang
  0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-29  8:05 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Alexey Dobriyan, linux-mm

On Wed, Apr 29, 2009 at 01:49:21AM +0800, Matt Mackall wrote:
> On Tue, 2009-04-28 at 09:09 +0800, Wu Fengguang wrote:
> > plain text document attachment (kpageflags-extending.patch)
> > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> 
> My only concern with this patch is it knows a bit too much about SLUB
> internals (and perhaps not enough about SLOB, which also overloads
> flags). 

Yup. PG_private=PG_slob_free is not masked because SLOB actually does
not set PG_slab at all. I wonder if it's safe to do this change:

        /* SLOB */
-       PG_slob_page = PG_active,
+       PG_slob_page = PG_slab,
        PG_slob_free = PG_private,


In the page-types output:

         flags  page-count       MB  symbolic-flags                     long-symbolic-flags
0x000800000040        7113       27  ______A_________________P____      active,private
0x000000000040          66        0  ______A______________________      active

The above two lines are obviously for SLOB pages.  It indicates lots of
free SLOB pages. So my question is:

- Do you have other means to get the nr_free_slobs info? (I found none in the code)
or
- Will exporting the SL*B overloaded flags going to help?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-29  8:05       ` Wu Fengguang
@ 2009-04-29 19:13         ` Matt Mackall
  -1 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-29 19:13 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Alexey Dobriyan, linux-mm

On Wed, 2009-04-29 at 16:05 +0800, Wu Fengguang wrote:
> On Wed, Apr 29, 2009 at 01:49:21AM +0800, Matt Mackall wrote:
> > On Tue, 2009-04-28 at 09:09 +0800, Wu Fengguang wrote:
> > > plain text document attachment (kpageflags-extending.patch)
> > > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> > 
> > My only concern with this patch is it knows a bit too much about SLUB
> > internals (and perhaps not enough about SLOB, which also overloads
> > flags). 
> 
> Yup. PG_private=PG_slob_free is not masked because SLOB actually does
> not set PG_slab at all. I wonder if it's safe to do this change:
> 
>         /* SLOB */
> -       PG_slob_page = PG_active,
> +       PG_slob_page = PG_slab,
>         PG_slob_free = PG_private,

Yep.

> In the page-types output:
> 
>          flags  page-count       MB  symbolic-flags                     long-symbolic-flags
> 0x000800000040        7113       27  ______A_________________P____      active,private
> 0x000000000040          66        0  ______A______________________      active
> 
> The above two lines are obviously for SLOB pages.  It indicates lots of
> free SLOB pages. So my question is:

Free here just means partially allocated.

> - Do you have other means to get the nr_free_slobs info? (I found none in the code)
> or
> - Will exporting the SL*B overloaded flags going to help?

Yes, it's useful.

-- 
http://selenic.com : development and support for Mercurial and Linux



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29 19:13         ` Matt Mackall
  0 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-29 19:13 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Alexey Dobriyan, linux-mm

On Wed, 2009-04-29 at 16:05 +0800, Wu Fengguang wrote:
> On Wed, Apr 29, 2009 at 01:49:21AM +0800, Matt Mackall wrote:
> > On Tue, 2009-04-28 at 09:09 +0800, Wu Fengguang wrote:
> > > plain text document attachment (kpageflags-extending.patch)
> > > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> > 
> > My only concern with this patch is it knows a bit too much about SLUB
> > internals (and perhaps not enough about SLOB, which also overloads
> > flags). 
> 
> Yup. PG_private=PG_slob_free is not masked because SLOB actually does
> not set PG_slab at all. I wonder if it's safe to do this change:
> 
>         /* SLOB */
> -       PG_slob_page = PG_active,
> +       PG_slob_page = PG_slab,
>         PG_slob_free = PG_private,

Yep.

> In the page-types output:
> 
>          flags  page-count       MB  symbolic-flags                     long-symbolic-flags
> 0x000800000040        7113       27  ______A_________________P____      active,private
> 0x000000000040          66        0  ______A______________________      active
> 
> The above two lines are obviously for SLOB pages.  It indicates lots of
> free SLOB pages. So my question is:

Free here just means partially allocated.

> - Do you have other means to get the nr_free_slobs info? (I found none in the code)
> or
> - Will exporting the SL*B overloaded flags going to help?

Yes, it's useful.

-- 
http://selenic.com : development and support for Mercurial and Linux


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
  2009-04-29 19:13         ` Matt Mackall
@ 2009-04-30  1:00           ` Wu Fengguang
  -1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-30  1:00 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Alexey Dobriyan, linux-mm

On Thu, Apr 30, 2009 at 03:13:56AM +0800, Matt Mackall wrote:
> On Wed, 2009-04-29 at 16:05 +0800, Wu Fengguang wrote:
> > On Wed, Apr 29, 2009 at 01:49:21AM +0800, Matt Mackall wrote:
> > > On Tue, 2009-04-28 at 09:09 +0800, Wu Fengguang wrote:
> > > > plain text document attachment (kpageflags-extending.patch)
> > > > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> > > 
> > > My only concern with this patch is it knows a bit too much about SLUB
> > > internals (and perhaps not enough about SLOB, which also overloads
> > > flags). 
> > 
> > Yup. PG_private=PG_slob_free is not masked because SLOB actually does
> > not set PG_slab at all. I wonder if it's safe to do this change:
> > 
> >         /* SLOB */
> > -       PG_slob_page = PG_active,
> > +       PG_slob_page = PG_slab,
> >         PG_slob_free = PG_private,
> 
> Yep.

OK. I'll do it - for consistency.

> > In the page-types output:
> > 
> >          flags  page-count       MB  symbolic-flags                     long-symbolic-flags
> > 0x000800000040        7113       27  ______A_________________P____      active,private
> > 0x000000000040          66        0  ______A______________________      active
> > 
> > The above two lines are obviously for SLOB pages.  It indicates lots of
> > free SLOB pages. So my question is:
> 
> Free here just means partially allocated.

Yes, I realized this when lying in bed ;-)

> > - Do you have other means to get the nr_free_slobs info? (I found none in the code)
> > or
> > - Will exporting the SL*B overloaded flags going to help?
> 
> Yes, it's useful.

Thank you. SLUB/SLOB overload different page flags, so it's possible
for user space tools to restore their real meanings - ugly but useful.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-30  1:00           ` Wu Fengguang
  0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-30  1:00 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
	Alexey Dobriyan, linux-mm

On Thu, Apr 30, 2009 at 03:13:56AM +0800, Matt Mackall wrote:
> On Wed, 2009-04-29 at 16:05 +0800, Wu Fengguang wrote:
> > On Wed, Apr 29, 2009 at 01:49:21AM +0800, Matt Mackall wrote:
> > > On Tue, 2009-04-28 at 09:09 +0800, Wu Fengguang wrote:
> > > > plain text document attachment (kpageflags-extending.patch)
> > > > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> > > 
> > > My only concern with this patch is it knows a bit too much about SLUB
> > > internals (and perhaps not enough about SLOB, which also overloads
> > > flags). 
> > 
> > Yup. PG_private=PG_slob_free is not masked because SLOB actually does
> > not set PG_slab at all. I wonder if it's safe to do this change:
> > 
> >         /* SLOB */
> > -       PG_slob_page = PG_active,
> > +       PG_slob_page = PG_slab,
> >         PG_slob_free = PG_private,
> 
> Yep.

OK. I'll do it - for consistency.

> > In the page-types output:
> > 
> >          flags  page-count       MB  symbolic-flags                     long-symbolic-flags
> > 0x000800000040        7113       27  ______A_________________P____      active,private
> > 0x000000000040          66        0  ______A______________________      active
> > 
> > The above two lines are obviously for SLOB pages.  It indicates lots of
> > free SLOB pages. So my question is:
> 
> Free here just means partially allocated.

Yes, I realized this when lying in bed ;-)

> > - Do you have other means to get the nr_free_slobs info? (I found none in the code)
> > or
> > - Will exporting the SL*B overloaded flags going to help?
> 
> Yes, it's useful.

Thank you. SLUB/SLOB overload different page flags, so it's possible
for user space tools to restore their real meanings - ugly but useful.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
  2009-04-28 13:31                             ` Wu Fengguang
@ 2009-05-12 13:01                               ` Frederic Weisbecker
  -1 siblings, 0 replies; 137+ messages in thread
From: Frederic Weisbecker @ 2009-05-12 13:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Ingo Molnar, Li Zefan, Tom Zanussi, KOSAKI Motohiro,
	Pekka Enberg, Andi Kleen, Steven Rostedt, Larry Woodman,
	Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Tue, Apr 28, 2009 at 09:31:08PM +0800, Wu Fengguang wrote:
> On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> > tent-Transfer-Encoding: quoted-printable
> > Status: RO
> > Content-Length: 5480
> > Lines: 161
> > 
> > 
> > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > > The above 'get object state' interface (which allows passive 
> > > > sampling) - integrated into the tracing framework - would serve 
> > > > that goal, agreed?
> > > 
> > > Agreed. That could in theory a good complement to dynamic 
> > > tracings.
> > > 
> > > Then what will be the canonical form for all the 'get object 
> > > state' interfaces - "object.attr=value", or whatever? [...]
> > 
> > Lemme outline what i'm thinking of.
> > 
> > I'd call the feature "object collection tracing", which would live 
> > in /debug/tracing, accessed via such files:
> > 
> >   /debug/tracing/objects/mm/pages/
> >   /debug/tracing/objects/mm/pages/format
> >   /debug/tracing/objects/mm/pages/filter
> >   /debug/tracing/objects/mm/pages/trace_pipe
> >   /debug/tracing/objects/mm/pages/stats
> >   /debug/tracing/objects/mm/pages/events/
> > 
> > here's the (proposed) semantics of those files:
> > 
> > 1) /debug/tracing/objects/mm/pages/
> > 
> > There's a subsystem / object basic directory structure to make it 
> > easy and intuitive to find our way around there.
> > 
> > 2) /debug/tracing/objects/mm/pages/format
> > 
> > the format file:
> > 
> >   /debug/tracing/objects/mm/pages/format
> > 
> > Would reuse the existing dynamic-tracepoint structured-logging 
> > descriptor format and code (this is upstream already):
> > 
> >  [root@phoenix sched_signal_send]# pwd
> >  /debug/tracing/events/sched/sched_signal_send
> > 
> >  [root@phoenix sched_signal_send]# cat format 
> >  name: sched_signal_send
> >  ID: 24
> >  format:
> > 	field:unsigned short common_type;		offset:0;	size:2;
> > 	field:unsigned char common_flags;		offset:2;	size:1;
> > 	field:unsigned char common_preempt_count;	offset:3;	size:1;
> > 	field:int common_pid;				offset:4;	size:4;
> > 	field:int common_tgid;				offset:8;	size:4;
> > 
> > 	field:int sig;					offset:12;	size:4;
> > 	field:char comm[TASK_COMM_LEN];			offset:16;	size:16;
> > 	field:pid_t pid;				offset:32;	size:4;
> > 
> >  print fmt: "sig: %d  task %s:%d", REC->sig, REC->comm, REC->pid
> > 
> > These format descriptors enumerate fields, types and sizes, in a 
> > structured way that user-space tools can parse easily. (The binary 
> > records that come from the trace_pipe file follow this format 
> > description.)
> > 
> > 3) /debug/tracing/objects/mm/pages/filter
> > 
> > This is the tracing filter that can be set based on the 'format' 
> > descriptor. So with the above (signal-send tracepoint) you can 
> > define such filter expressions:
> > 
> >   echo "(sig == 10 && comm == bash) || sig == 13" > filter
> > 
> > To restrict the 'scope' of the object collection along pretty much 
> > any key or combination of keys. (Or you can leave it as it is and 
> > dump all objects and do keying in user-space.)
> > 
> > [ Using in-kernel filtering is obviously faster that streaming it 
> >   out to user-space - but there might be details and types of 
> >   visualization you want to do in user-space - so we dont want to 
> >   restrict things here. ]
> > 
> > For the mm object collection tracepoint i could imagine such filter 
> > expressions:
> > 
> >   echo "type == shared && file == /sbin/init" > filter
> > 
> > To dump all shared pages that are mapped to /sbin/init.
> > 
> > 4) /debug/tracing/objects/mm/pages/trace_pipe
> > 
> > The 'trace_pipe' file can be used to dump all objects in the 
> > collection, which match the filter ('all objects' by default). The 
> > record format is described in 'format'.
> > 
> > trace_pipe would be a reuse of the existing trace_pipe code: it is a 
> > modern, poll()-able, read()-able, splice()-able pipe abstraction.
> > 
> > 5) /debug/tracing/objects/mm/pages/stats
> > 
> > The 'stats' file would be a reuse of the existing histogram code of 
> > the tracing code. We already make use of it for the branch tracers 
> > and for the workqueue tracer - it could be extended to be applicable 
> > to object collections as well.
> > 
> > The advantage there would be that there's no dumping at all - all 
> > the integration is done straight in the kernel. ( The 'filter' 
> > condition is listened to - increasing flexibility. The filter file 
> > could perhaps also act as a default histogram key. )
> > 
> > 6) /debug/tracing/objects/mm/pages/events/
> > 
> > The 'events' directory offers links back to existing dynamic 
> > tracepoints that are under /debug/tracing/events/. This would serve 
> > as an additional coherent force that keeps dynamic tracepoints 
> > collected by subsystem and by object type as well. (Tools could make 
> > use of this information as well - without being aware of actual 
> > object semantics.)
> > 
> > 
> > There would be a number of other object collections we could 
> > enumerate:
> > 
> >  tasks:
> > 
> >   /debug/tracing/objects/sched/tasks/
> > 
> >  active inodes known to the kernel:
> > 
> >   /debug/tracing/objects/fs/inodes/
> > 
> >  interrupts:
> > 
> >   /debug/tracing/objects/hw/irqs/
> > 
> > etc.
> > 
> > These would use the same 'object collection' framework. Once done we 
> > can use it for many other thing too.
> > 
> > Note how organically integrated it all is with the tracing 
> > framework. You could start from an 'object view' to get an overview 
> > and then go towards a more dynamic view of specific object 
> > attributes (or specific objects), as you drill down on a specific 
> > problem you want to analyze.
> > 
> > How does this all sound to you?
> 
> Great! I saw much opportunity to adapt the not yet submitted
> /proc/filecache interface to the proposed framework.
> 
> Its basic form is:
> 
> #      ino       size   cached cached% refcnt state       age accessed  process         dev             file
> [snip]
>        320          1        4     100      1    D-     50443     1085 udevd           00:11(tmpfs)     /.udev/uevent_seqnum
>     460725        123      124     100     35    --     50444     6795 touch           08:02(sda2)      /lib/libpthread-2.9.so
>     460727         31       32     100     14    --     50444     2007 touch           08:02(sda2)      /lib/librt-2.9.so
>     458865         97       80      82      1    --     50444       49 mount           08:02(sda2)      /lib/libdevmapper.so.1.02.1
>     460090         15       16     100      1    --     50444       48 mount           08:02(sda2)      /lib/libuuid.so.1.2
>     458866         46       48     100      1    --     50444       47 mount           08:02(sda2)      /lib/libblkid.so.1.0
>     460732         43       44     100     69    --     50444     3581 rcS             08:02(sda2)      /lib/libnss_nis-2.9.so
>     460739         87       88     100     73    --     50444     3597 rcS             08:02(sda2)      /lib/libnsl-2.9.so
>     460726         31       32     100     69    --     50444     3581 rcS             08:02(sda2)      /lib/libnss_compat-2.9.so
>     458804        250      252     100     11    --     50445     8175 rcS             08:02(sda2)      /lib/libncurses.so.5.6
>     229540        780      752      96      3    --     50445     7594 init            08:02(sda2)      /bin/bash
>     460735         15       16     100     89    --     50445    17581 init            08:02(sda2)      /lib/libdl-2.9.so
>     460721       1344     1340      99    117    --     50445    48732 init            08:02(sda2)      /lib/libc-2.9.so
>     458801        107      104      97     24    --     50445     3586 init            08:02(sda2)      /lib/libselinux.so.1
>     671870         37       24      65      1    --     50446        1 swapper         08:02(sda2)      /sbin/init
>        175          1    24412     100      1    --     50446        0 swapper         00:01(rootfs)    /dev/root
> 
> The patch basically does a traversal through one or more of the inode
> lists to produce the output:
>         inode_in_use
>         inode_unused
>         sb->s_dirty
>         sb->s_io
>         sb->s_more_io
>         sb->s_inodes
> 
> The filtering feature is a necessity for this interface - or it will
> take considerable time to do a full listing. It supports the following
> filters:
>         { LS_OPT_DIRTY,         "dirty"         },
>         { LS_OPT_CLEAN,         "clean"         },
>         { LS_OPT_INUSE,         "inuse"         },
>         { LS_OPT_EMPTY,         "empty"         },
>         { LS_OPT_ALL,           "all"           },
>         { LS_OPT_DEV,           "dev=%s"        },
> 
> There are two possible challenges for the conversion:
> 
> - One trick it does is to select different lists to traverse on
>   different filter options. Will this be possible in the object
>   tracing framework?



Yeah, I guess.



> - The file name lookup(last field) is the performance killer. Is it
>   possible to skip the file name lookup when the filter failed on the
>   leading fields?


objects collection lays on trace events where filters basically ignore
a whole entry in case of non-matching. Not sure if we can easily only
ignore one field.

But I guess we can do something about the performances...

Could you send us the (sob'ed) patch you made which implements this.
I could try to adapt it to object collection.

Thanks,
Frederic.


> Will the object tracing interface allow such flexibilities?
> (Sorry I'm not yet familiar with the tracing framework.)
> 
> > Can you see any conceptual holes in the scheme, any use-case that 
> > /proc/kpageflags supports but the object collection approach does 
> > not?
> 
> kpageflags is simply a big (perhaps sparse) binary array.
> I'd still prefer to retain its current form - the kernel patches and
> user space tools are all ready made, and I see no benefits in
> converting to the tracing framework.
> 
> > Would you be interested in seeing something like this, if we tried 
> > to implement it in the tracing tree? The majority of the code 
> > already exists, we just need interest from the MM side and we have 
> > to hook it all up. (it is by no means trivial to do - but looks like
> > a very exciting feature.)
> 
> Definitely! /proc/filecache has another 'page view':
> 
>         # head /proc/filecache
>         # file /bin/bash
>         # flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback X:readahead P:private O:owner b:buffer d:dirty w:writeback
>         # idx   len     state           refcnt
>         0       1       RAMU________    4
>         3       8       RAMU________    4
>         12      1       RAMU________    4
>         14      5       RAMU________    4
>         20      7       RAMU________    4
>         27      2       RAMU________    5
>         29      1       RAMU________    4
> 
> Which is also a good candidate. However I still need to investigate
> whether it offers considerable margins over the mincore() syscall.
> 
> Thanks and Regards,
> Fengguang


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
@ 2009-05-12 13:01                               ` Frederic Weisbecker
  0 siblings, 0 replies; 137+ messages in thread
From: Frederic Weisbecker @ 2009-05-12 13:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Ingo Molnar, Li Zefan, Tom Zanussi, KOSAKI Motohiro,
	Pekka Enberg, Andi Kleen, Steven Rostedt, Larry Woodman,
	Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Tue, Apr 28, 2009 at 09:31:08PM +0800, Wu Fengguang wrote:
> On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> > tent-Transfer-Encoding: quoted-printable
> > Status: RO
> > Content-Length: 5480
> > Lines: 161
> > 
> > 
> > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > > The above 'get object state' interface (which allows passive 
> > > > sampling) - integrated into the tracing framework - would serve 
> > > > that goal, agreed?
> > > 
> > > Agreed. That could in theory a good complement to dynamic 
> > > tracings.
> > > 
> > > Then what will be the canonical form for all the 'get object 
> > > state' interfaces - "object.attr=value", or whatever? [...]
> > 
> > Lemme outline what i'm thinking of.
> > 
> > I'd call the feature "object collection tracing", which would live 
> > in /debug/tracing, accessed via such files:
> > 
> >   /debug/tracing/objects/mm/pages/
> >   /debug/tracing/objects/mm/pages/format
> >   /debug/tracing/objects/mm/pages/filter
> >   /debug/tracing/objects/mm/pages/trace_pipe
> >   /debug/tracing/objects/mm/pages/stats
> >   /debug/tracing/objects/mm/pages/events/
> > 
> > here's the (proposed) semantics of those files:
> > 
> > 1) /debug/tracing/objects/mm/pages/
> > 
> > There's a subsystem / object basic directory structure to make it 
> > easy and intuitive to find our way around there.
> > 
> > 2) /debug/tracing/objects/mm/pages/format
> > 
> > the format file:
> > 
> >   /debug/tracing/objects/mm/pages/format
> > 
> > Would reuse the existing dynamic-tracepoint structured-logging 
> > descriptor format and code (this is upstream already):
> > 
> >  [root@phoenix sched_signal_send]# pwd
> >  /debug/tracing/events/sched/sched_signal_send
> > 
> >  [root@phoenix sched_signal_send]# cat format 
> >  name: sched_signal_send
> >  ID: 24
> >  format:
> > 	field:unsigned short common_type;		offset:0;	size:2;
> > 	field:unsigned char common_flags;		offset:2;	size:1;
> > 	field:unsigned char common_preempt_count;	offset:3;	size:1;
> > 	field:int common_pid;				offset:4;	size:4;
> > 	field:int common_tgid;				offset:8;	size:4;
> > 
> > 	field:int sig;					offset:12;	size:4;
> > 	field:char comm[TASK_COMM_LEN];			offset:16;	size:16;
> > 	field:pid_t pid;				offset:32;	size:4;
> > 
> >  print fmt: "sig: %d  task %s:%d", REC->sig, REC->comm, REC->pid
> > 
> > These format descriptors enumerate fields, types and sizes, in a 
> > structured way that user-space tools can parse easily. (The binary 
> > records that come from the trace_pipe file follow this format 
> > description.)
> > 
> > 3) /debug/tracing/objects/mm/pages/filter
> > 
> > This is the tracing filter that can be set based on the 'format' 
> > descriptor. So with the above (signal-send tracepoint) you can 
> > define such filter expressions:
> > 
> >   echo "(sig == 10 && comm == bash) || sig == 13" > filter
> > 
> > To restrict the 'scope' of the object collection along pretty much 
> > any key or combination of keys. (Or you can leave it as it is and 
> > dump all objects and do keying in user-space.)
> > 
> > [ Using in-kernel filtering is obviously faster that streaming it 
> >   out to user-space - but there might be details and types of 
> >   visualization you want to do in user-space - so we dont want to 
> >   restrict things here. ]
> > 
> > For the mm object collection tracepoint i could imagine such filter 
> > expressions:
> > 
> >   echo "type == shared && file == /sbin/init" > filter
> > 
> > To dump all shared pages that are mapped to /sbin/init.
> > 
> > 4) /debug/tracing/objects/mm/pages/trace_pipe
> > 
> > The 'trace_pipe' file can be used to dump all objects in the 
> > collection, which match the filter ('all objects' by default). The 
> > record format is described in 'format'.
> > 
> > trace_pipe would be a reuse of the existing trace_pipe code: it is a 
> > modern, poll()-able, read()-able, splice()-able pipe abstraction.
> > 
> > 5) /debug/tracing/objects/mm/pages/stats
> > 
> > The 'stats' file would be a reuse of the existing histogram code of 
> > the tracing code. We already make use of it for the branch tracers 
> > and for the workqueue tracer - it could be extended to be applicable 
> > to object collections as well.
> > 
> > The advantage there would be that there's no dumping at all - all 
> > the integration is done straight in the kernel. ( The 'filter' 
> > condition is listened to - increasing flexibility. The filter file 
> > could perhaps also act as a default histogram key. )
> > 
> > 6) /debug/tracing/objects/mm/pages/events/
> > 
> > The 'events' directory offers links back to existing dynamic 
> > tracepoints that are under /debug/tracing/events/. This would serve 
> > as an additional coherent force that keeps dynamic tracepoints 
> > collected by subsystem and by object type as well. (Tools could make 
> > use of this information as well - without being aware of actual 
> > object semantics.)
> > 
> > 
> > There would be a number of other object collections we could 
> > enumerate:
> > 
> >  tasks:
> > 
> >   /debug/tracing/objects/sched/tasks/
> > 
> >  active inodes known to the kernel:
> > 
> >   /debug/tracing/objects/fs/inodes/
> > 
> >  interrupts:
> > 
> >   /debug/tracing/objects/hw/irqs/
> > 
> > etc.
> > 
> > These would use the same 'object collection' framework. Once done we 
> > can use it for many other thing too.
> > 
> > Note how organically integrated it all is with the tracing 
> > framework. You could start from an 'object view' to get an overview 
> > and then go towards a more dynamic view of specific object 
> > attributes (or specific objects), as you drill down on a specific 
> > problem you want to analyze.
> > 
> > How does this all sound to you?
> 
> Great! I saw much opportunity to adapt the not yet submitted
> /proc/filecache interface to the proposed framework.
> 
> Its basic form is:
> 
> #      ino       size   cached cached% refcnt state       age accessed  process         dev             file
> [snip]
>        320          1        4     100      1    D-     50443     1085 udevd           00:11(tmpfs)     /.udev/uevent_seqnum
>     460725        123      124     100     35    --     50444     6795 touch           08:02(sda2)      /lib/libpthread-2.9.so
>     460727         31       32     100     14    --     50444     2007 touch           08:02(sda2)      /lib/librt-2.9.so
>     458865         97       80      82      1    --     50444       49 mount           08:02(sda2)      /lib/libdevmapper.so.1.02.1
>     460090         15       16     100      1    --     50444       48 mount           08:02(sda2)      /lib/libuuid.so.1.2
>     458866         46       48     100      1    --     50444       47 mount           08:02(sda2)      /lib/libblkid.so.1.0
>     460732         43       44     100     69    --     50444     3581 rcS             08:02(sda2)      /lib/libnss_nis-2.9.so
>     460739         87       88     100     73    --     50444     3597 rcS             08:02(sda2)      /lib/libnsl-2.9.so
>     460726         31       32     100     69    --     50444     3581 rcS             08:02(sda2)      /lib/libnss_compat-2.9.so
>     458804        250      252     100     11    --     50445     8175 rcS             08:02(sda2)      /lib/libncurses.so.5.6
>     229540        780      752      96      3    --     50445     7594 init            08:02(sda2)      /bin/bash
>     460735         15       16     100     89    --     50445    17581 init            08:02(sda2)      /lib/libdl-2.9.so
>     460721       1344     1340      99    117    --     50445    48732 init            08:02(sda2)      /lib/libc-2.9.so
>     458801        107      104      97     24    --     50445     3586 init            08:02(sda2)      /lib/libselinux.so.1
>     671870         37       24      65      1    --     50446        1 swapper         08:02(sda2)      /sbin/init
>        175          1    24412     100      1    --     50446        0 swapper         00:01(rootfs)    /dev/root
> 
> The patch basically does a traversal through one or more of the inode
> lists to produce the output:
>         inode_in_use
>         inode_unused
>         sb->s_dirty
>         sb->s_io
>         sb->s_more_io
>         sb->s_inodes
> 
> The filtering feature is a necessity for this interface - or it will
> take considerable time to do a full listing. It supports the following
> filters:
>         { LS_OPT_DIRTY,         "dirty"         },
>         { LS_OPT_CLEAN,         "clean"         },
>         { LS_OPT_INUSE,         "inuse"         },
>         { LS_OPT_EMPTY,         "empty"         },
>         { LS_OPT_ALL,           "all"           },
>         { LS_OPT_DEV,           "dev=%s"        },
> 
> There are two possible challenges for the conversion:
> 
> - One trick it does is to select different lists to traverse on
>   different filter options. Will this be possible in the object
>   tracing framework?



Yeah, I guess.



> - The file name lookup(last field) is the performance killer. Is it
>   possible to skip the file name lookup when the filter failed on the
>   leading fields?


objects collection lays on trace events where filters basically ignore
a whole entry in case of non-matching. Not sure if we can easily only
ignore one field.

But I guess we can do something about the performances...

Could you send us the (sob'ed) patch you made which implements this.
I could try to adapt it to object collection.

Thanks,
Frederic.


> Will the object tracing interface allow such flexibilities?
> (Sorry I'm not yet familiar with the tracing framework.)
> 
> > Can you see any conceptual holes in the scheme, any use-case that 
> > /proc/kpageflags supports but the object collection approach does 
> > not?
> 
> kpageflags is simply a big (perhaps sparse) binary array.
> I'd still prefer to retain its current form - the kernel patches and
> user space tools are all ready made, and I see no benefits in
> converting to the tracing framework.
> 
> > Would you be interested in seeing something like this, if we tried 
> > to implement it in the tracing tree? The majority of the code 
> > already exists, we just need interest from the MM side and we have 
> > to hook it all up. (it is by no means trivial to do - but looks like
> > a very exciting feature.)
> 
> Definitely! /proc/filecache has another 'page view':
> 
>         # head /proc/filecache
>         # file /bin/bash
>         # flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback X:readahead P:private O:owner b:buffer d:dirty w:writeback
>         # idx   len     state           refcnt
>         0       1       RAMU________    4
>         3       8       RAMU________    4
>         12      1       RAMU________    4
>         14      5       RAMU________    4
>         20      7       RAMU________    4
>         27      2       RAMU________    5
>         29      1       RAMU________    4
> 
> Which is also a good candidate. However I still need to investigate
> whether it offers considerable margins over the mincore() syscall.
> 
> Thanks and Regards,
> Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
  2009-05-12 13:01                               ` Frederic Weisbecker
  (?)
@ 2009-05-17 13:36                               ` Wu Fengguang
  2009-05-17 13:55                                   ` Frederic Weisbecker
  2009-05-18 11:44                                   ` KOSAKI Motohiro
  -1 siblings, 2 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-05-17 13:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Li Zefan, Tom Zanussi, KOSAKI Motohiro,
	Pekka Enberg, Andi Kleen, Steven Rostedt, Larry Woodman,
	Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	Matt Mackall, Alexey Dobriyan, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3473 bytes --]

On Tue, May 12, 2009 at 09:01:12PM +0800, Frederic Weisbecker wrote:
> On Tue, Apr 28, 2009 at 09:31:08PM +0800, Wu Fengguang wrote:
> > On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> >
> > There are two possible challenges for the conversion:
> >
> > - One trick it does is to select different lists to traverse on
> >   different filter options. Will this be possible in the object
> >   tracing framework?
> 
> Yeah, I guess.

Great.

> 
> > - The file name lookup(last field) is the performance killer. Is it
> >   possible to skip the file name lookup when the filter failed on the
> >   leading fields?
> 
> objects collection lays on trace events where filters basically ignore
> a whole entry in case of non-matching. Not sure if we can easily only
> ignore one field.
> 
> But I guess we can do something about the performances...

OK, but it's not as important as the previous requirement, so it could
be the last thing to work on :)

> Could you send us the (sob'ed) patch you made which implements this.
> I could try to adapt it to object collection.

Attached for your reference. Be aware that I still have plans to
change it in non trivial way, and there are ongoing works by Nick(on
inode_lock) and Jens(on s_dirty) that can create merge conflicts.
So basically it is not a right time to do the adaption.

However we can still do something to polish up the page object
collection under /debug/tracing/objects/mm/pages/. For example,
the timestamps and function name could be removed from the following
list :)

# tracer: nop                                                                                                                        
#                                                                                                                                    
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION                                                                                  
#              | |       |          |         |                                                                                      
           <...>-3743  [001]  3035.649769: dump_pages: pfn=1 flags=400 count=1 mapcount=0 index=0                           
           <...>-3743  [001]  3044.176403: dump_pages: pfn=1 flags=400 count=1 mapcount=0 index=0                           
           <...>-3743  [001]  3044.176407: dump_pages: pfn=2 flags=400 count=1 mapcount=0 index=0                           
           <...>-3743  [001]  3044.176408: dump_pages: pfn=3 flags=400 count=1 mapcount=0 index=0                           
           <...>-3743  [001]  3044.176409: dump_pages: pfn=4 flags=400 count=1 mapcount=0 index=0                           
           <...>-3743  [001]  3044.176409: dump_pages: pfn=5 flags=400 count=1 mapcount=0 index=0                           
           <...>-3743  [001]  3044.176410: dump_pages: pfn=6 flags=400 count=1 mapcount=0 index=0                           
           <...>-3743  [001]  3044.176410: dump_pages: pfn=7 flags=400 count=1 mapcount=0 index=0                           
           <...>-3743  [001]  3044.176411: dump_pages: pfn=8 flags=400 count=1 mapcount=0 index=0                           
           <...>-3743  [001]  3044.176411: dump_pages: pfn=9 flags=400 count=1 mapcount=0 index=0                           
           <...>-3743  [001]  3044.176412: dump_pages: pfn=10 flags=400 count=1 mapcount=0 index=0                          

Thanks,
Fengguang

[-- Attachment #2: filecache-2.6.30.patch --]
[-- Type: text/x-diff, Size: 33820 bytes --]

--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -27,6 +27,7 @@ extern unsigned long max_mapnr;
 extern unsigned long num_physpages;
 extern void * high_memory;
 extern int page_cluster;
+extern char * const zone_names[];
 
 #ifdef CONFIG_SYSCTL
 extern int sysctl_legacy_va_layout;
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -104,7 +104,7 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_Z
 
 EXPORT_SYMBOL(totalram_pages);
 
-static char * const zone_names[MAX_NR_ZONES] = {
+char * const zone_names[MAX_NR_ZONES] = {
 #ifdef CONFIG_ZONE_DMA
 	 "DMA",
 #endif
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -1925,7 +1925,10 @@ char *__d_path(const struct path *path, 
 
 		if (dentry == root->dentry && vfsmnt == root->mnt)
 			break;
-		if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
+		if (unlikely(!vfsmnt)) {
+			if (IS_ROOT(dentry))
+				break;
+		} else if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
 			/* Global root? */
 			if (vfsmnt->mnt_parent == vfsmnt) {
 				goto global_root;
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -564,7 +564,6 @@ out:
 }
 EXPORT_SYMBOL(radix_tree_tag_clear);
 
-#ifndef __KERNEL__	/* Only the test harness uses this at present */
 /**
  * radix_tree_tag_get - get a tag on a radix tree node
  * @root:		radix tree root
@@ -627,7 +626,6 @@ int radix_tree_tag_get(struct radix_tree
 	}
 }
 EXPORT_SYMBOL(radix_tree_tag_get);
-#endif
 
 /**
  *	radix_tree_next_hole    -    find the next hole (not-present entry)
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -84,6 +84,10 @@ static struct hlist_head *inode_hashtabl
  */
 DEFINE_SPINLOCK(inode_lock);
 
+EXPORT_SYMBOL(inode_in_use);
+EXPORT_SYMBOL(inode_unused);
+EXPORT_SYMBOL(inode_lock);
+
 /*
  * iprune_mutex provides exclusion between the kswapd or try_to_free_pages
  * icache shrinking path, and the umount path.  Without this exclusion,
@@ -110,6 +114,13 @@ static void wake_up_inode(struct inode *
 	wake_up_bit(&inode->i_state, __I_LOCK);
 }
 
+static inline void inode_created_by(struct inode *inode, struct task_struct *task)
+{
+#ifdef CONFIG_PROC_FILECACHE_EXTRAS
+	memcpy(inode->i_comm, task->comm, sizeof(task->comm));
+#endif
+}
+
 /**
  * inode_init_always - perform inode structure intialisation
  * @sb: superblock inode belongs to
@@ -147,7 +158,7 @@ struct inode *inode_init_always(struct s
 	inode->i_bdev = NULL;
 	inode->i_cdev = NULL;
 	inode->i_rdev = 0;
-	inode->dirtied_when = 0;
+	inode->dirtied_when = jiffies;
 
 	if (security_inode_alloc(inode))
 		goto out_free_inode;
@@ -188,6 +199,7 @@ struct inode *inode_init_always(struct s
 	}
 	inode->i_private = NULL;
 	inode->i_mapping = mapping;
+	inode_created_by(inode, current);
 
 	return inode;
 
@@ -276,6 +288,8 @@ void __iget(struct inode *inode)
 	inodes_stat.nr_unused--;
 }
 
+EXPORT_SYMBOL(__iget);
+
 /**
  * clear_inode - clear an inode
  * @inode: inode to clear
@@ -1459,6 +1473,16 @@ static void __wait_on_freeing_inode(stru
 	spin_lock(&inode_lock);
 }
 
+
+struct hlist_head * get_inode_hash_budget(unsigned long index)
+{
+       if (index >= (1 << i_hash_shift))
+               return NULL;
+
+       return inode_hashtable + index;
+}
+EXPORT_SYMBOL_GPL(get_inode_hash_budget);
+
 static __initdata unsigned long ihash_entries;
 static int __init set_ihash_entries(char *str)
 {
--- linux-2.6.orig/fs/super.c
+++ linux-2.6/fs/super.c
@@ -46,6 +46,9 @@
 LIST_HEAD(super_blocks);
 DEFINE_SPINLOCK(sb_lock);
 
+EXPORT_SYMBOL(super_blocks);
+EXPORT_SYMBOL(sb_lock);
+
 /**
  *	alloc_super	-	create new superblock
  *	@type:	filesystem type superblock should belong to
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -262,6 +262,7 @@ unsigned long shrink_slab(unsigned long 
 	up_read(&shrinker_rwsem);
 	return ret;
 }
+EXPORT_SYMBOL(shrink_slab);
 
 /* Called without lock on whether page is mapped, so answer is unstable */
 static inline int page_mapping_inuse(struct page *page)
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -45,6 +45,7 @@ struct address_space swapper_space = {
 	.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
 	.backing_dev_info = &swap_backing_dev_info,
 };
+EXPORT_SYMBOL_GPL(swapper_space);
 
 #define INC_CACHE_INFO(x)	do { swap_cache_info.x++; } while (0)
 
--- linux-2.6.orig/Documentation/filesystems/proc.txt
+++ linux-2.6/Documentation/filesystems/proc.txt
@@ -260,6 +260,7 @@ Table 1-4: Kernel info in /proc
  driver	     Various drivers grouped here, currently rtc (2.4)
  execdomains Execdomains, related to security			(2.4)
  fb	     Frame Buffer devices				(2.4)
+ filecache   Query/drop in-memory file cache
  fs	     File system parameters, currently nfs/exports	(2.4)
  ide         Directory containing info about the IDE subsystem 
  interrupts  Interrupt usage                                   
@@ -450,6 +451,88 @@ varies by architecture and compile optio
 
 > cat /proc/meminfo
 
+..............................................................................
+
+filecache:
+
+Provides access to the in-memory file cache.
+
+To list an index of all cached files:
+
+    echo ls > /proc/filecache
+    cat /proc/filecache
+
+The output looks like:
+
+    # filecache 1.0
+    #      ino       size   cached cached%  state   refcnt  dev             file
+       1026334         91       92    100   --      66      03:02(hda2)     /lib/ld-2.3.6.so
+        233608       1242      972     78   --      66      03:02(hda2)     /lib/tls/libc-2.3.6.so
+         65203        651      476     73   --      1       03:02(hda2)     /bin/bash
+       1026445        261      160     61   --      10      03:02(hda2)     /lib/libncurses.so.5.5
+        235427         10       12    100   --      44      03:02(hda2)     /lib/tls/libdl-2.3.6.so
+
+FIELD	INTRO
+---------------------------------------------------------------------------
+ino	inode number
+size	inode size in KB
+cached	cached size in KB
+cached%	percent of file data cached
+state1	'-' clean; 'd' metadata dirty; 'D' data dirty
+state2	'-' unlocked; 'L' locked, normally indicates file being written out
+refcnt	file reference count, it's an in-kernel one, not exactly open count
+dev	major:minor numbers in hex, followed by a descriptive device name
+file	file path _inside_ the filesystem. There are several special names:
+	'(noname)':	the file name is not available
+	'(03:02)':	the file is a block device file of major:minor
+	'...(deleted)': the named file has been deleted from the disk
+
+To list the cached pages of a perticular file:
+
+    echo /bin/bash > /proc/filecache
+    cat /proc/filecache
+
+    # file /bin/bash
+    # flags R:referenced A:active U:uptodate D:dirty W:writeback M:mmap
+    # idx   len     state   refcnt
+    0       36      RAU__M  3
+    36      1       RAU__M  2
+    37      8       RAU__M  3
+    45      2       RAU___  1
+    47      6       RAU__M  3
+    53      3       RAU__M  2
+    56      2       RAU__M  3
+
+FIELD	INTRO
+----------------------------------------------------------------------------
+idx	page index
+len	number of pages which are cached and share the same state
+state	page state of the flags listed in line two
+refcnt	page reference count
+
+Careful users may notice that the file name to be queried is remembered between
+commands. Internally, the module has a global variable to store the file name
+parameter, so that it can be inherited by newly opened /proc/filecache file.
+However it can lead to interference for multiple queriers. The solution here
+is to obey a rule: only root can interactively change the file name parameter;
+normal users must go for scripts to access the interface. Scripts should do it
+by following the code example below:
+
+    filecache = open("/proc/filecache", "rw");
+    # avoid polluting the global parameter filename
+    filecache.write("set private");
+
+To instruct the kernel to drop clean caches, dentries and inodes from memory,
+causing that memory to become free:
+
+    # drop clean file data cache (i.e. file backed pagecache)
+    echo drop pagecache > /proc/filecache
+
+    # drop clean file metadata cache (i.e. dentries and inodes)
+    echo drop slabcache > /proc/filecache
+
+Note that the drop commands are non-destructive operations and dirty objects
+are not freeable, the user should run `sync' first.
 
 MemTotal:     16344972 kB
 MemFree:      13634064 kB
--- /dev/null
+++ linux-2.6/fs/proc/filecache.c
@@ -0,0 +1,1045 @@
+/*
+ * fs/proc/filecache.c
+ *
+ * Copyright (C) 2006, 2007 Fengguang Wu <wfg@mail.ustc.edu.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/radix-tree.h>
+#include <linux/page-flags.h>
+#include <linux/pagevec.h>
+#include <linux/pagemap.h>
+#include <linux/vmalloc.h>
+#include <linux/writeback.h>
+#include <linux/buffer_head.h>
+#include <linux/parser.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/module.h>
+#include <asm/uaccess.h>
+
+/*
+ * Increase minor version when new columns are added;
+ * Increase major version when existing columns are changed.
+ */
+#define FILECACHE_VERSION	"1.0"
+
+/* Internal buffer sizes. The larger the more effcient. */
+#define SBUF_SIZE	(128<<10)
+#define IWIN_PAGE_ORDER	3
+#define IWIN_SIZE	((PAGE_SIZE<<IWIN_PAGE_ORDER) / sizeof(struct inode *))
+
+/*
+ * Session management.
+ *
+ * Each opened /proc/filecache file is assiocated with a session object.
+ * Also there is a global_session that maintains status across open()/close()
+ * (i.e. the lifetime of an opened file), so that a casual user can query the
+ * filecache via _multiple_ simple shell commands like
+ * 'echo cat /bin/bash > /proc/filecache; cat /proc/filecache'.
+ *
+ * session.query_file is the file whose cache info is to be queried.
+ * Its value determines what we get on read():
+ * 	- NULL: ii_*() called to show the inode index
+ * 	- filp: pg_*() called to show the page groups of a filp
+ *
+ * session.query_file is
+ * 	- cloned from global_session.query_file on open();
+ * 	- updated on write("cat filename");
+ * 	  note that the new file will also be saved in global_session.query_file if
+ * 	  session.private_session is false.
+ */
+
+struct session {
+	/* options */
+	int		private_session;
+	unsigned long	ls_options;
+	dev_t		ls_dev;
+
+	/* parameters */
+	struct file	*query_file;
+
+	/* seqfile pos */
+	pgoff_t		start_offset;
+	pgoff_t		next_offset;
+
+	/* inode at last pos */
+	struct {
+		unsigned long pos;
+		unsigned long state;
+		struct inode *inode;
+		struct inode *pinned_inode;
+	} ipos;
+
+	/* inode window */
+	struct {
+		unsigned long cursor;
+		unsigned long origin;
+		unsigned long size;
+		struct inode **inodes;
+	} iwin;
+};
+
+static struct session global_session;
+
+/*
+ * Session address is stored in proc_file->f_ra.start:
+ * we assume that there will be no readahead for proc_file.
+ */
+static struct session *get_session(struct file *proc_file)
+{
+	return (struct session *)proc_file->f_ra.start;
+}
+
+static void set_session(struct file *proc_file, struct session *s)
+{
+	BUG_ON(proc_file->f_ra.start);
+	proc_file->f_ra.start = (unsigned long)s;
+}
+
+static void update_global_file(struct session *s)
+{
+	if (s->private_session)
+		return;
+
+	if (global_session.query_file)
+		fput(global_session.query_file);
+
+	global_session.query_file = s->query_file;
+
+	if (global_session.query_file)
+		get_file(global_session.query_file);
+}
+
+/*
+ * Cases of the name:
+ * 1) NULL                (new session)
+ * 	s->query_file = global_session.query_file = 0;
+ * 2) ""                  (ls/la)
+ * 	s->query_file = global_session.query_file;
+ * 3) a regular file name (cat newfile)
+ * 	s->query_file = global_session.query_file = newfile;
+ */
+static int session_update_file(struct session *s, char *name)
+{
+	static DEFINE_MUTEX(mutex); /* protects global_session.query_file */
+	int err = 0;
+
+	mutex_lock(&mutex);
+
+	/*
+	 * We are to quit, or to list the cached files.
+	 * Reset *.query_file.
+	 */
+	if (!name) {
+		if (s->query_file) {
+			fput(s->query_file);
+			s->query_file = NULL;
+		}
+		update_global_file(s);
+		goto out;
+	}
+
+	/*
+	 * This is a new session.
+	 * Inherit options/parameters from global ones.
+	 */
+	if (name[0] == '\0') {
+		*s = global_session;
+		if (s->query_file)
+			get_file(s->query_file);
+		goto out;
+	}
+
+	/*
+	 * Open the named file.
+	 */
+	if (s->query_file)
+		fput(s->query_file);
+	s->query_file = filp_open(name, O_RDONLY|O_LARGEFILE, 0);
+	if (IS_ERR(s->query_file)) {
+		err = PTR_ERR(s->query_file);
+		s->query_file = NULL;
+	} else
+		update_global_file(s);
+
+out:
+	mutex_unlock(&mutex);
+
+	return err;
+}
+
+static struct session *session_create(void)
+{
+	struct session *s;
+	int err = 0;
+
+	s = kmalloc(sizeof(*s), GFP_KERNEL);
+	if (s)
+		err = session_update_file(s, "");
+	else
+		err = -ENOMEM;
+
+	return err ? ERR_PTR(err) : s;
+}
+
+static void session_release(struct session *s)
+{
+	if (s->ipos.pinned_inode)
+		iput(s->ipos.pinned_inode);
+	if (s->query_file)
+		fput(s->query_file);
+	kfree(s);
+}
+
+
+/*
+ * Listing of cached files.
+ *
+ * Usage:
+ * 		echo > /proc/filecache  # enter listing mode
+ * 		cat /proc/filecache     # get the file listing
+ */
+
+/* code style borrowed from ib_srp.c */
+enum {
+	LS_OPT_ERR	=	0,
+	LS_OPT_DIRTY	=	1 << 0,
+	LS_OPT_CLEAN	=	1 << 1,
+	LS_OPT_INUSE	=	1 << 2,
+	LS_OPT_EMPTY	=	1 << 3,
+	LS_OPT_ALL	=	1 << 4,
+	LS_OPT_DEV	=	1 << 5,
+};
+
+static match_table_t ls_opt_tokens = {
+	{ LS_OPT_DIRTY,		"dirty" 	},
+	{ LS_OPT_CLEAN,		"clean" 	},
+	{ LS_OPT_INUSE,		"inuse" 	},
+	{ LS_OPT_EMPTY,		"empty"		},
+	{ LS_OPT_ALL,		"all" 		},
+	{ LS_OPT_DEV,		"dev=%s"	},
+	{ LS_OPT_ERR,		NULL 		}
+};
+
+static int ls_parse_options(const char *buf, struct session *s)
+{
+	substring_t args[MAX_OPT_ARGS];
+	char *options, *sep_opt;
+	char *p;
+	int token;
+	int ret = 0;
+
+	if (!buf)
+		return 0;
+	options = kstrdup(buf, GFP_KERNEL);
+	if (!options)
+		return -ENOMEM;
+
+	s->ls_options = 0;
+	sep_opt = options;
+	while ((p = strsep(&sep_opt, " ")) != NULL) {
+		if (!*p)
+			continue;
+
+		token = match_token(p, ls_opt_tokens, args);
+
+		switch (token) {
+		case LS_OPT_DIRTY:
+		case LS_OPT_CLEAN:
+		case LS_OPT_INUSE:
+		case LS_OPT_EMPTY:
+		case LS_OPT_ALL:
+			s->ls_options |= token;
+			break;
+		case LS_OPT_DEV:
+			p = match_strdup(args);
+			if (!p) {
+				ret = -ENOMEM;
+				goto out;
+			}
+			if (*p == '/') {
+				struct kstat stat;
+				struct nameidata nd;
+				ret = path_lookup(p, LOOKUP_FOLLOW, &nd);
+				if (!ret)
+					ret = vfs_getattr(nd.path.mnt,
+							  nd.path.dentry, &stat);
+				if (!ret)
+					s->ls_dev = stat.rdev;
+			} else
+				s->ls_dev = simple_strtoul(p, NULL, 0);
+			/* printk("%lx %s\n", (long)s->ls_dev, p); */
+			kfree(p);
+			break;
+
+		default:
+			printk(KERN_WARNING "unknown parameter or missing value "
+			       "'%s' in ls command\n", p);
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+out:
+	kfree(options);
+	return ret;
+}
+
+/*
+ * Add possible filters here.
+ * No permission check: we cannot verify the path's permission anyway.
+ * We simply demand root previledge for accessing /proc/filecache.
+ */
+static int may_show_inode(struct session *s, struct inode *inode)
+{
+	if (!atomic_read(&inode->i_count))
+		return 0;
+	if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+		return 0;
+	if (!inode->i_mapping)
+		return 0;
+
+	if (s->ls_dev && s->ls_dev != inode->i_sb->s_dev)
+		return 0;
+
+	if (s->ls_options & LS_OPT_ALL)
+		return 1;
+
+	if (!(s->ls_options & LS_OPT_EMPTY) && !inode->i_mapping->nrpages)
+		return 0;
+
+	if ((s->ls_options & LS_OPT_DIRTY) && !(inode->i_state & I_DIRTY))
+		return 0;
+
+	if ((s->ls_options & LS_OPT_CLEAN) && (inode->i_state & I_DIRTY))
+		return 0;
+
+	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
+	      S_ISLNK(inode->i_mode) || S_ISBLK(inode->i_mode)))
+		return 0;
+
+	return 1;
+}
+
+/*
+ * Full: there are more data following.
+ */
+static int iwin_full(struct session *s)
+{
+	return !s->iwin.cursor ||
+		s->iwin.cursor > s->iwin.origin + s->iwin.size;
+}
+
+static int iwin_push(struct session *s, struct inode *inode)
+{
+	if (!may_show_inode(s, inode))
+		return 0;
+
+	s->iwin.cursor++;
+
+	if (s->iwin.size >= IWIN_SIZE)
+		return 1;
+
+	if (s->iwin.cursor > s->iwin.origin)
+		s->iwin.inodes[s->iwin.size++] = inode;
+	return 0;
+}
+
+/*
+ * Travease the inode lists in order - newest first.
+ * And fill @s->iwin.inodes with inodes positioned in [@pos, @pos+IWIN_SIZE).
+ */
+static int iwin_fill(struct session *s, unsigned long pos)
+{
+	struct inode *inode;
+	struct super_block *sb;
+
+	s->iwin.origin = pos;
+	s->iwin.cursor = 0;
+	s->iwin.size = 0;
+
+	/*
+	 * We have a cursor inode, clean and expected to be unchanged.
+	 */
+	if (s->ipos.inode && pos >= s->ipos.pos &&
+			!(s->ipos.state & I_DIRTY) &&
+			s->ipos.state == s->ipos.inode->i_state) {
+		inode = s->ipos.inode;
+		s->iwin.cursor = s->ipos.pos;
+		goto continue_from_saved;
+	}
+
+	if (s->ls_options & LS_OPT_CLEAN)
+		goto clean_inodes;
+
+	spin_lock(&sb_lock);
+	list_for_each_entry(sb, &super_blocks, s_list) {
+		if (s->ls_dev && s->ls_dev != sb->s_dev)
+			continue;
+
+		list_for_each_entry(inode, &sb->s_dirty, i_list) {
+			if (iwin_push(s, inode))
+				goto out_full_unlock;
+		}
+		list_for_each_entry(inode, &sb->s_io, i_list) {
+			if (iwin_push(s, inode))
+				goto out_full_unlock;
+		}
+	}
+	spin_unlock(&sb_lock);
+
+clean_inodes:
+	list_for_each_entry(inode, &inode_in_use, i_list) {
+		if (iwin_push(s, inode))
+			goto out_full;
+continue_from_saved:
+		;
+	}
+
+	if (s->ls_options & LS_OPT_INUSE)
+		return 0;
+
+	list_for_each_entry(inode, &inode_unused, i_list) {
+		if (iwin_push(s, inode))
+			goto out_full;
+	}
+
+	return 0;
+
+out_full_unlock:
+	spin_unlock(&sb_lock);
+out_full:
+	return 1;
+}
+
+static struct inode *iwin_inode(struct session *s, unsigned long pos)
+{
+	if ((iwin_full(s) && pos >= s->iwin.origin + s->iwin.size)
+			  || pos < s->iwin.origin)
+		iwin_fill(s, pos);
+
+	if (pos >= s->iwin.cursor)
+		return NULL;
+
+	s->ipos.pos = pos;
+	s->ipos.inode = s->iwin.inodes[pos - s->iwin.origin];
+	BUG_ON(!s->ipos.inode);
+	return s->ipos.inode;
+}
+
+static void show_inode(struct seq_file *m, struct inode *inode)
+{
+	char state[] = "--"; /* dirty, locked */
+	struct dentry *dentry;
+	loff_t size = i_size_read(inode);
+	unsigned long nrpages;
+	int percent;
+	int refcnt;
+	int shift;
+
+	if (!size)
+		size++;
+
+	if (inode->i_mapping)
+		nrpages = inode->i_mapping->nrpages;
+	else {
+		nrpages = 0;
+		WARN_ON(1);
+	}
+
+	for (shift = 0; (size >> shift) > ULONG_MAX / 128; shift += 12)
+		;
+	percent = min(100UL, (((100 * nrpages) >> shift) << PAGE_CACHE_SHIFT) /
+						(unsigned long)(size >> shift));
+
+	if (inode->i_state & (I_DIRTY_DATASYNC|I_DIRTY_PAGES))
+		state[0] = 'D';
+	else if (inode->i_state & I_DIRTY_SYNC)
+		state[0] = 'd';
+
+	if (inode->i_state & I_LOCK)
+		state[0] = 'L';
+
+	refcnt = 0;
+	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
+		refcnt += atomic_read(&dentry->d_count);
+	}
+
+	seq_printf(m, "%10lu %10llu %8lu %7d ",
+			inode->i_ino,
+			DIV_ROUND_UP(size, 1024),
+			nrpages << (PAGE_CACHE_SHIFT - 10),
+			percent);
+
+	seq_printf(m, "%6d %5s %9lu ",
+			refcnt,
+			state,
+			(jiffies - inode->dirtied_when) / HZ);
+
+#ifdef CONFIG_PROC_FILECACHE_EXTRAS
+	seq_printf(m, "%8u %-16s",
+			inode->i_access_count,
+			inode->i_comm);
+#endif
+
+	seq_printf(m, "%02x:%02x(%s)\t",
+			MAJOR(inode->i_sb->s_dev),
+			MINOR(inode->i_sb->s_dev),
+			inode->i_sb->s_id);
+
+	if (list_empty(&inode->i_dentry)) {
+		if (!atomic_read(&inode->i_count))
+			seq_puts(m, "(noname)\n");
+		else
+			seq_printf(m, "(%02x:%02x)\n",
+					imajor(inode), iminor(inode));
+	} else {
+		struct path path = {
+			.mnt = NULL,
+			.dentry = list_entry(inode->i_dentry.next,
+					     struct dentry, d_alias)
+		};
+
+		seq_path(m, &path, " \t\n\\");
+		seq_putc(m, '\n');
+	}
+}
+
+static int ii_show(struct seq_file *m, void *v)
+{
+	unsigned long index = *(loff_t *) v;
+	struct session *s = m->private;
+        struct inode *inode;
+
+	if (index == 0) {
+		seq_puts(m, "# filecache " FILECACHE_VERSION "\n");
+		seq_puts(m, "#      ino       size   cached cached% "
+				"refcnt state       age "
+#ifdef CONFIG_PROC_FILECACHE_EXTRAS
+				"accessed  process         "
+#endif
+				"dev\t\tfile\n");
+	}
+
+        inode = iwin_inode(s,index);
+	show_inode(m, inode);
+
+	return 0;
+}
+
+static void *ii_start(struct seq_file *m, loff_t *pos)
+{
+	struct session *s = m->private;
+
+	s->iwin.size = 0;
+	s->iwin.inodes = (struct inode **)
+				__get_free_pages( GFP_KERNEL, IWIN_PAGE_ORDER);
+	if (!s->iwin.inodes)
+		return NULL;
+
+	spin_lock(&inode_lock);
+
+	return iwin_inode(s, *pos) ? pos : NULL;
+}
+
+static void *ii_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct session *s = m->private;
+
+	(*pos)++;
+	return iwin_inode(s, *pos) ? pos : NULL;
+}
+
+static void ii_stop(struct seq_file *m, void *v)
+{
+	struct session *s = m->private;
+	struct inode *inode = s->ipos.inode;
+
+	if (!s->iwin.inodes)
+		return;
+
+	if (inode) {
+		__iget(inode);
+		s->ipos.state = inode->i_state;
+	}
+	spin_unlock(&inode_lock);
+
+	free_pages((unsigned long) s->iwin.inodes, IWIN_PAGE_ORDER);
+	if (s->ipos.pinned_inode)
+		iput(s->ipos.pinned_inode);
+	s->ipos.pinned_inode = inode;
+}
+
+/*
+ * Listing of cached page ranges of a file.
+ *
+ * Usage:
+ * 		echo 'file name' > /proc/filecache
+ * 		cat /proc/filecache
+ */
+
+unsigned long page_mask;
+#define PG_MMAP		PG_lru		/* reuse any non-relevant flag */
+#define PG_BUFFER	PG_swapcache	/* ditto */
+#define PG_DIRTY	PG_error	/* ditto */
+#define PG_WRITEBACK	PG_buddy	/* ditto */
+
+/*
+ * Page state names, prefixed by their abbreviations.
+ */
+struct {
+	unsigned long	mask;
+	const char     *name;
+	int		faked;
+} page_flag [] = {
+	{1 << PG_referenced,	"R:referenced",	0},
+	{1 << PG_active,	"A:active",	0},
+	{1 << PG_MMAP,		"M:mmap",	1},
+
+	{1 << PG_uptodate,	"U:uptodate",	0},
+	{1 << PG_dirty,		"D:dirty",	0},
+	{1 << PG_writeback,	"W:writeback",	0},
+	{1 << PG_reclaim,	"X:readahead",	0},
+
+	{1 << PG_private,	"P:private",	0},
+	{1 << PG_owner_priv_1,	"O:owner",	0},
+
+	{1 << PG_BUFFER,	"b:buffer",	1},
+	{1 << PG_DIRTY,		"d:dirty",	1},
+	{1 << PG_WRITEBACK,	"w:writeback",	1},
+};
+
+static unsigned long page_flags(struct page* page)
+{
+	unsigned long flags;
+	struct address_space *mapping = page_mapping(page);
+
+	flags = page->flags & page_mask;
+
+	if (page_mapped(page))
+		flags |= (1 << PG_MMAP);
+
+	if (page_has_buffers(page))
+		flags |= (1 << PG_BUFFER);
+
+	if (mapping) {
+		if (radix_tree_tag_get(&mapping->page_tree,
+					page_index(page),
+					PAGECACHE_TAG_WRITEBACK))
+			flags |= (1 << PG_WRITEBACK);
+
+		if (radix_tree_tag_get(&mapping->page_tree,
+					page_index(page),
+					PAGECACHE_TAG_DIRTY))
+			flags |= (1 << PG_DIRTY);
+	}
+
+	return flags;
+}
+
+static int pages_similiar(struct page* page0, struct page* page)
+{
+	if (page_count(page0) != page_count(page))
+		return 0;
+
+	if (page_flags(page0) != page_flags(page))
+		return 0;
+
+	return 1;
+}
+
+static void show_range(struct seq_file *m, struct page* page, unsigned long len)
+{
+	int i;
+	unsigned long flags;
+
+	if (!m || !page)
+		return;
+
+	seq_printf(m, "%lu\t%lu\t", page->index, len);
+
+	flags = page_flags(page);
+	for (i = 0; i < ARRAY_SIZE(page_flag); i++)
+		seq_putc(m, (flags & page_flag[i].mask) ?
+					page_flag[i].name[0] : '_');
+
+	seq_printf(m, "\t%d\n", page_count(page));
+}
+
+#define BATCH_LINES	100
+static pgoff_t show_file_cache(struct seq_file *m,
+				struct address_space *mapping, pgoff_t start)
+{
+	int i;
+	int lines = 0;
+	pgoff_t len = 0;
+	struct pagevec pvec;
+	struct page *page;
+	struct page *page0 = NULL;
+
+	for (;;) {
+		pagevec_init(&pvec, 0);
+		pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
+				(void **)pvec.pages, start + len, PAGEVEC_SIZE);
+
+		if (pvec.nr == 0) {
+			show_range(m, page0, len);
+			start = ULONG_MAX;
+			goto out;
+		}
+
+		if (!page0)
+			page0 = pvec.pages[0];
+
+		for (i = 0; i < pvec.nr; i++) {
+			page = pvec.pages[i];
+
+			if (page->index == start + len &&
+					pages_similiar(page0, page))
+				len++;
+			else {
+				show_range(m, page0, len);
+				page0 = page;
+				start = page->index;
+				len = 1;
+				if (++lines > BATCH_LINES)
+					goto out;
+			}
+		}
+	}
+
+out:
+	return start;
+}
+
+static int pg_show(struct seq_file *m, void *v)
+{
+	struct session *s = m->private;
+	struct file *file = s->query_file;
+	pgoff_t offset;
+
+	if (!file)
+		return ii_show(m, v);
+
+	offset = *(loff_t *) v;
+
+	if (!offset) { /* print header */
+		int i;
+
+		seq_puts(m, "# file ");
+		seq_path(m, &file->f_path, " \t\n\\");
+
+		seq_puts(m, "\n# flags");
+		for (i = 0; i < ARRAY_SIZE(page_flag); i++)
+			seq_printf(m, " %s", page_flag[i].name);
+
+		seq_puts(m, "\n# idx\tlen\tstate\t\trefcnt\n");
+	}
+
+	s->start_offset = offset;
+	s->next_offset = show_file_cache(m, file->f_mapping, offset);
+
+	return 0;
+}
+
+static void *file_pos(struct file *file, loff_t *pos)
+{
+	loff_t size = i_size_read(file->f_mapping->host);
+	pgoff_t end = DIV_ROUND_UP(size, PAGE_CACHE_SIZE);
+	pgoff_t offset = *pos;
+
+	return offset < end ? pos : NULL;
+}
+
+static void *pg_start(struct seq_file *m, loff_t *pos)
+{
+	struct session *s = m->private;
+	struct file *file = s->query_file;
+	pgoff_t offset = *pos;
+
+	if (!file)
+		return ii_start(m, pos);
+
+	rcu_read_lock();
+
+	if (offset - s->start_offset == 1)
+		*pos = s->next_offset;
+	return file_pos(file, pos);
+}
+
+static void *pg_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct session *s = m->private;
+	struct file *file = s->query_file;
+
+	if (!file)
+		return ii_next(m, v, pos);
+
+	*pos = s->next_offset;
+	return file_pos(file, pos);
+}
+
+static void pg_stop(struct seq_file *m, void *v)
+{
+	struct session *s = m->private;
+	struct file *file = s->query_file;
+
+	if (!file)
+		return ii_stop(m, v);
+
+	rcu_read_unlock();
+}
+
+struct seq_operations seq_filecache_op = {
+	.start	= pg_start,
+	.next	= pg_next,
+	.stop	= pg_stop,
+	.show	= pg_show,
+};
+
+/*
+ * Implement the manual drop-all-pagecache function
+ */
+
+#define MAX_INODES	(PAGE_SIZE / sizeof(struct inode *))
+static int drop_pagecache(void)
+{
+	struct hlist_head *head;
+	struct hlist_node *node;
+	struct inode *inode;
+	struct inode **inodes;
+	unsigned long i, j, k;
+	int err = 0;
+
+	inodes = (struct inode **)__get_free_pages(GFP_KERNEL, IWIN_PAGE_ORDER);
+	if (!inodes)
+		return -ENOMEM;
+
+	for (i = 0; (head = get_inode_hash_budget(i)); i++) {
+		if (hlist_empty(head))
+			continue;
+
+		j = 0;
+		cond_resched();
+
+		/*
+		 * Grab some inodes.
+		 */
+		spin_lock(&inode_lock);
+		hlist_for_each (node, head) {
+			inode = hlist_entry(node, struct inode, i_hash);
+			if (!atomic_read(&inode->i_count))
+				continue;
+			if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+				continue;
+			if (!inode->i_mapping || !inode->i_mapping->nrpages)
+				continue;
+			__iget(inode);
+			inodes[j++] = inode;
+			if (j >= MAX_INODES)
+				break;
+		}
+		spin_unlock(&inode_lock);
+
+		/*
+		 * Free clean pages.
+		 */
+		for (k = 0; k < j; k++) {
+			inode = inodes[k];
+			invalidate_mapping_pages(inode->i_mapping, 0, ~1);
+			iput(inode);
+		}
+
+		/*
+		 * Simply ignore the remaining inodes.
+		 */
+		if (j >= MAX_INODES && !err) {
+			printk(KERN_WARNING
+				"Too many collides in inode hash table.\n"
+				"Pls boot with a larger ihash_entries=XXX.\n");
+			err = -EAGAIN;
+		}
+	}
+
+	free_pages((unsigned long) inodes, IWIN_PAGE_ORDER);
+	return err;
+}
+
+static void drop_slabcache(void)
+{
+	int nr_objects;
+
+	do {
+		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
+	} while (nr_objects > 10);
+}
+
+/*
+ * Proc file operations.
+ */
+
+static int filecache_open(struct inode *inode, struct file *proc_file)
+{
+	struct seq_file *m;
+	struct session *s;
+	unsigned size;
+	char *buf = 0;
+	int ret;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENOENT;
+
+	s = session_create();
+	if (IS_ERR(s)) {
+		ret = PTR_ERR(s);
+		goto out;
+	}
+	set_session(proc_file, s);
+
+	size = SBUF_SIZE;
+	buf = kmalloc(size, GFP_KERNEL);
+	if (!buf) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = seq_open(proc_file, &seq_filecache_op);
+	if (!ret) {
+		m = proc_file->private_data;
+		m->private = s;
+		m->buf = buf;
+		m->size = size;
+	}
+
+out:
+	if (ret) {
+		kfree(s);
+		kfree(buf);
+		module_put(THIS_MODULE);
+	}
+	return ret;
+}
+
+static int filecache_release(struct inode *inode, struct file *proc_file)
+{
+	struct session *s = get_session(proc_file);
+	int ret;
+
+	session_release(s);
+	ret = seq_release(inode, proc_file);
+	module_put(THIS_MODULE);
+	return ret;
+}
+
+ssize_t filecache_write(struct file *proc_file, const char __user * buffer,
+			size_t count, loff_t *ppos)
+{
+	struct session *s;
+	char *name;
+	int err = 0;
+
+	if (count >= PATH_MAX + 5)
+		return -ENAMETOOLONG;
+
+	name = kmalloc(count+1, GFP_KERNEL);
+	if (!name)
+		return -ENOMEM;
+
+	if (copy_from_user(name, buffer, count)) {
+		err = -EFAULT;
+		goto out;
+	}
+
+	/* strip the optional newline */
+	if (count && name[count-1] == '\n')
+		name[count-1] = '\0';
+	else
+		name[count] = '\0';
+
+	s = get_session(proc_file);
+	if (!strcmp(name, "set private")) {
+		s->private_session = 1;
+		goto out;
+	}
+
+	if (!strncmp(name, "cat ", 4)) {
+		err = session_update_file(s, name+4);
+		goto out;
+	}
+
+	if (!strncmp(name, "ls", 2)) {
+		err = session_update_file(s, NULL);
+		if (!err)
+			err = ls_parse_options(name+2, s);
+		if (!err && !s->private_session) {
+			global_session.ls_dev = s->ls_dev;
+			global_session.ls_options = s->ls_options;
+		}
+		goto out;
+	}
+
+	if (!strncmp(name, "drop pagecache", 14)) {
+		err = drop_pagecache();
+		goto out;
+	}
+
+	if (!strncmp(name, "drop slabcache", 14)) {
+		drop_slabcache();
+		goto out;
+	}
+
+	/* err = -EINVAL; */
+	err = session_update_file(s, name);
+
+out:
+	kfree(name);
+
+	return err ? err : count;
+}
+
+static struct file_operations proc_filecache_fops = {
+	.owner		= THIS_MODULE,
+	.open		= filecache_open,
+	.release	= filecache_release,
+	.write		= filecache_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+};
+
+
+static __init int filecache_init(void)
+{
+	int i;
+	struct proc_dir_entry *entry;
+
+	entry = create_proc_entry("filecache", 0600, NULL);
+	if (entry)
+		entry->proc_fops = &proc_filecache_fops;
+
+	for (page_mask = i = 0; i < ARRAY_SIZE(page_flag); i++)
+		if (!page_flag[i].faked)
+			page_mask |= page_flag[i].mask;
+
+	return 0;
+}
+
+static void filecache_exit(void)
+{
+	remove_proc_entry("filecache", NULL);
+	if (global_session.query_file)
+		fput(global_session.query_file);
+}
+
+MODULE_AUTHOR("Fengguang Wu <wfg@mail.ustc.edu.cn>");
+MODULE_LICENSE("GPL");
+
+module_init(filecache_init);
+module_exit(filecache_exit);
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -775,6 +775,11 @@ struct inode {
 	void			*i_security;
 #endif
 	void			*i_private; /* fs or device private pointer */
+
+#ifdef CONFIG_PROC_FILECACHE_EXTRAS
+	unsigned int		i_access_count;	/* opened how many times? */
+	char			i_comm[16];	/* opened first by which app? */
+#endif
 };
 
 /*
@@ -860,6 +865,13 @@ static inline unsigned imajor(const stru
 	return MAJOR(inode->i_rdev);
 }
 
+static inline void inode_accessed(struct inode *inode)
+{
+#ifdef CONFIG_PROC_FILECACHE_EXTRAS
+	inode->i_access_count++;
+#endif
+}
+
 extern struct block_device *I_BDEV(struct inode *inode);
 
 struct fown_struct {
@@ -2171,6 +2183,7 @@ extern void remove_inode_hash(struct ino
 static inline void insert_inode_hash(struct inode *inode) {
 	__insert_inode_hash(inode, inode->i_ino);
 }
+struct hlist_head * get_inode_hash_budget(unsigned long index);
 
 extern struct file * get_empty_filp(void);
 extern void file_move(struct file *f, struct list_head *list);
--- linux-2.6.orig/fs/open.c
+++ linux-2.6/fs/open.c
@@ -842,6 +842,7 @@ static struct file *__dentry_open(struct
 			goto cleanup_all;
 	}
 
+	inode_accessed(inode);
 	f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
 
 	file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
--- linux-2.6.orig/fs/Kconfig
+++ linux-2.6/fs/Kconfig
@@ -265,4 +265,34 @@ endif
 source "fs/nls/Kconfig"
 source "fs/dlm/Kconfig"
 
+config PROC_FILECACHE
+	tristate "/proc/filecache support"
+	default m
+	depends on PROC_FS
+	help
+	  This option creates a file /proc/filecache which enables one to
+	  query/drop the cached files in memory.
+
+	  A quick start guide:
+
+	  # echo 'ls' > /proc/filecache
+	  # head /proc/filecache
+
+	  # echo 'cat /bin/bash' > /proc/filecache
+	  # head /proc/filecache
+
+	  # echo 'drop pagecache' > /proc/filecache
+	  # echo 'drop slabcache' > /proc/filecache
+
+	  For more details, please check Documentation/filesystems/proc.txt .
+
+	  It can be a handy tool for sysadms and desktop users.
+
+config PROC_FILECACHE_EXTRAS
+	bool "track extra states"
+	default y
+	depends on PROC_FILECACHE
+	help
+	  Track extra states that costs a little more time/space.
+
 endmenu
--- linux-2.6.orig/fs/proc/Makefile
+++ linux-2.6/fs/proc/Makefile
@@ -2,7 +2,8 @@
 # Makefile for the Linux proc filesystem routines.
 #
 
-obj-$(CONFIG_PROC_FS) += proc.o
+obj-$(CONFIG_PROC_FS)		+= proc.o
+obj-$(CONFIG_PROC_FILECACHE)	+= filecache.o
 
 proc-y			:= nommu.o task_nommu.o
 proc-$(CONFIG_MMU)	:= mmu.o task_mmu.o

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
  2009-05-17 13:36                               ` Wu Fengguang
@ 2009-05-17 13:55                                   ` Frederic Weisbecker
  2009-05-18 11:44                                   ` KOSAKI Motohiro
  1 sibling, 0 replies; 137+ messages in thread
From: Frederic Weisbecker @ 2009-05-17 13:55 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Ingo Molnar, Li Zefan, Tom Zanussi, KOSAKI Motohiro,
	Pekka Enberg, Andi Kleen, Steven Rostedt, Larry Woodman,
	Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Sun, May 17, 2009 at 09:36:59PM +0800, Wu Fengguang wrote:
> On Tue, May 12, 2009 at 09:01:12PM +0800, Frederic Weisbecker wrote:
> > On Tue, Apr 28, 2009 at 09:31:08PM +0800, Wu Fengguang wrote:
> > > On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> > >
> > > There are two possible challenges for the conversion:
> > >
> > > - One trick it does is to select different lists to traverse on
> > >   different filter options. Will this be possible in the object
> > >   tracing framework?
> > 
> > Yeah, I guess.
> 
> Great.
> 
> > 
> > > - The file name lookup(last field) is the performance killer. Is it
> > >   possible to skip the file name lookup when the filter failed on the
> > >   leading fields?
> > 
> > objects collection lays on trace events where filters basically ignore
> > a whole entry in case of non-matching. Not sure if we can easily only
> > ignore one field.
> > 
> > But I guess we can do something about the performances...
> 
> OK, but it's not as important as the previous requirement, so it could
> be the last thing to work on :)
> 
> > Could you send us the (sob'ed) patch you made which implements this.
> > I could try to adapt it to object collection.
> 
> Attached for your reference. Be aware that I still have plans to
> change it in non trivial way, and there are ongoing works by Nick(on
> inode_lock) and Jens(on s_dirty) that can create merge conflicts.
> So basically it is not a right time to do the adaption.


Ah ok, so I will wait a bit :-)

 
> However we can still do something to polish up the page object
> collection under /debug/tracing/objects/mm/pages/. For example,
> the timestamps and function name could be removed from the following
> list :)
> 
> # tracer: nop                                                                                                                        
> #                                                                                                                                    
> #           TASK-PID    CPU#    TIMESTAMP  FUNCTION                                                                                  
> #              | |       |          |         |                                                                                      
>            <...>-3743  [001]  3035.649769: dump_pages: pfn=1 flags=400 count=1 mapcount=0 index=0                           
>            <...>-3743  [001]  3044.176403: dump_pages: pfn=1 flags=400 count=1 mapcount=0 index=0                           
>            <...>-3743  [001]  3044.176407: dump_pages: pfn=2 flags=400 count=1 mapcount=0 index=0                           
>            <...>-3743  [001]  3044.176408: dump_pages: pfn=3 flags=400 count=1 mapcount=0 index=0                           
>            <...>-3743  [001]  3044.176409: dump_pages: pfn=4 flags=400 count=1 mapcount=0 index=0                           
>            <...>-3743  [001]  3044.176409: dump_pages: pfn=5 flags=400 count=1 mapcount=0 index=0                           
>            <...>-3743  [001]  3044.176410: dump_pages: pfn=6 flags=400 count=1 mapcount=0 index=0                           
>            <...>-3743  [001]  3044.176410: dump_pages: pfn=7 flags=400 count=1 mapcount=0 index=0                           
>            <...>-3743  [001]  3044.176411: dump_pages: pfn=8 flags=400 count=1 mapcount=0 index=0                           
>            <...>-3743  [001]  3044.176411: dump_pages: pfn=9 flags=400 count=1 mapcount=0 index=0                           
>            <...>-3743  [001]  3044.176412: dump_pages: pfn=10 flags=400 count=1 mapcount=0 index=0                          


echo nocontext-info > /debug/tracing/trace_options :-)
But you'll have only the function and the pages specifics. It's not really the
function but more specifically the name of the event. It's useful to distinguish
multiple events to a trace.

Hmm, may be it's not that much useful in a object dump...

Thanks.



> Thanks,
> Fengguang

> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -27,6 +27,7 @@ extern unsigned long max_mapnr;
>  extern unsigned long num_physpages;
>  extern void * high_memory;
>  extern int page_cluster;
> +extern char * const zone_names[];
>  
>  #ifdef CONFIG_SYSCTL
>  extern int sysctl_legacy_va_layout;
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -104,7 +104,7 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_Z
>  
>  EXPORT_SYMBOL(totalram_pages);
>  
> -static char * const zone_names[MAX_NR_ZONES] = {
> +char * const zone_names[MAX_NR_ZONES] = {
>  #ifdef CONFIG_ZONE_DMA
>  	 "DMA",
>  #endif
> --- linux-2.6.orig/fs/dcache.c
> +++ linux-2.6/fs/dcache.c
> @@ -1925,7 +1925,10 @@ char *__d_path(const struct path *path, 
>  
>  		if (dentry == root->dentry && vfsmnt == root->mnt)
>  			break;
> -		if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
> +		if (unlikely(!vfsmnt)) {
> +			if (IS_ROOT(dentry))
> +				break;
> +		} else if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
>  			/* Global root? */
>  			if (vfsmnt->mnt_parent == vfsmnt) {
>  				goto global_root;
> --- linux-2.6.orig/lib/radix-tree.c
> +++ linux-2.6/lib/radix-tree.c
> @@ -564,7 +564,6 @@ out:
>  }
>  EXPORT_SYMBOL(radix_tree_tag_clear);
>  
> -#ifndef __KERNEL__	/* Only the test harness uses this at present */
>  /**
>   * radix_tree_tag_get - get a tag on a radix tree node
>   * @root:		radix tree root
> @@ -627,7 +626,6 @@ int radix_tree_tag_get(struct radix_tree
>  	}
>  }
>  EXPORT_SYMBOL(radix_tree_tag_get);
> -#endif
>  
>  /**
>   *	radix_tree_next_hole    -    find the next hole (not-present entry)
> --- linux-2.6.orig/fs/inode.c
> +++ linux-2.6/fs/inode.c
> @@ -84,6 +84,10 @@ static struct hlist_head *inode_hashtabl
>   */
>  DEFINE_SPINLOCK(inode_lock);
>  
> +EXPORT_SYMBOL(inode_in_use);
> +EXPORT_SYMBOL(inode_unused);
> +EXPORT_SYMBOL(inode_lock);
> +
>  /*
>   * iprune_mutex provides exclusion between the kswapd or try_to_free_pages
>   * icache shrinking path, and the umount path.  Without this exclusion,
> @@ -110,6 +114,13 @@ static void wake_up_inode(struct inode *
>  	wake_up_bit(&inode->i_state, __I_LOCK);
>  }
>  
> +static inline void inode_created_by(struct inode *inode, struct task_struct *task)
> +{
> +#ifdef CONFIG_PROC_FILECACHE_EXTRAS
> +	memcpy(inode->i_comm, task->comm, sizeof(task->comm));
> +#endif
> +}
> +
>  /**
>   * inode_init_always - perform inode structure intialisation
>   * @sb: superblock inode belongs to
> @@ -147,7 +158,7 @@ struct inode *inode_init_always(struct s
>  	inode->i_bdev = NULL;
>  	inode->i_cdev = NULL;
>  	inode->i_rdev = 0;
> -	inode->dirtied_when = 0;
> +	inode->dirtied_when = jiffies;
>  
>  	if (security_inode_alloc(inode))
>  		goto out_free_inode;
> @@ -188,6 +199,7 @@ struct inode *inode_init_always(struct s
>  	}
>  	inode->i_private = NULL;
>  	inode->i_mapping = mapping;
> +	inode_created_by(inode, current);
>  
>  	return inode;
>  
> @@ -276,6 +288,8 @@ void __iget(struct inode *inode)
>  	inodes_stat.nr_unused--;
>  }
>  
> +EXPORT_SYMBOL(__iget);
> +
>  /**
>   * clear_inode - clear an inode
>   * @inode: inode to clear
> @@ -1459,6 +1473,16 @@ static void __wait_on_freeing_inode(stru
>  	spin_lock(&inode_lock);
>  }
>  
> +
> +struct hlist_head * get_inode_hash_budget(unsigned long index)
> +{
> +       if (index >= (1 << i_hash_shift))
> +               return NULL;
> +
> +       return inode_hashtable + index;
> +}
> +EXPORT_SYMBOL_GPL(get_inode_hash_budget);
> +
>  static __initdata unsigned long ihash_entries;
>  static int __init set_ihash_entries(char *str)
>  {
> --- linux-2.6.orig/fs/super.c
> +++ linux-2.6/fs/super.c
> @@ -46,6 +46,9 @@
>  LIST_HEAD(super_blocks);
>  DEFINE_SPINLOCK(sb_lock);
>  
> +EXPORT_SYMBOL(super_blocks);
> +EXPORT_SYMBOL(sb_lock);
> +
>  /**
>   *	alloc_super	-	create new superblock
>   *	@type:	filesystem type superblock should belong to
> --- linux-2.6.orig/mm/vmscan.c
> +++ linux-2.6/mm/vmscan.c
> @@ -262,6 +262,7 @@ unsigned long shrink_slab(unsigned long 
>  	up_read(&shrinker_rwsem);
>  	return ret;
>  }
> +EXPORT_SYMBOL(shrink_slab);
>  
>  /* Called without lock on whether page is mapped, so answer is unstable */
>  static inline int page_mapping_inuse(struct page *page)
> --- linux-2.6.orig/mm/swap_state.c
> +++ linux-2.6/mm/swap_state.c
> @@ -45,6 +45,7 @@ struct address_space swapper_space = {
>  	.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
>  	.backing_dev_info = &swap_backing_dev_info,
>  };
> +EXPORT_SYMBOL_GPL(swapper_space);
>  
>  #define INC_CACHE_INFO(x)	do { swap_cache_info.x++; } while (0)
>  
> --- linux-2.6.orig/Documentation/filesystems/proc.txt
> +++ linux-2.6/Documentation/filesystems/proc.txt
> @@ -260,6 +260,7 @@ Table 1-4: Kernel info in /proc
>   driver	     Various drivers grouped here, currently rtc (2.4)
>   execdomains Execdomains, related to security			(2.4)
>   fb	     Frame Buffer devices				(2.4)
> + filecache   Query/drop in-memory file cache
>   fs	     File system parameters, currently nfs/exports	(2.4)
>   ide         Directory containing info about the IDE subsystem 
>   interrupts  Interrupt usage                                   
> @@ -450,6 +451,88 @@ varies by architecture and compile optio
>  
>  > cat /proc/meminfo
>  
> +..............................................................................
> +
> +filecache:
> +
> +Provides access to the in-memory file cache.
> +
> +To list an index of all cached files:
> +
> +    echo ls > /proc/filecache
> +    cat /proc/filecache
> +
> +The output looks like:
> +
> +    # filecache 1.0
> +    #      ino       size   cached cached%  state   refcnt  dev             file
> +       1026334         91       92    100   --      66      03:02(hda2)     /lib/ld-2.3.6.so
> +        233608       1242      972     78   --      66      03:02(hda2)     /lib/tls/libc-2.3.6.so
> +         65203        651      476     73   --      1       03:02(hda2)     /bin/bash
> +       1026445        261      160     61   --      10      03:02(hda2)     /lib/libncurses.so.5.5
> +        235427         10       12    100   --      44      03:02(hda2)     /lib/tls/libdl-2.3.6.so
> +
> +FIELD	INTRO
> +---------------------------------------------------------------------------
> +ino	inode number
> +size	inode size in KB
> +cached	cached size in KB
> +cached%	percent of file data cached
> +state1	'-' clean; 'd' metadata dirty; 'D' data dirty
> +state2	'-' unlocked; 'L' locked, normally indicates file being written out
> +refcnt	file reference count, it's an in-kernel one, not exactly open count
> +dev	major:minor numbers in hex, followed by a descriptive device name
> +file	file path _inside_ the filesystem. There are several special names:
> +	'(noname)':	the file name is not available
> +	'(03:02)':	the file is a block device file of major:minor
> +	'...(deleted)': the named file has been deleted from the disk
> +
> +To list the cached pages of a perticular file:
> +
> +    echo /bin/bash > /proc/filecache
> +    cat /proc/filecache
> +
> +    # file /bin/bash
> +    # flags R:referenced A:active U:uptodate D:dirty W:writeback M:mmap
> +    # idx   len     state   refcnt
> +    0       36      RAU__M  3
> +    36      1       RAU__M  2
> +    37      8       RAU__M  3
> +    45      2       RAU___  1
> +    47      6       RAU__M  3
> +    53      3       RAU__M  2
> +    56      2       RAU__M  3
> +
> +FIELD	INTRO
> +----------------------------------------------------------------------------
> +idx	page index
> +len	number of pages which are cached and share the same state
> +state	page state of the flags listed in line two
> +refcnt	page reference count
> +
> +Careful users may notice that the file name to be queried is remembered between
> +commands. Internally, the module has a global variable to store the file name
> +parameter, so that it can be inherited by newly opened /proc/filecache file.
> +However it can lead to interference for multiple queriers. The solution here
> +is to obey a rule: only root can interactively change the file name parameter;
> +normal users must go for scripts to access the interface. Scripts should do it
> +by following the code example below:
> +
> +    filecache = open("/proc/filecache", "rw");
> +    # avoid polluting the global parameter filename
> +    filecache.write("set private");
> +
> +To instruct the kernel to drop clean caches, dentries and inodes from memory,
> +causing that memory to become free:
> +
> +    # drop clean file data cache (i.e. file backed pagecache)
> +    echo drop pagecache > /proc/filecache
> +
> +    # drop clean file metadata cache (i.e. dentries and inodes)
> +    echo drop slabcache > /proc/filecache
> +
> +Note that the drop commands are non-destructive operations and dirty objects
> +are not freeable, the user should run `sync' first.
>  
>  MemTotal:     16344972 kB
>  MemFree:      13634064 kB
> --- /dev/null
> +++ linux-2.6/fs/proc/filecache.c
> @@ -0,0 +1,1045 @@
> +/*
> + * fs/proc/filecache.c
> + *
> + * Copyright (C) 2006, 2007 Fengguang Wu <wfg@mail.ustc.edu.cn>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/radix-tree.h>
> +#include <linux/page-flags.h>
> +#include <linux/pagevec.h>
> +#include <linux/pagemap.h>
> +#include <linux/vmalloc.h>
> +#include <linux/writeback.h>
> +#include <linux/buffer_head.h>
> +#include <linux/parser.h>
> +#include <linux/proc_fs.h>
> +#include <linux/seq_file.h>
> +#include <linux/file.h>
> +#include <linux/namei.h>
> +#include <linux/module.h>
> +#include <asm/uaccess.h>
> +
> +/*
> + * Increase minor version when new columns are added;
> + * Increase major version when existing columns are changed.
> + */
> +#define FILECACHE_VERSION	"1.0"
> +
> +/* Internal buffer sizes. The larger the more effcient. */
> +#define SBUF_SIZE	(128<<10)
> +#define IWIN_PAGE_ORDER	3
> +#define IWIN_SIZE	((PAGE_SIZE<<IWIN_PAGE_ORDER) / sizeof(struct inode *))
> +
> +/*
> + * Session management.
> + *
> + * Each opened /proc/filecache file is assiocated with a session object.
> + * Also there is a global_session that maintains status across open()/close()
> + * (i.e. the lifetime of an opened file), so that a casual user can query the
> + * filecache via _multiple_ simple shell commands like
> + * 'echo cat /bin/bash > /proc/filecache; cat /proc/filecache'.
> + *
> + * session.query_file is the file whose cache info is to be queried.
> + * Its value determines what we get on read():
> + * 	- NULL: ii_*() called to show the inode index
> + * 	- filp: pg_*() called to show the page groups of a filp
> + *
> + * session.query_file is
> + * 	- cloned from global_session.query_file on open();
> + * 	- updated on write("cat filename");
> + * 	  note that the new file will also be saved in global_session.query_file if
> + * 	  session.private_session is false.
> + */
> +
> +struct session {
> +	/* options */
> +	int		private_session;
> +	unsigned long	ls_options;
> +	dev_t		ls_dev;
> +
> +	/* parameters */
> +	struct file	*query_file;
> +
> +	/* seqfile pos */
> +	pgoff_t		start_offset;
> +	pgoff_t		next_offset;
> +
> +	/* inode at last pos */
> +	struct {
> +		unsigned long pos;
> +		unsigned long state;
> +		struct inode *inode;
> +		struct inode *pinned_inode;
> +	} ipos;
> +
> +	/* inode window */
> +	struct {
> +		unsigned long cursor;
> +		unsigned long origin;
> +		unsigned long size;
> +		struct inode **inodes;
> +	} iwin;
> +};
> +
> +static struct session global_session;
> +
> +/*
> + * Session address is stored in proc_file->f_ra.start:
> + * we assume that there will be no readahead for proc_file.
> + */
> +static struct session *get_session(struct file *proc_file)
> +{
> +	return (struct session *)proc_file->f_ra.start;
> +}
> +
> +static void set_session(struct file *proc_file, struct session *s)
> +{
> +	BUG_ON(proc_file->f_ra.start);
> +	proc_file->f_ra.start = (unsigned long)s;
> +}
> +
> +static void update_global_file(struct session *s)
> +{
> +	if (s->private_session)
> +		return;
> +
> +	if (global_session.query_file)
> +		fput(global_session.query_file);
> +
> +	global_session.query_file = s->query_file;
> +
> +	if (global_session.query_file)
> +		get_file(global_session.query_file);
> +}
> +
> +/*
> + * Cases of the name:
> + * 1) NULL                (new session)
> + * 	s->query_file = global_session.query_file = 0;
> + * 2) ""                  (ls/la)
> + * 	s->query_file = global_session.query_file;
> + * 3) a regular file name (cat newfile)
> + * 	s->query_file = global_session.query_file = newfile;
> + */
> +static int session_update_file(struct session *s, char *name)
> +{
> +	static DEFINE_MUTEX(mutex); /* protects global_session.query_file */
> +	int err = 0;
> +
> +	mutex_lock(&mutex);
> +
> +	/*
> +	 * We are to quit, or to list the cached files.
> +	 * Reset *.query_file.
> +	 */
> +	if (!name) {
> +		if (s->query_file) {
> +			fput(s->query_file);
> +			s->query_file = NULL;
> +		}
> +		update_global_file(s);
> +		goto out;
> +	}
> +
> +	/*
> +	 * This is a new session.
> +	 * Inherit options/parameters from global ones.
> +	 */
> +	if (name[0] == '\0') {
> +		*s = global_session;
> +		if (s->query_file)
> +			get_file(s->query_file);
> +		goto out;
> +	}
> +
> +	/*
> +	 * Open the named file.
> +	 */
> +	if (s->query_file)
> +		fput(s->query_file);
> +	s->query_file = filp_open(name, O_RDONLY|O_LARGEFILE, 0);
> +	if (IS_ERR(s->query_file)) {
> +		err = PTR_ERR(s->query_file);
> +		s->query_file = NULL;
> +	} else
> +		update_global_file(s);
> +
> +out:
> +	mutex_unlock(&mutex);
> +
> +	return err;
> +}
> +
> +static struct session *session_create(void)
> +{
> +	struct session *s;
> +	int err = 0;
> +
> +	s = kmalloc(sizeof(*s), GFP_KERNEL);
> +	if (s)
> +		err = session_update_file(s, "");
> +	else
> +		err = -ENOMEM;
> +
> +	return err ? ERR_PTR(err) : s;
> +}
> +
> +static void session_release(struct session *s)
> +{
> +	if (s->ipos.pinned_inode)
> +		iput(s->ipos.pinned_inode);
> +	if (s->query_file)
> +		fput(s->query_file);
> +	kfree(s);
> +}
> +
> +
> +/*
> + * Listing of cached files.
> + *
> + * Usage:
> + * 		echo > /proc/filecache  # enter listing mode
> + * 		cat /proc/filecache     # get the file listing
> + */
> +
> +/* code style borrowed from ib_srp.c */
> +enum {
> +	LS_OPT_ERR	=	0,
> +	LS_OPT_DIRTY	=	1 << 0,
> +	LS_OPT_CLEAN	=	1 << 1,
> +	LS_OPT_INUSE	=	1 << 2,
> +	LS_OPT_EMPTY	=	1 << 3,
> +	LS_OPT_ALL	=	1 << 4,
> +	LS_OPT_DEV	=	1 << 5,
> +};
> +
> +static match_table_t ls_opt_tokens = {
> +	{ LS_OPT_DIRTY,		"dirty" 	},
> +	{ LS_OPT_CLEAN,		"clean" 	},
> +	{ LS_OPT_INUSE,		"inuse" 	},
> +	{ LS_OPT_EMPTY,		"empty"		},
> +	{ LS_OPT_ALL,		"all" 		},
> +	{ LS_OPT_DEV,		"dev=%s"	},
> +	{ LS_OPT_ERR,		NULL 		}
> +};
> +
> +static int ls_parse_options(const char *buf, struct session *s)
> +{
> +	substring_t args[MAX_OPT_ARGS];
> +	char *options, *sep_opt;
> +	char *p;
> +	int token;
> +	int ret = 0;
> +
> +	if (!buf)
> +		return 0;
> +	options = kstrdup(buf, GFP_KERNEL);
> +	if (!options)
> +		return -ENOMEM;
> +
> +	s->ls_options = 0;
> +	sep_opt = options;
> +	while ((p = strsep(&sep_opt, " ")) != NULL) {
> +		if (!*p)
> +			continue;
> +
> +		token = match_token(p, ls_opt_tokens, args);
> +
> +		switch (token) {
> +		case LS_OPT_DIRTY:
> +		case LS_OPT_CLEAN:
> +		case LS_OPT_INUSE:
> +		case LS_OPT_EMPTY:
> +		case LS_OPT_ALL:
> +			s->ls_options |= token;
> +			break;
> +		case LS_OPT_DEV:
> +			p = match_strdup(args);
> +			if (!p) {
> +				ret = -ENOMEM;
> +				goto out;
> +			}
> +			if (*p == '/') {
> +				struct kstat stat;
> +				struct nameidata nd;
> +				ret = path_lookup(p, LOOKUP_FOLLOW, &nd);
> +				if (!ret)
> +					ret = vfs_getattr(nd.path.mnt,
> +							  nd.path.dentry, &stat);
> +				if (!ret)
> +					s->ls_dev = stat.rdev;
> +			} else
> +				s->ls_dev = simple_strtoul(p, NULL, 0);
> +			/* printk("%lx %s\n", (long)s->ls_dev, p); */
> +			kfree(p);
> +			break;
> +
> +		default:
> +			printk(KERN_WARNING "unknown parameter or missing value "
> +			       "'%s' in ls command\n", p);
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +	}
> +
> +out:
> +	kfree(options);
> +	return ret;
> +}
> +
> +/*
> + * Add possible filters here.
> + * No permission check: we cannot verify the path's permission anyway.
> + * We simply demand root previledge for accessing /proc/filecache.
> + */
> +static int may_show_inode(struct session *s, struct inode *inode)
> +{
> +	if (!atomic_read(&inode->i_count))
> +		return 0;
> +	if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
> +		return 0;
> +	if (!inode->i_mapping)
> +		return 0;
> +
> +	if (s->ls_dev && s->ls_dev != inode->i_sb->s_dev)
> +		return 0;
> +
> +	if (s->ls_options & LS_OPT_ALL)
> +		return 1;
> +
> +	if (!(s->ls_options & LS_OPT_EMPTY) && !inode->i_mapping->nrpages)
> +		return 0;
> +
> +	if ((s->ls_options & LS_OPT_DIRTY) && !(inode->i_state & I_DIRTY))
> +		return 0;
> +
> +	if ((s->ls_options & LS_OPT_CLEAN) && (inode->i_state & I_DIRTY))
> +		return 0;
> +
> +	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
> +	      S_ISLNK(inode->i_mode) || S_ISBLK(inode->i_mode)))
> +		return 0;
> +
> +	return 1;
> +}
> +
> +/*
> + * Full: there are more data following.
> + */
> +static int iwin_full(struct session *s)
> +{
> +	return !s->iwin.cursor ||
> +		s->iwin.cursor > s->iwin.origin + s->iwin.size;
> +}
> +
> +static int iwin_push(struct session *s, struct inode *inode)
> +{
> +	if (!may_show_inode(s, inode))
> +		return 0;
> +
> +	s->iwin.cursor++;
> +
> +	if (s->iwin.size >= IWIN_SIZE)
> +		return 1;
> +
> +	if (s->iwin.cursor > s->iwin.origin)
> +		s->iwin.inodes[s->iwin.size++] = inode;
> +	return 0;
> +}
> +
> +/*
> + * Travease the inode lists in order - newest first.
> + * And fill @s->iwin.inodes with inodes positioned in [@pos, @pos+IWIN_SIZE).
> + */
> +static int iwin_fill(struct session *s, unsigned long pos)
> +{
> +	struct inode *inode;
> +	struct super_block *sb;
> +
> +	s->iwin.origin = pos;
> +	s->iwin.cursor = 0;
> +	s->iwin.size = 0;
> +
> +	/*
> +	 * We have a cursor inode, clean and expected to be unchanged.
> +	 */
> +	if (s->ipos.inode && pos >= s->ipos.pos &&
> +			!(s->ipos.state & I_DIRTY) &&
> +			s->ipos.state == s->ipos.inode->i_state) {
> +		inode = s->ipos.inode;
> +		s->iwin.cursor = s->ipos.pos;
> +		goto continue_from_saved;
> +	}
> +
> +	if (s->ls_options & LS_OPT_CLEAN)
> +		goto clean_inodes;
> +
> +	spin_lock(&sb_lock);
> +	list_for_each_entry(sb, &super_blocks, s_list) {
> +		if (s->ls_dev && s->ls_dev != sb->s_dev)
> +			continue;
> +
> +		list_for_each_entry(inode, &sb->s_dirty, i_list) {
> +			if (iwin_push(s, inode))
> +				goto out_full_unlock;
> +		}
> +		list_for_each_entry(inode, &sb->s_io, i_list) {
> +			if (iwin_push(s, inode))
> +				goto out_full_unlock;
> +		}
> +	}
> +	spin_unlock(&sb_lock);
> +
> +clean_inodes:
> +	list_for_each_entry(inode, &inode_in_use, i_list) {
> +		if (iwin_push(s, inode))
> +			goto out_full;
> +continue_from_saved:
> +		;
> +	}
> +
> +	if (s->ls_options & LS_OPT_INUSE)
> +		return 0;
> +
> +	list_for_each_entry(inode, &inode_unused, i_list) {
> +		if (iwin_push(s, inode))
> +			goto out_full;
> +	}
> +
> +	return 0;
> +
> +out_full_unlock:
> +	spin_unlock(&sb_lock);
> +out_full:
> +	return 1;
> +}
> +
> +static struct inode *iwin_inode(struct session *s, unsigned long pos)
> +{
> +	if ((iwin_full(s) && pos >= s->iwin.origin + s->iwin.size)
> +			  || pos < s->iwin.origin)
> +		iwin_fill(s, pos);
> +
> +	if (pos >= s->iwin.cursor)
> +		return NULL;
> +
> +	s->ipos.pos = pos;
> +	s->ipos.inode = s->iwin.inodes[pos - s->iwin.origin];
> +	BUG_ON(!s->ipos.inode);
> +	return s->ipos.inode;
> +}
> +
> +static void show_inode(struct seq_file *m, struct inode *inode)
> +{
> +	char state[] = "--"; /* dirty, locked */
> +	struct dentry *dentry;
> +	loff_t size = i_size_read(inode);
> +	unsigned long nrpages;
> +	int percent;
> +	int refcnt;
> +	int shift;
> +
> +	if (!size)
> +		size++;
> +
> +	if (inode->i_mapping)
> +		nrpages = inode->i_mapping->nrpages;
> +	else {
> +		nrpages = 0;
> +		WARN_ON(1);
> +	}
> +
> +	for (shift = 0; (size >> shift) > ULONG_MAX / 128; shift += 12)
> +		;
> +	percent = min(100UL, (((100 * nrpages) >> shift) << PAGE_CACHE_SHIFT) /
> +						(unsigned long)(size >> shift));
> +
> +	if (inode->i_state & (I_DIRTY_DATASYNC|I_DIRTY_PAGES))
> +		state[0] = 'D';
> +	else if (inode->i_state & I_DIRTY_SYNC)
> +		state[0] = 'd';
> +
> +	if (inode->i_state & I_LOCK)
> +		state[0] = 'L';
> +
> +	refcnt = 0;
> +	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
> +		refcnt += atomic_read(&dentry->d_count);
> +	}
> +
> +	seq_printf(m, "%10lu %10llu %8lu %7d ",
> +			inode->i_ino,
> +			DIV_ROUND_UP(size, 1024),
> +			nrpages << (PAGE_CACHE_SHIFT - 10),
> +			percent);
> +
> +	seq_printf(m, "%6d %5s %9lu ",
> +			refcnt,
> +			state,
> +			(jiffies - inode->dirtied_when) / HZ);
> +
> +#ifdef CONFIG_PROC_FILECACHE_EXTRAS
> +	seq_printf(m, "%8u %-16s",
> +			inode->i_access_count,
> +			inode->i_comm);
> +#endif
> +
> +	seq_printf(m, "%02x:%02x(%s)\t",
> +			MAJOR(inode->i_sb->s_dev),
> +			MINOR(inode->i_sb->s_dev),
> +			inode->i_sb->s_id);
> +
> +	if (list_empty(&inode->i_dentry)) {
> +		if (!atomic_read(&inode->i_count))
> +			seq_puts(m, "(noname)\n");
> +		else
> +			seq_printf(m, "(%02x:%02x)\n",
> +					imajor(inode), iminor(inode));
> +	} else {
> +		struct path path = {
> +			.mnt = NULL,
> +			.dentry = list_entry(inode->i_dentry.next,
> +					     struct dentry, d_alias)
> +		};
> +
> +		seq_path(m, &path, " \t\n\\");
> +		seq_putc(m, '\n');
> +	}
> +}
> +
> +static int ii_show(struct seq_file *m, void *v)
> +{
> +	unsigned long index = *(loff_t *) v;
> +	struct session *s = m->private;
> +        struct inode *inode;
> +
> +	if (index == 0) {
> +		seq_puts(m, "# filecache " FILECACHE_VERSION "\n");
> +		seq_puts(m, "#      ino       size   cached cached% "
> +				"refcnt state       age "
> +#ifdef CONFIG_PROC_FILECACHE_EXTRAS
> +				"accessed  process         "
> +#endif
> +				"dev\t\tfile\n");
> +	}
> +
> +        inode = iwin_inode(s,index);
> +	show_inode(m, inode);
> +
> +	return 0;
> +}
> +
> +static void *ii_start(struct seq_file *m, loff_t *pos)
> +{
> +	struct session *s = m->private;
> +
> +	s->iwin.size = 0;
> +	s->iwin.inodes = (struct inode **)
> +				__get_free_pages( GFP_KERNEL, IWIN_PAGE_ORDER);
> +	if (!s->iwin.inodes)
> +		return NULL;
> +
> +	spin_lock(&inode_lock);
> +
> +	return iwin_inode(s, *pos) ? pos : NULL;
> +}
> +
> +static void *ii_next(struct seq_file *m, void *v, loff_t *pos)
> +{
> +	struct session *s = m->private;
> +
> +	(*pos)++;
> +	return iwin_inode(s, *pos) ? pos : NULL;
> +}
> +
> +static void ii_stop(struct seq_file *m, void *v)
> +{
> +	struct session *s = m->private;
> +	struct inode *inode = s->ipos.inode;
> +
> +	if (!s->iwin.inodes)
> +		return;
> +
> +	if (inode) {
> +		__iget(inode);
> +		s->ipos.state = inode->i_state;
> +	}
> +	spin_unlock(&inode_lock);
> +
> +	free_pages((unsigned long) s->iwin.inodes, IWIN_PAGE_ORDER);
> +	if (s->ipos.pinned_inode)
> +		iput(s->ipos.pinned_inode);
> +	s->ipos.pinned_inode = inode;
> +}
> +
> +/*
> + * Listing of cached page ranges of a file.
> + *
> + * Usage:
> + * 		echo 'file name' > /proc/filecache
> + * 		cat /proc/filecache
> + */
> +
> +unsigned long page_mask;
> +#define PG_MMAP		PG_lru		/* reuse any non-relevant flag */
> +#define PG_BUFFER	PG_swapcache	/* ditto */
> +#define PG_DIRTY	PG_error	/* ditto */
> +#define PG_WRITEBACK	PG_buddy	/* ditto */
> +
> +/*
> + * Page state names, prefixed by their abbreviations.
> + */
> +struct {
> +	unsigned long	mask;
> +	const char     *name;
> +	int		faked;
> +} page_flag [] = {
> +	{1 << PG_referenced,	"R:referenced",	0},
> +	{1 << PG_active,	"A:active",	0},
> +	{1 << PG_MMAP,		"M:mmap",	1},
> +
> +	{1 << PG_uptodate,	"U:uptodate",	0},
> +	{1 << PG_dirty,		"D:dirty",	0},
> +	{1 << PG_writeback,	"W:writeback",	0},
> +	{1 << PG_reclaim,	"X:readahead",	0},
> +
> +	{1 << PG_private,	"P:private",	0},
> +	{1 << PG_owner_priv_1,	"O:owner",	0},
> +
> +	{1 << PG_BUFFER,	"b:buffer",	1},
> +	{1 << PG_DIRTY,		"d:dirty",	1},
> +	{1 << PG_WRITEBACK,	"w:writeback",	1},
> +};
> +
> +static unsigned long page_flags(struct page* page)
> +{
> +	unsigned long flags;
> +	struct address_space *mapping = page_mapping(page);
> +
> +	flags = page->flags & page_mask;
> +
> +	if (page_mapped(page))
> +		flags |= (1 << PG_MMAP);
> +
> +	if (page_has_buffers(page))
> +		flags |= (1 << PG_BUFFER);
> +
> +	if (mapping) {
> +		if (radix_tree_tag_get(&mapping->page_tree,
> +					page_index(page),
> +					PAGECACHE_TAG_WRITEBACK))
> +			flags |= (1 << PG_WRITEBACK);
> +
> +		if (radix_tree_tag_get(&mapping->page_tree,
> +					page_index(page),
> +					PAGECACHE_TAG_DIRTY))
> +			flags |= (1 << PG_DIRTY);
> +	}
> +
> +	return flags;
> +}
> +
> +static int pages_similiar(struct page* page0, struct page* page)
> +{
> +	if (page_count(page0) != page_count(page))
> +		return 0;
> +
> +	if (page_flags(page0) != page_flags(page))
> +		return 0;
> +
> +	return 1;
> +}
> +
> +static void show_range(struct seq_file *m, struct page* page, unsigned long len)
> +{
> +	int i;
> +	unsigned long flags;
> +
> +	if (!m || !page)
> +		return;
> +
> +	seq_printf(m, "%lu\t%lu\t", page->index, len);
> +
> +	flags = page_flags(page);
> +	for (i = 0; i < ARRAY_SIZE(page_flag); i++)
> +		seq_putc(m, (flags & page_flag[i].mask) ?
> +					page_flag[i].name[0] : '_');
> +
> +	seq_printf(m, "\t%d\n", page_count(page));
> +}
> +
> +#define BATCH_LINES	100
> +static pgoff_t show_file_cache(struct seq_file *m,
> +				struct address_space *mapping, pgoff_t start)
> +{
> +	int i;
> +	int lines = 0;
> +	pgoff_t len = 0;
> +	struct pagevec pvec;
> +	struct page *page;
> +	struct page *page0 = NULL;
> +
> +	for (;;) {
> +		pagevec_init(&pvec, 0);
> +		pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
> +				(void **)pvec.pages, start + len, PAGEVEC_SIZE);
> +
> +		if (pvec.nr == 0) {
> +			show_range(m, page0, len);
> +			start = ULONG_MAX;
> +			goto out;
> +		}
> +
> +		if (!page0)
> +			page0 = pvec.pages[0];
> +
> +		for (i = 0; i < pvec.nr; i++) {
> +			page = pvec.pages[i];
> +
> +			if (page->index == start + len &&
> +					pages_similiar(page0, page))
> +				len++;
> +			else {
> +				show_range(m, page0, len);
> +				page0 = page;
> +				start = page->index;
> +				len = 1;
> +				if (++lines > BATCH_LINES)
> +					goto out;
> +			}
> +		}
> +	}
> +
> +out:
> +	return start;
> +}
> +
> +static int pg_show(struct seq_file *m, void *v)
> +{
> +	struct session *s = m->private;
> +	struct file *file = s->query_file;
> +	pgoff_t offset;
> +
> +	if (!file)
> +		return ii_show(m, v);
> +
> +	offset = *(loff_t *) v;
> +
> +	if (!offset) { /* print header */
> +		int i;
> +
> +		seq_puts(m, "# file ");
> +		seq_path(m, &file->f_path, " \t\n\\");
> +
> +		seq_puts(m, "\n# flags");
> +		for (i = 0; i < ARRAY_SIZE(page_flag); i++)
> +			seq_printf(m, " %s", page_flag[i].name);
> +
> +		seq_puts(m, "\n# idx\tlen\tstate\t\trefcnt\n");
> +	}
> +
> +	s->start_offset = offset;
> +	s->next_offset = show_file_cache(m, file->f_mapping, offset);
> +
> +	return 0;
> +}
> +
> +static void *file_pos(struct file *file, loff_t *pos)
> +{
> +	loff_t size = i_size_read(file->f_mapping->host);
> +	pgoff_t end = DIV_ROUND_UP(size, PAGE_CACHE_SIZE);
> +	pgoff_t offset = *pos;
> +
> +	return offset < end ? pos : NULL;
> +}
> +
> +static void *pg_start(struct seq_file *m, loff_t *pos)
> +{
> +	struct session *s = m->private;
> +	struct file *file = s->query_file;
> +	pgoff_t offset = *pos;
> +
> +	if (!file)
> +		return ii_start(m, pos);
> +
> +	rcu_read_lock();
> +
> +	if (offset - s->start_offset == 1)
> +		*pos = s->next_offset;
> +	return file_pos(file, pos);
> +}
> +
> +static void *pg_next(struct seq_file *m, void *v, loff_t *pos)
> +{
> +	struct session *s = m->private;
> +	struct file *file = s->query_file;
> +
> +	if (!file)
> +		return ii_next(m, v, pos);
> +
> +	*pos = s->next_offset;
> +	return file_pos(file, pos);
> +}
> +
> +static void pg_stop(struct seq_file *m, void *v)
> +{
> +	struct session *s = m->private;
> +	struct file *file = s->query_file;
> +
> +	if (!file)
> +		return ii_stop(m, v);
> +
> +	rcu_read_unlock();
> +}
> +
> +struct seq_operations seq_filecache_op = {
> +	.start	= pg_start,
> +	.next	= pg_next,
> +	.stop	= pg_stop,
> +	.show	= pg_show,
> +};
> +
> +/*
> + * Implement the manual drop-all-pagecache function
> + */
> +
> +#define MAX_INODES	(PAGE_SIZE / sizeof(struct inode *))
> +static int drop_pagecache(void)
> +{
> +	struct hlist_head *head;
> +	struct hlist_node *node;
> +	struct inode *inode;
> +	struct inode **inodes;
> +	unsigned long i, j, k;
> +	int err = 0;
> +
> +	inodes = (struct inode **)__get_free_pages(GFP_KERNEL, IWIN_PAGE_ORDER);
> +	if (!inodes)
> +		return -ENOMEM;
> +
> +	for (i = 0; (head = get_inode_hash_budget(i)); i++) {
> +		if (hlist_empty(head))
> +			continue;
> +
> +		j = 0;
> +		cond_resched();
> +
> +		/*
> +		 * Grab some inodes.
> +		 */
> +		spin_lock(&inode_lock);
> +		hlist_for_each (node, head) {
> +			inode = hlist_entry(node, struct inode, i_hash);
> +			if (!atomic_read(&inode->i_count))
> +				continue;
> +			if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
> +				continue;
> +			if (!inode->i_mapping || !inode->i_mapping->nrpages)
> +				continue;
> +			__iget(inode);
> +			inodes[j++] = inode;
> +			if (j >= MAX_INODES)
> +				break;
> +		}
> +		spin_unlock(&inode_lock);
> +
> +		/*
> +		 * Free clean pages.
> +		 */
> +		for (k = 0; k < j; k++) {
> +			inode = inodes[k];
> +			invalidate_mapping_pages(inode->i_mapping, 0, ~1);
> +			iput(inode);
> +		}
> +
> +		/*
> +		 * Simply ignore the remaining inodes.
> +		 */
> +		if (j >= MAX_INODES && !err) {
> +			printk(KERN_WARNING
> +				"Too many collides in inode hash table.\n"
> +				"Pls boot with a larger ihash_entries=XXX.\n");
> +			err = -EAGAIN;
> +		}
> +	}
> +
> +	free_pages((unsigned long) inodes, IWIN_PAGE_ORDER);
> +	return err;
> +}
> +
> +static void drop_slabcache(void)
> +{
> +	int nr_objects;
> +
> +	do {
> +		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
> +	} while (nr_objects > 10);
> +}
> +
> +/*
> + * Proc file operations.
> + */
> +
> +static int filecache_open(struct inode *inode, struct file *proc_file)
> +{
> +	struct seq_file *m;
> +	struct session *s;
> +	unsigned size;
> +	char *buf = 0;
> +	int ret;
> +
> +	if (!try_module_get(THIS_MODULE))
> +		return -ENOENT;
> +
> +	s = session_create();
> +	if (IS_ERR(s)) {
> +		ret = PTR_ERR(s);
> +		goto out;
> +	}
> +	set_session(proc_file, s);
> +
> +	size = SBUF_SIZE;
> +	buf = kmalloc(size, GFP_KERNEL);
> +	if (!buf) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ret = seq_open(proc_file, &seq_filecache_op);
> +	if (!ret) {
> +		m = proc_file->private_data;
> +		m->private = s;
> +		m->buf = buf;
> +		m->size = size;
> +	}
> +
> +out:
> +	if (ret) {
> +		kfree(s);
> +		kfree(buf);
> +		module_put(THIS_MODULE);
> +	}
> +	return ret;
> +}
> +
> +static int filecache_release(struct inode *inode, struct file *proc_file)
> +{
> +	struct session *s = get_session(proc_file);
> +	int ret;
> +
> +	session_release(s);
> +	ret = seq_release(inode, proc_file);
> +	module_put(THIS_MODULE);
> +	return ret;
> +}
> +
> +ssize_t filecache_write(struct file *proc_file, const char __user * buffer,
> +			size_t count, loff_t *ppos)
> +{
> +	struct session *s;
> +	char *name;
> +	int err = 0;
> +
> +	if (count >= PATH_MAX + 5)
> +		return -ENAMETOOLONG;
> +
> +	name = kmalloc(count+1, GFP_KERNEL);
> +	if (!name)
> +		return -ENOMEM;
> +
> +	if (copy_from_user(name, buffer, count)) {
> +		err = -EFAULT;
> +		goto out;
> +	}
> +
> +	/* strip the optional newline */
> +	if (count && name[count-1] == '\n')
> +		name[count-1] = '\0';
> +	else
> +		name[count] = '\0';
> +
> +	s = get_session(proc_file);
> +	if (!strcmp(name, "set private")) {
> +		s->private_session = 1;
> +		goto out;
> +	}
> +
> +	if (!strncmp(name, "cat ", 4)) {
> +		err = session_update_file(s, name+4);
> +		goto out;
> +	}
> +
> +	if (!strncmp(name, "ls", 2)) {
> +		err = session_update_file(s, NULL);
> +		if (!err)
> +			err = ls_parse_options(name+2, s);
> +		if (!err && !s->private_session) {
> +			global_session.ls_dev = s->ls_dev;
> +			global_session.ls_options = s->ls_options;
> +		}
> +		goto out;
> +	}
> +
> +	if (!strncmp(name, "drop pagecache", 14)) {
> +		err = drop_pagecache();
> +		goto out;
> +	}
> +
> +	if (!strncmp(name, "drop slabcache", 14)) {
> +		drop_slabcache();
> +		goto out;
> +	}
> +
> +	/* err = -EINVAL; */
> +	err = session_update_file(s, name);
> +
> +out:
> +	kfree(name);
> +
> +	return err ? err : count;
> +}
> +
> +static struct file_operations proc_filecache_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= filecache_open,
> +	.release	= filecache_release,
> +	.write		= filecache_write,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +};
> +
> +
> +static __init int filecache_init(void)
> +{
> +	int i;
> +	struct proc_dir_entry *entry;
> +
> +	entry = create_proc_entry("filecache", 0600, NULL);
> +	if (entry)
> +		entry->proc_fops = &proc_filecache_fops;
> +
> +	for (page_mask = i = 0; i < ARRAY_SIZE(page_flag); i++)
> +		if (!page_flag[i].faked)
> +			page_mask |= page_flag[i].mask;
> +
> +	return 0;
> +}
> +
> +static void filecache_exit(void)
> +{
> +	remove_proc_entry("filecache", NULL);
> +	if (global_session.query_file)
> +		fput(global_session.query_file);
> +}
> +
> +MODULE_AUTHOR("Fengguang Wu <wfg@mail.ustc.edu.cn>");
> +MODULE_LICENSE("GPL");
> +
> +module_init(filecache_init);
> +module_exit(filecache_exit);
> --- linux-2.6.orig/include/linux/fs.h
> +++ linux-2.6/include/linux/fs.h
> @@ -775,6 +775,11 @@ struct inode {
>  	void			*i_security;
>  #endif
>  	void			*i_private; /* fs or device private pointer */
> +
> +#ifdef CONFIG_PROC_FILECACHE_EXTRAS
> +	unsigned int		i_access_count;	/* opened how many times? */
> +	char			i_comm[16];	/* opened first by which app? */
> +#endif
>  };
>  
>  /*
> @@ -860,6 +865,13 @@ static inline unsigned imajor(const stru
>  	return MAJOR(inode->i_rdev);
>  }
>  
> +static inline void inode_accessed(struct inode *inode)
> +{
> +#ifdef CONFIG_PROC_FILECACHE_EXTRAS
> +	inode->i_access_count++;
> +#endif
> +}
> +
>  extern struct block_device *I_BDEV(struct inode *inode);
>  
>  struct fown_struct {
> @@ -2171,6 +2183,7 @@ extern void remove_inode_hash(struct ino
>  static inline void insert_inode_hash(struct inode *inode) {
>  	__insert_inode_hash(inode, inode->i_ino);
>  }
> +struct hlist_head * get_inode_hash_budget(unsigned long index);
>  
>  extern struct file * get_empty_filp(void);
>  extern void file_move(struct file *f, struct list_head *list);
> --- linux-2.6.orig/fs/open.c
> +++ linux-2.6/fs/open.c
> @@ -842,6 +842,7 @@ static struct file *__dentry_open(struct
>  			goto cleanup_all;
>  	}
>  
> +	inode_accessed(inode);
>  	f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
>  
>  	file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
> --- linux-2.6.orig/fs/Kconfig
> +++ linux-2.6/fs/Kconfig
> @@ -265,4 +265,34 @@ endif
>  source "fs/nls/Kconfig"
>  source "fs/dlm/Kconfig"
>  
> +config PROC_FILECACHE
> +	tristate "/proc/filecache support"
> +	default m
> +	depends on PROC_FS
> +	help
> +	  This option creates a file /proc/filecache which enables one to
> +	  query/drop the cached files in memory.
> +
> +	  A quick start guide:
> +
> +	  # echo 'ls' > /proc/filecache
> +	  # head /proc/filecache
> +
> +	  # echo 'cat /bin/bash' > /proc/filecache
> +	  # head /proc/filecache
> +
> +	  # echo 'drop pagecache' > /proc/filecache
> +	  # echo 'drop slabcache' > /proc/filecache
> +
> +	  For more details, please check Documentation/filesystems/proc.txt .
> +
> +	  It can be a handy tool for sysadms and desktop users.
> +
> +config PROC_FILECACHE_EXTRAS
> +	bool "track extra states"
> +	default y
> +	depends on PROC_FILECACHE
> +	help
> +	  Track extra states that costs a little more time/space.
> +
>  endmenu
> --- linux-2.6.orig/fs/proc/Makefile
> +++ linux-2.6/fs/proc/Makefile
> @@ -2,7 +2,8 @@
>  # Makefile for the Linux proc filesystem routines.
>  #
>  
> -obj-$(CONFIG_PROC_FS) += proc.o
> +obj-$(CONFIG_PROC_FS)		+= proc.o
> +obj-$(CONFIG_PROC_FILECACHE)	+= filecache.o
>  
>  proc-y			:= nommu.o task_nommu.o
>  proc-$(CONFIG_MMU)	:= mmu.o task_mmu.o



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
@ 2009-05-17 13:55                                   ` Frederic Weisbecker
  0 siblings, 0 replies; 137+ messages in thread
From: Frederic Weisbecker @ 2009-05-17 13:55 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Ingo Molnar, Li Zefan, Tom Zanussi, KOSAKI Motohiro,
	Pekka Enberg, Andi Kleen, Steven Rostedt, Larry Woodman,
	Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Sun, May 17, 2009 at 09:36:59PM +0800, Wu Fengguang wrote:
> On Tue, May 12, 2009 at 09:01:12PM +0800, Frederic Weisbecker wrote:
> > On Tue, Apr 28, 2009 at 09:31:08PM +0800, Wu Fengguang wrote:
> > > On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> > >
> > > There are two possible challenges for the conversion:
> > >
> > > - One trick it does is to select different lists to traverse on
> > >   different filter options. Will this be possible in the object
> > >   tracing framework?
> > 
> > Yeah, I guess.
> 
> Great.
> 
> > 
> > > - The file name lookup(last field) is the performance killer. Is it
> > >   possible to skip the file name lookup when the filter failed on the
> > >   leading fields?
> > 
> > objects collection lays on trace events where filters basically ignore
> > a whole entry in case of non-matching. Not sure if we can easily only
> > ignore one field.
> > 
> > But I guess we can do something about the performances...
> 
> OK, but it's not as important as the previous requirement, so it could
> be the last thing to work on :)
> 
> > Could you send us the (sob'ed) patch you made which implements this.
> > I could try to adapt it to object collection.
> 
> Attached for your reference. Be aware that I still have plans to
> change it in non trivial way, and there are ongoing works by Nick(on
> inode_lock) and Jens(on s_dirty) that can create merge conflicts.
> So basically it is not a right time to do the adaption.


Ah ok, so I will wait a bit :-)

 
> However we can still do something to polish up the page object
> collection under /debug/tracing/objects/mm/pages/. For example,
> the timestamps and function name could be removed from the following
> list :)
> 
> # tracer: nop                                                                                                                        
> #                                                                                                                                    
> #           TASK-PID    CPU#    TIMESTAMP  FUNCTION                                                                                  
> #              | |       |          |         |                                                                                      
>            <...>-3743  [001]  3035.649769: dump_pages: pfn=1 flags=400 count=1 mapcount=0 index=0                           
>            <...>-3743  [001]  3044.176403: dump_pages: pfn=1 flags=400 count=1 mapcount=0 index=0                           
>            <...>-3743  [001]  3044.176407: dump_pages: pfn=2 flags=400 count=1 mapcount=0 index=0                           
>            <...>-3743  [001]  3044.176408: dump_pages: pfn=3 flags=400 count=1 mapcount=0 index=0                           
>            <...>-3743  [001]  3044.176409: dump_pages: pfn=4 flags=400 count=1 mapcount=0 index=0                           
>            <...>-3743  [001]  3044.176409: dump_pages: pfn=5 flags=400 count=1 mapcount=0 index=0                           
>            <...>-3743  [001]  3044.176410: dump_pages: pfn=6 flags=400 count=1 mapcount=0 index=0                           
>            <...>-3743  [001]  3044.176410: dump_pages: pfn=7 flags=400 count=1 mapcount=0 index=0                           
>            <...>-3743  [001]  3044.176411: dump_pages: pfn=8 flags=400 count=1 mapcount=0 index=0                           
>            <...>-3743  [001]  3044.176411: dump_pages: pfn=9 flags=400 count=1 mapcount=0 index=0                           
>            <...>-3743  [001]  3044.176412: dump_pages: pfn=10 flags=400 count=1 mapcount=0 index=0                          


echo nocontext-info > /debug/tracing/trace_options :-)
But you'll have only the function and the pages specifics. It's not really the
function but more specifically the name of the event. It's useful to distinguish
multiple events to a trace.

Hmm, may be it's not that much useful in a object dump...

Thanks.



> Thanks,
> Fengguang

> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -27,6 +27,7 @@ extern unsigned long max_mapnr;
>  extern unsigned long num_physpages;
>  extern void * high_memory;
>  extern int page_cluster;
> +extern char * const zone_names[];
>  
>  #ifdef CONFIG_SYSCTL
>  extern int sysctl_legacy_va_layout;
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -104,7 +104,7 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_Z
>  
>  EXPORT_SYMBOL(totalram_pages);
>  
> -static char * const zone_names[MAX_NR_ZONES] = {
> +char * const zone_names[MAX_NR_ZONES] = {
>  #ifdef CONFIG_ZONE_DMA
>  	 "DMA",
>  #endif
> --- linux-2.6.orig/fs/dcache.c
> +++ linux-2.6/fs/dcache.c
> @@ -1925,7 +1925,10 @@ char *__d_path(const struct path *path, 
>  
>  		if (dentry == root->dentry && vfsmnt == root->mnt)
>  			break;
> -		if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
> +		if (unlikely(!vfsmnt)) {
> +			if (IS_ROOT(dentry))
> +				break;
> +		} else if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
>  			/* Global root? */
>  			if (vfsmnt->mnt_parent == vfsmnt) {
>  				goto global_root;
> --- linux-2.6.orig/lib/radix-tree.c
> +++ linux-2.6/lib/radix-tree.c
> @@ -564,7 +564,6 @@ out:
>  }
>  EXPORT_SYMBOL(radix_tree_tag_clear);
>  
> -#ifndef __KERNEL__	/* Only the test harness uses this at present */
>  /**
>   * radix_tree_tag_get - get a tag on a radix tree node
>   * @root:		radix tree root
> @@ -627,7 +626,6 @@ int radix_tree_tag_get(struct radix_tree
>  	}
>  }
>  EXPORT_SYMBOL(radix_tree_tag_get);
> -#endif
>  
>  /**
>   *	radix_tree_next_hole    -    find the next hole (not-present entry)
> --- linux-2.6.orig/fs/inode.c
> +++ linux-2.6/fs/inode.c
> @@ -84,6 +84,10 @@ static struct hlist_head *inode_hashtabl
>   */
>  DEFINE_SPINLOCK(inode_lock);
>  
> +EXPORT_SYMBOL(inode_in_use);
> +EXPORT_SYMBOL(inode_unused);
> +EXPORT_SYMBOL(inode_lock);
> +
>  /*
>   * iprune_mutex provides exclusion between the kswapd or try_to_free_pages
>   * icache shrinking path, and the umount path.  Without this exclusion,
> @@ -110,6 +114,13 @@ static void wake_up_inode(struct inode *
>  	wake_up_bit(&inode->i_state, __I_LOCK);
>  }
>  
> +static inline void inode_created_by(struct inode *inode, struct task_struct *task)
> +{
> +#ifdef CONFIG_PROC_FILECACHE_EXTRAS
> +	memcpy(inode->i_comm, task->comm, sizeof(task->comm));
> +#endif
> +}
> +
>  /**
>   * inode_init_always - perform inode structure intialisation
>   * @sb: superblock inode belongs to
> @@ -147,7 +158,7 @@ struct inode *inode_init_always(struct s
>  	inode->i_bdev = NULL;
>  	inode->i_cdev = NULL;
>  	inode->i_rdev = 0;
> -	inode->dirtied_when = 0;
> +	inode->dirtied_when = jiffies;
>  
>  	if (security_inode_alloc(inode))
>  		goto out_free_inode;
> @@ -188,6 +199,7 @@ struct inode *inode_init_always(struct s
>  	}
>  	inode->i_private = NULL;
>  	inode->i_mapping = mapping;
> +	inode_created_by(inode, current);
>  
>  	return inode;
>  
> @@ -276,6 +288,8 @@ void __iget(struct inode *inode)
>  	inodes_stat.nr_unused--;
>  }
>  
> +EXPORT_SYMBOL(__iget);
> +
>  /**
>   * clear_inode - clear an inode
>   * @inode: inode to clear
> @@ -1459,6 +1473,16 @@ static void __wait_on_freeing_inode(stru
>  	spin_lock(&inode_lock);
>  }
>  
> +
> +struct hlist_head * get_inode_hash_budget(unsigned long index)
> +{
> +       if (index >= (1 << i_hash_shift))
> +               return NULL;
> +
> +       return inode_hashtable + index;
> +}
> +EXPORT_SYMBOL_GPL(get_inode_hash_budget);
> +
>  static __initdata unsigned long ihash_entries;
>  static int __init set_ihash_entries(char *str)
>  {
> --- linux-2.6.orig/fs/super.c
> +++ linux-2.6/fs/super.c
> @@ -46,6 +46,9 @@
>  LIST_HEAD(super_blocks);
>  DEFINE_SPINLOCK(sb_lock);
>  
> +EXPORT_SYMBOL(super_blocks);
> +EXPORT_SYMBOL(sb_lock);
> +
>  /**
>   *	alloc_super	-	create new superblock
>   *	@type:	filesystem type superblock should belong to
> --- linux-2.6.orig/mm/vmscan.c
> +++ linux-2.6/mm/vmscan.c
> @@ -262,6 +262,7 @@ unsigned long shrink_slab(unsigned long 
>  	up_read(&shrinker_rwsem);
>  	return ret;
>  }
> +EXPORT_SYMBOL(shrink_slab);
>  
>  /* Called without lock on whether page is mapped, so answer is unstable */
>  static inline int page_mapping_inuse(struct page *page)
> --- linux-2.6.orig/mm/swap_state.c
> +++ linux-2.6/mm/swap_state.c
> @@ -45,6 +45,7 @@ struct address_space swapper_space = {
>  	.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
>  	.backing_dev_info = &swap_backing_dev_info,
>  };
> +EXPORT_SYMBOL_GPL(swapper_space);
>  
>  #define INC_CACHE_INFO(x)	do { swap_cache_info.x++; } while (0)
>  
> --- linux-2.6.orig/Documentation/filesystems/proc.txt
> +++ linux-2.6/Documentation/filesystems/proc.txt
> @@ -260,6 +260,7 @@ Table 1-4: Kernel info in /proc
>   driver	     Various drivers grouped here, currently rtc (2.4)
>   execdomains Execdomains, related to security			(2.4)
>   fb	     Frame Buffer devices				(2.4)
> + filecache   Query/drop in-memory file cache
>   fs	     File system parameters, currently nfs/exports	(2.4)
>   ide         Directory containing info about the IDE subsystem 
>   interrupts  Interrupt usage                                   
> @@ -450,6 +451,88 @@ varies by architecture and compile optio
>  
>  > cat /proc/meminfo
>  
> +..............................................................................
> +
> +filecache:
> +
> +Provides access to the in-memory file cache.
> +
> +To list an index of all cached files:
> +
> +    echo ls > /proc/filecache
> +    cat /proc/filecache
> +
> +The output looks like:
> +
> +    # filecache 1.0
> +    #      ino       size   cached cached%  state   refcnt  dev             file
> +       1026334         91       92    100   --      66      03:02(hda2)     /lib/ld-2.3.6.so
> +        233608       1242      972     78   --      66      03:02(hda2)     /lib/tls/libc-2.3.6.so
> +         65203        651      476     73   --      1       03:02(hda2)     /bin/bash
> +       1026445        261      160     61   --      10      03:02(hda2)     /lib/libncurses.so.5.5
> +        235427         10       12    100   --      44      03:02(hda2)     /lib/tls/libdl-2.3.6.so
> +
> +FIELD	INTRO
> +---------------------------------------------------------------------------
> +ino	inode number
> +size	inode size in KB
> +cached	cached size in KB
> +cached%	percent of file data cached
> +state1	'-' clean; 'd' metadata dirty; 'D' data dirty
> +state2	'-' unlocked; 'L' locked, normally indicates file being written out
> +refcnt	file reference count, it's an in-kernel one, not exactly open count
> +dev	major:minor numbers in hex, followed by a descriptive device name
> +file	file path _inside_ the filesystem. There are several special names:
> +	'(noname)':	the file name is not available
> +	'(03:02)':	the file is a block device file of major:minor
> +	'...(deleted)': the named file has been deleted from the disk
> +
> +To list the cached pages of a perticular file:
> +
> +    echo /bin/bash > /proc/filecache
> +    cat /proc/filecache
> +
> +    # file /bin/bash
> +    # flags R:referenced A:active U:uptodate D:dirty W:writeback M:mmap
> +    # idx   len     state   refcnt
> +    0       36      RAU__M  3
> +    36      1       RAU__M  2
> +    37      8       RAU__M  3
> +    45      2       RAU___  1
> +    47      6       RAU__M  3
> +    53      3       RAU__M  2
> +    56      2       RAU__M  3
> +
> +FIELD	INTRO
> +----------------------------------------------------------------------------
> +idx	page index
> +len	number of pages which are cached and share the same state
> +state	page state of the flags listed in line two
> +refcnt	page reference count
> +
> +Careful users may notice that the file name to be queried is remembered between
> +commands. Internally, the module has a global variable to store the file name
> +parameter, so that it can be inherited by newly opened /proc/filecache file.
> +However it can lead to interference for multiple queriers. The solution here
> +is to obey a rule: only root can interactively change the file name parameter;
> +normal users must go for scripts to access the interface. Scripts should do it
> +by following the code example below:
> +
> +    filecache = open("/proc/filecache", "rw");
> +    # avoid polluting the global parameter filename
> +    filecache.write("set private");
> +
> +To instruct the kernel to drop clean caches, dentries and inodes from memory,
> +causing that memory to become free:
> +
> +    # drop clean file data cache (i.e. file backed pagecache)
> +    echo drop pagecache > /proc/filecache
> +
> +    # drop clean file metadata cache (i.e. dentries and inodes)
> +    echo drop slabcache > /proc/filecache
> +
> +Note that the drop commands are non-destructive operations and dirty objects
> +are not freeable, the user should run `sync' first.
>  
>  MemTotal:     16344972 kB
>  MemFree:      13634064 kB
> --- /dev/null
> +++ linux-2.6/fs/proc/filecache.c
> @@ -0,0 +1,1045 @@
> +/*
> + * fs/proc/filecache.c
> + *
> + * Copyright (C) 2006, 2007 Fengguang Wu <wfg@mail.ustc.edu.cn>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/radix-tree.h>
> +#include <linux/page-flags.h>
> +#include <linux/pagevec.h>
> +#include <linux/pagemap.h>
> +#include <linux/vmalloc.h>
> +#include <linux/writeback.h>
> +#include <linux/buffer_head.h>
> +#include <linux/parser.h>
> +#include <linux/proc_fs.h>
> +#include <linux/seq_file.h>
> +#include <linux/file.h>
> +#include <linux/namei.h>
> +#include <linux/module.h>
> +#include <asm/uaccess.h>
> +
> +/*
> + * Increase minor version when new columns are added;
> + * Increase major version when existing columns are changed.
> + */
> +#define FILECACHE_VERSION	"1.0"
> +
> +/* Internal buffer sizes. The larger the more effcient. */
> +#define SBUF_SIZE	(128<<10)
> +#define IWIN_PAGE_ORDER	3
> +#define IWIN_SIZE	((PAGE_SIZE<<IWIN_PAGE_ORDER) / sizeof(struct inode *))
> +
> +/*
> + * Session management.
> + *
> + * Each opened /proc/filecache file is assiocated with a session object.
> + * Also there is a global_session that maintains status across open()/close()
> + * (i.e. the lifetime of an opened file), so that a casual user can query the
> + * filecache via _multiple_ simple shell commands like
> + * 'echo cat /bin/bash > /proc/filecache; cat /proc/filecache'.
> + *
> + * session.query_file is the file whose cache info is to be queried.
> + * Its value determines what we get on read():
> + * 	- NULL: ii_*() called to show the inode index
> + * 	- filp: pg_*() called to show the page groups of a filp
> + *
> + * session.query_file is
> + * 	- cloned from global_session.query_file on open();
> + * 	- updated on write("cat filename");
> + * 	  note that the new file will also be saved in global_session.query_file if
> + * 	  session.private_session is false.
> + */
> +
> +struct session {
> +	/* options */
> +	int		private_session;
> +	unsigned long	ls_options;
> +	dev_t		ls_dev;
> +
> +	/* parameters */
> +	struct file	*query_file;
> +
> +	/* seqfile pos */
> +	pgoff_t		start_offset;
> +	pgoff_t		next_offset;
> +
> +	/* inode at last pos */
> +	struct {
> +		unsigned long pos;
> +		unsigned long state;
> +		struct inode *inode;
> +		struct inode *pinned_inode;
> +	} ipos;
> +
> +	/* inode window */
> +	struct {
> +		unsigned long cursor;
> +		unsigned long origin;
> +		unsigned long size;
> +		struct inode **inodes;
> +	} iwin;
> +};
> +
> +static struct session global_session;
> +
> +/*
> + * Session address is stored in proc_file->f_ra.start:
> + * we assume that there will be no readahead for proc_file.
> + */
> +static struct session *get_session(struct file *proc_file)
> +{
> +	return (struct session *)proc_file->f_ra.start;
> +}
> +
> +static void set_session(struct file *proc_file, struct session *s)
> +{
> +	BUG_ON(proc_file->f_ra.start);
> +	proc_file->f_ra.start = (unsigned long)s;
> +}
> +
> +static void update_global_file(struct session *s)
> +{
> +	if (s->private_session)
> +		return;
> +
> +	if (global_session.query_file)
> +		fput(global_session.query_file);
> +
> +	global_session.query_file = s->query_file;
> +
> +	if (global_session.query_file)
> +		get_file(global_session.query_file);
> +}
> +
> +/*
> + * Cases of the name:
> + * 1) NULL                (new session)
> + * 	s->query_file = global_session.query_file = 0;
> + * 2) ""                  (ls/la)
> + * 	s->query_file = global_session.query_file;
> + * 3) a regular file name (cat newfile)
> + * 	s->query_file = global_session.query_file = newfile;
> + */
> +static int session_update_file(struct session *s, char *name)
> +{
> +	static DEFINE_MUTEX(mutex); /* protects global_session.query_file */
> +	int err = 0;
> +
> +	mutex_lock(&mutex);
> +
> +	/*
> +	 * We are to quit, or to list the cached files.
> +	 * Reset *.query_file.
> +	 */
> +	if (!name) {
> +		if (s->query_file) {
> +			fput(s->query_file);
> +			s->query_file = NULL;
> +		}
> +		update_global_file(s);
> +		goto out;
> +	}
> +
> +	/*
> +	 * This is a new session.
> +	 * Inherit options/parameters from global ones.
> +	 */
> +	if (name[0] == '\0') {
> +		*s = global_session;
> +		if (s->query_file)
> +			get_file(s->query_file);
> +		goto out;
> +	}
> +
> +	/*
> +	 * Open the named file.
> +	 */
> +	if (s->query_file)
> +		fput(s->query_file);
> +	s->query_file = filp_open(name, O_RDONLY|O_LARGEFILE, 0);
> +	if (IS_ERR(s->query_file)) {
> +		err = PTR_ERR(s->query_file);
> +		s->query_file = NULL;
> +	} else
> +		update_global_file(s);
> +
> +out:
> +	mutex_unlock(&mutex);
> +
> +	return err;
> +}
> +
> +static struct session *session_create(void)
> +{
> +	struct session *s;
> +	int err = 0;
> +
> +	s = kmalloc(sizeof(*s), GFP_KERNEL);
> +	if (s)
> +		err = session_update_file(s, "");
> +	else
> +		err = -ENOMEM;
> +
> +	return err ? ERR_PTR(err) : s;
> +}
> +
> +static void session_release(struct session *s)
> +{
> +	if (s->ipos.pinned_inode)
> +		iput(s->ipos.pinned_inode);
> +	if (s->query_file)
> +		fput(s->query_file);
> +	kfree(s);
> +}
> +
> +
> +/*
> + * Listing of cached files.
> + *
> + * Usage:
> + * 		echo > /proc/filecache  # enter listing mode
> + * 		cat /proc/filecache     # get the file listing
> + */
> +
> +/* code style borrowed from ib_srp.c */
> +enum {
> +	LS_OPT_ERR	=	0,
> +	LS_OPT_DIRTY	=	1 << 0,
> +	LS_OPT_CLEAN	=	1 << 1,
> +	LS_OPT_INUSE	=	1 << 2,
> +	LS_OPT_EMPTY	=	1 << 3,
> +	LS_OPT_ALL	=	1 << 4,
> +	LS_OPT_DEV	=	1 << 5,
> +};
> +
> +static match_table_t ls_opt_tokens = {
> +	{ LS_OPT_DIRTY,		"dirty" 	},
> +	{ LS_OPT_CLEAN,		"clean" 	},
> +	{ LS_OPT_INUSE,		"inuse" 	},
> +	{ LS_OPT_EMPTY,		"empty"		},
> +	{ LS_OPT_ALL,		"all" 		},
> +	{ LS_OPT_DEV,		"dev=%s"	},
> +	{ LS_OPT_ERR,		NULL 		}
> +};
> +
> +static int ls_parse_options(const char *buf, struct session *s)
> +{
> +	substring_t args[MAX_OPT_ARGS];
> +	char *options, *sep_opt;
> +	char *p;
> +	int token;
> +	int ret = 0;
> +
> +	if (!buf)
> +		return 0;
> +	options = kstrdup(buf, GFP_KERNEL);
> +	if (!options)
> +		return -ENOMEM;
> +
> +	s->ls_options = 0;
> +	sep_opt = options;
> +	while ((p = strsep(&sep_opt, " ")) != NULL) {
> +		if (!*p)
> +			continue;
> +
> +		token = match_token(p, ls_opt_tokens, args);
> +
> +		switch (token) {
> +		case LS_OPT_DIRTY:
> +		case LS_OPT_CLEAN:
> +		case LS_OPT_INUSE:
> +		case LS_OPT_EMPTY:
> +		case LS_OPT_ALL:
> +			s->ls_options |= token;
> +			break;
> +		case LS_OPT_DEV:
> +			p = match_strdup(args);
> +			if (!p) {
> +				ret = -ENOMEM;
> +				goto out;
> +			}
> +			if (*p == '/') {
> +				struct kstat stat;
> +				struct nameidata nd;
> +				ret = path_lookup(p, LOOKUP_FOLLOW, &nd);
> +				if (!ret)
> +					ret = vfs_getattr(nd.path.mnt,
> +							  nd.path.dentry, &stat);
> +				if (!ret)
> +					s->ls_dev = stat.rdev;
> +			} else
> +				s->ls_dev = simple_strtoul(p, NULL, 0);
> +			/* printk("%lx %s\n", (long)s->ls_dev, p); */
> +			kfree(p);
> +			break;
> +
> +		default:
> +			printk(KERN_WARNING "unknown parameter or missing value "
> +			       "'%s' in ls command\n", p);
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +	}
> +
> +out:
> +	kfree(options);
> +	return ret;
> +}
> +
> +/*
> + * Add possible filters here.
> + * No permission check: we cannot verify the path's permission anyway.
> + * We simply demand root previledge for accessing /proc/filecache.
> + */
> +static int may_show_inode(struct session *s, struct inode *inode)
> +{
> +	if (!atomic_read(&inode->i_count))
> +		return 0;
> +	if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
> +		return 0;
> +	if (!inode->i_mapping)
> +		return 0;
> +
> +	if (s->ls_dev && s->ls_dev != inode->i_sb->s_dev)
> +		return 0;
> +
> +	if (s->ls_options & LS_OPT_ALL)
> +		return 1;
> +
> +	if (!(s->ls_options & LS_OPT_EMPTY) && !inode->i_mapping->nrpages)
> +		return 0;
> +
> +	if ((s->ls_options & LS_OPT_DIRTY) && !(inode->i_state & I_DIRTY))
> +		return 0;
> +
> +	if ((s->ls_options & LS_OPT_CLEAN) && (inode->i_state & I_DIRTY))
> +		return 0;
> +
> +	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
> +	      S_ISLNK(inode->i_mode) || S_ISBLK(inode->i_mode)))
> +		return 0;
> +
> +	return 1;
> +}
> +
> +/*
> + * Full: there are more data following.
> + */
> +static int iwin_full(struct session *s)
> +{
> +	return !s->iwin.cursor ||
> +		s->iwin.cursor > s->iwin.origin + s->iwin.size;
> +}
> +
> +static int iwin_push(struct session *s, struct inode *inode)
> +{
> +	if (!may_show_inode(s, inode))
> +		return 0;
> +
> +	s->iwin.cursor++;
> +
> +	if (s->iwin.size >= IWIN_SIZE)
> +		return 1;
> +
> +	if (s->iwin.cursor > s->iwin.origin)
> +		s->iwin.inodes[s->iwin.size++] = inode;
> +	return 0;
> +}
> +
> +/*
> + * Travease the inode lists in order - newest first.
> + * And fill @s->iwin.inodes with inodes positioned in [@pos, @pos+IWIN_SIZE).
> + */
> +static int iwin_fill(struct session *s, unsigned long pos)
> +{
> +	struct inode *inode;
> +	struct super_block *sb;
> +
> +	s->iwin.origin = pos;
> +	s->iwin.cursor = 0;
> +	s->iwin.size = 0;
> +
> +	/*
> +	 * We have a cursor inode, clean and expected to be unchanged.
> +	 */
> +	if (s->ipos.inode && pos >= s->ipos.pos &&
> +			!(s->ipos.state & I_DIRTY) &&
> +			s->ipos.state == s->ipos.inode->i_state) {
> +		inode = s->ipos.inode;
> +		s->iwin.cursor = s->ipos.pos;
> +		goto continue_from_saved;
> +	}
> +
> +	if (s->ls_options & LS_OPT_CLEAN)
> +		goto clean_inodes;
> +
> +	spin_lock(&sb_lock);
> +	list_for_each_entry(sb, &super_blocks, s_list) {
> +		if (s->ls_dev && s->ls_dev != sb->s_dev)
> +			continue;
> +
> +		list_for_each_entry(inode, &sb->s_dirty, i_list) {
> +			if (iwin_push(s, inode))
> +				goto out_full_unlock;
> +		}
> +		list_for_each_entry(inode, &sb->s_io, i_list) {
> +			if (iwin_push(s, inode))
> +				goto out_full_unlock;
> +		}
> +	}
> +	spin_unlock(&sb_lock);
> +
> +clean_inodes:
> +	list_for_each_entry(inode, &inode_in_use, i_list) {
> +		if (iwin_push(s, inode))
> +			goto out_full;
> +continue_from_saved:
> +		;
> +	}
> +
> +	if (s->ls_options & LS_OPT_INUSE)
> +		return 0;
> +
> +	list_for_each_entry(inode, &inode_unused, i_list) {
> +		if (iwin_push(s, inode))
> +			goto out_full;
> +	}
> +
> +	return 0;
> +
> +out_full_unlock:
> +	spin_unlock(&sb_lock);
> +out_full:
> +	return 1;
> +}
> +
> +static struct inode *iwin_inode(struct session *s, unsigned long pos)
> +{
> +	if ((iwin_full(s) && pos >= s->iwin.origin + s->iwin.size)
> +			  || pos < s->iwin.origin)
> +		iwin_fill(s, pos);
> +
> +	if (pos >= s->iwin.cursor)
> +		return NULL;
> +
> +	s->ipos.pos = pos;
> +	s->ipos.inode = s->iwin.inodes[pos - s->iwin.origin];
> +	BUG_ON(!s->ipos.inode);
> +	return s->ipos.inode;
> +}
> +
> +static void show_inode(struct seq_file *m, struct inode *inode)
> +{
> +	char state[] = "--"; /* dirty, locked */
> +	struct dentry *dentry;
> +	loff_t size = i_size_read(inode);
> +	unsigned long nrpages;
> +	int percent;
> +	int refcnt;
> +	int shift;
> +
> +	if (!size)
> +		size++;
> +
> +	if (inode->i_mapping)
> +		nrpages = inode->i_mapping->nrpages;
> +	else {
> +		nrpages = 0;
> +		WARN_ON(1);
> +	}
> +
> +	for (shift = 0; (size >> shift) > ULONG_MAX / 128; shift += 12)
> +		;
> +	percent = min(100UL, (((100 * nrpages) >> shift) << PAGE_CACHE_SHIFT) /
> +						(unsigned long)(size >> shift));
> +
> +	if (inode->i_state & (I_DIRTY_DATASYNC|I_DIRTY_PAGES))
> +		state[0] = 'D';
> +	else if (inode->i_state & I_DIRTY_SYNC)
> +		state[0] = 'd';
> +
> +	if (inode->i_state & I_LOCK)
> +		state[0] = 'L';
> +
> +	refcnt = 0;
> +	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
> +		refcnt += atomic_read(&dentry->d_count);
> +	}
> +
> +	seq_printf(m, "%10lu %10llu %8lu %7d ",
> +			inode->i_ino,
> +			DIV_ROUND_UP(size, 1024),
> +			nrpages << (PAGE_CACHE_SHIFT - 10),
> +			percent);
> +
> +	seq_printf(m, "%6d %5s %9lu ",
> +			refcnt,
> +			state,
> +			(jiffies - inode->dirtied_when) / HZ);
> +
> +#ifdef CONFIG_PROC_FILECACHE_EXTRAS
> +	seq_printf(m, "%8u %-16s",
> +			inode->i_access_count,
> +			inode->i_comm);
> +#endif
> +
> +	seq_printf(m, "%02x:%02x(%s)\t",
> +			MAJOR(inode->i_sb->s_dev),
> +			MINOR(inode->i_sb->s_dev),
> +			inode->i_sb->s_id);
> +
> +	if (list_empty(&inode->i_dentry)) {
> +		if (!atomic_read(&inode->i_count))
> +			seq_puts(m, "(noname)\n");
> +		else
> +			seq_printf(m, "(%02x:%02x)\n",
> +					imajor(inode), iminor(inode));
> +	} else {
> +		struct path path = {
> +			.mnt = NULL,
> +			.dentry = list_entry(inode->i_dentry.next,
> +					     struct dentry, d_alias)
> +		};
> +
> +		seq_path(m, &path, " \t\n\\");
> +		seq_putc(m, '\n');
> +	}
> +}
> +
> +static int ii_show(struct seq_file *m, void *v)
> +{
> +	unsigned long index = *(loff_t *) v;
> +	struct session *s = m->private;
> +        struct inode *inode;
> +
> +	if (index == 0) {
> +		seq_puts(m, "# filecache " FILECACHE_VERSION "\n");
> +		seq_puts(m, "#      ino       size   cached cached% "
> +				"refcnt state       age "
> +#ifdef CONFIG_PROC_FILECACHE_EXTRAS
> +				"accessed  process         "
> +#endif
> +				"dev\t\tfile\n");
> +	}
> +
> +        inode = iwin_inode(s,index);
> +	show_inode(m, inode);
> +
> +	return 0;
> +}
> +
> +static void *ii_start(struct seq_file *m, loff_t *pos)
> +{
> +	struct session *s = m->private;
> +
> +	s->iwin.size = 0;
> +	s->iwin.inodes = (struct inode **)
> +				__get_free_pages( GFP_KERNEL, IWIN_PAGE_ORDER);
> +	if (!s->iwin.inodes)
> +		return NULL;
> +
> +	spin_lock(&inode_lock);
> +
> +	return iwin_inode(s, *pos) ? pos : NULL;
> +}
> +
> +static void *ii_next(struct seq_file *m, void *v, loff_t *pos)
> +{
> +	struct session *s = m->private;
> +
> +	(*pos)++;
> +	return iwin_inode(s, *pos) ? pos : NULL;
> +}
> +
> +static void ii_stop(struct seq_file *m, void *v)
> +{
> +	struct session *s = m->private;
> +	struct inode *inode = s->ipos.inode;
> +
> +	if (!s->iwin.inodes)
> +		return;
> +
> +	if (inode) {
> +		__iget(inode);
> +		s->ipos.state = inode->i_state;
> +	}
> +	spin_unlock(&inode_lock);
> +
> +	free_pages((unsigned long) s->iwin.inodes, IWIN_PAGE_ORDER);
> +	if (s->ipos.pinned_inode)
> +		iput(s->ipos.pinned_inode);
> +	s->ipos.pinned_inode = inode;
> +}
> +
> +/*
> + * Listing of cached page ranges of a file.
> + *
> + * Usage:
> + * 		echo 'file name' > /proc/filecache
> + * 		cat /proc/filecache
> + */
> +
> +unsigned long page_mask;
> +#define PG_MMAP		PG_lru		/* reuse any non-relevant flag */
> +#define PG_BUFFER	PG_swapcache	/* ditto */
> +#define PG_DIRTY	PG_error	/* ditto */
> +#define PG_WRITEBACK	PG_buddy	/* ditto */
> +
> +/*
> + * Page state names, prefixed by their abbreviations.
> + */
> +struct {
> +	unsigned long	mask;
> +	const char     *name;
> +	int		faked;
> +} page_flag [] = {
> +	{1 << PG_referenced,	"R:referenced",	0},
> +	{1 << PG_active,	"A:active",	0},
> +	{1 << PG_MMAP,		"M:mmap",	1},
> +
> +	{1 << PG_uptodate,	"U:uptodate",	0},
> +	{1 << PG_dirty,		"D:dirty",	0},
> +	{1 << PG_writeback,	"W:writeback",	0},
> +	{1 << PG_reclaim,	"X:readahead",	0},
> +
> +	{1 << PG_private,	"P:private",	0},
> +	{1 << PG_owner_priv_1,	"O:owner",	0},
> +
> +	{1 << PG_BUFFER,	"b:buffer",	1},
> +	{1 << PG_DIRTY,		"d:dirty",	1},
> +	{1 << PG_WRITEBACK,	"w:writeback",	1},
> +};
> +
> +static unsigned long page_flags(struct page* page)
> +{
> +	unsigned long flags;
> +	struct address_space *mapping = page_mapping(page);
> +
> +	flags = page->flags & page_mask;
> +
> +	if (page_mapped(page))
> +		flags |= (1 << PG_MMAP);
> +
> +	if (page_has_buffers(page))
> +		flags |= (1 << PG_BUFFER);
> +
> +	if (mapping) {
> +		if (radix_tree_tag_get(&mapping->page_tree,
> +					page_index(page),
> +					PAGECACHE_TAG_WRITEBACK))
> +			flags |= (1 << PG_WRITEBACK);
> +
> +		if (radix_tree_tag_get(&mapping->page_tree,
> +					page_index(page),
> +					PAGECACHE_TAG_DIRTY))
> +			flags |= (1 << PG_DIRTY);
> +	}
> +
> +	return flags;
> +}
> +
> +static int pages_similiar(struct page* page0, struct page* page)
> +{
> +	if (page_count(page0) != page_count(page))
> +		return 0;
> +
> +	if (page_flags(page0) != page_flags(page))
> +		return 0;
> +
> +	return 1;
> +}
> +
> +static void show_range(struct seq_file *m, struct page* page, unsigned long len)
> +{
> +	int i;
> +	unsigned long flags;
> +
> +	if (!m || !page)
> +		return;
> +
> +	seq_printf(m, "%lu\t%lu\t", page->index, len);
> +
> +	flags = page_flags(page);
> +	for (i = 0; i < ARRAY_SIZE(page_flag); i++)
> +		seq_putc(m, (flags & page_flag[i].mask) ?
> +					page_flag[i].name[0] : '_');
> +
> +	seq_printf(m, "\t%d\n", page_count(page));
> +}
> +
> +#define BATCH_LINES	100
> +static pgoff_t show_file_cache(struct seq_file *m,
> +				struct address_space *mapping, pgoff_t start)
> +{
> +	int i;
> +	int lines = 0;
> +	pgoff_t len = 0;
> +	struct pagevec pvec;
> +	struct page *page;
> +	struct page *page0 = NULL;
> +
> +	for (;;) {
> +		pagevec_init(&pvec, 0);
> +		pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
> +				(void **)pvec.pages, start + len, PAGEVEC_SIZE);
> +
> +		if (pvec.nr == 0) {
> +			show_range(m, page0, len);
> +			start = ULONG_MAX;
> +			goto out;
> +		}
> +
> +		if (!page0)
> +			page0 = pvec.pages[0];
> +
> +		for (i = 0; i < pvec.nr; i++) {
> +			page = pvec.pages[i];
> +
> +			if (page->index == start + len &&
> +					pages_similiar(page0, page))
> +				len++;
> +			else {
> +				show_range(m, page0, len);
> +				page0 = page;
> +				start = page->index;
> +				len = 1;
> +				if (++lines > BATCH_LINES)
> +					goto out;
> +			}
> +		}
> +	}
> +
> +out:
> +	return start;
> +}
> +
> +static int pg_show(struct seq_file *m, void *v)
> +{
> +	struct session *s = m->private;
> +	struct file *file = s->query_file;
> +	pgoff_t offset;
> +
> +	if (!file)
> +		return ii_show(m, v);
> +
> +	offset = *(loff_t *) v;
> +
> +	if (!offset) { /* print header */
> +		int i;
> +
> +		seq_puts(m, "# file ");
> +		seq_path(m, &file->f_path, " \t\n\\");
> +
> +		seq_puts(m, "\n# flags");
> +		for (i = 0; i < ARRAY_SIZE(page_flag); i++)
> +			seq_printf(m, " %s", page_flag[i].name);
> +
> +		seq_puts(m, "\n# idx\tlen\tstate\t\trefcnt\n");
> +	}
> +
> +	s->start_offset = offset;
> +	s->next_offset = show_file_cache(m, file->f_mapping, offset);
> +
> +	return 0;
> +}
> +
> +static void *file_pos(struct file *file, loff_t *pos)
> +{
> +	loff_t size = i_size_read(file->f_mapping->host);
> +	pgoff_t end = DIV_ROUND_UP(size, PAGE_CACHE_SIZE);
> +	pgoff_t offset = *pos;
> +
> +	return offset < end ? pos : NULL;
> +}
> +
> +static void *pg_start(struct seq_file *m, loff_t *pos)
> +{
> +	struct session *s = m->private;
> +	struct file *file = s->query_file;
> +	pgoff_t offset = *pos;
> +
> +	if (!file)
> +		return ii_start(m, pos);
> +
> +	rcu_read_lock();
> +
> +	if (offset - s->start_offset == 1)
> +		*pos = s->next_offset;
> +	return file_pos(file, pos);
> +}
> +
> +static void *pg_next(struct seq_file *m, void *v, loff_t *pos)
> +{
> +	struct session *s = m->private;
> +	struct file *file = s->query_file;
> +
> +	if (!file)
> +		return ii_next(m, v, pos);
> +
> +	*pos = s->next_offset;
> +	return file_pos(file, pos);
> +}
> +
> +static void pg_stop(struct seq_file *m, void *v)
> +{
> +	struct session *s = m->private;
> +	struct file *file = s->query_file;
> +
> +	if (!file)
> +		return ii_stop(m, v);
> +
> +	rcu_read_unlock();
> +}
> +
> +struct seq_operations seq_filecache_op = {
> +	.start	= pg_start,
> +	.next	= pg_next,
> +	.stop	= pg_stop,
> +	.show	= pg_show,
> +};
> +
> +/*
> + * Implement the manual drop-all-pagecache function
> + */
> +
> +#define MAX_INODES	(PAGE_SIZE / sizeof(struct inode *))
> +static int drop_pagecache(void)
> +{
> +	struct hlist_head *head;
> +	struct hlist_node *node;
> +	struct inode *inode;
> +	struct inode **inodes;
> +	unsigned long i, j, k;
> +	int err = 0;
> +
> +	inodes = (struct inode **)__get_free_pages(GFP_KERNEL, IWIN_PAGE_ORDER);
> +	if (!inodes)
> +		return -ENOMEM;
> +
> +	for (i = 0; (head = get_inode_hash_budget(i)); i++) {
> +		if (hlist_empty(head))
> +			continue;
> +
> +		j = 0;
> +		cond_resched();
> +
> +		/*
> +		 * Grab some inodes.
> +		 */
> +		spin_lock(&inode_lock);
> +		hlist_for_each (node, head) {
> +			inode = hlist_entry(node, struct inode, i_hash);
> +			if (!atomic_read(&inode->i_count))
> +				continue;
> +			if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
> +				continue;
> +			if (!inode->i_mapping || !inode->i_mapping->nrpages)
> +				continue;
> +			__iget(inode);
> +			inodes[j++] = inode;
> +			if (j >= MAX_INODES)
> +				break;
> +		}
> +		spin_unlock(&inode_lock);
> +
> +		/*
> +		 * Free clean pages.
> +		 */
> +		for (k = 0; k < j; k++) {
> +			inode = inodes[k];
> +			invalidate_mapping_pages(inode->i_mapping, 0, ~1);
> +			iput(inode);
> +		}
> +
> +		/*
> +		 * Simply ignore the remaining inodes.
> +		 */
> +		if (j >= MAX_INODES && !err) {
> +			printk(KERN_WARNING
> +				"Too many collides in inode hash table.\n"
> +				"Pls boot with a larger ihash_entries=XXX.\n");
> +			err = -EAGAIN;
> +		}
> +	}
> +
> +	free_pages((unsigned long) inodes, IWIN_PAGE_ORDER);
> +	return err;
> +}
> +
> +static void drop_slabcache(void)
> +{
> +	int nr_objects;
> +
> +	do {
> +		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
> +	} while (nr_objects > 10);
> +}
> +
> +/*
> + * Proc file operations.
> + */
> +
> +static int filecache_open(struct inode *inode, struct file *proc_file)
> +{
> +	struct seq_file *m;
> +	struct session *s;
> +	unsigned size;
> +	char *buf = 0;
> +	int ret;
> +
> +	if (!try_module_get(THIS_MODULE))
> +		return -ENOENT;
> +
> +	s = session_create();
> +	if (IS_ERR(s)) {
> +		ret = PTR_ERR(s);
> +		goto out;
> +	}
> +	set_session(proc_file, s);
> +
> +	size = SBUF_SIZE;
> +	buf = kmalloc(size, GFP_KERNEL);
> +	if (!buf) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ret = seq_open(proc_file, &seq_filecache_op);
> +	if (!ret) {
> +		m = proc_file->private_data;
> +		m->private = s;
> +		m->buf = buf;
> +		m->size = size;
> +	}
> +
> +out:
> +	if (ret) {
> +		kfree(s);
> +		kfree(buf);
> +		module_put(THIS_MODULE);
> +	}
> +	return ret;
> +}
> +
> +static int filecache_release(struct inode *inode, struct file *proc_file)
> +{
> +	struct session *s = get_session(proc_file);
> +	int ret;
> +
> +	session_release(s);
> +	ret = seq_release(inode, proc_file);
> +	module_put(THIS_MODULE);
> +	return ret;
> +}
> +
> +ssize_t filecache_write(struct file *proc_file, const char __user * buffer,
> +			size_t count, loff_t *ppos)
> +{
> +	struct session *s;
> +	char *name;
> +	int err = 0;
> +
> +	if (count >= PATH_MAX + 5)
> +		return -ENAMETOOLONG;
> +
> +	name = kmalloc(count+1, GFP_KERNEL);
> +	if (!name)
> +		return -ENOMEM;
> +
> +	if (copy_from_user(name, buffer, count)) {
> +		err = -EFAULT;
> +		goto out;
> +	}
> +
> +	/* strip the optional newline */
> +	if (count && name[count-1] == '\n')
> +		name[count-1] = '\0';
> +	else
> +		name[count] = '\0';
> +
> +	s = get_session(proc_file);
> +	if (!strcmp(name, "set private")) {
> +		s->private_session = 1;
> +		goto out;
> +	}
> +
> +	if (!strncmp(name, "cat ", 4)) {
> +		err = session_update_file(s, name+4);
> +		goto out;
> +	}
> +
> +	if (!strncmp(name, "ls", 2)) {
> +		err = session_update_file(s, NULL);
> +		if (!err)
> +			err = ls_parse_options(name+2, s);
> +		if (!err && !s->private_session) {
> +			global_session.ls_dev = s->ls_dev;
> +			global_session.ls_options = s->ls_options;
> +		}
> +		goto out;
> +	}
> +
> +	if (!strncmp(name, "drop pagecache", 14)) {
> +		err = drop_pagecache();
> +		goto out;
> +	}
> +
> +	if (!strncmp(name, "drop slabcache", 14)) {
> +		drop_slabcache();
> +		goto out;
> +	}
> +
> +	/* err = -EINVAL; */
> +	err = session_update_file(s, name);
> +
> +out:
> +	kfree(name);
> +
> +	return err ? err : count;
> +}
> +
> +static struct file_operations proc_filecache_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= filecache_open,
> +	.release	= filecache_release,
> +	.write		= filecache_write,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +};
> +
> +
> +static __init int filecache_init(void)
> +{
> +	int i;
> +	struct proc_dir_entry *entry;
> +
> +	entry = create_proc_entry("filecache", 0600, NULL);
> +	if (entry)
> +		entry->proc_fops = &proc_filecache_fops;
> +
> +	for (page_mask = i = 0; i < ARRAY_SIZE(page_flag); i++)
> +		if (!page_flag[i].faked)
> +			page_mask |= page_flag[i].mask;
> +
> +	return 0;
> +}
> +
> +static void filecache_exit(void)
> +{
> +	remove_proc_entry("filecache", NULL);
> +	if (global_session.query_file)
> +		fput(global_session.query_file);
> +}
> +
> +MODULE_AUTHOR("Fengguang Wu <wfg@mail.ustc.edu.cn>");
> +MODULE_LICENSE("GPL");
> +
> +module_init(filecache_init);
> +module_exit(filecache_exit);
> --- linux-2.6.orig/include/linux/fs.h
> +++ linux-2.6/include/linux/fs.h
> @@ -775,6 +775,11 @@ struct inode {
>  	void			*i_security;
>  #endif
>  	void			*i_private; /* fs or device private pointer */
> +
> +#ifdef CONFIG_PROC_FILECACHE_EXTRAS
> +	unsigned int		i_access_count;	/* opened how many times? */
> +	char			i_comm[16];	/* opened first by which app? */
> +#endif
>  };
>  
>  /*
> @@ -860,6 +865,13 @@ static inline unsigned imajor(const stru
>  	return MAJOR(inode->i_rdev);
>  }
>  
> +static inline void inode_accessed(struct inode *inode)
> +{
> +#ifdef CONFIG_PROC_FILECACHE_EXTRAS
> +	inode->i_access_count++;
> +#endif
> +}
> +
>  extern struct block_device *I_BDEV(struct inode *inode);
>  
>  struct fown_struct {
> @@ -2171,6 +2183,7 @@ extern void remove_inode_hash(struct ino
>  static inline void insert_inode_hash(struct inode *inode) {
>  	__insert_inode_hash(inode, inode->i_ino);
>  }
> +struct hlist_head * get_inode_hash_budget(unsigned long index);
>  
>  extern struct file * get_empty_filp(void);
>  extern void file_move(struct file *f, struct list_head *list);
> --- linux-2.6.orig/fs/open.c
> +++ linux-2.6/fs/open.c
> @@ -842,6 +842,7 @@ static struct file *__dentry_open(struct
>  			goto cleanup_all;
>  	}
>  
> +	inode_accessed(inode);
>  	f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
>  
>  	file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
> --- linux-2.6.orig/fs/Kconfig
> +++ linux-2.6/fs/Kconfig
> @@ -265,4 +265,34 @@ endif
>  source "fs/nls/Kconfig"
>  source "fs/dlm/Kconfig"
>  
> +config PROC_FILECACHE
> +	tristate "/proc/filecache support"
> +	default m
> +	depends on PROC_FS
> +	help
> +	  This option creates a file /proc/filecache which enables one to
> +	  query/drop the cached files in memory.
> +
> +	  A quick start guide:
> +
> +	  # echo 'ls' > /proc/filecache
> +	  # head /proc/filecache
> +
> +	  # echo 'cat /bin/bash' > /proc/filecache
> +	  # head /proc/filecache
> +
> +	  # echo 'drop pagecache' > /proc/filecache
> +	  # echo 'drop slabcache' > /proc/filecache
> +
> +	  For more details, please check Documentation/filesystems/proc.txt .
> +
> +	  It can be a handy tool for sysadms and desktop users.
> +
> +config PROC_FILECACHE_EXTRAS
> +	bool "track extra states"
> +	default y
> +	depends on PROC_FILECACHE
> +	help
> +	  Track extra states that costs a little more time/space.
> +
>  endmenu
> --- linux-2.6.orig/fs/proc/Makefile
> +++ linux-2.6/fs/proc/Makefile
> @@ -2,7 +2,8 @@
>  # Makefile for the Linux proc filesystem routines.
>  #
>  
> -obj-$(CONFIG_PROC_FS) += proc.o
> +obj-$(CONFIG_PROC_FS)		+= proc.o
> +obj-$(CONFIG_PROC_FILECACHE)	+= filecache.o
>  
>  proc-y			:= nommu.o task_nommu.o
>  proc-$(CONFIG_MMU)	:= mmu.o task_mmu.o


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
  2009-05-17 13:55                                   ` Frederic Weisbecker
@ 2009-05-17 14:12                                     ` Wu Fengguang
  -1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-05-17 14:12 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Li Zefan, Tom Zanussi, KOSAKI Motohiro,
	Pekka Enberg, Andi Kleen, Steven Rostedt, Larry Woodman,
	Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Sun, May 17, 2009 at 09:55:12PM +0800, Frederic Weisbecker wrote:
> On Sun, May 17, 2009 at 09:36:59PM +0800, Wu Fengguang wrote:
> > On Tue, May 12, 2009 at 09:01:12PM +0800, Frederic Weisbecker wrote:
> > > On Tue, Apr 28, 2009 at 09:31:08PM +0800, Wu Fengguang wrote:
> > > > On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> > > >
> > > > There are two possible challenges for the conversion:
> > > >
> > > > - One trick it does is to select different lists to traverse on
> > > >   different filter options. Will this be possible in the object
> > > >   tracing framework?
> > >
> > > Yeah, I guess.
> >
> > Great.
> >
> > >
> > > > - The file name lookup(last field) is the performance killer. Is it
> > > >   possible to skip the file name lookup when the filter failed on the
> > > >   leading fields?
> > >
> > > objects collection lays on trace events where filters basically ignore
> > > a whole entry in case of non-matching. Not sure if we can easily only
> > > ignore one field.
> > >
> > > But I guess we can do something about the performances...
> >
> > OK, but it's not as important as the previous requirement, so it could
> > be the last thing to work on :)
> >
> > > Could you send us the (sob'ed) patch you made which implements this.
> > > I could try to adapt it to object collection.
> >
> > Attached for your reference. Be aware that I still have plans to
> > change it in non trivial way, and there are ongoing works by Nick(on
> > inode_lock) and Jens(on s_dirty) that can create merge conflicts.
> > So basically it is not a right time to do the adaption.
> 
> 
> Ah ok, so I will wait a bit :-)
> 
> 
> > However we can still do something to polish up the page object
> > collection under /debug/tracing/objects/mm/pages/. For example,
> > the timestamps and function name could be removed from the following
> > list :)
> >
> > # tracer: nop
> > #
> > #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
> > #              | |       |          |         |
> >            <...>-3743  [001]  3035.649769: dump_pages: pfn=1 flags=400 count=1 mapcount=0 index=0
> >            <...>-3743  [001]  3044.176403: dump_pages: pfn=1 flags=400 count=1 mapcount=0 index=0
> >            <...>-3743  [001]  3044.176407: dump_pages: pfn=2 flags=400 count=1 mapcount=0 index=0
> >            <...>-3743  [001]  3044.176408: dump_pages: pfn=3 flags=400 count=1 mapcount=0 index=0
> >            <...>-3743  [001]  3044.176409: dump_pages: pfn=4 flags=400 count=1 mapcount=0 index=0
> >            <...>-3743  [001]  3044.176409: dump_pages: pfn=5 flags=400 count=1 mapcount=0 index=0
> >            <...>-3743  [001]  3044.176410: dump_pages: pfn=6 flags=400 count=1 mapcount=0 index=0
> >            <...>-3743  [001]  3044.176410: dump_pages: pfn=7 flags=400 count=1 mapcount=0 index=0
> >            <...>-3743  [001]  3044.176411: dump_pages: pfn=8 flags=400 count=1 mapcount=0 index=0
> >            <...>-3743  [001]  3044.176411: dump_pages: pfn=9 flags=400 count=1 mapcount=0 index=0
> >            <...>-3743  [001]  3044.176412: dump_pages: pfn=10 flags=400 count=1 mapcount=0 index=0
> 
> 
> echo nocontext-info > /debug/tracing/trace_options :-)

Nice tip - I should really learn more about ftrace :-)

> But you'll have only the function and the pages specifics. It's not really the
> function but more specifically the name of the event. It's useful to distinguish
> multiple events to a trace.
> 
> Hmm, may be it's not that much useful in a object dump...

Yeah - and to enable that option automatically in relevant code :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
@ 2009-05-17 14:12                                     ` Wu Fengguang
  0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-05-17 14:12 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Li Zefan, Tom Zanussi, KOSAKI Motohiro,
	Pekka Enberg, Andi Kleen, Steven Rostedt, Larry Woodman,
	Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Sun, May 17, 2009 at 09:55:12PM +0800, Frederic Weisbecker wrote:
> On Sun, May 17, 2009 at 09:36:59PM +0800, Wu Fengguang wrote:
> > On Tue, May 12, 2009 at 09:01:12PM +0800, Frederic Weisbecker wrote:
> > > On Tue, Apr 28, 2009 at 09:31:08PM +0800, Wu Fengguang wrote:
> > > > On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> > > >
> > > > There are two possible challenges for the conversion:
> > > >
> > > > - One trick it does is to select different lists to traverse on
> > > >   different filter options. Will this be possible in the object
> > > >   tracing framework?
> > >
> > > Yeah, I guess.
> >
> > Great.
> >
> > >
> > > > - The file name lookup(last field) is the performance killer. Is it
> > > >   possible to skip the file name lookup when the filter failed on the
> > > >   leading fields?
> > >
> > > objects collection lays on trace events where filters basically ignore
> > > a whole entry in case of non-matching. Not sure if we can easily only
> > > ignore one field.
> > >
> > > But I guess we can do something about the performances...
> >
> > OK, but it's not as important as the previous requirement, so it could
> > be the last thing to work on :)
> >
> > > Could you send us the (sob'ed) patch you made which implements this.
> > > I could try to adapt it to object collection.
> >
> > Attached for your reference. Be aware that I still have plans to
> > change it in non trivial way, and there are ongoing works by Nick(on
> > inode_lock) and Jens(on s_dirty) that can create merge conflicts.
> > So basically it is not a right time to do the adaption.
> 
> 
> Ah ok, so I will wait a bit :-)
> 
> 
> > However we can still do something to polish up the page object
> > collection under /debug/tracing/objects/mm/pages/. For example,
> > the timestamps and function name could be removed from the following
> > list :)
> >
> > # tracer: nop
> > #
> > #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
> > #              | |       |          |         |
> >            <...>-3743  [001]  3035.649769: dump_pages: pfn=1 flags=400 count=1 mapcount=0 index=0
> >            <...>-3743  [001]  3044.176403: dump_pages: pfn=1 flags=400 count=1 mapcount=0 index=0
> >            <...>-3743  [001]  3044.176407: dump_pages: pfn=2 flags=400 count=1 mapcount=0 index=0
> >            <...>-3743  [001]  3044.176408: dump_pages: pfn=3 flags=400 count=1 mapcount=0 index=0
> >            <...>-3743  [001]  3044.176409: dump_pages: pfn=4 flags=400 count=1 mapcount=0 index=0
> >            <...>-3743  [001]  3044.176409: dump_pages: pfn=5 flags=400 count=1 mapcount=0 index=0
> >            <...>-3743  [001]  3044.176410: dump_pages: pfn=6 flags=400 count=1 mapcount=0 index=0
> >            <...>-3743  [001]  3044.176410: dump_pages: pfn=7 flags=400 count=1 mapcount=0 index=0
> >            <...>-3743  [001]  3044.176411: dump_pages: pfn=8 flags=400 count=1 mapcount=0 index=0
> >            <...>-3743  [001]  3044.176411: dump_pages: pfn=9 flags=400 count=1 mapcount=0 index=0
> >            <...>-3743  [001]  3044.176412: dump_pages: pfn=10 flags=400 count=1 mapcount=0 index=0
> 
> 
> echo nocontext-info > /debug/tracing/trace_options :-)

Nice tip - I should really learn more about ftrace :-)

> But you'll have only the function and the pages specifics. It's not really the
> function but more specifically the name of the event. It's useful to distinguish
> multiple events to a trace.
> 
> Hmm, may be it's not that much useful in a object dump...

Yeah - and to enable that option automatically in relevant code :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
  2009-05-17 13:36                               ` Wu Fengguang
@ 2009-05-18 11:44                                   ` KOSAKI Motohiro
  2009-05-18 11:44                                   ` KOSAKI Motohiro
  1 sibling, 0 replies; 137+ messages in thread
From: KOSAKI Motohiro @ 2009-05-18 11:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Frederic Weisbecker, Ingo Molnar, Li Zefan,
	Tom Zanussi, Pekka Enberg, Andi Kleen, Steven Rostedt,
	Larry Woodman, Peter Zijlstra, Eduard - Gabriel Munteanu,
	Andrew Morton, LKML, Matt Mackall, Alexey Dobriyan, linux-mm

Hi


> > Could you send us the (sob'ed) patch you made which implements this.
> > I could try to adapt it to object collection.
> 
> Attached for your reference. Be aware that I still have plans to
> change it in non trivial way, and there are ongoing works by Nick(on
> inode_lock) and Jens(on s_dirty) that can create merge conflicts.
> So basically it is not a right time to do the adaption.

if you can make object collection based filecache viewer, could you
please cc me? I guess I can review it in mm part.

thanks.



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
@ 2009-05-18 11:44                                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 137+ messages in thread
From: KOSAKI Motohiro @ 2009-05-18 11:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Frederic Weisbecker, Ingo Molnar, Li Zefan,
	Tom Zanussi, Pekka Enberg, Andi Kleen, Steven Rostedt,
	Larry Woodman, Peter Zijlstra, Eduard - Gabriel Munteanu,
	Andrew Morton, LKML, Matt Mackall, Alexey Dobriyan, linux-mm

Hi


> > Could you send us the (sob'ed) patch you made which implements this.
> > I could try to adapt it to object collection.
> 
> Attached for your reference. Be aware that I still have plans to
> change it in non trivial way, and there are ongoing works by Nick(on
> inode_lock) and Jens(on s_dirty) that can create merge conflicts.
> So basically it is not a right time to do the adaption.

if you can make object collection based filecache viewer, could you
please cc me? I guess I can review it in mm part.

thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
  2009-05-18 11:44                                   ` KOSAKI Motohiro
@ 2009-05-18 11:47                                     ` Wu Fengguang
  -1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-05-18 11:47 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Frederic Weisbecker, Ingo Molnar, Li Zefan, Tom Zanussi,
	Pekka Enberg, Andi Kleen, Steven Rostedt, Larry Woodman,
	Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Mon, May 18, 2009 at 07:44:21PM +0800, KOSAKI Motohiro wrote:
> Hi
> 
> 
> > > Could you send us the (sob'ed) patch you made which implements this.
> > > I could try to adapt it to object collection.
> > 
> > Attached for your reference. Be aware that I still have plans to
> > change it in non trivial way, and there are ongoing works by Nick(on
> > inode_lock) and Jens(on s_dirty) that can create merge conflicts.
> > So basically it is not a right time to do the adaption.
> 
> if you can make object collection based filecache viewer, could you
> please cc me? I guess I can review it in mm part.

OK, thank you! I should be able to work on it in next month.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
@ 2009-05-18 11:47                                     ` Wu Fengguang
  0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-05-18 11:47 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Frederic Weisbecker, Ingo Molnar, Li Zefan, Tom Zanussi,
	Pekka Enberg, Andi Kleen, Steven Rostedt, Larry Woodman,
	Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
	Matt Mackall, Alexey Dobriyan, linux-mm

On Mon, May 18, 2009 at 07:44:21PM +0800, KOSAKI Motohiro wrote:
> Hi
> 
> 
> > > Could you send us the (sob'ed) patch you made which implements this.
> > > I could try to adapt it to object collection.
> > 
> > Attached for your reference. Be aware that I still have plans to
> > change it in non trivial way, and there are ongoing works by Nick(on
> > inode_lock) and Jens(on s_dirty) that can create merge conflicts.
> > So basically it is not a right time to do the adaption.
> 
> if you can make object collection based filecache viewer, could you
> please cc me? I guess I can review it in mm part.

OK, thank you! I should be able to work on it in next month.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 137+ messages in thread

end of thread, other threads:[~2009-05-18 11:47 UTC | newest]

Thread overview: 137+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-28  1:09 [PATCH 0/5] proc: export more page flags in /proc/kpageflags (take 4) Wu Fengguang
2009-04-28  1:09 ` Wu Fengguang
2009-04-28  1:09 ` [PATCH 1/5] pagemap: document clarifications Wu Fengguang
2009-04-28  1:09   ` Wu Fengguang
2009-04-28  7:11   ` Tommi Rantala
2009-04-28  7:11     ` Tommi Rantala
2009-04-28  1:09 ` [PATCH 2/5] pagemap: documentation 9 more exported page flags Wu Fengguang
2009-04-28  1:09   ` Wu Fengguang
2009-04-28  1:09 ` [PATCH 3/5] mm: introduce PageHuge() for testing huge/gigantic pages Wu Fengguang
2009-04-28  1:09   ` Wu Fengguang
2009-04-28  1:09 ` [PATCH 4/5] proc: kpagecount/kpageflags code cleanup Wu Fengguang
2009-04-28  1:09   ` Wu Fengguang
2009-04-28  1:09 ` [PATCH 5/5] proc: export more page flags in /proc/kpageflags Wu Fengguang
2009-04-28  1:09   ` Wu Fengguang
2009-04-28  6:55   ` Ingo Molnar
2009-04-28  6:55     ` Ingo Molnar
2009-04-28  7:40     ` Andi Kleen
2009-04-28  7:40       ` Andi Kleen
2009-04-28  9:04       ` Pekka Enberg
2009-04-28  9:04         ` Pekka Enberg
2009-04-28  9:10         ` Andi Kleen
2009-04-28  9:10           ` Andi Kleen
2009-04-28  9:15           ` Pekka Enberg
2009-04-28  9:15             ` Pekka Enberg
2009-04-28  9:15         ` Ingo Molnar
2009-04-28  9:15           ` Ingo Molnar
2009-04-28  9:19           ` Pekka Enberg
2009-04-28  9:19             ` Pekka Enberg
2009-04-28  9:25             ` Pekka Enberg
2009-04-28  9:25               ` Pekka Enberg
2009-04-28  9:36               ` Wu Fengguang
2009-04-28  9:36                 ` Wu Fengguang
2009-04-28  9:36               ` Ingo Molnar
2009-04-28  9:36                 ` Ingo Molnar
2009-04-28  9:57                 ` Pekka Enberg
2009-04-28  9:57                   ` Pekka Enberg
2009-04-28 10:10                   ` KOSAKI Motohiro
2009-04-28 10:10                     ` KOSAKI Motohiro
2009-04-28 10:21                     ` Pekka Enberg
2009-04-28 10:21                       ` Pekka Enberg
2009-04-28 10:56                       ` Ingo Molnar
2009-04-28 10:56                         ` Ingo Molnar
2009-04-28 11:09                         ` KOSAKI Motohiro
2009-04-28 11:09                           ` KOSAKI Motohiro
2009-04-28 12:42                           ` Ingo Molnar
2009-04-28 12:42                             ` Ingo Molnar
2009-04-28 11:03                   ` Ingo Molnar
2009-04-28 11:03                     ` Ingo Molnar
2009-04-28 17:42                 ` Matt Mackall
2009-04-28 17:42                   ` Matt Mackall
2009-04-28  9:29             ` Ingo Molnar
2009-04-28  9:29               ` Ingo Molnar
2009-04-28  9:34               ` KOSAKI Motohiro
2009-04-28  9:34                 ` KOSAKI Motohiro
2009-04-28  9:38                 ` Ingo Molnar
2009-04-28  9:38                   ` Ingo Molnar
2009-04-28  9:55                   ` Wu Fengguang
2009-04-28  9:55                     ` Wu Fengguang
2009-04-28 10:11                     ` KOSAKI Motohiro
2009-04-28 10:11                       ` KOSAKI Motohiro
2009-04-28 11:05                     ` Ingo Molnar
2009-04-28 11:05                       ` Ingo Molnar
2009-04-28 11:36                       ` Wu Fengguang
2009-04-28 11:36                         ` Wu Fengguang
2009-04-28 12:17                         ` [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags) Ingo Molnar
2009-04-28 12:17                           ` Ingo Molnar
2009-04-28 13:31                           ` Wu Fengguang
2009-04-28 13:31                             ` Wu Fengguang
2009-05-12 13:01                             ` Frederic Weisbecker
2009-05-12 13:01                               ` Frederic Weisbecker
2009-05-17 13:36                               ` Wu Fengguang
2009-05-17 13:55                                 ` Frederic Weisbecker
2009-05-17 13:55                                   ` Frederic Weisbecker
2009-05-17 14:12                                   ` Wu Fengguang
2009-05-17 14:12                                     ` Wu Fengguang
2009-05-18 11:44                                 ` KOSAKI Motohiro
2009-05-18 11:44                                   ` KOSAKI Motohiro
2009-05-18 11:47                                   ` Wu Fengguang
2009-05-18 11:47                                     ` Wu Fengguang
2009-04-28 10:18                   ` [PATCH 5/5] proc: export more page flags in /proc/kpageflags Andi Kleen
2009-04-28 10:18                     ` Andi Kleen
2009-04-28  8:33     ` Wu Fengguang
2009-04-28  8:33       ` Wu Fengguang
2009-04-28  9:24       ` Ingo Molnar
2009-04-28  9:24         ` Ingo Molnar
2009-04-28 18:11       ` Tony Luck
2009-04-28 18:11         ` Tony Luck
2009-04-28 18:34         ` Matt Mackall
2009-04-28 18:34           ` Matt Mackall
2009-04-28 20:47           ` Tony Luck
2009-04-28 20:47             ` Tony Luck
2009-04-28 20:54             ` Andi Kleen
2009-04-28 20:54               ` Andi Kleen
2009-04-28 20:59             ` Matt Mackall
2009-04-28 20:59               ` Matt Mackall
2009-04-28 21:17         ` Andrew Morton
2009-04-28 21:17           ` Andrew Morton
2009-04-28 21:49           ` Matt Mackall
2009-04-28 21:49             ` Matt Mackall
2009-04-29  0:02             ` Robin Holt
2009-04-29  0:02               ` Robin Holt
2009-04-28 17:49   ` Matt Mackall
2009-04-28 17:49     ` Matt Mackall
2009-04-29  8:05     ` Wu Fengguang
2009-04-29  8:05       ` Wu Fengguang
2009-04-29 19:13       ` Matt Mackall
2009-04-29 19:13         ` Matt Mackall
2009-04-30  1:00         ` Wu Fengguang
2009-04-30  1:00           ` Wu Fengguang
2009-04-28 21:32   ` Andrew Morton
2009-04-28 21:32     ` Andrew Morton
2009-04-28 22:46     ` Matt Mackall
2009-04-28 22:46       ` Matt Mackall
2009-04-28 23:02       ` Andrew Morton
2009-04-28 23:02         ` Andrew Morton
2009-04-28 23:31         ` Matt Mackall
2009-04-28 23:31           ` Matt Mackall
2009-04-28 23:42           ` Andrew Morton
2009-04-28 23:42             ` Andrew Morton
2009-04-28 23:55             ` Matt Mackall
2009-04-28 23:55               ` Matt Mackall
2009-04-29  3:33               ` Wu Fengguang
2009-04-29  3:33                 ` Wu Fengguang
2009-04-29  2:38     ` Wu Fengguang
2009-04-29  2:38       ` Wu Fengguang
2009-04-29  2:55       ` Andrew Morton
2009-04-29  2:55         ` Andrew Morton
2009-04-29  3:48         ` Wu Fengguang
2009-04-29  3:48           ` Wu Fengguang
2009-04-29  5:09           ` Wu Fengguang
2009-04-29  5:09             ` Wu Fengguang
2009-04-29  4:41       ` Nathan Lynch
2009-04-29  4:41         ` Nathan Lynch
2009-04-29  4:41         ` Nathan Lynch
2009-04-29  4:50         ` Andrew Morton
2009-04-29  4:50           ` Andrew Morton
2009-04-29  4:50           ` Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.