* [PATCH 0/5] proc: export more page flags in /proc/kpageflags (take 4)
@ 2009-04-28 1:09 ` Wu Fengguang
0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 1:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: LKML, KOSAKI Motohiro, Wu, Fengguang, Andi Kleen, linux-mm
Hi all,
Export 9 more flags to end users (and more for kernel developers):
11. KPF_MMAP (pseudo flag) memory mapped page
12. KPF_ANON (pseudo flag) memory mapped page (anonymous)
13. KPF_SWAPCACHE page is in swap cache
14. KPF_SWAPBACKED page is swap/RAM backed
15. KPF_COMPOUND_HEAD (*)
16. KPF_COMPOUND_TAIL (*)
17. KPF_UNEVICTABLE page is in the unevictable LRU list
18. KPF_HWPOISON hardware detected corruption
19. KPF_NOPAGE (pseudo flag) no page frame at the address
(*) For compound pages, exporting _both_ head/tail info enables
users to tell where a compound page starts/ends, and its order.
Please check the documentary patch and changelog of the final patch
for the details.
[PATCH 1/5] pagemap: document clarifications
[PATCH 2/5] pagemap: documentation new page flags
[PATCH 3/5] mm: introduce PageHuge() for testing huge/gigantic pages
[PATCH 4/5] proc: kpagecount/kpageflags code cleanup
[PATCH 5/5] proc: export more page flags in /proc/kpageflags
Thanks,
Fengguang
--
^ permalink raw reply [flat|nested] 137+ messages in thread
* [PATCH 0/5] proc: export more page flags in /proc/kpageflags (take 4)
@ 2009-04-28 1:09 ` Wu Fengguang
0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 1:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: LKML, KOSAKI Motohiro, Wu, Fengguang, Andi Kleen, linux-mm
Hi all,
Export 9 more flags to end users (and more for kernel developers):
11. KPF_MMAP (pseudo flag) memory mapped page
12. KPF_ANON (pseudo flag) memory mapped page (anonymous)
13. KPF_SWAPCACHE page is in swap cache
14. KPF_SWAPBACKED page is swap/RAM backed
15. KPF_COMPOUND_HEAD (*)
16. KPF_COMPOUND_TAIL (*)
17. KPF_UNEVICTABLE page is in the unevictable LRU list
18. KPF_HWPOISON hardware detected corruption
19. KPF_NOPAGE (pseudo flag) no page frame at the address
(*) For compound pages, exporting _both_ head/tail info enables
users to tell where a compound page starts/ends, and its order.
Please check the documentary patch and changelog of the final patch
for the details.
[PATCH 1/5] pagemap: document clarifications
[PATCH 2/5] pagemap: documentation new page flags
[PATCH 3/5] mm: introduce PageHuge() for testing huge/gigantic pages
[PATCH 4/5] proc: kpagecount/kpageflags code cleanup
[PATCH 5/5] proc: export more page flags in /proc/kpageflags
Thanks,
Fengguang
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* [PATCH 1/5] pagemap: document clarifications
2009-04-28 1:09 ` Wu Fengguang
@ 2009-04-28 1:09 ` Wu Fengguang
-1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 1:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: LKML, Wu Fengguang, KOSAKI Motohiro, Andi Kleen, linux-mm
[-- Attachment #1: kpageflags-doc-fix.patch --]
[-- Type: text/plain, Size: 1171 bytes --]
Some bit ranges were inclusive and some not.
Fix them to be consistently inclusive.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
Documentation/vm/pagemap.txt | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- mm.orig/Documentation/vm/pagemap.txt
+++ mm/Documentation/vm/pagemap.txt
@@ -12,9 +12,9 @@ There are three components to pagemap:
value for each virtual page, containing the following data (from
fs/proc/task_mmu.c, above pagemap_read):
- * Bits 0-55 page frame number (PFN) if present
+ * Bits 0-54 page frame number (PFN) if present
* Bits 0-4 swap type if swapped
- * Bits 5-55 swap offset if swapped
+ * Bits 5-54 swap offset if swapped
* Bits 55-60 page shift (page size = 1<<page shift)
* Bit 61 reserved for future use
* Bit 62 page swapped
@@ -36,7 +36,7 @@ There are three components to pagemap:
* /proc/kpageflags. This file contains a 64-bit set of flags for each
page, indexed by PFN.
- The flags are (from fs/proc/proc_misc, above kpageflags_read):
+ The flags are (from fs/proc/page.c, above kpageflags_read):
0. LOCKED
1. ERROR
--
^ permalink raw reply [flat|nested] 137+ messages in thread
* [PATCH 1/5] pagemap: document clarifications
@ 2009-04-28 1:09 ` Wu Fengguang
0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 1:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: LKML, Wu Fengguang, KOSAKI Motohiro, Andi Kleen, linux-mm
[-- Attachment #1: kpageflags-doc-fix.patch --]
[-- Type: text/plain, Size: 1396 bytes --]
Some bit ranges were inclusive and some not.
Fix them to be consistently inclusive.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
Documentation/vm/pagemap.txt | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- mm.orig/Documentation/vm/pagemap.txt
+++ mm/Documentation/vm/pagemap.txt
@@ -12,9 +12,9 @@ There are three components to pagemap:
value for each virtual page, containing the following data (from
fs/proc/task_mmu.c, above pagemap_read):
- * Bits 0-55 page frame number (PFN) if present
+ * Bits 0-54 page frame number (PFN) if present
* Bits 0-4 swap type if swapped
- * Bits 5-55 swap offset if swapped
+ * Bits 5-54 swap offset if swapped
* Bits 55-60 page shift (page size = 1<<page shift)
* Bit 61 reserved for future use
* Bit 62 page swapped
@@ -36,7 +36,7 @@ There are three components to pagemap:
* /proc/kpageflags. This file contains a 64-bit set of flags for each
page, indexed by PFN.
- The flags are (from fs/proc/proc_misc, above kpageflags_read):
+ The flags are (from fs/proc/page.c, above kpageflags_read):
0. LOCKED
1. ERROR
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* [PATCH 2/5] pagemap: documentation 9 more exported page flags
2009-04-28 1:09 ` Wu Fengguang
@ 2009-04-28 1:09 ` Wu Fengguang
-1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 1:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: LKML, Wu Fengguang, KOSAKI Motohiro, Andi Kleen, linux-mm
[-- Attachment #1: kpageflags-doc.patch --]
[-- Type: text/plain, Size: 2990 bytes --]
Also add short descriptions for all of the 20 exported page flags.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
Documentation/vm/pagemap.txt | 62 +++++++++++++++++++++++++++++++++
1 file changed, 62 insertions(+)
--- mm.orig/Documentation/vm/pagemap.txt
+++ mm/Documentation/vm/pagemap.txt
@@ -49,6 +49,68 @@ There are three components to pagemap:
8. WRITEBACK
9. RECLAIM
10. BUDDY
+ 11. MMAP
+ 12. ANON
+ 13. SWAPCACHE
+ 14. SWAPBACKED
+ 15. COMPOUND_HEAD
+ 16. COMPOUND_TAIL
+ 17. UNEVICTABLE
+ 18. HWPOISON
+ 19. NOPAGE
+
+Short descriptions to the page flags:
+
+ 0. LOCKED
+ page is being locked for exclusive access, eg. by undergoing read/write IO
+
+ 7. SLAB
+ page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
+ When compound page is used, SLUB/SLQB will only set this flag on the head
+ page; SLOB will not flag it at all.
+
+10. BUDDY
+ a free memory block managed by the buddy system allocator
+ The buddy system organizes free memory in blocks of various orders.
+ An order N block has 2^N physically contiguous pages, with the BUDDY flag
+ set for and _only_ for the first page.
+
+15. COMPOUND_HEAD
+16. COMPOUND_TAIL
+ A compound page with order N consists of 2^N physically contiguous pages.
+ A compound page with order 2 takes the form of "HTTT", where H donates its
+ head page and T donates its tail page(s). The major consumers of compound
+ pages are hugeTLB pages (Documentation/vm/hugetlbpage.txt), the SLUB etc.
+ memory allocators and various device drivers. However in this interface,
+ only huge/giga pages are made visible to end users.
+
+18. HWPOISON
+ hardware detected memory corruption on this page: don't touch the data!
+
+19. NOPAGE
+ no page frame exists at the requested address
+
+ [IO related page flags]
+ 1. ERROR IO error occurred
+ 3. UPTODATE page has up-to-date data
+ ie. for file backed page: (in-memory data revision >= on-disk one)
+ 4. DIRTY page has been written to, hence contains new data
+ ie. for file backed page: (in-memory data revision > on-disk one)
+ 8. WRITEBACK page is being synced to disk
+
+ [LRU related page flags]
+ 5. LRU page is in one of the LRU lists
+ 6. ACTIVE page is in the active LRU list
+17. UNEVICTABLE page is in the unevictable (non-)LRU list
+ It is somehow pinned and not a candidate for LRU page reclaims,
+ eg. ramfs pages, shmctl(SHM_LOCK) and mlock() memory segments
+ 2. REFERENCED page has been referenced since last LRU list enqueue/requeue
+ 9. RECLAIM page will be reclaimed soon after its pageout IO completed
+11. MMAP a memory mapped page
+12. ANON a memory mapped page that is not part of a file
+13. SWAPCACHE page is mapped to swap space, ie. has an associated swap entry
+14. SWAPBACKED page is backed by swap/RAM
+
Using pagemap to do something useful:
--
^ permalink raw reply [flat|nested] 137+ messages in thread
* [PATCH 2/5] pagemap: documentation 9 more exported page flags
@ 2009-04-28 1:09 ` Wu Fengguang
0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 1:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: LKML, Wu Fengguang, KOSAKI Motohiro, Andi Kleen, linux-mm
[-- Attachment #1: kpageflags-doc.patch --]
[-- Type: text/plain, Size: 3215 bytes --]
Also add short descriptions for all of the 20 exported page flags.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
Documentation/vm/pagemap.txt | 62 +++++++++++++++++++++++++++++++++
1 file changed, 62 insertions(+)
--- mm.orig/Documentation/vm/pagemap.txt
+++ mm/Documentation/vm/pagemap.txt
@@ -49,6 +49,68 @@ There are three components to pagemap:
8. WRITEBACK
9. RECLAIM
10. BUDDY
+ 11. MMAP
+ 12. ANON
+ 13. SWAPCACHE
+ 14. SWAPBACKED
+ 15. COMPOUND_HEAD
+ 16. COMPOUND_TAIL
+ 17. UNEVICTABLE
+ 18. HWPOISON
+ 19. NOPAGE
+
+Short descriptions to the page flags:
+
+ 0. LOCKED
+ page is being locked for exclusive access, eg. by undergoing read/write IO
+
+ 7. SLAB
+ page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
+ When compound page is used, SLUB/SLQB will only set this flag on the head
+ page; SLOB will not flag it at all.
+
+10. BUDDY
+ a free memory block managed by the buddy system allocator
+ The buddy system organizes free memory in blocks of various orders.
+ An order N block has 2^N physically contiguous pages, with the BUDDY flag
+ set for and _only_ for the first page.
+
+15. COMPOUND_HEAD
+16. COMPOUND_TAIL
+ A compound page with order N consists of 2^N physically contiguous pages.
+ A compound page with order 2 takes the form of "HTTT", where H donates its
+ head page and T donates its tail page(s). The major consumers of compound
+ pages are hugeTLB pages (Documentation/vm/hugetlbpage.txt), the SLUB etc.
+ memory allocators and various device drivers. However in this interface,
+ only huge/giga pages are made visible to end users.
+
+18. HWPOISON
+ hardware detected memory corruption on this page: don't touch the data!
+
+19. NOPAGE
+ no page frame exists at the requested address
+
+ [IO related page flags]
+ 1. ERROR IO error occurred
+ 3. UPTODATE page has up-to-date data
+ ie. for file backed page: (in-memory data revision >= on-disk one)
+ 4. DIRTY page has been written to, hence contains new data
+ ie. for file backed page: (in-memory data revision > on-disk one)
+ 8. WRITEBACK page is being synced to disk
+
+ [LRU related page flags]
+ 5. LRU page is in one of the LRU lists
+ 6. ACTIVE page is in the active LRU list
+17. UNEVICTABLE page is in the unevictable (non-)LRU list
+ It is somehow pinned and not a candidate for LRU page reclaims,
+ eg. ramfs pages, shmctl(SHM_LOCK) and mlock() memory segments
+ 2. REFERENCED page has been referenced since last LRU list enqueue/requeue
+ 9. RECLAIM page will be reclaimed soon after its pageout IO completed
+11. MMAP a memory mapped page
+12. ANON a memory mapped page that is not part of a file
+13. SWAPCACHE page is mapped to swap space, ie. has an associated swap entry
+14. SWAPBACKED page is backed by swap/RAM
+
Using pagemap to do something useful:
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* [PATCH 3/5] mm: introduce PageHuge() for testing huge/gigantic pages
2009-04-28 1:09 ` Wu Fengguang
@ 2009-04-28 1:09 ` Wu Fengguang
-1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 1:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: LKML, Wu Fengguang, KOSAKI Motohiro, Andi Kleen, linux-mm
[-- Attachment #1: giga-page.patch --]
[-- Type: text/plain, Size: 2112 bytes --]
Introduce PageHuge(), which identifies huge/gigantic pages
by their dedicated compound destructor functions.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/mm.h | 24 ++++++++++++++++++++++++
mm/hugetlb.c | 2 +-
mm/page_alloc.c | 11 ++++++++++-
3 files changed, 35 insertions(+), 2 deletions(-)
--- mm.orig/mm/page_alloc.c
+++ mm/mm/page_alloc.c
@@ -299,13 +299,22 @@ void prep_compound_page(struct page *pag
}
#ifdef CONFIG_HUGETLBFS
+/*
+ * This (duplicated) destructor function distinguishes gigantic pages from
+ * normal compound pages.
+ */
+void free_gigantic_page(struct page *page)
+{
+ __free_pages_ok(page, compound_order(page));
+}
+
void prep_compound_gigantic_page(struct page *page, unsigned long order)
{
int i;
int nr_pages = 1 << order;
struct page *p = page + 1;
- set_compound_page_dtor(page, free_compound_page);
+ set_compound_page_dtor(page, free_gigantic_page);
set_compound_order(page, order);
__SetPageHead(page);
for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
--- mm.orig/mm/hugetlb.c
+++ mm/mm/hugetlb.c
@@ -550,7 +550,7 @@ struct hstate *size_to_hstate(unsigned l
return NULL;
}
-static void free_huge_page(struct page *page)
+void free_huge_page(struct page *page)
{
/*
* Can't pass hstate in here because it is called from the
--- mm.orig/include/linux/mm.h
+++ mm/include/linux/mm.h
@@ -355,6 +355,30 @@ static inline void set_compound_order(st
page[1].lru.prev = (void *)order;
}
+#ifdef CONFIG_HUGETLBFS
+void free_huge_page(struct page *page);
+void free_gigantic_page(struct page *page);
+
+static inline int PageHuge(struct page *page)
+{
+ compound_page_dtor *dtor;
+
+ if (!PageCompound(page))
+ return 0;
+
+ page = compound_head(page);
+ dtor = get_compound_page_dtor(page);
+
+ return dtor == free_huge_page ||
+ dtor == free_gigantic_page;
+}
+#else
+static inline int PageHuge(struct page *page)
+{
+ return 0;
+}
+#endif
+
/*
* Multiple processes may "see" the same page. E.g. for untouched
* mappings of /dev/null, all processes see the same page full of
--
^ permalink raw reply [flat|nested] 137+ messages in thread
* [PATCH 3/5] mm: introduce PageHuge() for testing huge/gigantic pages
@ 2009-04-28 1:09 ` Wu Fengguang
0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 1:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: LKML, Wu Fengguang, KOSAKI Motohiro, Andi Kleen, linux-mm
[-- Attachment #1: giga-page.patch --]
[-- Type: text/plain, Size: 2337 bytes --]
Introduce PageHuge(), which identifies huge/gigantic pages
by their dedicated compound destructor functions.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/mm.h | 24 ++++++++++++++++++++++++
mm/hugetlb.c | 2 +-
mm/page_alloc.c | 11 ++++++++++-
3 files changed, 35 insertions(+), 2 deletions(-)
--- mm.orig/mm/page_alloc.c
+++ mm/mm/page_alloc.c
@@ -299,13 +299,22 @@ void prep_compound_page(struct page *pag
}
#ifdef CONFIG_HUGETLBFS
+/*
+ * This (duplicated) destructor function distinguishes gigantic pages from
+ * normal compound pages.
+ */
+void free_gigantic_page(struct page *page)
+{
+ __free_pages_ok(page, compound_order(page));
+}
+
void prep_compound_gigantic_page(struct page *page, unsigned long order)
{
int i;
int nr_pages = 1 << order;
struct page *p = page + 1;
- set_compound_page_dtor(page, free_compound_page);
+ set_compound_page_dtor(page, free_gigantic_page);
set_compound_order(page, order);
__SetPageHead(page);
for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
--- mm.orig/mm/hugetlb.c
+++ mm/mm/hugetlb.c
@@ -550,7 +550,7 @@ struct hstate *size_to_hstate(unsigned l
return NULL;
}
-static void free_huge_page(struct page *page)
+void free_huge_page(struct page *page)
{
/*
* Can't pass hstate in here because it is called from the
--- mm.orig/include/linux/mm.h
+++ mm/include/linux/mm.h
@@ -355,6 +355,30 @@ static inline void set_compound_order(st
page[1].lru.prev = (void *)order;
}
+#ifdef CONFIG_HUGETLBFS
+void free_huge_page(struct page *page);
+void free_gigantic_page(struct page *page);
+
+static inline int PageHuge(struct page *page)
+{
+ compound_page_dtor *dtor;
+
+ if (!PageCompound(page))
+ return 0;
+
+ page = compound_head(page);
+ dtor = get_compound_page_dtor(page);
+
+ return dtor == free_huge_page ||
+ dtor == free_gigantic_page;
+}
+#else
+static inline int PageHuge(struct page *page)
+{
+ return 0;
+}
+#endif
+
/*
* Multiple processes may "see" the same page. E.g. for untouched
* mappings of /dev/null, all processes see the same page full of
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* [PATCH 4/5] proc: kpagecount/kpageflags code cleanup
2009-04-28 1:09 ` Wu Fengguang
@ 2009-04-28 1:09 ` Wu Fengguang
-1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 1:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: LKML, Wu Fengguang, KOSAKI Motohiro, Andi Kleen, linux-mm
[-- Attachment #1: kpageflags-fix-out.patch --]
[-- Type: text/plain, Size: 1254 bytes --]
Move increments of pfn/out to bottom of the loop.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/proc/page.c | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)
--- mm.orig/fs/proc/page.c
+++ mm/fs/proc/page.c
@@ -32,20 +32,22 @@ static ssize_t kpagecount_read(struct fi
return -EINVAL;
while (count > 0) {
- ppage = NULL;
if (pfn_valid(pfn))
ppage = pfn_to_page(pfn);
- pfn++;
+ else
+ ppage = NULL;
if (!ppage)
pcount = 0;
else
pcount = page_mapcount(ppage);
- if (put_user(pcount, out++)) {
+ if (put_user(pcount, out)) {
ret = -EFAULT;
break;
}
+ pfn++;
+ out++;
count -= KPMSIZE;
}
@@ -98,10 +100,10 @@ static ssize_t kpageflags_read(struct fi
return -EINVAL;
while (count > 0) {
- ppage = NULL;
if (pfn_valid(pfn))
ppage = pfn_to_page(pfn);
- pfn++;
+ else
+ ppage = NULL;
if (!ppage)
kflags = 0;
else
@@ -119,11 +121,13 @@ static ssize_t kpageflags_read(struct fi
kpf_copy_bit(kflags, KPF_RECLAIM, PG_reclaim) |
kpf_copy_bit(kflags, KPF_BUDDY, PG_buddy);
- if (put_user(uflags, out++)) {
+ if (put_user(uflags, out)) {
ret = -EFAULT;
break;
}
+ pfn++;
+ out++;
count -= KPMSIZE;
}
--
^ permalink raw reply [flat|nested] 137+ messages in thread
* [PATCH 4/5] proc: kpagecount/kpageflags code cleanup
@ 2009-04-28 1:09 ` Wu Fengguang
0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 1:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: LKML, Wu Fengguang, KOSAKI Motohiro, Andi Kleen, linux-mm
[-- Attachment #1: kpageflags-fix-out.patch --]
[-- Type: text/plain, Size: 1479 bytes --]
Move increments of pfn/out to bottom of the loop.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/proc/page.c | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)
--- mm.orig/fs/proc/page.c
+++ mm/fs/proc/page.c
@@ -32,20 +32,22 @@ static ssize_t kpagecount_read(struct fi
return -EINVAL;
while (count > 0) {
- ppage = NULL;
if (pfn_valid(pfn))
ppage = pfn_to_page(pfn);
- pfn++;
+ else
+ ppage = NULL;
if (!ppage)
pcount = 0;
else
pcount = page_mapcount(ppage);
- if (put_user(pcount, out++)) {
+ if (put_user(pcount, out)) {
ret = -EFAULT;
break;
}
+ pfn++;
+ out++;
count -= KPMSIZE;
}
@@ -98,10 +100,10 @@ static ssize_t kpageflags_read(struct fi
return -EINVAL;
while (count > 0) {
- ppage = NULL;
if (pfn_valid(pfn))
ppage = pfn_to_page(pfn);
- pfn++;
+ else
+ ppage = NULL;
if (!ppage)
kflags = 0;
else
@@ -119,11 +121,13 @@ static ssize_t kpageflags_read(struct fi
kpf_copy_bit(kflags, KPF_RECLAIM, PG_reclaim) |
kpf_copy_bit(kflags, KPF_BUDDY, PG_buddy);
- if (put_user(uflags, out++)) {
+ if (put_user(uflags, out)) {
ret = -EFAULT;
break;
}
+ pfn++;
+ out++;
count -= KPMSIZE;
}
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 1:09 ` Wu Fengguang
@ 2009-04-28 1:09 ` Wu Fengguang
-1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 1:09 UTC (permalink / raw)
To: Andrew Morton
Cc: LKML, KOSAKI Motohiro, Andi Kleen, Matt Mackall, Alexey Dobriyan,
Wu Fengguang, linux-mm
[-- Attachment #1: kpageflags-extending.patch --]
[-- Type: text/plain, Size: 13723 bytes --]
Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
1) for kernel hackers (on CONFIG_DEBUG_KERNEL)
- all available page flags are exported, and
- exported as is
2) for admins and end users
- only the more `well known' flags are exported:
11. KPF_MMAP (pseudo flag) memory mapped page
12. KPF_ANON (pseudo flag) memory mapped page (anonymous)
13. KPF_SWAPCACHE page is in swap cache
14. KPF_SWAPBACKED page is swap/RAM backed
15. KPF_COMPOUND_HEAD (*)
16. KPF_COMPOUND_TAIL (*)
17. KPF_UNEVICTABLE page is in the unevictable LRU list
18. KPF_HWPOISON hardware detected corruption
19. KPF_NOPAGE (pseudo flag) no page frame at the address
(*) For compound pages, exporting _both_ head/tail info enables
users to tell where a compound page starts/ends, and its order.
- limit flags to their typical usage scenario, as indicated by KOSAKI:
- LRU pages: only export relevant flags
- PG_lru
- PG_unevictable
- PG_active
- PG_referenced
- page_mapped()
- PageAnon()
- PG_swapcache
- PG_swapbacked
- PG_reclaim
- no-IO pages: mask out irrelevant flags
- PG_dirty
- PG_uptodate
- PG_writeback
- SLAB pages: mask out overloaded flags:
- PG_error
- PG_active
- PG_private
- PG_reclaim: mask out the overloaded PG_readahead
- compound flags: only export huge/gigantic pages
Here are the admin/linus views of all page flags on a newly booted nfs-root system:
# ./page-types # for admin
flags page-count MB symbolic-flags long-symbolic-flags
0x000000000000 491174 1918 ____________________________
0x000000000020 1 0 _____l______________________ lru
0x000000000028 2543 9 ___U_l______________________ uptodate,lru
0x00000000002c 5288 20 __RU_l______________________ referenced,uptodate,lru
0x000000004060 1 0 _____lA_______b_____________ lru,active,swapbacked
0x000000004064 19 0 __R__lA_______b_____________ referenced,lru,active,swapbacked
0x000000000068 225 0 ___U_lA_____________________ uptodate,lru,active
0x00000000006c 969 3 __RU_lA_____________________ referenced,uptodate,lru,active
0x000000000080 6832 26 _______S____________________ slab
0x000000000400 576 2 __________B_________________ buddy
0x000000000828 1159 4 ___U_l_____M________________ uptodate,lru,mmap
0x00000000082c 310 1 __RU_l_____M________________ referenced,uptodate,lru,mmap
0x000000004860 2 0 _____lA____M__b_____________ lru,active,mmap,swapbacked
0x000000000868 375 1 ___U_lA____M________________ uptodate,lru,active,mmap
0x00000000086c 635 2 __RU_lA____M________________ referenced,uptodate,lru,active,mmap
0x000000005860 3831 14 _____lA____Ma_b_____________ lru,active,mmap,anonymous,swapbacked
0x000000005864 28 0 __R__lA____Ma_b_____________ referenced,lru,active,mmap,anonymous,swapbacked
total 513968 2007
# ./page-types # for linus, when CONFIG_DEBUG_KERNEL is turned on
flags page-count MB symbolic-flags long-symbolic-flags
0x000000000000 471058 1840 ____________________________
0x000100000000 19288 75 ____________________r_______ reserved
0x000000010000 1064 4 ________________T___________ compound_tail
0x000000008000 1 0 _______________H____________ compound_head
0x000000008014 1 0 __R_D__________H____________ referenced,dirty,compound_head
0x000000010014 4 0 __R_D___________T___________ referenced,dirty,compound_tail
0x000000000020 1 0 _____l______________________ lru
0x000000000028 2522 9 ___U_l______________________ uptodate,lru
0x00000000002c 5207 20 __RU_l______________________ referenced,uptodate,lru
0x000000000068 203 0 ___U_lA_____________________ uptodate,lru,active
0x00000000006c 869 3 __RU_lA_____________________ referenced,uptodate,lru,active
0x000000004078 1 0 ___UDlA_______b_____________ uptodate,dirty,lru,active,swapbacked
0x00000000407c 19 0 __RUDlA_______b_____________ referenced,uptodate,dirty,lru,active,swapbacked
0x000000000080 5989 23 _______S____________________ slab
0x000000008080 778 3 _______S_______H____________ slab,compound_head
0x000000000228 44 0 ___U_l___I__________________ uptodate,lru,reclaim
0x00000000022c 39 0 __RU_l___I__________________ referenced,uptodate,lru,reclaim
0x000000000268 12 0 ___U_lA__I__________________ uptodate,lru,active,reclaim
0x00000000026c 44 0 __RU_lA__I__________________ referenced,uptodate,lru,active,reclaim
0x000000000400 550 2 __________B_________________ buddy
0x000000000804 1 0 __R________M________________ referenced,mmap
0x000000000828 1068 4 ___U_l_____M________________ uptodate,lru,mmap
0x00000000082c 326 1 __RU_l_____M________________ referenced,uptodate,lru,mmap
0x000000000868 335 1 ___U_lA____M________________ uptodate,lru,active,mmap
0x00000000086c 599 2 __RU_lA____M________________ referenced,uptodate,lru,active,mmap
0x000000004878 2 0 ___UDlA____M__b_____________ uptodate,dirty,lru,active,mmap,swapbacked
0x000000000a28 44 0 ___U_l___I_M________________ uptodate,lru,reclaim,mmap
0x000000000a2c 12 0 __RU_l___I_M________________ referenced,uptodate,lru,reclaim,mmap
0x000000000a68 8 0 ___U_lA__I_M________________ uptodate,lru,active,reclaim,mmap
0x000000000a6c 31 0 __RU_lA__I_M________________ referenced,uptodate,lru,active,reclaim,mmap
0x000000001000 442 1 ____________a_______________ anonymous
0x000000005808 7 0 ___U_______Ma_b_____________ uptodate,mmap,anonymous,swapbacked
0x000000005868 3371 13 ___U_lA____Ma_b_____________ uptodate,lru,active,mmap,anonymous,swapbacked
0x00000000586c 28 0 __RU_lA____Ma_b_____________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
total 513968 2007
Thanks to KOSAKI and Andi for their valuable recommendations!
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/proc/page.c | 197 +++++++++++++++++++++++++++++++++++++++--------
1 file changed, 167 insertions(+), 30 deletions(-)
--- mm.orig/fs/proc/page.c
+++ mm/fs/proc/page.c
@@ -6,6 +6,7 @@
#include <linux/mmzone.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
+#include <linux/backing-dev.h>
#include <asm/uaccess.h>
#include "internal.h"
@@ -70,19 +71,172 @@ static const struct file_operations proc
/* These macros are used to decouple internal flags from exported ones */
-#define KPF_LOCKED 0
-#define KPF_ERROR 1
-#define KPF_REFERENCED 2
-#define KPF_UPTODATE 3
-#define KPF_DIRTY 4
-#define KPF_LRU 5
-#define KPF_ACTIVE 6
-#define KPF_SLAB 7
-#define KPF_WRITEBACK 8
-#define KPF_RECLAIM 9
-#define KPF_BUDDY 10
+#define KPF_LOCKED 0
+#define KPF_ERROR 1
+#define KPF_REFERENCED 2
+#define KPF_UPTODATE 3
+#define KPF_DIRTY 4
+#define KPF_LRU 5
+#define KPF_ACTIVE 6
+#define KPF_SLAB 7
+#define KPF_WRITEBACK 8
+#define KPF_RECLAIM 9
+#define KPF_BUDDY 10
+
+/* new additions in 2.6.31 */
+#define KPF_MMAP 11
+#define KPF_ANON 12
+#define KPF_SWAPCACHE 13
+#define KPF_SWAPBACKED 14
+#define KPF_COMPOUND_HEAD 15
+#define KPF_COMPOUND_TAIL 16
+#define KPF_UNEVICTABLE 17
+#define KPF_HWPOISON 18
+#define KPF_NOPAGE 19
+
+/* kernel hacking assistances */
+#define KPF_RESERVED 32
+#define KPF_MLOCKED 33
+#define KPF_MAPPEDTODISK 34
+#define KPF_PRIVATE 35
+#define KPF_PRIVATE2 36
+#define KPF_OWNER_PRIVATE 37
+#define KPF_ARCH 38
+#define KPF_UNCACHED 39
+
+/*
+ * Kernel flags are exported faithfully to Linus and his fellow hackers.
+ * Otherwise some details are masked to avoid confusing the end user:
+ * - some kernel flags are completely invisible
+ * - some kernel flags are conditionally invisible on their odd usages
+ */
+#ifdef CONFIG_DEBUG_KERNEL
+static inline int genuine_linus(void) { return 1; }
+#else
+static inline int genuine_linus(void) { return 0; }
+#endif
+
+#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit) \
+ do { \
+ if (visible || genuine_linus()) \
+ uflags |= ((kflags >> kbit) & 1) << ubit; \
+ } while (0);
+
+/* a helper function _not_ intended for more general uses */
+static inline int page_cap_writeback_dirty(struct page *page)
+{
+ struct address_space *mapping;
+
+ if (!PageSlab(page))
+ mapping = page_mapping(page);
+ else
+ mapping = NULL;
+
+ return mapping && mapping_cap_writeback_dirty(mapping);
+}
+
+static u64 get_uflags(struct page *page)
+{
+ u64 k;
+ u64 u;
+ int io;
+ int lru;
+ int slab;
+
+ /*
+ * pseudo flag: KPF_NOPAGE
+ * it differentiates a memory hole from a page with no flags
+ */
+ if (!page)
+ return 1 << KPF_NOPAGE;
+
+ k = page->flags;
+ u = 0;
+
+ io = page_cap_writeback_dirty(page);
+ lru = k & (1 << PG_lru);
+ slab = k & (1 << PG_slab);
+
+ /*
+ * pseudo flags for the well known (anonymous) memory mapped pages
+ */
+ if (lru || genuine_linus()) {
+ if (!slab && page_mapped(page))
+ u |= 1 << KPF_MMAP;
+ if (PageAnon(page))
+ u |= 1 << KPF_ANON;
+ }
-#define kpf_copy_bit(flags, dstpos, srcpos) (((flags >> srcpos) & 1) << dstpos)
+ /*
+ * compound pages: export both head/tail info
+ * they together define a compound page's start/end pos and order
+ */
+ if (PageHuge(page) || genuine_linus()) {
+ if (PageHead(page))
+ u |= 1 << KPF_COMPOUND_HEAD;
+ if (PageTail(page))
+ u |= 1 << KPF_COMPOUND_TAIL;
+ }
+
+ kpf_copy_bit(u, k, 1, KPF_LOCKED, PG_locked);
+
+ /*
+ * Caveats on high order pages:
+ * PG_buddy will only be set on the head page; SLUB/SLQB do the same
+ * for PG_slab; SLOB won't set PG_slab at all on compound pages.
+ */
+ kpf_copy_bit(u, k, 1, KPF_SLAB, PG_slab);
+ kpf_copy_bit(u, k, 1, KPF_BUDDY, PG_buddy);
+
+ kpf_copy_bit(u, k, io, KPF_ERROR, PG_error);
+ kpf_copy_bit(u, k, io, KPF_DIRTY, PG_dirty);
+ kpf_copy_bit(u, k, io, KPF_UPTODATE, PG_uptodate);
+ kpf_copy_bit(u, k, io, KPF_WRITEBACK, PG_writeback);
+
+ kpf_copy_bit(u, k, 1, KPF_LRU, PG_lru);
+ kpf_copy_bit(u, k, lru, KPF_REFERENCED, PG_referenced);
+ kpf_copy_bit(u, k, lru, KPF_ACTIVE, PG_active);
+ kpf_copy_bit(u, k, lru, KPF_RECLAIM, PG_reclaim);
+
+ kpf_copy_bit(u, k, lru, KPF_SWAPCACHE, PG_swapcache);
+ kpf_copy_bit(u, k, lru, KPF_SWAPBACKED, PG_swapbacked);
+
+#ifdef CONFIG_MEMORY_FAILURE
+ kpf_copy_bit(u, k, 1, KPF_HWPOISON, PG_hwpoison);
+#endif
+
+#ifdef CONFIG_UNEVICTABLE_LRU
+ kpf_copy_bit(u, k, lru, KPF_UNEVICTABLE, PG_unevictable);
+ kpf_copy_bit(u, k, 0, KPF_MLOCKED, PG_mlocked);
+#endif
+
+ kpf_copy_bit(u, k, 0, KPF_RESERVED, PG_reserved);
+ kpf_copy_bit(u, k, 0, KPF_MAPPEDTODISK, PG_mappedtodisk);
+ kpf_copy_bit(u, k, 0, KPF_PRIVATE, PG_private);
+ kpf_copy_bit(u, k, 0, KPF_PRIVATE2, PG_private_2);
+ kpf_copy_bit(u, k, 0, KPF_OWNER_PRIVATE, PG_owner_priv_1);
+ kpf_copy_bit(u, k, 0, KPF_ARCH, PG_arch_1);
+
+#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
+ kpf_copy_bit(u, k, 0, KPF_UNCACHED, PG_uncached);
+#endif
+
+ if (!genuine_linus()) {
+ /*
+ * SLUB overloads some page flags which may confuse end user.
+ */
+ if (slab)
+ u &= ~((1 << KPF_ACTIVE) | (1 << KPF_ERROR));
+ /*
+ * PG_reclaim could be overloaded as PG_readahead,
+ * and we only want to export the first one.
+ */
+ if (!(u & (1 << KPF_WRITEBACK)))
+ u &= ~(1 << KPF_RECLAIM);
+ }
+
+ return u;
+};
static ssize_t kpageflags_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
@@ -92,7 +246,6 @@ static ssize_t kpageflags_read(struct fi
unsigned long src = *ppos;
unsigned long pfn;
ssize_t ret = 0;
- u64 kflags, uflags;
pfn = src / KPMSIZE;
count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
@@ -104,24 +257,8 @@ static ssize_t kpageflags_read(struct fi
ppage = pfn_to_page(pfn);
else
ppage = NULL;
- if (!ppage)
- kflags = 0;
- else
- kflags = ppage->flags;
-
- uflags = kpf_copy_bit(kflags, KPF_LOCKED, PG_locked) |
- kpf_copy_bit(kflags, KPF_ERROR, PG_error) |
- kpf_copy_bit(kflags, KPF_REFERENCED, PG_referenced) |
- kpf_copy_bit(kflags, KPF_UPTODATE, PG_uptodate) |
- kpf_copy_bit(kflags, KPF_DIRTY, PG_dirty) |
- kpf_copy_bit(kflags, KPF_LRU, PG_lru) |
- kpf_copy_bit(kflags, KPF_ACTIVE, PG_active) |
- kpf_copy_bit(kflags, KPF_SLAB, PG_slab) |
- kpf_copy_bit(kflags, KPF_WRITEBACK, PG_writeback) |
- kpf_copy_bit(kflags, KPF_RECLAIM, PG_reclaim) |
- kpf_copy_bit(kflags, KPF_BUDDY, PG_buddy);
- if (put_user(uflags, out)) {
+ if (put_user(get_uflags(ppage), out)) {
ret = -EFAULT;
break;
}
--
^ permalink raw reply [flat|nested] 137+ messages in thread
* [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 1:09 ` Wu Fengguang
0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 1:09 UTC (permalink / raw)
To: Andrew Morton
Cc: LKML, KOSAKI Motohiro, Andi Kleen, Matt Mackall, Alexey Dobriyan,
Wu Fengguang, linux-mm
[-- Attachment #1: kpageflags-extending.patch --]
[-- Type: text/plain, Size: 13948 bytes --]
Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
1) for kernel hackers (on CONFIG_DEBUG_KERNEL)
- all available page flags are exported, and
- exported as is
2) for admins and end users
- only the more `well known' flags are exported:
11. KPF_MMAP (pseudo flag) memory mapped page
12. KPF_ANON (pseudo flag) memory mapped page (anonymous)
13. KPF_SWAPCACHE page is in swap cache
14. KPF_SWAPBACKED page is swap/RAM backed
15. KPF_COMPOUND_HEAD (*)
16. KPF_COMPOUND_TAIL (*)
17. KPF_UNEVICTABLE page is in the unevictable LRU list
18. KPF_HWPOISON hardware detected corruption
19. KPF_NOPAGE (pseudo flag) no page frame at the address
(*) For compound pages, exporting _both_ head/tail info enables
users to tell where a compound page starts/ends, and its order.
- limit flags to their typical usage scenario, as indicated by KOSAKI:
- LRU pages: only export relevant flags
- PG_lru
- PG_unevictable
- PG_active
- PG_referenced
- page_mapped()
- PageAnon()
- PG_swapcache
- PG_swapbacked
- PG_reclaim
- no-IO pages: mask out irrelevant flags
- PG_dirty
- PG_uptodate
- PG_writeback
- SLAB pages: mask out overloaded flags:
- PG_error
- PG_active
- PG_private
- PG_reclaim: mask out the overloaded PG_readahead
- compound flags: only export huge/gigantic pages
Here are the admin/linus views of all page flags on a newly booted nfs-root system:
# ./page-types # for admin
flags page-count MB symbolic-flags long-symbolic-flags
0x000000000000 491174 1918 ____________________________
0x000000000020 1 0 _____l______________________ lru
0x000000000028 2543 9 ___U_l______________________ uptodate,lru
0x00000000002c 5288 20 __RU_l______________________ referenced,uptodate,lru
0x000000004060 1 0 _____lA_______b_____________ lru,active,swapbacked
0x000000004064 19 0 __R__lA_______b_____________ referenced,lru,active,swapbacked
0x000000000068 225 0 ___U_lA_____________________ uptodate,lru,active
0x00000000006c 969 3 __RU_lA_____________________ referenced,uptodate,lru,active
0x000000000080 6832 26 _______S____________________ slab
0x000000000400 576 2 __________B_________________ buddy
0x000000000828 1159 4 ___U_l_____M________________ uptodate,lru,mmap
0x00000000082c 310 1 __RU_l_____M________________ referenced,uptodate,lru,mmap
0x000000004860 2 0 _____lA____M__b_____________ lru,active,mmap,swapbacked
0x000000000868 375 1 ___U_lA____M________________ uptodate,lru,active,mmap
0x00000000086c 635 2 __RU_lA____M________________ referenced,uptodate,lru,active,mmap
0x000000005860 3831 14 _____lA____Ma_b_____________ lru,active,mmap,anonymous,swapbacked
0x000000005864 28 0 __R__lA____Ma_b_____________ referenced,lru,active,mmap,anonymous,swapbacked
total 513968 2007
# ./page-types # for linus, when CONFIG_DEBUG_KERNEL is turned on
flags page-count MB symbolic-flags long-symbolic-flags
0x000000000000 471058 1840 ____________________________
0x000100000000 19288 75 ____________________r_______ reserved
0x000000010000 1064 4 ________________T___________ compound_tail
0x000000008000 1 0 _______________H____________ compound_head
0x000000008014 1 0 __R_D__________H____________ referenced,dirty,compound_head
0x000000010014 4 0 __R_D___________T___________ referenced,dirty,compound_tail
0x000000000020 1 0 _____l______________________ lru
0x000000000028 2522 9 ___U_l______________________ uptodate,lru
0x00000000002c 5207 20 __RU_l______________________ referenced,uptodate,lru
0x000000000068 203 0 ___U_lA_____________________ uptodate,lru,active
0x00000000006c 869 3 __RU_lA_____________________ referenced,uptodate,lru,active
0x000000004078 1 0 ___UDlA_______b_____________ uptodate,dirty,lru,active,swapbacked
0x00000000407c 19 0 __RUDlA_______b_____________ referenced,uptodate,dirty,lru,active,swapbacked
0x000000000080 5989 23 _______S____________________ slab
0x000000008080 778 3 _______S_______H____________ slab,compound_head
0x000000000228 44 0 ___U_l___I__________________ uptodate,lru,reclaim
0x00000000022c 39 0 __RU_l___I__________________ referenced,uptodate,lru,reclaim
0x000000000268 12 0 ___U_lA__I__________________ uptodate,lru,active,reclaim
0x00000000026c 44 0 __RU_lA__I__________________ referenced,uptodate,lru,active,reclaim
0x000000000400 550 2 __________B_________________ buddy
0x000000000804 1 0 __R________M________________ referenced,mmap
0x000000000828 1068 4 ___U_l_____M________________ uptodate,lru,mmap
0x00000000082c 326 1 __RU_l_____M________________ referenced,uptodate,lru,mmap
0x000000000868 335 1 ___U_lA____M________________ uptodate,lru,active,mmap
0x00000000086c 599 2 __RU_lA____M________________ referenced,uptodate,lru,active,mmap
0x000000004878 2 0 ___UDlA____M__b_____________ uptodate,dirty,lru,active,mmap,swapbacked
0x000000000a28 44 0 ___U_l___I_M________________ uptodate,lru,reclaim,mmap
0x000000000a2c 12 0 __RU_l___I_M________________ referenced,uptodate,lru,reclaim,mmap
0x000000000a68 8 0 ___U_lA__I_M________________ uptodate,lru,active,reclaim,mmap
0x000000000a6c 31 0 __RU_lA__I_M________________ referenced,uptodate,lru,active,reclaim,mmap
0x000000001000 442 1 ____________a_______________ anonymous
0x000000005808 7 0 ___U_______Ma_b_____________ uptodate,mmap,anonymous,swapbacked
0x000000005868 3371 13 ___U_lA____Ma_b_____________ uptodate,lru,active,mmap,anonymous,swapbacked
0x00000000586c 28 0 __RU_lA____Ma_b_____________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
total 513968 2007
Thanks to KOSAKI and Andi for their valuable recommendations!
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/proc/page.c | 197 +++++++++++++++++++++++++++++++++++++++--------
1 file changed, 167 insertions(+), 30 deletions(-)
--- mm.orig/fs/proc/page.c
+++ mm/fs/proc/page.c
@@ -6,6 +6,7 @@
#include <linux/mmzone.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
+#include <linux/backing-dev.h>
#include <asm/uaccess.h>
#include "internal.h"
@@ -70,19 +71,172 @@ static const struct file_operations proc
/* These macros are used to decouple internal flags from exported ones */
-#define KPF_LOCKED 0
-#define KPF_ERROR 1
-#define KPF_REFERENCED 2
-#define KPF_UPTODATE 3
-#define KPF_DIRTY 4
-#define KPF_LRU 5
-#define KPF_ACTIVE 6
-#define KPF_SLAB 7
-#define KPF_WRITEBACK 8
-#define KPF_RECLAIM 9
-#define KPF_BUDDY 10
+#define KPF_LOCKED 0
+#define KPF_ERROR 1
+#define KPF_REFERENCED 2
+#define KPF_UPTODATE 3
+#define KPF_DIRTY 4
+#define KPF_LRU 5
+#define KPF_ACTIVE 6
+#define KPF_SLAB 7
+#define KPF_WRITEBACK 8
+#define KPF_RECLAIM 9
+#define KPF_BUDDY 10
+
+/* new additions in 2.6.31 */
+#define KPF_MMAP 11
+#define KPF_ANON 12
+#define KPF_SWAPCACHE 13
+#define KPF_SWAPBACKED 14
+#define KPF_COMPOUND_HEAD 15
+#define KPF_COMPOUND_TAIL 16
+#define KPF_UNEVICTABLE 17
+#define KPF_HWPOISON 18
+#define KPF_NOPAGE 19
+
+/* kernel hacking assistances */
+#define KPF_RESERVED 32
+#define KPF_MLOCKED 33
+#define KPF_MAPPEDTODISK 34
+#define KPF_PRIVATE 35
+#define KPF_PRIVATE2 36
+#define KPF_OWNER_PRIVATE 37
+#define KPF_ARCH 38
+#define KPF_UNCACHED 39
+
+/*
+ * Kernel flags are exported faithfully to Linus and his fellow hackers.
+ * Otherwise some details are masked to avoid confusing the end user:
+ * - some kernel flags are completely invisible
+ * - some kernel flags are conditionally invisible on their odd usages
+ */
+#ifdef CONFIG_DEBUG_KERNEL
+static inline int genuine_linus(void) { return 1; }
+#else
+static inline int genuine_linus(void) { return 0; }
+#endif
+
+#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit) \
+ do { \
+ if (visible || genuine_linus()) \
+ uflags |= ((kflags >> kbit) & 1) << ubit; \
+ } while (0);
+
+/* a helper function _not_ intended for more general uses */
+static inline int page_cap_writeback_dirty(struct page *page)
+{
+ struct address_space *mapping;
+
+ if (!PageSlab(page))
+ mapping = page_mapping(page);
+ else
+ mapping = NULL;
+
+ return mapping && mapping_cap_writeback_dirty(mapping);
+}
+
+static u64 get_uflags(struct page *page)
+{
+ u64 k;
+ u64 u;
+ int io;
+ int lru;
+ int slab;
+
+ /*
+ * pseudo flag: KPF_NOPAGE
+ * it differentiates a memory hole from a page with no flags
+ */
+ if (!page)
+ return 1 << KPF_NOPAGE;
+
+ k = page->flags;
+ u = 0;
+
+ io = page_cap_writeback_dirty(page);
+ lru = k & (1 << PG_lru);
+ slab = k & (1 << PG_slab);
+
+ /*
+ * pseudo flags for the well known (anonymous) memory mapped pages
+ */
+ if (lru || genuine_linus()) {
+ if (!slab && page_mapped(page))
+ u |= 1 << KPF_MMAP;
+ if (PageAnon(page))
+ u |= 1 << KPF_ANON;
+ }
-#define kpf_copy_bit(flags, dstpos, srcpos) (((flags >> srcpos) & 1) << dstpos)
+ /*
+ * compound pages: export both head/tail info
+ * they together define a compound page's start/end pos and order
+ */
+ if (PageHuge(page) || genuine_linus()) {
+ if (PageHead(page))
+ u |= 1 << KPF_COMPOUND_HEAD;
+ if (PageTail(page))
+ u |= 1 << KPF_COMPOUND_TAIL;
+ }
+
+ kpf_copy_bit(u, k, 1, KPF_LOCKED, PG_locked);
+
+ /*
+ * Caveats on high order pages:
+ * PG_buddy will only be set on the head page; SLUB/SLQB do the same
+ * for PG_slab; SLOB won't set PG_slab at all on compound pages.
+ */
+ kpf_copy_bit(u, k, 1, KPF_SLAB, PG_slab);
+ kpf_copy_bit(u, k, 1, KPF_BUDDY, PG_buddy);
+
+ kpf_copy_bit(u, k, io, KPF_ERROR, PG_error);
+ kpf_copy_bit(u, k, io, KPF_DIRTY, PG_dirty);
+ kpf_copy_bit(u, k, io, KPF_UPTODATE, PG_uptodate);
+ kpf_copy_bit(u, k, io, KPF_WRITEBACK, PG_writeback);
+
+ kpf_copy_bit(u, k, 1, KPF_LRU, PG_lru);
+ kpf_copy_bit(u, k, lru, KPF_REFERENCED, PG_referenced);
+ kpf_copy_bit(u, k, lru, KPF_ACTIVE, PG_active);
+ kpf_copy_bit(u, k, lru, KPF_RECLAIM, PG_reclaim);
+
+ kpf_copy_bit(u, k, lru, KPF_SWAPCACHE, PG_swapcache);
+ kpf_copy_bit(u, k, lru, KPF_SWAPBACKED, PG_swapbacked);
+
+#ifdef CONFIG_MEMORY_FAILURE
+ kpf_copy_bit(u, k, 1, KPF_HWPOISON, PG_hwpoison);
+#endif
+
+#ifdef CONFIG_UNEVICTABLE_LRU
+ kpf_copy_bit(u, k, lru, KPF_UNEVICTABLE, PG_unevictable);
+ kpf_copy_bit(u, k, 0, KPF_MLOCKED, PG_mlocked);
+#endif
+
+ kpf_copy_bit(u, k, 0, KPF_RESERVED, PG_reserved);
+ kpf_copy_bit(u, k, 0, KPF_MAPPEDTODISK, PG_mappedtodisk);
+ kpf_copy_bit(u, k, 0, KPF_PRIVATE, PG_private);
+ kpf_copy_bit(u, k, 0, KPF_PRIVATE2, PG_private_2);
+ kpf_copy_bit(u, k, 0, KPF_OWNER_PRIVATE, PG_owner_priv_1);
+ kpf_copy_bit(u, k, 0, KPF_ARCH, PG_arch_1);
+
+#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
+ kpf_copy_bit(u, k, 0, KPF_UNCACHED, PG_uncached);
+#endif
+
+ if (!genuine_linus()) {
+ /*
+ * SLUB overloads some page flags which may confuse end user.
+ */
+ if (slab)
+ u &= ~((1 << KPF_ACTIVE) | (1 << KPF_ERROR));
+ /*
+ * PG_reclaim could be overloaded as PG_readahead,
+ * and we only want to export the first one.
+ */
+ if (!(u & (1 << KPF_WRITEBACK)))
+ u &= ~(1 << KPF_RECLAIM);
+ }
+
+ return u;
+};
static ssize_t kpageflags_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
@@ -92,7 +246,6 @@ static ssize_t kpageflags_read(struct fi
unsigned long src = *ppos;
unsigned long pfn;
ssize_t ret = 0;
- u64 kflags, uflags;
pfn = src / KPMSIZE;
count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
@@ -104,24 +257,8 @@ static ssize_t kpageflags_read(struct fi
ppage = pfn_to_page(pfn);
else
ppage = NULL;
- if (!ppage)
- kflags = 0;
- else
- kflags = ppage->flags;
-
- uflags = kpf_copy_bit(kflags, KPF_LOCKED, PG_locked) |
- kpf_copy_bit(kflags, KPF_ERROR, PG_error) |
- kpf_copy_bit(kflags, KPF_REFERENCED, PG_referenced) |
- kpf_copy_bit(kflags, KPF_UPTODATE, PG_uptodate) |
- kpf_copy_bit(kflags, KPF_DIRTY, PG_dirty) |
- kpf_copy_bit(kflags, KPF_LRU, PG_lru) |
- kpf_copy_bit(kflags, KPF_ACTIVE, PG_active) |
- kpf_copy_bit(kflags, KPF_SLAB, PG_slab) |
- kpf_copy_bit(kflags, KPF_WRITEBACK, PG_writeback) |
- kpf_copy_bit(kflags, KPF_RECLAIM, PG_reclaim) |
- kpf_copy_bit(kflags, KPF_BUDDY, PG_buddy);
- if (put_user(uflags, out)) {
+ if (put_user(get_uflags(ppage), out)) {
ret = -EFAULT;
break;
}
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 1:09 ` Wu Fengguang
@ 2009-04-28 6:55 ` Ingo Molnar
-1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 6:55 UTC (permalink / raw)
To: Wu Fengguang, Steven Rostedt, Frédéric Weisbecker,
Larry Woodman, Peter Zijlstra, Pekka Enberg,
Eduard - Gabriel Munteanu
Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen, Matt Mackall,
Alexey Dobriyan, linux-mm
* Wu Fengguang <fengguang.wu@intel.com> wrote:
> Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
>
> 1) for kernel hackers (on CONFIG_DEBUG_KERNEL)
> - all available page flags are exported, and
> - exported as is
> 2) for admins and end users
> - only the more `well known' flags are exported:
> 11. KPF_MMAP (pseudo flag) memory mapped page
> 12. KPF_ANON (pseudo flag) memory mapped page (anonymous)
> 13. KPF_SWAPCACHE page is in swap cache
> 14. KPF_SWAPBACKED page is swap/RAM backed
> 15. KPF_COMPOUND_HEAD (*)
> 16. KPF_COMPOUND_TAIL (*)
> 17. KPF_UNEVICTABLE page is in the unevictable LRU list
> 18. KPF_HWPOISON hardware detected corruption
> 19. KPF_NOPAGE (pseudo flag) no page frame at the address
>
> (*) For compound pages, exporting _both_ head/tail info enables
> users to tell where a compound page starts/ends, and its order.
>
> - limit flags to their typical usage scenario, as indicated by KOSAKI:
> - LRU pages: only export relevant flags
> - PG_lru
> - PG_unevictable
> - PG_active
> - PG_referenced
> - page_mapped()
> - PageAnon()
> - PG_swapcache
> - PG_swapbacked
> - PG_reclaim
> - no-IO pages: mask out irrelevant flags
> - PG_dirty
> - PG_uptodate
> - PG_writeback
> - SLAB pages: mask out overloaded flags:
> - PG_error
> - PG_active
> - PG_private
> - PG_reclaim: mask out the overloaded PG_readahead
> - compound flags: only export huge/gigantic pages
>
> Here are the admin/linus views of all page flags on a newly booted nfs-root system:
>
> # ./page-types # for admin
> flags page-count MB symbolic-flags long-symbolic-flags
> 0x000000000000 491174 1918 ____________________________
> 0x000000000020 1 0 _____l______________________ lru
> 0x000000000028 2543 9 ___U_l______________________ uptodate,lru
> 0x00000000002c 5288 20 __RU_l______________________ referenced,uptodate,lru
> 0x000000004060 1 0 _____lA_______b_____________ lru,active,swapbacked
I think i have to NAK this kind of ad-hoc instrumentation of kernel
internals and statistics until we clear up why such instrumentation
measures are being accepted into the MM while other, more dynamic
and more flexible MM instrumentation are being resisted by Andrew.
The above type of condensed information can be built out of dynamic
trace data too - and much more. Being able to track page state
transitions is very valuable when debugging VM problems. One such
'view' of trace data would be a summary histogram like above.
( done after a "echo 3 > /proc/sys/vm/drop_caches" to make sure all
interesting pages have been re-established and their state is
present in the trace. )
The SLAB code already has such a facility, kmemtrace: it's very
useful and successful in visualizing complex SLAB details, both
dynamically and statically.
I think the same general approach should be used for the page
allocator too (and for the page cache and some other struct page
based caches): the life-time of an object should be followed. If we
capture the important details we capture the big picture too. Pekka
already sent an RFC patch to extend kmemtrace in such a fashion. Why
is that more useful method not being pursued?
By extending upon the (existing) /proc/kpageflags hack a usecase is
taken away from the tracing based solution and a needless overlap is
created - and that's not particularly helpful IMHO. We now have all
the facilities upstream that allow us to do intelligent
instrumentation - we should make use of them.
Ingo
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 6:55 ` Ingo Molnar
0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 6:55 UTC (permalink / raw)
To: Wu Fengguang, Steven Rostedt, Frédéric Weisbecker,
Larry Woodman, Peter Zijlstra, Pekka Enberg,
Eduard - Gabriel Munteanu
Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen, Matt Mackall,
Alexey Dobriyan, linux-mm
* Wu Fengguang <fengguang.wu@intel.com> wrote:
> Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
>
> 1) for kernel hackers (on CONFIG_DEBUG_KERNEL)
> - all available page flags are exported, and
> - exported as is
> 2) for admins and end users
> - only the more `well known' flags are exported:
> 11. KPF_MMAP (pseudo flag) memory mapped page
> 12. KPF_ANON (pseudo flag) memory mapped page (anonymous)
> 13. KPF_SWAPCACHE page is in swap cache
> 14. KPF_SWAPBACKED page is swap/RAM backed
> 15. KPF_COMPOUND_HEAD (*)
> 16. KPF_COMPOUND_TAIL (*)
> 17. KPF_UNEVICTABLE page is in the unevictable LRU list
> 18. KPF_HWPOISON hardware detected corruption
> 19. KPF_NOPAGE (pseudo flag) no page frame at the address
>
> (*) For compound pages, exporting _both_ head/tail info enables
> users to tell where a compound page starts/ends, and its order.
>
> - limit flags to their typical usage scenario, as indicated by KOSAKI:
> - LRU pages: only export relevant flags
> - PG_lru
> - PG_unevictable
> - PG_active
> - PG_referenced
> - page_mapped()
> - PageAnon()
> - PG_swapcache
> - PG_swapbacked
> - PG_reclaim
> - no-IO pages: mask out irrelevant flags
> - PG_dirty
> - PG_uptodate
> - PG_writeback
> - SLAB pages: mask out overloaded flags:
> - PG_error
> - PG_active
> - PG_private
> - PG_reclaim: mask out the overloaded PG_readahead
> - compound flags: only export huge/gigantic pages
>
> Here are the admin/linus views of all page flags on a newly booted nfs-root system:
>
> # ./page-types # for admin
> flags page-count MB symbolic-flags long-symbolic-flags
> 0x000000000000 491174 1918 ____________________________
> 0x000000000020 1 0 _____l______________________ lru
> 0x000000000028 2543 9 ___U_l______________________ uptodate,lru
> 0x00000000002c 5288 20 __RU_l______________________ referenced,uptodate,lru
> 0x000000004060 1 0 _____lA_______b_____________ lru,active,swapbacked
I think i have to NAK this kind of ad-hoc instrumentation of kernel
internals and statistics until we clear up why such instrumentation
measures are being accepted into the MM while other, more dynamic
and more flexible MM instrumentation are being resisted by Andrew.
The above type of condensed information can be built out of dynamic
trace data too - and much more. Being able to track page state
transitions is very valuable when debugging VM problems. One such
'view' of trace data would be a summary histogram like above.
( done after a "echo 3 > /proc/sys/vm/drop_caches" to make sure all
interesting pages have been re-established and their state is
present in the trace. )
The SLAB code already has such a facility, kmemtrace: it's very
useful and successful in visualizing complex SLAB details, both
dynamically and statically.
I think the same general approach should be used for the page
allocator too (and for the page cache and some other struct page
based caches): the life-time of an object should be followed. If we
capture the important details we capture the big picture too. Pekka
already sent an RFC patch to extend kmemtrace in such a fashion. Why
is that more useful method not being pursued?
By extending upon the (existing) /proc/kpageflags hack a usecase is
taken away from the tracing based solution and a needless overlap is
created - and that's not particularly helpful IMHO. We now have all
the facilities upstream that allow us to do intelligent
instrumentation - we should make use of them.
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 1/5] pagemap: document clarifications
2009-04-28 1:09 ` Wu Fengguang
@ 2009-04-28 7:11 ` Tommi Rantala
-1 siblings, 0 replies; 137+ messages in thread
From: Tommi Rantala @ 2009-04-28 7:11 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen, linux-mm
2009/4/28 Wu Fengguang <fengguang.wu@intel.com>:
> Some bit ranges were inclusive and some not.
> Fix them to be consistently inclusive.
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> Documentation/vm/pagemap.txt | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> --- mm.orig/Documentation/vm/pagemap.txt
> +++ mm/Documentation/vm/pagemap.txt
> @@ -12,9 +12,9 @@ There are three components to pagemap:
> value for each virtual page, containing the following data (from
> fs/proc/task_mmu.c, above pagemap_read):
>
> - * Bits 0-55 page frame number (PFN) if present
> + * Bits 0-54 page frame number (PFN) if present
> * Bits 0-4 swap type if swapped
> - * Bits 5-55 swap offset if swapped
> + * Bits 5-54 swap offset if swapped
> * Bits 55-60 page shift (page size = 1<<page shift)
> * Bit 61 reserved for future use
> * Bit 62 page swapped
The same fix should be applied to fs/proc/task_mmu.c as well,
it includes the same description of the bits.
Regards,
Tommi Rantala
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 1/5] pagemap: document clarifications
@ 2009-04-28 7:11 ` Tommi Rantala
0 siblings, 0 replies; 137+ messages in thread
From: Tommi Rantala @ 2009-04-28 7:11 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen, linux-mm
2009/4/28 Wu Fengguang <fengguang.wu@intel.com>:
> Some bit ranges were inclusive and some not.
> Fix them to be consistently inclusive.
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> Documentation/vm/pagemap.txt | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> --- mm.orig/Documentation/vm/pagemap.txt
> +++ mm/Documentation/vm/pagemap.txt
> @@ -12,9 +12,9 @@ There are three components to pagemap:
> value for each virtual page, containing the following data (from
> fs/proc/task_mmu.c, above pagemap_read):
>
> - * Bits 0-55 page frame number (PFN) if present
> + * Bits 0-54 page frame number (PFN) if present
> * Bits 0-4 swap type if swapped
> - * Bits 5-55 swap offset if swapped
> + * Bits 5-54 swap offset if swapped
> * Bits 55-60 page shift (page size = 1<<page shift)
> * Bit 61 reserved for future use
> * Bit 62 page swapped
The same fix should be applied to fs/proc/task_mmu.c as well,
it includes the same description of the bits.
Regards,
Tommi Rantala
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 6:55 ` Ingo Molnar
@ 2009-04-28 7:40 ` Andi Kleen
-1 siblings, 0 replies; 137+ messages in thread
From: Andi Kleen @ 2009-04-28 7:40 UTC (permalink / raw)
To: Ingo Molnar
Cc: Wu Fengguang, Steven Rostedt, Frédéric Weisbecker,
Larry Woodman, Peter Zijlstra, Pekka Enberg,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Andi Kleen, Matt Mackall, Alexey Dobriyan, linux-mm
> I think i have to NAK this kind of ad-hoc instrumentation of kernel
> internals and statistics until we clear up why such instrumentation
I think because it has zero fast path overhead and can be used
any time without enabling anything special.
> measures are being accepted into the MM while other, more dynamic
While the dynamic instrumentation you're proposing
has non zero fast path overhead, especially if you consider the
CPU time needed for the backend computation in user space too.
And it requires explicit tracing first and some backend
that counts the events and maintains a shadow data structure
covering all of mem_map again.
So it's clear your alternative will be much more costly, plus
have additional drawbacks (needs enabling first, cannot
take a snapshot at arbitary time)
Also dynamic tracing tends to have trouble with full memory
observation. I experimented with systemtap tracing for my
memory usage paper I did a couple of years ago, but ended
up with integrated counters (similar to those) because it was
impossible to do proper accounting for the pages set up
in early boot with the standard tracers.
I suspect both have their uses (that's indeed some things
that can only be done with dynamic tracing), but they're clearly
complementary and the static facility seems useful enough
on its own.
I think Fengguang is demonstrating that clearly by the great
improvements he's doing for readahead which are enabled by these
patches.
-Andi
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 7:40 ` Andi Kleen
0 siblings, 0 replies; 137+ messages in thread
From: Andi Kleen @ 2009-04-28 7:40 UTC (permalink / raw)
To: Ingo Molnar
Cc: Wu Fengguang, Steven Rostedt, Frédéric Weisbecker,
Larry Woodman, Peter Zijlstra, Pekka Enberg,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Andi Kleen, Matt Mackall, Alexey Dobriyan, linux-mm
> I think i have to NAK this kind of ad-hoc instrumentation of kernel
> internals and statistics until we clear up why such instrumentation
I think because it has zero fast path overhead and can be used
any time without enabling anything special.
> measures are being accepted into the MM while other, more dynamic
While the dynamic instrumentation you're proposing
has non zero fast path overhead, especially if you consider the
CPU time needed for the backend computation in user space too.
And it requires explicit tracing first and some backend
that counts the events and maintains a shadow data structure
covering all of mem_map again.
So it's clear your alternative will be much more costly, plus
have additional drawbacks (needs enabling first, cannot
take a snapshot at arbitary time)
Also dynamic tracing tends to have trouble with full memory
observation. I experimented with systemtap tracing for my
memory usage paper I did a couple of years ago, but ended
up with integrated counters (similar to those) because it was
impossible to do proper accounting for the pages set up
in early boot with the standard tracers.
I suspect both have their uses (that's indeed some things
that can only be done with dynamic tracing), but they're clearly
complementary and the static facility seems useful enough
on its own.
I think Fengguang is demonstrating that clearly by the great
improvements he's doing for readahead which are enabled by these
patches.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 6:55 ` Ingo Molnar
@ 2009-04-28 8:33 ` Wu Fengguang
-1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 8:33 UTC (permalink / raw)
To: Ingo Molnar
Cc: Steven Rostedt, Frédéric Weisbecker, Larry Woodman,
Peter Zijlstra, Pekka Enberg, Eduard - Gabriel Munteanu,
Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen, Matt Mackall,
Alexey Dobriyan, linux-mm
On Tue, Apr 28, 2009 at 08:55:07AM +0200, Ingo Molnar wrote:
>
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> >
> > 1) for kernel hackers (on CONFIG_DEBUG_KERNEL)
> > - all available page flags are exported, and
> > - exported as is
> > 2) for admins and end users
> > - only the more `well known' flags are exported:
> > 11. KPF_MMAP (pseudo flag) memory mapped page
> > 12. KPF_ANON (pseudo flag) memory mapped page (anonymous)
> > 13. KPF_SWAPCACHE page is in swap cache
> > 14. KPF_SWAPBACKED page is swap/RAM backed
> > 15. KPF_COMPOUND_HEAD (*)
> > 16. KPF_COMPOUND_TAIL (*)
> > 17. KPF_UNEVICTABLE page is in the unevictable LRU list
> > 18. KPF_HWPOISON hardware detected corruption
> > 19. KPF_NOPAGE (pseudo flag) no page frame at the address
> >
> > (*) For compound pages, exporting _both_ head/tail info enables
> > users to tell where a compound page starts/ends, and its order.
> >
> > - limit flags to their typical usage scenario, as indicated by KOSAKI:
> > - LRU pages: only export relevant flags
> > - PG_lru
> > - PG_unevictable
> > - PG_active
> > - PG_referenced
> > - page_mapped()
> > - PageAnon()
> > - PG_swapcache
> > - PG_swapbacked
> > - PG_reclaim
> > - no-IO pages: mask out irrelevant flags
> > - PG_dirty
> > - PG_uptodate
> > - PG_writeback
> > - SLAB pages: mask out overloaded flags:
> > - PG_error
> > - PG_active
> > - PG_private
> > - PG_reclaim: mask out the overloaded PG_readahead
> > - compound flags: only export huge/gigantic pages
> >
> > Here are the admin/linus views of all page flags on a newly booted nfs-root system:
> >
> > # ./page-types # for admin
> > flags page-count MB symbolic-flags long-symbolic-flags
> > 0x000000000000 491174 1918 ____________________________
> > 0x000000000020 1 0 _____l______________________ lru
> > 0x000000000028 2543 9 ___U_l______________________ uptodate,lru
> > 0x00000000002c 5288 20 __RU_l______________________ referenced,uptodate,lru
> > 0x000000004060 1 0 _____lA_______b_____________ lru,active,swapbacked
>
> I think i have to NAK this kind of ad-hoc instrumentation of kernel
> internals and statistics until we clear up why such instrumentation
> measures are being accepted into the MM while other, more dynamic
> and more flexible MM instrumentation are being resisted by Andrew.
An unexpected NAK - to throw away an orange because we are to have an apple? ;-)
Anyway here are the missing rationals.
1) FAST
It takes merely 0.2s to scan 4GB pages:
./page-types 0.02s user 0.20s system 99% cpu 0.216 total
2) SIMPLE
/proc/kpageflags will be a *long standing* hack we have to live with -
it was originally introduced by Matt to do shared memory accounting and
a facility to analyze applications' memory consumptions, with the hope
it will also help kernel developers someday.
So why not extend and embrace it, in a straightforward way?
3) USE CASES
I have/will take advantage of the above page-types command in a number ways:
- to help track down memory leak (the recent trace/ring_buffer.c case)
- to estimate the system wide readahead miss ratio
- Andi want to examine the major page types in different workloads
(for the hwpoison work)
- Me too, for fun of learning: read/write/lock/whatever a lot of pages
and examine their flags, to get an idea of some random kernel behaviors.
(the dynamic tracing tools can be more helpful, as a different view)
4) COMPLEMENTARITY
In some cases the dynamic tracing tool is not enough (or too complex)
to rebuild the current status view.
I myself have a dynamic readahead tracing tool(very useful!).
At the same time I also use readahead accounting numbers, and the
/proc/filecache tool(frequently!), and the above page-types tool.
I simply need them all - they are handy for different cases.
Thanks,
Fengguang
> The above type of condensed information can be built out of dynamic
> trace data too - and much more. Being able to track page state
> transitions is very valuable when debugging VM problems. One such
> 'view' of trace data would be a summary histogram like above.
>
> ( done after a "echo 3 > /proc/sys/vm/drop_caches" to make sure all
> interesting pages have been re-established and their state is
> present in the trace. )
>
> The SLAB code already has such a facility, kmemtrace: it's very
> useful and successful in visualizing complex SLAB details, both
> dynamically and statically.
>
> I think the same general approach should be used for the page
> allocator too (and for the page cache and some other struct page
> based caches): the life-time of an object should be followed. If we
> capture the important details we capture the big picture too. Pekka
> already sent an RFC patch to extend kmemtrace in such a fashion. Why
> is that more useful method not being pursued?
>
> By extending upon the (existing) /proc/kpageflags hack a usecase is
> taken away from the tracing based solution and a needless overlap is
> created - and that's not particularly helpful IMHO. We now have all
> the facilities upstream that allow us to do intelligent
> instrumentation - we should make use of them.
>
> Ingo
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 8:33 ` Wu Fengguang
0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 8:33 UTC (permalink / raw)
To: Ingo Molnar
Cc: Steven Rostedt, Frédéric Weisbecker, Larry Woodman,
Peter Zijlstra, Pekka Enberg, Eduard - Gabriel Munteanu,
Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen, Matt Mackall,
Alexey Dobriyan, linux-mm
On Tue, Apr 28, 2009 at 08:55:07AM +0200, Ingo Molnar wrote:
>
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> >
> > 1) for kernel hackers (on CONFIG_DEBUG_KERNEL)
> > - all available page flags are exported, and
> > - exported as is
> > 2) for admins and end users
> > - only the more `well known' flags are exported:
> > 11. KPF_MMAP (pseudo flag) memory mapped page
> > 12. KPF_ANON (pseudo flag) memory mapped page (anonymous)
> > 13. KPF_SWAPCACHE page is in swap cache
> > 14. KPF_SWAPBACKED page is swap/RAM backed
> > 15. KPF_COMPOUND_HEAD (*)
> > 16. KPF_COMPOUND_TAIL (*)
> > 17. KPF_UNEVICTABLE page is in the unevictable LRU list
> > 18. KPF_HWPOISON hardware detected corruption
> > 19. KPF_NOPAGE (pseudo flag) no page frame at the address
> >
> > (*) For compound pages, exporting _both_ head/tail info enables
> > users to tell where a compound page starts/ends, and its order.
> >
> > - limit flags to their typical usage scenario, as indicated by KOSAKI:
> > - LRU pages: only export relevant flags
> > - PG_lru
> > - PG_unevictable
> > - PG_active
> > - PG_referenced
> > - page_mapped()
> > - PageAnon()
> > - PG_swapcache
> > - PG_swapbacked
> > - PG_reclaim
> > - no-IO pages: mask out irrelevant flags
> > - PG_dirty
> > - PG_uptodate
> > - PG_writeback
> > - SLAB pages: mask out overloaded flags:
> > - PG_error
> > - PG_active
> > - PG_private
> > - PG_reclaim: mask out the overloaded PG_readahead
> > - compound flags: only export huge/gigantic pages
> >
> > Here are the admin/linus views of all page flags on a newly booted nfs-root system:
> >
> > # ./page-types # for admin
> > flags page-count MB symbolic-flags long-symbolic-flags
> > 0x000000000000 491174 1918 ____________________________
> > 0x000000000020 1 0 _____l______________________ lru
> > 0x000000000028 2543 9 ___U_l______________________ uptodate,lru
> > 0x00000000002c 5288 20 __RU_l______________________ referenced,uptodate,lru
> > 0x000000004060 1 0 _____lA_______b_____________ lru,active,swapbacked
>
> I think i have to NAK this kind of ad-hoc instrumentation of kernel
> internals and statistics until we clear up why such instrumentation
> measures are being accepted into the MM while other, more dynamic
> and more flexible MM instrumentation are being resisted by Andrew.
An unexpected NAK - to throw away an orange because we are to have an apple? ;-)
Anyway here are the missing rationals.
1) FAST
It takes merely 0.2s to scan 4GB pages:
./page-types 0.02s user 0.20s system 99% cpu 0.216 total
2) SIMPLE
/proc/kpageflags will be a *long standing* hack we have to live with -
it was originally introduced by Matt to do shared memory accounting and
a facility to analyze applications' memory consumptions, with the hope
it will also help kernel developers someday.
So why not extend and embrace it, in a straightforward way?
3) USE CASES
I have/will take advantage of the above page-types command in a number ways:
- to help track down memory leak (the recent trace/ring_buffer.c case)
- to estimate the system wide readahead miss ratio
- Andi want to examine the major page types in different workloads
(for the hwpoison work)
- Me too, for fun of learning: read/write/lock/whatever a lot of pages
and examine their flags, to get an idea of some random kernel behaviors.
(the dynamic tracing tools can be more helpful, as a different view)
4) COMPLEMENTARITY
In some cases the dynamic tracing tool is not enough (or too complex)
to rebuild the current status view.
I myself have a dynamic readahead tracing tool(very useful!).
At the same time I also use readahead accounting numbers, and the
/proc/filecache tool(frequently!), and the above page-types tool.
I simply need them all - they are handy for different cases.
Thanks,
Fengguang
> The above type of condensed information can be built out of dynamic
> trace data too - and much more. Being able to track page state
> transitions is very valuable when debugging VM problems. One such
> 'view' of trace data would be a summary histogram like above.
>
> ( done after a "echo 3 > /proc/sys/vm/drop_caches" to make sure all
> interesting pages have been re-established and their state is
> present in the trace. )
>
> The SLAB code already has such a facility, kmemtrace: it's very
> useful and successful in visualizing complex SLAB details, both
> dynamically and statically.
>
> I think the same general approach should be used for the page
> allocator too (and for the page cache and some other struct page
> based caches): the life-time of an object should be followed. If we
> capture the important details we capture the big picture too. Pekka
> already sent an RFC patch to extend kmemtrace in such a fashion. Why
> is that more useful method not being pursued?
>
> By extending upon the (existing) /proc/kpageflags hack a usecase is
> taken away from the tracing based solution and a needless overlap is
> created - and that's not particularly helpful IMHO. We now have all
> the facilities upstream that allow us to do intelligent
> instrumentation - we should make use of them.
>
> Ingo
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 7:40 ` Andi Kleen
@ 2009-04-28 9:04 ` Pekka Enberg
-1 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28 9:04 UTC (permalink / raw)
To: Andi Kleen
Cc: Ingo Molnar, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
Hi Andi,
On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
> > I think i have to NAK this kind of ad-hoc instrumentation of kernel
> > internals and statistics until we clear up why such instrumentation
>
> I think because it has zero fast path overhead and can be used
> any time without enabling anything special.
Yes, zero overhead is important for certain things (like
CONFIG_SLUB_STATS, for example). However, putting slab allocator
specific checks in fs/proc looks pretty fragile to me. It would be nice
to have this under the "kmemtrace umbrella" so that there's just one
place that needs to be fixed up when allocators change.
Also, while you probably don't want to use tracepoints for this kind of
instrumentation, you might want to look into reusing the ftrace
reporting bits.
Pekka
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 9:04 ` Pekka Enberg
0 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28 9:04 UTC (permalink / raw)
To: Andi Kleen
Cc: Ingo Molnar, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
Hi Andi,
On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
> > I think i have to NAK this kind of ad-hoc instrumentation of kernel
> > internals and statistics until we clear up why such instrumentation
>
> I think because it has zero fast path overhead and can be used
> any time without enabling anything special.
Yes, zero overhead is important for certain things (like
CONFIG_SLUB_STATS, for example). However, putting slab allocator
specific checks in fs/proc looks pretty fragile to me. It would be nice
to have this under the "kmemtrace umbrella" so that there's just one
place that needs to be fixed up when allocators change.
Also, while you probably don't want to use tracepoints for this kind of
instrumentation, you might want to look into reusing the ftrace
reporting bits.
Pekka
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 9:04 ` Pekka Enberg
@ 2009-04-28 9:10 ` Andi Kleen
-1 siblings, 0 replies; 137+ messages in thread
From: Andi Kleen @ 2009-04-28 9:10 UTC (permalink / raw)
To: Pekka Enberg
Cc: Andi Kleen, Ingo Molnar, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
> Yes, zero overhead is important for certain things (like
> CONFIG_SLUB_STATS, for example). However, putting slab allocator
> specific checks in fs/proc looks pretty fragile to me. It would be nice
Ok, perhaps that could be put into a inline into slab.h. Would
that address your concerns?
> Also, while you probably don't want to use tracepoints for this kind of
> instrumentation, you might want to look into reusing the ftrace
> reporting bits.
There's already perfectly fine code in tree for this, I don't see why it would
need another infrastructure that doesn't really fit anyways.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 9:10 ` Andi Kleen
0 siblings, 0 replies; 137+ messages in thread
From: Andi Kleen @ 2009-04-28 9:10 UTC (permalink / raw)
To: Pekka Enberg
Cc: Andi Kleen, Ingo Molnar, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
> Yes, zero overhead is important for certain things (like
> CONFIG_SLUB_STATS, for example). However, putting slab allocator
> specific checks in fs/proc looks pretty fragile to me. It would be nice
Ok, perhaps that could be put into a inline into slab.h. Would
that address your concerns?
> Also, while you probably don't want to use tracepoints for this kind of
> instrumentation, you might want to look into reusing the ftrace
> reporting bits.
There's already perfectly fine code in tree for this, I don't see why it would
need another infrastructure that doesn't really fit anyways.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 9:04 ` Pekka Enberg
@ 2009-04-28 9:15 ` Ingo Molnar
-1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 9:15 UTC (permalink / raw)
To: Pekka Enberg
Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
* Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> Hi Andi,
>
> On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
> > > I think i have to NAK this kind of ad-hoc instrumentation of kernel
> > > internals and statistics until we clear up why such instrumentation
> >
> > I think because it has zero fast path overhead and can be used
> > any time without enabling anything special.
( That's a dubious claim in any case - tracepoints are very cheap.
And they could be made even cheaper and such efforts would benefit
all the tracepoint users so it's a prime focus of interest.
Andi is a SystemTap proponent, right? I saw him oppose pretty much
everything built-in kernel tracing related. I consider that a
pretty extreme position. )
> Yes, zero overhead is important for certain things (like
> CONFIG_SLUB_STATS, for example). However, putting slab allocator
> specific checks in fs/proc looks pretty fragile to me. It would be
> nice to have this under the "kmemtrace umbrella" so that there's
> just one place that needs to be fixed up when allocators change.
>
> Also, while you probably don't want to use tracepoints for this
> kind of instrumentation, you might want to look into reusing the
> ftrace reporting bits.
Exactly - we have a tracing and statistics framework for a reason.
Ingo
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 9:15 ` Ingo Molnar
0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 9:15 UTC (permalink / raw)
To: Pekka Enberg
Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
* Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> Hi Andi,
>
> On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
> > > I think i have to NAK this kind of ad-hoc instrumentation of kernel
> > > internals and statistics until we clear up why such instrumentation
> >
> > I think because it has zero fast path overhead and can be used
> > any time without enabling anything special.
( That's a dubious claim in any case - tracepoints are very cheap.
And they could be made even cheaper and such efforts would benefit
all the tracepoint users so it's a prime focus of interest.
Andi is a SystemTap proponent, right? I saw him oppose pretty much
everything built-in kernel tracing related. I consider that a
pretty extreme position. )
> Yes, zero overhead is important for certain things (like
> CONFIG_SLUB_STATS, for example). However, putting slab allocator
> specific checks in fs/proc looks pretty fragile to me. It would be
> nice to have this under the "kmemtrace umbrella" so that there's
> just one place that needs to be fixed up when allocators change.
>
> Also, while you probably don't want to use tracepoints for this
> kind of instrumentation, you might want to look into reusing the
> ftrace reporting bits.
Exactly - we have a tracing and statistics framework for a reason.
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 9:10 ` Andi Kleen
@ 2009-04-28 9:15 ` Pekka Enberg
-1 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28 9:15 UTC (permalink / raw)
To: Andi Kleen
Cc: Ingo Molnar, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
Hi Andi,
On Tue, Apr 28, 2009 at 12:10 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> Yes, zero overhead is important for certain things (like
>> CONFIG_SLUB_STATS, for example). However, putting slab allocator
>> specific checks in fs/proc looks pretty fragile to me. It would be nice
>
> Ok, perhaps that could be put into a inline into slab.h. Would
> that address your concerns?
Yeah, I'm fine with that. Putting them in the individual
slub_def.h/slob_def.h headers might be even better.
On Tue, Apr 28, 2009 at 12:10 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> Also, while you probably don't want to use tracepoints for this kind of
>> instrumentation, you might want to look into reusing the ftrace
>> reporting bits.
>
> There's already perfectly fine code in tree for this, I don't see why it would
> need another infrastructure that doesn't really fit anyways.
It's just that I suspect that we want page flag printing and
zero-overhead statistics for kmemtrace at some point. But anyway, I'm
not objecting to extending /proc/kpageflags if that's what people want
to do.
Pekka
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 9:15 ` Pekka Enberg
0 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28 9:15 UTC (permalink / raw)
To: Andi Kleen
Cc: Ingo Molnar, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
Hi Andi,
On Tue, Apr 28, 2009 at 12:10 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> Yes, zero overhead is important for certain things (like
>> CONFIG_SLUB_STATS, for example). However, putting slab allocator
>> specific checks in fs/proc looks pretty fragile to me. It would be nice
>
> Ok, perhaps that could be put into a inline into slab.h. Would
> that address your concerns?
Yeah, I'm fine with that. Putting them in the individual
slub_def.h/slob_def.h headers might be even better.
On Tue, Apr 28, 2009 at 12:10 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> Also, while you probably don't want to use tracepoints for this kind of
>> instrumentation, you might want to look into reusing the ftrace
>> reporting bits.
>
> There's already perfectly fine code in tree for this, I don't see why it would
> need another infrastructure that doesn't really fit anyways.
It's just that I suspect that we want page flag printing and
zero-overhead statistics for kmemtrace at some point. But anyway, I'm
not objecting to extending /proc/kpageflags if that's what people want
to do.
Pekka
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 9:15 ` Ingo Molnar
@ 2009-04-28 9:19 ` Pekka Enberg
-1 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28 9:19 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
Hi Ingo,
On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
>> > > I think i have to NAK this kind of ad-hoc instrumentation of kernel
>> > > internals and statistics until we clear up why such instrumentation
* Pekka Enberg <penberg@cs.helsinki.fi> wrote:
>> > I think because it has zero fast path overhead and can be used
>> > any time without enabling anything special.
On Tue, Apr 28, 2009 at 12:15 PM, Ingo Molnar <mingo@elte.hu> wrote:
> ( That's a dubious claim in any case - tracepoints are very cheap.
> And they could be made even cheaper and such efforts would benefit
> all the tracepoint users so it's a prime focus of interest.
> Andi is a SystemTap proponent, right? I saw him oppose pretty much
> everything built-in kernel tracing related. I consider that a
> pretty extreme position. )
I have no idea how expensive tracepoints are but I suspect they don't
make too much sense for this particular scenario. After all, kmemtrace
is mainly interested in _allocation patterns_ whereas this patch seems
to be more interested in "memory layout" type of things.
Pekka
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 9:19 ` Pekka Enberg
0 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28 9:19 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
Hi Ingo,
On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
>> > > I think i have to NAK this kind of ad-hoc instrumentation of kernel
>> > > internals and statistics until we clear up why such instrumentation
* Pekka Enberg <penberg@cs.helsinki.fi> wrote:
>> > I think because it has zero fast path overhead and can be used
>> > any time without enabling anything special.
On Tue, Apr 28, 2009 at 12:15 PM, Ingo Molnar <mingo@elte.hu> wrote:
> ( That's a dubious claim in any case - tracepoints are very cheap.
> And they could be made even cheaper and such efforts would benefit
> all the tracepoint users so it's a prime focus of interest.
> Andi is a SystemTap proponent, right? I saw him oppose pretty much
> everything built-in kernel tracing related. I consider that a
> pretty extreme position. )
I have no idea how expensive tracepoints are but I suspect they don't
make too much sense for this particular scenario. After all, kmemtrace
is mainly interested in _allocation patterns_ whereas this patch seems
to be more interested in "memory layout" type of things.
Pekka
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 8:33 ` Wu Fengguang
@ 2009-04-28 9:24 ` Ingo Molnar
-1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 9:24 UTC (permalink / raw)
To: Wu Fengguang
Cc: Steven Rostedt, Frédéric Weisbecker, Larry Woodman,
Peter Zijlstra, Pekka Enberg, Eduard - Gabriel Munteanu,
Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen, Matt Mackall,
Alexey Dobriyan, linux-mm
* Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Tue, Apr 28, 2009 at 08:55:07AM +0200, Ingo Molnar wrote:
> >
> > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> >
> > > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> > >
> > > 1) for kernel hackers (on CONFIG_DEBUG_KERNEL)
> > > - all available page flags are exported, and
> > > - exported as is
> > > 2) for admins and end users
> > > - only the more `well known' flags are exported:
> > > 11. KPF_MMAP (pseudo flag) memory mapped page
> > > 12. KPF_ANON (pseudo flag) memory mapped page (anonymous)
> > > 13. KPF_SWAPCACHE page is in swap cache
> > > 14. KPF_SWAPBACKED page is swap/RAM backed
> > > 15. KPF_COMPOUND_HEAD (*)
> > > 16. KPF_COMPOUND_TAIL (*)
> > > 17. KPF_UNEVICTABLE page is in the unevictable LRU list
> > > 18. KPF_HWPOISON hardware detected corruption
> > > 19. KPF_NOPAGE (pseudo flag) no page frame at the address
> > >
> > > (*) For compound pages, exporting _both_ head/tail info enables
> > > users to tell where a compound page starts/ends, and its order.
> > >
> > > - limit flags to their typical usage scenario, as indicated by KOSAKI:
> > > - LRU pages: only export relevant flags
> > > - PG_lru
> > > - PG_unevictable
> > > - PG_active
> > > - PG_referenced
> > > - page_mapped()
> > > - PageAnon()
> > > - PG_swapcache
> > > - PG_swapbacked
> > > - PG_reclaim
> > > - no-IO pages: mask out irrelevant flags
> > > - PG_dirty
> > > - PG_uptodate
> > > - PG_writeback
> > > - SLAB pages: mask out overloaded flags:
> > > - PG_error
> > > - PG_active
> > > - PG_private
> > > - PG_reclaim: mask out the overloaded PG_readahead
> > > - compound flags: only export huge/gigantic pages
> > >
> > > Here are the admin/linus views of all page flags on a newly booted nfs-root system:
> > >
> > > # ./page-types # for admin
> > > flags page-count MB symbolic-flags long-symbolic-flags
> > > 0x000000000000 491174 1918 ____________________________
> > > 0x000000000020 1 0 _____l______________________ lru
> > > 0x000000000028 2543 9 ___U_l______________________ uptodate,lru
> > > 0x00000000002c 5288 20 __RU_l______________________ referenced,uptodate,lru
> > > 0x000000004060 1 0 _____lA_______b_____________ lru,active,swapbacked
> >
> > I think i have to NAK this kind of ad-hoc instrumentation of kernel
> > internals and statistics until we clear up why such instrumentation
> > measures are being accepted into the MM while other, more dynamic
> > and more flexible MM instrumentation are being resisted by Andrew.
>
> An unexpected NAK - to throw away an orange because we are to have an apple? ;-)
>
> Anyway here are the missing rationals.
>
> 1) FAST
>
> It takes merely 0.2s to scan 4GB pages:
>
> ./page-types 0.02s user 0.20s system 99% cpu 0.216 total
>
> 2) SIMPLE
>
> /proc/kpageflags will be a *long standing* hack we have to live
> with - it was originally introduced by Matt to do shared memory
> accounting and a facility to analyze applications' memory
> consumptions, with the hope it will also help kernel developers
> someday.
>
> So why not extend and embrace it, in a straightforward way?
>
> 3) USE CASES
>
> I have/will take advantage of the above page-types command in a number ways:
> - to help track down memory leak (the recent trace/ring_buffer.c case)
> - to estimate the system wide readahead miss ratio
> - Andi want to examine the major page types in different workloads
> (for the hwpoison work)
> - Me too, for fun of learning: read/write/lock/whatever a lot of pages
> and examine their flags, to get an idea of some random kernel behaviors.
> (the dynamic tracing tools can be more helpful, as a different view)
>
> 4) COMPLEMENTARITY
>
> In some cases the dynamic tracing tool is not enough (or too complex)
> to rebuild the current status view.
>
> I myself have a dynamic readahead tracing tool(very useful!). At
> the same time I also use readahead accounting numbers, and the
> /proc/filecache tool(frequently!), and the above page-types tool.
> I simply need them all - they are handy for different cases.
Well, the main counter argument here is that statistics is _derived_
from events. In their simplest form the 'counts' are the integral of
events over time.
So if we capture all interesting events, and do that with low
overhead (and in fact can even collect and integrate them in-kernel,
today), we _dont have_ to maintain various overlapping counters all
around the kernel. This is really a general instrumentation design
observation.
Every time we add yet another /proc hack we splinter Linux
instrumentation, in a hard to reverse way.
So your single-purpose /proc hack could be made multi-purpose and
could help a much broader range of people, with just a little bit of
effort i believe. Pekka already wrote the page tracking patch for
example, that would be a good starting point.
Does it mean more work to do? You bet ;-)
Ingo
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 9:24 ` Ingo Molnar
0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 9:24 UTC (permalink / raw)
To: Wu Fengguang
Cc: Steven Rostedt, Frédéric Weisbecker, Larry Woodman,
Peter Zijlstra, Pekka Enberg, Eduard - Gabriel Munteanu,
Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen, Matt Mackall,
Alexey Dobriyan, linux-mm
* Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Tue, Apr 28, 2009 at 08:55:07AM +0200, Ingo Molnar wrote:
> >
> > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> >
> > > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> > >
> > > 1) for kernel hackers (on CONFIG_DEBUG_KERNEL)
> > > - all available page flags are exported, and
> > > - exported as is
> > > 2) for admins and end users
> > > - only the more `well known' flags are exported:
> > > 11. KPF_MMAP (pseudo flag) memory mapped page
> > > 12. KPF_ANON (pseudo flag) memory mapped page (anonymous)
> > > 13. KPF_SWAPCACHE page is in swap cache
> > > 14. KPF_SWAPBACKED page is swap/RAM backed
> > > 15. KPF_COMPOUND_HEAD (*)
> > > 16. KPF_COMPOUND_TAIL (*)
> > > 17. KPF_UNEVICTABLE page is in the unevictable LRU list
> > > 18. KPF_HWPOISON hardware detected corruption
> > > 19. KPF_NOPAGE (pseudo flag) no page frame at the address
> > >
> > > (*) For compound pages, exporting _both_ head/tail info enables
> > > users to tell where a compound page starts/ends, and its order.
> > >
> > > - limit flags to their typical usage scenario, as indicated by KOSAKI:
> > > - LRU pages: only export relevant flags
> > > - PG_lru
> > > - PG_unevictable
> > > - PG_active
> > > - PG_referenced
> > > - page_mapped()
> > > - PageAnon()
> > > - PG_swapcache
> > > - PG_swapbacked
> > > - PG_reclaim
> > > - no-IO pages: mask out irrelevant flags
> > > - PG_dirty
> > > - PG_uptodate
> > > - PG_writeback
> > > - SLAB pages: mask out overloaded flags:
> > > - PG_error
> > > - PG_active
> > > - PG_private
> > > - PG_reclaim: mask out the overloaded PG_readahead
> > > - compound flags: only export huge/gigantic pages
> > >
> > > Here are the admin/linus views of all page flags on a newly booted nfs-root system:
> > >
> > > # ./page-types # for admin
> > > flags page-count MB symbolic-flags long-symbolic-flags
> > > 0x000000000000 491174 1918 ____________________________
> > > 0x000000000020 1 0 _____l______________________ lru
> > > 0x000000000028 2543 9 ___U_l______________________ uptodate,lru
> > > 0x00000000002c 5288 20 __RU_l______________________ referenced,uptodate,lru
> > > 0x000000004060 1 0 _____lA_______b_____________ lru,active,swapbacked
> >
> > I think i have to NAK this kind of ad-hoc instrumentation of kernel
> > internals and statistics until we clear up why such instrumentation
> > measures are being accepted into the MM while other, more dynamic
> > and more flexible MM instrumentation are being resisted by Andrew.
>
> An unexpected NAK - to throw away an orange because we are to have an apple? ;-)
>
> Anyway here are the missing rationals.
>
> 1) FAST
>
> It takes merely 0.2s to scan 4GB pages:
>
> ./page-types 0.02s user 0.20s system 99% cpu 0.216 total
>
> 2) SIMPLE
>
> /proc/kpageflags will be a *long standing* hack we have to live
> with - it was originally introduced by Matt to do shared memory
> accounting and a facility to analyze applications' memory
> consumptions, with the hope it will also help kernel developers
> someday.
>
> So why not extend and embrace it, in a straightforward way?
>
> 3) USE CASES
>
> I have/will take advantage of the above page-types command in a number ways:
> - to help track down memory leak (the recent trace/ring_buffer.c case)
> - to estimate the system wide readahead miss ratio
> - Andi want to examine the major page types in different workloads
> (for the hwpoison work)
> - Me too, for fun of learning: read/write/lock/whatever a lot of pages
> and examine their flags, to get an idea of some random kernel behaviors.
> (the dynamic tracing tools can be more helpful, as a different view)
>
> 4) COMPLEMENTARITY
>
> In some cases the dynamic tracing tool is not enough (or too complex)
> to rebuild the current status view.
>
> I myself have a dynamic readahead tracing tool(very useful!). At
> the same time I also use readahead accounting numbers, and the
> /proc/filecache tool(frequently!), and the above page-types tool.
> I simply need them all - they are handy for different cases.
Well, the main counter argument here is that statistics is _derived_
from events. In their simplest form the 'counts' are the integral of
events over time.
So if we capture all interesting events, and do that with low
overhead (and in fact can even collect and integrate them in-kernel,
today), we _dont have_ to maintain various overlapping counters all
around the kernel. This is really a general instrumentation design
observation.
Every time we add yet another /proc hack we splinter Linux
instrumentation, in a hard to reverse way.
So your single-purpose /proc hack could be made multi-purpose and
could help a much broader range of people, with just a little bit of
effort i believe. Pekka already wrote the page tracking patch for
example, that would be a good starting point.
Does it mean more work to do? You bet ;-)
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 9:19 ` Pekka Enberg
@ 2009-04-28 9:25 ` Pekka Enberg
-1 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28 9:25 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
>>> > > I think i have to NAK this kind of ad-hoc instrumentation of kernel
>>> > > internals and statistics until we clear up why such instrumentation
* Pekka Enberg <penberg@cs.helsinki.fi> wrote:
>>> > I think because it has zero fast path overhead and can be used
>>> > any time without enabling anything special.
>
> On Tue, Apr 28, 2009 at 12:15 PM, Ingo Molnar <mingo@elte.hu> wrote:
>> ( That's a dubious claim in any case - tracepoints are very cheap.
>> And they could be made even cheaper and such efforts would benefit
>> all the tracepoint users so it's a prime focus of interest.
>> Andi is a SystemTap proponent, right? I saw him oppose pretty much
>> everything built-in kernel tracing related. I consider that a
>> pretty extreme position. )
On Tue, Apr 28, 2009 at 12:19 PM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> I have no idea how expensive tracepoints are but I suspect they don't
> make too much sense for this particular scenario. After all, kmemtrace
> is mainly interested in _allocation patterns_ whereas this patch seems
> to be more interested in "memory layout" type of things.
That said, I do foresee a need to be able to turn on more detailed
tracing after you've identified problematic areas from kpageflags type
of overview report. And for that, you almost certainly want
kmemtrace/tracepoints style solution with pid/function/whatever regexp
matching ftrace already provides.
Pekka
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 9:25 ` Pekka Enberg
0 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28 9:25 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
>>> > > I think i have to NAK this kind of ad-hoc instrumentation of kernel
>>> > > internals and statistics until we clear up why such instrumentation
* Pekka Enberg <penberg@cs.helsinki.fi> wrote:
>>> > I think because it has zero fast path overhead and can be used
>>> > any time without enabling anything special.
>
> On Tue, Apr 28, 2009 at 12:15 PM, Ingo Molnar <mingo@elte.hu> wrote:
>> ( That's a dubious claim in any case - tracepoints are very cheap.
>> And they could be made even cheaper and such efforts would benefit
>> all the tracepoint users so it's a prime focus of interest.
>> Andi is a SystemTap proponent, right? I saw him oppose pretty much
>> everything built-in kernel tracing related. I consider that a
>> pretty extreme position. )
On Tue, Apr 28, 2009 at 12:19 PM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> I have no idea how expensive tracepoints are but I suspect they don't
> make too much sense for this particular scenario. After all, kmemtrace
> is mainly interested in _allocation patterns_ whereas this patch seems
> to be more interested in "memory layout" type of things.
That said, I do foresee a need to be able to turn on more detailed
tracing after you've identified problematic areas from kpageflags type
of overview report. And for that, you almost certainly want
kmemtrace/tracepoints style solution with pid/function/whatever regexp
matching ftrace already provides.
Pekka
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 9:19 ` Pekka Enberg
@ 2009-04-28 9:29 ` Ingo Molnar
-1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 9:29 UTC (permalink / raw)
To: Pekka Enberg
Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
* Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> I have no idea how expensive tracepoints are but I suspect they
> don't make too much sense for this particular scenario. After all,
> kmemtrace is mainly interested in _allocation patterns_ whereas
> this patch seems to be more interested in "memory layout" type of
> things.
My point is that the allocation patterns can be derived from dynamic
events. We can build a map of everything if we know all the events
that led up to it. Doing:
echo 3 > /proc/sys/vm/drop_caches
will clear 99% of the memory allocations, so we can build a new map
from scratch just about anytime. (and if boot allocations are
interesting they can be traced too)
_And_ via this angle we'll also have access to the dynamic events,
in a different 'view' of the same tracepoints - which is obviously
very useful for different purposes.
Ingo
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 9:29 ` Ingo Molnar
0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 9:29 UTC (permalink / raw)
To: Pekka Enberg
Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
* Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> I have no idea how expensive tracepoints are but I suspect they
> don't make too much sense for this particular scenario. After all,
> kmemtrace is mainly interested in _allocation patterns_ whereas
> this patch seems to be more interested in "memory layout" type of
> things.
My point is that the allocation patterns can be derived from dynamic
events. We can build a map of everything if we know all the events
that led up to it. Doing:
echo 3 > /proc/sys/vm/drop_caches
will clear 99% of the memory allocations, so we can build a new map
from scratch just about anytime. (and if boot allocations are
interesting they can be traced too)
_And_ via this angle we'll also have access to the dynamic events,
in a different 'view' of the same tracepoints - which is obviously
very useful for different purposes.
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 9:29 ` Ingo Molnar
@ 2009-04-28 9:34 ` KOSAKI Motohiro
-1 siblings, 0 replies; 137+ messages in thread
From: KOSAKI Motohiro @ 2009-04-28 9:34 UTC (permalink / raw)
To: Ingo Molnar
Cc: kosaki.motohiro, Pekka Enberg, Andi Kleen, Wu Fengguang,
Steven Rostedt, Fr馘駻ic Weisbecker, Larry Woodman,
Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
Matt Mackall, Alexey Dobriyan, linux-mm
>
> * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
>
> > I have no idea how expensive tracepoints are but I suspect they
> > don't make too much sense for this particular scenario. After all,
> > kmemtrace is mainly interested in _allocation patterns_ whereas
> > this patch seems to be more interested in "memory layout" type of
> > things.
>
> My point is that the allocation patterns can be derived from dynamic
> events. We can build a map of everything if we know all the events
> that led up to it. Doing:
>
> echo 3 > /proc/sys/vm/drop_caches
>
> will clear 99% of the memory allocations, so we can build a new map
> from scratch just about anytime. (and if boot allocations are
> interesting they can be traced too)
>
> _And_ via this angle we'll also have access to the dynamic events,
> in a different 'view' of the same tracepoints - which is obviously
> very useful for different purposes.
I am one of most strongly want guys to MM tracepoint.
but No, many cunstomer never permit to use drop_caches.
I believe this patch and tracepoint are _both_ necessary and useful.
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 9:34 ` KOSAKI Motohiro
0 siblings, 0 replies; 137+ messages in thread
From: KOSAKI Motohiro @ 2009-04-28 9:34 UTC (permalink / raw)
To: Ingo Molnar
Cc: kosaki.motohiro, Pekka Enberg, Andi Kleen, Wu Fengguang,
Steven Rostedt, Fr馘駻ic Weisbecker, Larry Woodman,
Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
Matt Mackall, Alexey Dobriyan, linux-mm
>
> * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
>
> > I have no idea how expensive tracepoints are but I suspect they
> > don't make too much sense for this particular scenario. After all,
> > kmemtrace is mainly interested in _allocation patterns_ whereas
> > this patch seems to be more interested in "memory layout" type of
> > things.
>
> My point is that the allocation patterns can be derived from dynamic
> events. We can build a map of everything if we know all the events
> that led up to it. Doing:
>
> echo 3 > /proc/sys/vm/drop_caches
>
> will clear 99% of the memory allocations, so we can build a new map
> from scratch just about anytime. (and if boot allocations are
> interesting they can be traced too)
>
> _And_ via this angle we'll also have access to the dynamic events,
> in a different 'view' of the same tracepoints - which is obviously
> very useful for different purposes.
I am one of most strongly want guys to MM tracepoint.
but No, many cunstomer never permit to use drop_caches.
I believe this patch and tracepoint are _both_ necessary and useful.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 9:25 ` Pekka Enberg
@ 2009-04-28 9:36 ` Wu Fengguang
-1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 9:36 UTC (permalink / raw)
To: Pekka Enberg
Cc: Ingo Molnar, Andi Kleen, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
On Tue, Apr 28, 2009 at 05:25:06PM +0800, Pekka Enberg wrote:
> On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
> >>> > > I think i have to NAK this kind of ad-hoc instrumentation of kernel
> >>> > > internals and statistics until we clear up why such instrumentation
>
> * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> >>> > I think because it has zero fast path overhead and can be used
> >>> > any time without enabling anything special.
> >
> > On Tue, Apr 28, 2009 at 12:15 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >> ( That's a dubious claim in any case - tracepoints are very cheap.
> >> And they could be made even cheaper and such efforts would benefit
> >> all the tracepoint users so it's a prime focus of interest.
> >> Andi is a SystemTap proponent, right? I saw him oppose pretty much
> >> everything built-in kernel tracing related. I consider that a
> >> pretty extreme position. )
>
> On Tue, Apr 28, 2009 at 12:19 PM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> > I have no idea how expensive tracepoints are but I suspect they don't
> > make too much sense for this particular scenario. After all, kmemtrace
> > is mainly interested in _allocation patterns_ whereas this patch seems
> > to be more interested in "memory layout" type of things.
>
> That said, I do foresee a need to be able to turn on more detailed
> tracing after you've identified problematic areas from kpageflags type
> of overview report. And for that, you almost certainly want
> kmemtrace/tracepoints style solution with pid/function/whatever regexp
> matching ftrace already provides.
Exactly - kmemtrace is the tool I looked for when hunting down the
page flags of the leaked ring buffer memory :-)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 9:36 ` Wu Fengguang
0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 9:36 UTC (permalink / raw)
To: Pekka Enberg
Cc: Ingo Molnar, Andi Kleen, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
On Tue, Apr 28, 2009 at 05:25:06PM +0800, Pekka Enberg wrote:
> On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
> >>> > > I think i have to NAK this kind of ad-hoc instrumentation of kernel
> >>> > > internals and statistics until we clear up why such instrumentation
>
> * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> >>> > I think because it has zero fast path overhead and can be used
> >>> > any time without enabling anything special.
> >
> > On Tue, Apr 28, 2009 at 12:15 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >> ( That's a dubious claim in any case - tracepoints are very cheap.
> >> A And they could be made even cheaper and such efforts would benefit
> >> A all the tracepoint users so it's a prime focus of interest.
> >> A Andi is a SystemTap proponent, right? I saw him oppose pretty much
> >> A everything built-in kernel tracing related. I consider that a
> >> A pretty extreme position. )
>
> On Tue, Apr 28, 2009 at 12:19 PM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> > I have no idea how expensive tracepoints are but I suspect they don't
> > make too much sense for this particular scenario. After all, kmemtrace
> > is mainly interested in _allocation patterns_ whereas this patch seems
> > to be more interested in "memory layout" type of things.
>
> That said, I do foresee a need to be able to turn on more detailed
> tracing after you've identified problematic areas from kpageflags type
> of overview report. And for that, you almost certainly want
> kmemtrace/tracepoints style solution with pid/function/whatever regexp
> matching ftrace already provides.
Exactly - kmemtrace is the tool I looked for when hunting down the
page flags of the leaked ring buffer memory :-)
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 9:25 ` Pekka Enberg
@ 2009-04-28 9:36 ` Ingo Molnar
-1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 9:36 UTC (permalink / raw)
To: Pekka Enberg
Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
* Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> > I have no idea how expensive tracepoints are but I suspect they
> > don't make too much sense for this particular scenario. After
> > all, kmemtrace is mainly interested in _allocation patterns_
> > whereas this patch seems to be more interested in "memory
> > layout" type of things.
>
> That said, I do foresee a need to be able to turn on more detailed
> tracing after you've identified problematic areas from kpageflags
> type of overview report. And for that, you almost certainly want
> kmemtrace/tracepoints style solution with pid/function/whatever
> regexp matching ftrace already provides.
yes. My point is that by having the latter, we pretty much have the
former as well!
I 'integrate' traces all the time to get summary counts. This series
of dynamic events:
allocation
page count up
page count up
page count down
page count up
page count up
page count up
page count up
integrates into: "page count is 6".
Note that "integration" can be done wholly in the kernel too,
without going to the overhead of streaming all dynamic events to
user-space, just to summarize data into counts, in-kernel. That is
what the ftrace statistics framework and various ftrace plugins are
about.
Also, it might make sense to extend the framework with a series of
'get current object state' events when tracing is turned on. A
special case of _that_ would in essence be what the /proc hack does
now - just expressed in a much more generic, and a much more usable
form.
Ingo
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 9:36 ` Ingo Molnar
0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 9:36 UTC (permalink / raw)
To: Pekka Enberg
Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
* Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> > I have no idea how expensive tracepoints are but I suspect they
> > don't make too much sense for this particular scenario. After
> > all, kmemtrace is mainly interested in _allocation patterns_
> > whereas this patch seems to be more interested in "memory
> > layout" type of things.
>
> That said, I do foresee a need to be able to turn on more detailed
> tracing after you've identified problematic areas from kpageflags
> type of overview report. And for that, you almost certainly want
> kmemtrace/tracepoints style solution with pid/function/whatever
> regexp matching ftrace already provides.
yes. My point is that by having the latter, we pretty much have the
former as well!
I 'integrate' traces all the time to get summary counts. This series
of dynamic events:
allocation
page count up
page count up
page count down
page count up
page count up
page count up
page count up
integrates into: "page count is 6".
Note that "integration" can be done wholly in the kernel too,
without going to the overhead of streaming all dynamic events to
user-space, just to summarize data into counts, in-kernel. That is
what the ftrace statistics framework and various ftrace plugins are
about.
Also, it might make sense to extend the framework with a series of
'get current object state' events when tracing is turned on. A
special case of _that_ would in essence be what the /proc hack does
now - just expressed in a much more generic, and a much more usable
form.
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 9:34 ` KOSAKI Motohiro
@ 2009-04-28 9:38 ` Ingo Molnar
-1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 9:38 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Pekka Enberg, Andi Kleen, Wu Fengguang, Steven Rostedt,
Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
Alexey Dobriyan, linux-mm
* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> >
> > * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> >
> > > I have no idea how expensive tracepoints are but I suspect they
> > > don't make too much sense for this particular scenario. After all,
> > > kmemtrace is mainly interested in _allocation patterns_ whereas
> > > this patch seems to be more interested in "memory layout" type of
> > > things.
> >
> > My point is that the allocation patterns can be derived from dynamic
> > events. We can build a map of everything if we know all the events
> > that led up to it. Doing:
> >
> > echo 3 > /proc/sys/vm/drop_caches
> >
> > will clear 99% of the memory allocations, so we can build a new map
> > from scratch just about anytime. (and if boot allocations are
> > interesting they can be traced too)
> >
> > _And_ via this angle we'll also have access to the dynamic events,
> > in a different 'view' of the same tracepoints - which is obviously
> > very useful for different purposes.
>
> I am one of most strongly want guys to MM tracepoint. but No, many
> cunstomer never permit to use drop_caches.
See my other mail i just sent: it would be a natural extension of
tracing to also dump all current object state when tracing is turned
on. That way no drop_caches is needed at all.
But it has to be expressed in one framework that cares about the
totality of the kernel - not just these splintered bits of
instrumentation and pieces of statistics.
Ingo
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 9:38 ` Ingo Molnar
0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 9:38 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Pekka Enberg, Andi Kleen, Wu Fengguang, Steven Rostedt,
Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
Alexey Dobriyan, linux-mm
* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> >
> > * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> >
> > > I have no idea how expensive tracepoints are but I suspect they
> > > don't make too much sense for this particular scenario. After all,
> > > kmemtrace is mainly interested in _allocation patterns_ whereas
> > > this patch seems to be more interested in "memory layout" type of
> > > things.
> >
> > My point is that the allocation patterns can be derived from dynamic
> > events. We can build a map of everything if we know all the events
> > that led up to it. Doing:
> >
> > echo 3 > /proc/sys/vm/drop_caches
> >
> > will clear 99% of the memory allocations, so we can build a new map
> > from scratch just about anytime. (and if boot allocations are
> > interesting they can be traced too)
> >
> > _And_ via this angle we'll also have access to the dynamic events,
> > in a different 'view' of the same tracepoints - which is obviously
> > very useful for different purposes.
>
> I am one of most strongly want guys to MM tracepoint. but No, many
> cunstomer never permit to use drop_caches.
See my other mail i just sent: it would be a natural extension of
tracing to also dump all current object state when tracing is turned
on. That way no drop_caches is needed at all.
But it has to be expressed in one framework that cares about the
totality of the kernel - not just these splintered bits of
instrumentation and pieces of statistics.
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 9:38 ` Ingo Molnar
@ 2009-04-28 9:55 ` Wu Fengguang
-1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 9:55 UTC (permalink / raw)
To: Ingo Molnar
Cc: KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Steven Rostedt,
Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
Alexey Dobriyan, linux-mm
On Tue, Apr 28, 2009 at 05:38:33PM +0800, Ingo Molnar wrote:
>
> * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
>
> > >
> > > * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> > >
> > > > I have no idea how expensive tracepoints are but I suspect they
> > > > don't make too much sense for this particular scenario. After all,
> > > > kmemtrace is mainly interested in _allocation patterns_ whereas
> > > > this patch seems to be more interested in "memory layout" type of
> > > > things.
> > >
> > > My point is that the allocation patterns can be derived from dynamic
> > > events. We can build a map of everything if we know all the events
> > > that led up to it. Doing:
> > >
> > > echo 3 > /proc/sys/vm/drop_caches
> > >
> > > will clear 99% of the memory allocations, so we can build a new map
> > > from scratch just about anytime. (and if boot allocations are
> > > interesting they can be traced too)
> > >
> > > _And_ via this angle we'll also have access to the dynamic events,
> > > in a different 'view' of the same tracepoints - which is obviously
> > > very useful for different purposes.
> >
> > I am one of most strongly want guys to MM tracepoint. but No, many
> > cunstomer never permit to use drop_caches.
>
> See my other mail i just sent: it would be a natural extension of
> tracing to also dump all current object state when tracing is turned
> on. That way no drop_caches is needed at all.
I can understand the merits here - I also did readahead
tracing/accounting in _one_ piece of code. Very handy.
The readahead traces are now raw printks - converting to the ftrace
framework would be a big win.
But. It's still not a fit-all solution. Imagine when full data _since_
booting is required, but the user cannot afford a reboot.
> But it has to be expressed in one framework that cares about the
> totality of the kernel - not just these splintered bits of
> instrumentation and pieces of statistics.
Though minded to push the kpageflags interface, I totally agree the
above fine principle and discipline :-)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 9:55 ` Wu Fengguang
0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 9:55 UTC (permalink / raw)
To: Ingo Molnar
Cc: KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Steven Rostedt,
Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
Alexey Dobriyan, linux-mm
On Tue, Apr 28, 2009 at 05:38:33PM +0800, Ingo Molnar wrote:
>
> * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
>
> > >
> > > * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> > >
> > > > I have no idea how expensive tracepoints are but I suspect they
> > > > don't make too much sense for this particular scenario. After all,
> > > > kmemtrace is mainly interested in _allocation patterns_ whereas
> > > > this patch seems to be more interested in "memory layout" type of
> > > > things.
> > >
> > > My point is that the allocation patterns can be derived from dynamic
> > > events. We can build a map of everything if we know all the events
> > > that led up to it. Doing:
> > >
> > > echo 3 > /proc/sys/vm/drop_caches
> > >
> > > will clear 99% of the memory allocations, so we can build a new map
> > > from scratch just about anytime. (and if boot allocations are
> > > interesting they can be traced too)
> > >
> > > _And_ via this angle we'll also have access to the dynamic events,
> > > in a different 'view' of the same tracepoints - which is obviously
> > > very useful for different purposes.
> >
> > I am one of most strongly want guys to MM tracepoint. but No, many
> > cunstomer never permit to use drop_caches.
>
> See my other mail i just sent: it would be a natural extension of
> tracing to also dump all current object state when tracing is turned
> on. That way no drop_caches is needed at all.
I can understand the merits here - I also did readahead
tracing/accounting in _one_ piece of code. Very handy.
The readahead traces are now raw printks - converting to the ftrace
framework would be a big win.
But. It's still not a fit-all solution. Imagine when full data _since_
booting is required, but the user cannot afford a reboot.
> But it has to be expressed in one framework that cares about the
> totality of the kernel - not just these splintered bits of
> instrumentation and pieces of statistics.
Though minded to push the kpageflags interface, I totally agree the
above fine principle and discipline :-)
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 9:36 ` Ingo Molnar
@ 2009-04-28 9:57 ` Pekka Enberg
-1 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28 9:57 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
Hi Ingo,
On Tue, Apr 28, 2009 at 12:36 PM, Ingo Molnar <mingo@elte.hu> wrote:
> I 'integrate' traces all the time to get summary counts. This series
> of dynamic events:
>
> allocation
> page count up
> page count up
> page count down
> page count up
> page count up
> page count up
> page count up
>
> integrates into: "page count is 6".
>
> Note that "integration" can be done wholly in the kernel too,
> without going to the overhead of streaming all dynamic events to
> user-space, just to summarize data into counts, in-kernel. That is
> what the ftrace statistics framework and various ftrace plugins are
> about.
>
> Also, it might make sense to extend the framework with a series of
> 'get current object state' events when tracing is turned on. A
> special case of _that_ would in essence be what the /proc hack does
> now - just expressed in a much more generic, and a much more usable
> form.
I guess the main question here is whether this approach will scale to
something like kmalloc() or the page allocator in production
environments. For any serious workload, the frequency of events is
going to be pretty high.
Pekka
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 9:57 ` Pekka Enberg
0 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28 9:57 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
Hi Ingo,
On Tue, Apr 28, 2009 at 12:36 PM, Ingo Molnar <mingo@elte.hu> wrote:
> I 'integrate' traces all the time to get summary counts. This series
> of dynamic events:
>
> allocation
> page count up
> page count up
> page count down
> page count up
> page count up
> page count up
> page count up
>
> integrates into: "page count is 6".
>
> Note that "integration" can be done wholly in the kernel too,
> without going to the overhead of streaming all dynamic events to
> user-space, just to summarize data into counts, in-kernel. That is
> what the ftrace statistics framework and various ftrace plugins are
> about.
>
> Also, it might make sense to extend the framework with a series of
> 'get current object state' events when tracing is turned on. A
> special case of _that_ would in essence be what the /proc hack does
> now - just expressed in a much more generic, and a much more usable
> form.
I guess the main question here is whether this approach will scale to
something like kmalloc() or the page allocator in production
environments. For any serious workload, the frequency of events is
going to be pretty high.
Pekka
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 9:57 ` Pekka Enberg
@ 2009-04-28 10:10 ` KOSAKI Motohiro
-1 siblings, 0 replies; 137+ messages in thread
From: KOSAKI Motohiro @ 2009-04-28 10:10 UTC (permalink / raw)
To: Pekka Enberg
Cc: kosaki.motohiro, Ingo Molnar, Andi Kleen, Wu Fengguang,
Steven Rostedt, Fr馘駻ic Weisbecker, Larry Woodman,
Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
Matt Mackall, Alexey Dobriyan, linux-mm
> I guess the main question here is whether this approach will scale to
> something like kmalloc() or the page allocator in production
> environments. For any serious workload, the frequency of events is
> going to be pretty high.
Immediate Values patch series makes zero-overhead to tracepoint
while it's not used.
So, We have to implement to stop collect stastics way. it restore
zero overhead world.
We don't lose any performance by trace.
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 10:10 ` KOSAKI Motohiro
0 siblings, 0 replies; 137+ messages in thread
From: KOSAKI Motohiro @ 2009-04-28 10:10 UTC (permalink / raw)
To: Pekka Enberg
Cc: kosaki.motohiro, Ingo Molnar, Andi Kleen, Wu Fengguang,
Steven Rostedt, Fr馘駻ic Weisbecker, Larry Woodman,
Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
Matt Mackall, Alexey Dobriyan, linux-mm
> I guess the main question here is whether this approach will scale to
> something like kmalloc() or the page allocator in production
> environments. For any serious workload, the frequency of events is
> going to be pretty high.
Immediate Values patch series makes zero-overhead to tracepoint
while it's not used.
So, We have to implement to stop collect stastics way. it restore
zero overhead world.
We don't lose any performance by trace.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 9:55 ` Wu Fengguang
@ 2009-04-28 10:11 ` KOSAKI Motohiro
-1 siblings, 0 replies; 137+ messages in thread
From: KOSAKI Motohiro @ 2009-04-28 10:11 UTC (permalink / raw)
To: Wu Fengguang
Cc: kosaki.motohiro, Ingo Molnar, Pekka Enberg, Andi Kleen,
Steven Rostedt, Fr馘駻ic Weisbecker, Larry Woodman,
Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
Matt Mackall, Alexey Dobriyan, linux-mm
> > > I am one of most strongly want guys to MM tracepoint. but No, many
> > > cunstomer never permit to use drop_caches.
> >
> > See my other mail i just sent: it would be a natural extension of
> > tracing to also dump all current object state when tracing is turned
> > on. That way no drop_caches is needed at all.
>
> I can understand the merits here - I also did readahead
> tracing/accounting in _one_ piece of code. Very handy.
>
> The readahead traces are now raw printks - converting to the ftrace
> framework would be a big win.
>
> But. It's still not a fit-all solution. Imagine when full data _since_
> booting is required, but the user cannot afford a reboot.
>
> > But it has to be expressed in one framework that cares about the
> > totality of the kernel - not just these splintered bits of
> > instrumentation and pieces of statistics.
>
> Though minded to push the kpageflags interface, I totally agree the
> above fine principle and discipline :-)
Yeah.
I totally agree your claim.
I'm interest to both ftrace based readahead tracer and this patch :)
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 10:11 ` KOSAKI Motohiro
0 siblings, 0 replies; 137+ messages in thread
From: KOSAKI Motohiro @ 2009-04-28 10:11 UTC (permalink / raw)
To: Wu Fengguang
Cc: kosaki.motohiro, Ingo Molnar, Pekka Enberg, Andi Kleen,
Steven Rostedt, Fr馘駻ic Weisbecker, Larry Woodman,
Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
Matt Mackall, Alexey Dobriyan, linux-mm
> > > I am one of most strongly want guys to MM tracepoint. but No, many
> > > cunstomer never permit to use drop_caches.
> >
> > See my other mail i just sent: it would be a natural extension of
> > tracing to also dump all current object state when tracing is turned
> > on. That way no drop_caches is needed at all.
>
> I can understand the merits here - I also did readahead
> tracing/accounting in _one_ piece of code. Very handy.
>
> The readahead traces are now raw printks - converting to the ftrace
> framework would be a big win.
>
> But. It's still not a fit-all solution. Imagine when full data _since_
> booting is required, but the user cannot afford a reboot.
>
> > But it has to be expressed in one framework that cares about the
> > totality of the kernel - not just these splintered bits of
> > instrumentation and pieces of statistics.
>
> Though minded to push the kpageflags interface, I totally agree the
> above fine principle and discipline :-)
Yeah.
I totally agree your claim.
I'm interest to both ftrace based readahead tracer and this patch :)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 9:38 ` Ingo Molnar
@ 2009-04-28 10:18 ` Andi Kleen
-1 siblings, 0 replies; 137+ messages in thread
From: Andi Kleen @ 2009-04-28 10:18 UTC (permalink / raw)
To: Ingo Molnar
Cc: KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Wu Fengguang,
Steven Rostedt, Fr馘駻ic Weisbecker, Larry Woodman,
Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
Matt Mackall, Alexey Dobriyan, linux-mm
> But it has to be expressed in one framework that cares about the
> totality of the kernel - not just these splintered bits of
Can you perhaps expand a bit what code that framework would
provide to kpageflags? As far as I can see it only needs
a ->read callback from somewhere and I admit it's hard to see for me
how that could share much code with anything else.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 10:18 ` Andi Kleen
0 siblings, 0 replies; 137+ messages in thread
From: Andi Kleen @ 2009-04-28 10:18 UTC (permalink / raw)
To: Ingo Molnar
Cc: KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Wu Fengguang,
Steven Rostedt, Fr馘駻ic Weisbecker, Larry Woodman,
Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
Matt Mackall, Alexey Dobriyan, linux-mm
> But it has to be expressed in one framework that cares about the
> totality of the kernel - not just these splintered bits of
Can you perhaps expand a bit what code that framework would
provide to kpageflags? As far as I can see it only needs
a ->read callback from somewhere and I admit it's hard to see for me
how that could share much code with anything else.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 10:10 ` KOSAKI Motohiro
@ 2009-04-28 10:21 ` Pekka Enberg
-1 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28 10:21 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Ingo Molnar, Andi Kleen, Wu Fengguang, Steven Rostedt,
Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
Alexey Dobriyan, linux-mm
Hi!
2009/4/28 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
>> I guess the main question here is whether this approach will scale to
>> something like kmalloc() or the page allocator in production
>> environments. For any serious workload, the frequency of events is
>> going to be pretty high.
>
> Immediate Values patch series makes zero-overhead to tracepoint
> while it's not used.
>
> So, We have to implement to stop collect stastics way. it restore
> zero overhead world.
> We don't lose any performance by trace.
Sure but I meant the _enabled_ case here. kmalloc() (and the page
allocator to some extent) is very performance sensitive in many
workloads so you probably don't want to use tracepoints if you're
collecting some overall statistics (i.e. tracing all events) like we
do here.
Pekka
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 10:21 ` Pekka Enberg
0 siblings, 0 replies; 137+ messages in thread
From: Pekka Enberg @ 2009-04-28 10:21 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Ingo Molnar, Andi Kleen, Wu Fengguang, Steven Rostedt,
Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
Alexey Dobriyan, linux-mm
Hi!
2009/4/28 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
>> I guess the main question here is whether this approach will scale to
>> something like kmalloc() or the page allocator in production
>> environments. For any serious workload, the frequency of events is
>> going to be pretty high.
>
> Immediate Values patch series makes zero-overhead to tracepoint
> while it's not used.
>
> So, We have to implement to stop collect stastics way. it restore
> zero overhead world.
> We don't lose any performance by trace.
Sure but I meant the _enabled_ case here. kmalloc() (and the page
allocator to some extent) is very performance sensitive in many
workloads so you probably don't want to use tracepoints if you're
collecting some overall statistics (i.e. tracing all events) like we
do here.
Pekka
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 10:21 ` Pekka Enberg
@ 2009-04-28 10:56 ` Ingo Molnar
-1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 10:56 UTC (permalink / raw)
To: Pekka Enberg
Cc: KOSAKI Motohiro, Andi Kleen, Wu Fengguang, Steven Rostedt,
Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
Alexey Dobriyan, linux-mm
* Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> Hi!
>
> 2009/4/28 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
> >> I guess the main question here is whether this approach will scale to
> >> something like kmalloc() or the page allocator in production
> >> environments. For any serious workload, the frequency of events is
> >> going to be pretty high.
> >
> > Immediate Values patch series makes zero-overhead to tracepoint
> > while it's not used.
> >
> > So, We have to implement to stop collect stastics way. it restore
> > zero overhead world.
> > We don't lose any performance by trace.
>
> Sure but I meant the _enabled_ case here. kmalloc() (and the page
> allocator to some extent) is very performance sensitive in many
> workloads so you probably don't want to use tracepoints if you're
> collecting some overall statistics (i.e. tracing all events) like
> we do here.
That's where 'collect current state' kind of tracepoints would help
- they could be used even without enabling any of the other
tracepoints. And they'd still be in a coherent whole with the
dynamic-events tracepoints.
So i'm not arguing against these techniques at all - and we can move
on a wide scale from zero-overhead to lots-of-tracing-enabled models
- what i'm arguing against is the splintering.
Ingo
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 10:56 ` Ingo Molnar
0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 10:56 UTC (permalink / raw)
To: Pekka Enberg
Cc: KOSAKI Motohiro, Andi Kleen, Wu Fengguang, Steven Rostedt,
Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
Alexey Dobriyan, linux-mm
* Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> Hi!
>
> 2009/4/28 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
> >> I guess the main question here is whether this approach will scale to
> >> something like kmalloc() or the page allocator in production
> >> environments. For any serious workload, the frequency of events is
> >> going to be pretty high.
> >
> > Immediate Values patch series makes zero-overhead to tracepoint
> > while it's not used.
> >
> > So, We have to implement to stop collect stastics way. it restore
> > zero overhead world.
> > We don't lose any performance by trace.
>
> Sure but I meant the _enabled_ case here. kmalloc() (and the page
> allocator to some extent) is very performance sensitive in many
> workloads so you probably don't want to use tracepoints if you're
> collecting some overall statistics (i.e. tracing all events) like
> we do here.
That's where 'collect current state' kind of tracepoints would help
- they could be used even without enabling any of the other
tracepoints. And they'd still be in a coherent whole with the
dynamic-events tracepoints.
So i'm not arguing against these techniques at all - and we can move
on a wide scale from zero-overhead to lots-of-tracing-enabled models
- what i'm arguing against is the splintering.
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 9:57 ` Pekka Enberg
@ 2009-04-28 11:03 ` Ingo Molnar
-1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 11:03 UTC (permalink / raw)
To: Pekka Enberg
Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
* Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> Hi Ingo,
>
> On Tue, Apr 28, 2009 at 12:36 PM, Ingo Molnar <mingo@elte.hu> wrote:
> > I 'integrate' traces all the time to get summary counts. This series
> > of dynamic events:
> >
> > allocation
> > page count up
> > page count up
> > page count down
> > page count up
> > page count up
> > page count up
> > page count up
> >
> > integrates into: "page count is 6".
> >
> > Note that "integration" can be done wholly in the kernel too,
> > without going to the overhead of streaming all dynamic events to
> > user-space, just to summarize data into counts, in-kernel. That
> > is what the ftrace statistics framework and various ftrace
> > plugins are about.
> >
> > Also, it might make sense to extend the framework with a series
> > of 'get current object state' events when tracing is turned on.
> > A special case of _that_ would in essence be what the /proc hack
> > does now - just expressed in a much more generic, and a much
> > more usable form.
>
> I guess the main question here is whether this approach will scale
> to something like kmalloc() or the page allocator in production
> environments. For any serious workload, the frequency of events is
> going to be pretty high.
it depends on the level of integration. If the integration is done
right at the tracepoint callback, performance overhead will be very
small. If everything is traced and then streamed to user-space then
there is going to be noticeable overhead starting somewhere around a
few hundred thousand events per second per cpu.
Note that the 'get object state' approach i outlined above in the
final paragraph has no runtime overhead at all. As long as 'object
state' only covers fields that we maintain already for normal kernel
functionality, it costs nothing to allow the passive sampling of
that state. The /proc patch is a subset of such a facility in
essence.
Ingo
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 11:03 ` Ingo Molnar
0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 11:03 UTC (permalink / raw)
To: Pekka Enberg
Cc: Andi Kleen, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Matt Mackall, Alexey Dobriyan, linux-mm
* Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> Hi Ingo,
>
> On Tue, Apr 28, 2009 at 12:36 PM, Ingo Molnar <mingo@elte.hu> wrote:
> > I 'integrate' traces all the time to get summary counts. This series
> > of dynamic events:
> >
> > allocation
> > page count up
> > page count up
> > page count down
> > page count up
> > page count up
> > page count up
> > page count up
> >
> > integrates into: "page count is 6".
> >
> > Note that "integration" can be done wholly in the kernel too,
> > without going to the overhead of streaming all dynamic events to
> > user-space, just to summarize data into counts, in-kernel. That
> > is what the ftrace statistics framework and various ftrace
> > plugins are about.
> >
> > Also, it might make sense to extend the framework with a series
> > of 'get current object state' events when tracing is turned on.
> > A special case of _that_ would in essence be what the /proc hack
> > does now - just expressed in a much more generic, and a much
> > more usable form.
>
> I guess the main question here is whether this approach will scale
> to something like kmalloc() or the page allocator in production
> environments. For any serious workload, the frequency of events is
> going to be pretty high.
it depends on the level of integration. If the integration is done
right at the tracepoint callback, performance overhead will be very
small. If everything is traced and then streamed to user-space then
there is going to be noticeable overhead starting somewhere around a
few hundred thousand events per second per cpu.
Note that the 'get object state' approach i outlined above in the
final paragraph has no runtime overhead at all. As long as 'object
state' only covers fields that we maintain already for normal kernel
functionality, it costs nothing to allow the passive sampling of
that state. The /proc patch is a subset of such a facility in
essence.
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 9:55 ` Wu Fengguang
@ 2009-04-28 11:05 ` Ingo Molnar
-1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 11:05 UTC (permalink / raw)
To: Wu Fengguang
Cc: KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Steven Rostedt,
Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
Alexey Dobriyan, linux-mm
* Wu Fengguang <fengguang.wu@intel.com> wrote:
> > See my other mail i just sent: it would be a natural extension
> > of tracing to also dump all current object state when tracing is
> > turned on. That way no drop_caches is needed at all.
>
> I can understand the merits here - I also did readahead
> tracing/accounting in _one_ piece of code. Very handy.
>
> The readahead traces are now raw printks - converting to the
> ftrace framework would be a big win.
>
> But. It's still not a fit-all solution. Imagine when full data
> _since_ booting is required, but the user cannot afford a reboot.
The above 'get object state' interface (which allows passive
sampling) - integrated into the tracing framework - would serve that
goal, agreed?
Ingo
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 11:05 ` Ingo Molnar
0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 11:05 UTC (permalink / raw)
To: Wu Fengguang
Cc: KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Steven Rostedt,
Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
Alexey Dobriyan, linux-mm
* Wu Fengguang <fengguang.wu@intel.com> wrote:
> > See my other mail i just sent: it would be a natural extension
> > of tracing to also dump all current object state when tracing is
> > turned on. That way no drop_caches is needed at all.
>
> I can understand the merits here - I also did readahead
> tracing/accounting in _one_ piece of code. Very handy.
>
> The readahead traces are now raw printks - converting to the
> ftrace framework would be a big win.
>
> But. It's still not a fit-all solution. Imagine when full data
> _since_ booting is required, but the user cannot afford a reboot.
The above 'get object state' interface (which allows passive
sampling) - integrated into the tracing framework - would serve that
goal, agreed?
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 10:56 ` Ingo Molnar
@ 2009-04-28 11:09 ` KOSAKI Motohiro
-1 siblings, 0 replies; 137+ messages in thread
From: KOSAKI Motohiro @ 2009-04-28 11:09 UTC (permalink / raw)
To: Ingo Molnar
Cc: Pekka Enberg, Andi Kleen, Wu Fengguang, Steven Rostedt,
Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
Alexey Dobriyan, linux-mm
2009/4/28 Ingo Molnar <mingo@elte.hu>:
>
> * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
>
>> Hi!
>>
>> 2009/4/28 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
>> >> I guess the main question here is whether this approach will scale to
>> >> something like kmalloc() or the page allocator in production
>> >> environments. For any serious workload, the frequency of events is
>> >> going to be pretty high.
>> >
>> > Immediate Values patch series makes zero-overhead to tracepoint
>> > while it's not used.
>> >
>> > So, We have to implement to stop collect stastics way. it restore
>> > zero overhead world.
>> > We don't lose any performance by trace.
>>
>> Sure but I meant the _enabled_ case here. kmalloc() (and the page
>> allocator to some extent) is very performance sensitive in many
>> workloads so you probably don't want to use tracepoints if you're
>> collecting some overall statistics (i.e. tracing all events) like
>> we do here.
>
> That's where 'collect current state' kind of tracepoints would help
> - they could be used even without enabling any of the other
> tracepoints. And they'd still be in a coherent whole with the
> dynamic-events tracepoints.
>
> So i'm not arguing against these techniques at all - and we can move
> on a wide scale from zero-overhead to lots-of-tracing-enabled models
> - what i'm arguing against is the splintering.
umm.
I guess Pekka and you talk about different thing.
if tracepoint is ON, tracepoint makes one function call. but few hot spot don't
have patience to one function call overhead.
scheduler stat and slab stat are one of good example, I think.
I really don't want convert slab_stat and sched_stat to ftrace base stastics.
currently it don't need extra function call and it only touch per-cpu variable.
So, a overhead is extream small.
Unfortunately, tracepoint still don't reach this extream performance.
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 11:09 ` KOSAKI Motohiro
0 siblings, 0 replies; 137+ messages in thread
From: KOSAKI Motohiro @ 2009-04-28 11:09 UTC (permalink / raw)
To: Ingo Molnar
Cc: Pekka Enberg, Andi Kleen, Wu Fengguang, Steven Rostedt,
Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
Alexey Dobriyan, linux-mm
2009/4/28 Ingo Molnar <mingo@elte.hu>:
>
> * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
>
>> Hi!
>>
>> 2009/4/28 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
>> >> I guess the main question here is whether this approach will scale to
>> >> something like kmalloc() or the page allocator in production
>> >> environments. For any serious workload, the frequency of events is
>> >> going to be pretty high.
>> >
>> > Immediate Values patch series makes zero-overhead to tracepoint
>> > while it's not used.
>> >
>> > So, We have to implement to stop collect stastics way. it restore
>> > zero overhead world.
>> > We don't lose any performance by trace.
>>
>> Sure but I meant the _enabled_ case here. kmalloc() (and the page
>> allocator to some extent) is very performance sensitive in many
>> workloads so you probably don't want to use tracepoints if you're
>> collecting some overall statistics (i.e. tracing all events) like
>> we do here.
>
> That's where 'collect current state' kind of tracepoints would help
> - they could be used even without enabling any of the other
> tracepoints. And they'd still be in a coherent whole with the
> dynamic-events tracepoints.
>
> So i'm not arguing against these techniques at all - and we can move
> on a wide scale from zero-overhead to lots-of-tracing-enabled models
> - what i'm arguing against is the splintering.
umm.
I guess Pekka and you talk about different thing.
if tracepoint is ON, tracepoint makes one function call. but few hot spot don't
have patience to one function call overhead.
scheduler stat and slab stat are one of good example, I think.
I really don't want convert slab_stat and sched_stat to ftrace base stastics.
currently it don't need extra function call and it only touch per-cpu variable.
So, a overhead is extream small.
Unfortunately, tracepoint still don't reach this extream performance.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 11:05 ` Ingo Molnar
@ 2009-04-28 11:36 ` Wu Fengguang
-1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 11:36 UTC (permalink / raw)
To: Ingo Molnar
Cc: KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Steven Rostedt,
Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
Alexey Dobriyan, linux-mm
On Tue, Apr 28, 2009 at 07:05:53PM +0800, Ingo Molnar wrote:
>
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > > See my other mail i just sent: it would be a natural extension
> > > of tracing to also dump all current object state when tracing is
> > > turned on. That way no drop_caches is needed at all.
> >
> > I can understand the merits here - I also did readahead
> > tracing/accounting in _one_ piece of code. Very handy.
> >
> > The readahead traces are now raw printks - converting to the
> > ftrace framework would be a big win.
> >
> > But. It's still not a fit-all solution. Imagine when full data
> > _since_ booting is required, but the user cannot afford a reboot.
>
> The above 'get object state' interface (which allows passive
> sampling) - integrated into the tracing framework - would serve that
> goal, agreed?
Agreed. That could in theory a good complement to dynamic tracings.
Then what will be the canonical form for all the 'get object state'
interfaces - "object.attr=value", or whatever? I'm afraid we will have
to sacrifice efficiency or human readability to have a normalized form.
Or to define two standard forms? One "key value" form and one "value1
value2 value3..." form?
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 11:36 ` Wu Fengguang
0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 11:36 UTC (permalink / raw)
To: Ingo Molnar
Cc: KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Steven Rostedt,
Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
Alexey Dobriyan, linux-mm
On Tue, Apr 28, 2009 at 07:05:53PM +0800, Ingo Molnar wrote:
>
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > > See my other mail i just sent: it would be a natural extension
> > > of tracing to also dump all current object state when tracing is
> > > turned on. That way no drop_caches is needed at all.
> >
> > I can understand the merits here - I also did readahead
> > tracing/accounting in _one_ piece of code. Very handy.
> >
> > The readahead traces are now raw printks - converting to the
> > ftrace framework would be a big win.
> >
> > But. It's still not a fit-all solution. Imagine when full data
> > _since_ booting is required, but the user cannot afford a reboot.
>
> The above 'get object state' interface (which allows passive
> sampling) - integrated into the tracing framework - would serve that
> goal, agreed?
Agreed. That could in theory a good complement to dynamic tracings.
Then what will be the canonical form for all the 'get object state'
interfaces - "object.attr=value", or whatever? I'm afraid we will have
to sacrifice efficiency or human readability to have a normalized form.
Or to define two standard forms? One "key value" form and one "value1
value2 value3..." form?
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
2009-04-28 11:36 ` Wu Fengguang
@ 2009-04-28 12:17 ` Ingo Molnar
-1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 12:17 UTC (permalink / raw)
To: Wu Fengguang, Li Zefan, Tom Zanussi
Cc: KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Steven Rostedt,
Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
Alexey Dobriyan, linux-mm
* Wu Fengguang <fengguang.wu@intel.com> wrote:
> > The above 'get object state' interface (which allows passive
> > sampling) - integrated into the tracing framework - would serve
> > that goal, agreed?
>
> Agreed. That could in theory a good complement to dynamic
> tracings.
>
> Then what will be the canonical form for all the 'get object
> state' interfaces - "object.attr=value", or whatever? [...]
Lemme outline what i'm thinking of.
I'd call the feature "object collection tracing", which would live
in /debug/tracing, accessed via such files:
/debug/tracing/objects/mm/pages/
/debug/tracing/objects/mm/pages/format
/debug/tracing/objects/mm/pages/filter
/debug/tracing/objects/mm/pages/trace_pipe
/debug/tracing/objects/mm/pages/stats
/debug/tracing/objects/mm/pages/events/
here's the (proposed) semantics of those files:
1) /debug/tracing/objects/mm/pages/
There's a subsystem / object basic directory structure to make it
easy and intuitive to find our way around there.
2) /debug/tracing/objects/mm/pages/format
the format file:
/debug/tracing/objects/mm/pages/format
Would reuse the existing dynamic-tracepoint structured-logging
descriptor format and code (this is upstream already):
[root@phoenix sched_signal_send]# pwd
/debug/tracing/events/sched/sched_signal_send
[root@phoenix sched_signal_send]# cat format
name: sched_signal_send
ID: 24
format:
field:unsigned short common_type; offset:0; size:2;
field:unsigned char common_flags; offset:2; size:1;
field:unsigned char common_preempt_count; offset:3; size:1;
field:int common_pid; offset:4; size:4;
field:int common_tgid; offset:8; size:4;
field:int sig; offset:12; size:4;
field:char comm[TASK_COMM_LEN]; offset:16; size:16;
field:pid_t pid; offset:32; size:4;
print fmt: "sig: %d task %s:%d", REC->sig, REC->comm, REC->pid
These format descriptors enumerate fields, types and sizes, in a
structured way that user-space tools can parse easily. (The binary
records that come from the trace_pipe file follow this format
description.)
3) /debug/tracing/objects/mm/pages/filter
This is the tracing filter that can be set based on the 'format'
descriptor. So with the above (signal-send tracepoint) you can
define such filter expressions:
echo "(sig == 10 && comm == bash) || sig == 13" > filter
To restrict the 'scope' of the object collection along pretty much
any key or combination of keys. (Or you can leave it as it is and
dump all objects and do keying in user-space.)
[ Using in-kernel filtering is obviously faster that streaming it
out to user-space - but there might be details and types of
visualization you want to do in user-space - so we dont want to
restrict things here. ]
For the mm object collection tracepoint i could imagine such filter
expressions:
echo "type == shared && file == /sbin/init" > filter
To dump all shared pages that are mapped to /sbin/init.
4) /debug/tracing/objects/mm/pages/trace_pipe
The 'trace_pipe' file can be used to dump all objects in the
collection, which match the filter ('all objects' by default). The
record format is described in 'format'.
trace_pipe would be a reuse of the existing trace_pipe code: it is a
modern, poll()-able, read()-able, splice()-able pipe abstraction.
5) /debug/tracing/objects/mm/pages/stats
The 'stats' file would be a reuse of the existing histogram code of
the tracing code. We already make use of it for the branch tracers
and for the workqueue tracer - it could be extended to be applicable
to object collections as well.
The advantage there would be that there's no dumping at all - all
the integration is done straight in the kernel. ( The 'filter'
condition is listened to - increasing flexibility. The filter file
could perhaps also act as a default histogram key. )
6) /debug/tracing/objects/mm/pages/events/
The 'events' directory offers links back to existing dynamic
tracepoints that are under /debug/tracing/events/. This would serve
as an additional coherent force that keeps dynamic tracepoints
collected by subsystem and by object type as well. (Tools could make
use of this information as well - without being aware of actual
object semantics.)
There would be a number of other object collections we could
enumerate:
tasks:
/debug/tracing/objects/sched/tasks/
active inodes known to the kernel:
/debug/tracing/objects/fs/inodes/
interrupts:
/debug/tracing/objects/hw/irqs/
etc.
These would use the same 'object collection' framework. Once done we
can use it for many other thing too.
Note how organically integrated it all is with the tracing
framework. You could start from an 'object view' to get an overview
and then go towards a more dynamic view of specific object
attributes (or specific objects), as you drill down on a specific
problem you want to analyze.
How does this all sound to you?
Can you see any conceptual holes in the scheme, any use-case that
/proc/kpageflags supports but the object collection approach does
not?
Would you be interested in seeing something like this, if we tried
to implement it in the tracing tree? The majority of the code
already exists, we just need interest from the MM side and we have
to hook it all up. (it is by no means trivial to do - but looks like
a very exciting feature.)
Thanks,
Ingo
^ permalink raw reply [flat|nested] 137+ messages in thread
* [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
@ 2009-04-28 12:17 ` Ingo Molnar
0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 12:17 UTC (permalink / raw)
To: Wu Fengguang, Li Zefan, Tom Zanussi
Cc: KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Steven Rostedt,
Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
Alexey Dobriyan, linux-mm
* Wu Fengguang <fengguang.wu@intel.com> wrote:
> > The above 'get object state' interface (which allows passive
> > sampling) - integrated into the tracing framework - would serve
> > that goal, agreed?
>
> Agreed. That could in theory a good complement to dynamic
> tracings.
>
> Then what will be the canonical form for all the 'get object
> state' interfaces - "object.attr=value", or whatever? [...]
Lemme outline what i'm thinking of.
I'd call the feature "object collection tracing", which would live
in /debug/tracing, accessed via such files:
/debug/tracing/objects/mm/pages/
/debug/tracing/objects/mm/pages/format
/debug/tracing/objects/mm/pages/filter
/debug/tracing/objects/mm/pages/trace_pipe
/debug/tracing/objects/mm/pages/stats
/debug/tracing/objects/mm/pages/events/
here's the (proposed) semantics of those files:
1) /debug/tracing/objects/mm/pages/
There's a subsystem / object basic directory structure to make it
easy and intuitive to find our way around there.
2) /debug/tracing/objects/mm/pages/format
the format file:
/debug/tracing/objects/mm/pages/format
Would reuse the existing dynamic-tracepoint structured-logging
descriptor format and code (this is upstream already):
[root@phoenix sched_signal_send]# pwd
/debug/tracing/events/sched/sched_signal_send
[root@phoenix sched_signal_send]# cat format
name: sched_signal_send
ID: 24
format:
field:unsigned short common_type; offset:0; size:2;
field:unsigned char common_flags; offset:2; size:1;
field:unsigned char common_preempt_count; offset:3; size:1;
field:int common_pid; offset:4; size:4;
field:int common_tgid; offset:8; size:4;
field:int sig; offset:12; size:4;
field:char comm[TASK_COMM_LEN]; offset:16; size:16;
field:pid_t pid; offset:32; size:4;
print fmt: "sig: %d task %s:%d", REC->sig, REC->comm, REC->pid
These format descriptors enumerate fields, types and sizes, in a
structured way that user-space tools can parse easily. (The binary
records that come from the trace_pipe file follow this format
description.)
3) /debug/tracing/objects/mm/pages/filter
This is the tracing filter that can be set based on the 'format'
descriptor. So with the above (signal-send tracepoint) you can
define such filter expressions:
echo "(sig == 10 && comm == bash) || sig == 13" > filter
To restrict the 'scope' of the object collection along pretty much
any key or combination of keys. (Or you can leave it as it is and
dump all objects and do keying in user-space.)
[ Using in-kernel filtering is obviously faster that streaming it
out to user-space - but there might be details and types of
visualization you want to do in user-space - so we dont want to
restrict things here. ]
For the mm object collection tracepoint i could imagine such filter
expressions:
echo "type == shared && file == /sbin/init" > filter
To dump all shared pages that are mapped to /sbin/init.
4) /debug/tracing/objects/mm/pages/trace_pipe
The 'trace_pipe' file can be used to dump all objects in the
collection, which match the filter ('all objects' by default). The
record format is described in 'format'.
trace_pipe would be a reuse of the existing trace_pipe code: it is a
modern, poll()-able, read()-able, splice()-able pipe abstraction.
5) /debug/tracing/objects/mm/pages/stats
The 'stats' file would be a reuse of the existing histogram code of
the tracing code. We already make use of it for the branch tracers
and for the workqueue tracer - it could be extended to be applicable
to object collections as well.
The advantage there would be that there's no dumping at all - all
the integration is done straight in the kernel. ( The 'filter'
condition is listened to - increasing flexibility. The filter file
could perhaps also act as a default histogram key. )
6) /debug/tracing/objects/mm/pages/events/
The 'events' directory offers links back to existing dynamic
tracepoints that are under /debug/tracing/events/. This would serve
as an additional coherent force that keeps dynamic tracepoints
collected by subsystem and by object type as well. (Tools could make
use of this information as well - without being aware of actual
object semantics.)
There would be a number of other object collections we could
enumerate:
tasks:
/debug/tracing/objects/sched/tasks/
active inodes known to the kernel:
/debug/tracing/objects/fs/inodes/
interrupts:
/debug/tracing/objects/hw/irqs/
etc.
These would use the same 'object collection' framework. Once done we
can use it for many other thing too.
Note how organically integrated it all is with the tracing
framework. You could start from an 'object view' to get an overview
and then go towards a more dynamic view of specific object
attributes (or specific objects), as you drill down on a specific
problem you want to analyze.
How does this all sound to you?
Can you see any conceptual holes in the scheme, any use-case that
/proc/kpageflags supports but the object collection approach does
not?
Would you be interested in seeing something like this, if we tried
to implement it in the tracing tree? The majority of the code
already exists, we just need interest from the MM side and we have
to hook it all up. (it is by no means trivial to do - but looks like
a very exciting feature.)
Thanks,
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 11:09 ` KOSAKI Motohiro
@ 2009-04-28 12:42 ` Ingo Molnar
-1 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 12:42 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Pekka Enberg, Andi Kleen, Wu Fengguang, Steven Rostedt,
Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
Alexey Dobriyan, linux-mm
* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 2009/4/28 Ingo Molnar <mingo@elte.hu>:
> >
> > * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> >
> >> Hi!
> >>
> >> 2009/4/28 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
> >> >> I guess the main question here is whether this approach will scale to
> >> >> something like kmalloc() or the page allocator in production
> >> >> environments. For any serious workload, the frequency of events is
> >> >> going to be pretty high.
> >> >
> >> > Immediate Values patch series makes zero-overhead to tracepoint
> >> > while it's not used.
> >> >
> >> > So, We have to implement to stop collect stastics way. it restore
> >> > zero overhead world.
> >> > We don't lose any performance by trace.
> >>
> >> Sure but I meant the _enabled_ case here. kmalloc() (and the page
> >> allocator to some extent) is very performance sensitive in many
> >> workloads so you probably don't want to use tracepoints if you're
> >> collecting some overall statistics (i.e. tracing all events) like
> >> we do here.
> >
> > That's where 'collect current state' kind of tracepoints would help
> > - they could be used even without enabling any of the other
> > tracepoints. And they'd still be in a coherent whole with the
> > dynamic-events tracepoints.
> >
> > So i'm not arguing against these techniques at all - and we can move
> > on a wide scale from zero-overhead to lots-of-tracing-enabled models
> > - what i'm arguing against is the splintering.
>
> umm.
> I guess Pekka and you talk about different thing.
>
> if tracepoint is ON, tracepoint makes one function call. but few
> hot spot don't have patience to one function call overhead.
>
> scheduler stat and slab stat are one of good example, I think.
>
> I really don't want convert slab_stat and sched_stat to ftrace
> base stastics. currently it don't need extra function call and it
> only touch per-cpu variable. So, a overhead is extream small.
>
> Unfortunately, tracepoint still don't reach this extream
> performance.
I understand that - please see my "[rfc] object collection tracing"
reply in this thread, for a more detailed description about what i
meant under 'object state tracing'.
Ingo
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 12:42 ` Ingo Molnar
0 siblings, 0 replies; 137+ messages in thread
From: Ingo Molnar @ 2009-04-28 12:42 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Pekka Enberg, Andi Kleen, Wu Fengguang, Steven Rostedt,
Fr馘駻ic Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, Matt Mackall,
Alexey Dobriyan, linux-mm
* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 2009/4/28 Ingo Molnar <mingo@elte.hu>:
> >
> > * Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> >
> >> Hi!
> >>
> >> 2009/4/28 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
> >> >> I guess the main question here is whether this approach will scale to
> >> >> something like kmalloc() or the page allocator in production
> >> >> environments. For any serious workload, the frequency of events is
> >> >> going to be pretty high.
> >> >
> >> > Immediate Values patch series makes zero-overhead to tracepoint
> >> > while it's not used.
> >> >
> >> > So, We have to implement to stop collect stastics way. it restore
> >> > zero overhead world.
> >> > We don't lose any performance by trace.
> >>
> >> Sure but I meant the _enabled_ case here. kmalloc() (and the page
> >> allocator to some extent) is very performance sensitive in many
> >> workloads so you probably don't want to use tracepoints if you're
> >> collecting some overall statistics (i.e. tracing all events) like
> >> we do here.
> >
> > That's where 'collect current state' kind of tracepoints would help
> > - they could be used even without enabling any of the other
> > tracepoints. And they'd still be in a coherent whole with the
> > dynamic-events tracepoints.
> >
> > So i'm not arguing against these techniques at all - and we can move
> > on a wide scale from zero-overhead to lots-of-tracing-enabled models
> > - what i'm arguing against is the splintering.
>
> umm.
> I guess Pekka and you talk about different thing.
>
> if tracepoint is ON, tracepoint makes one function call. but few
> hot spot don't have patience to one function call overhead.
>
> scheduler stat and slab stat are one of good example, I think.
>
> I really don't want convert slab_stat and sched_stat to ftrace
> base stastics. currently it don't need extra function call and it
> only touch per-cpu variable. So, a overhead is extream small.
>
> Unfortunately, tracepoint still don't reach this extream
> performance.
I understand that - please see my "[rfc] object collection tracing"
reply in this thread, for a more detailed description about what i
meant under 'object state tracing'.
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
2009-04-28 12:17 ` Ingo Molnar
@ 2009-04-28 13:31 ` Wu Fengguang
-1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 13:31 UTC (permalink / raw)
To: Ingo Molnar
Cc: Li Zefan, Tom Zanussi, KOSAKI Motohiro, Pekka Enberg, Andi Kleen,
Steven Rostedt, Fr馘駻ic Weisbecker, Larry Woodman,
Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
Matt Mackall, Alexey Dobriyan, linux-mm
On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> tent-Transfer-Encoding: quoted-printable
> Status: RO
> Content-Length: 5480
> Lines: 161
>
>
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > > The above 'get object state' interface (which allows passive
> > > sampling) - integrated into the tracing framework - would serve
> > > that goal, agreed?
> >
> > Agreed. That could in theory a good complement to dynamic
> > tracings.
> >
> > Then what will be the canonical form for all the 'get object
> > state' interfaces - "object.attr=value", or whatever? [...]
>
> Lemme outline what i'm thinking of.
>
> I'd call the feature "object collection tracing", which would live
> in /debug/tracing, accessed via such files:
>
> /debug/tracing/objects/mm/pages/
> /debug/tracing/objects/mm/pages/format
> /debug/tracing/objects/mm/pages/filter
> /debug/tracing/objects/mm/pages/trace_pipe
> /debug/tracing/objects/mm/pages/stats
> /debug/tracing/objects/mm/pages/events/
>
> here's the (proposed) semantics of those files:
>
> 1) /debug/tracing/objects/mm/pages/
>
> There's a subsystem / object basic directory structure to make it
> easy and intuitive to find our way around there.
>
> 2) /debug/tracing/objects/mm/pages/format
>
> the format file:
>
> /debug/tracing/objects/mm/pages/format
>
> Would reuse the existing dynamic-tracepoint structured-logging
> descriptor format and code (this is upstream already):
>
> [root@phoenix sched_signal_send]# pwd
> /debug/tracing/events/sched/sched_signal_send
>
> [root@phoenix sched_signal_send]# cat format
> name: sched_signal_send
> ID: 24
> format:
> field:unsigned short common_type; offset:0; size:2;
> field:unsigned char common_flags; offset:2; size:1;
> field:unsigned char common_preempt_count; offset:3; size:1;
> field:int common_pid; offset:4; size:4;
> field:int common_tgid; offset:8; size:4;
>
> field:int sig; offset:12; size:4;
> field:char comm[TASK_COMM_LEN]; offset:16; size:16;
> field:pid_t pid; offset:32; size:4;
>
> print fmt: "sig: %d task %s:%d", REC->sig, REC->comm, REC->pid
>
> These format descriptors enumerate fields, types and sizes, in a
> structured way that user-space tools can parse easily. (The binary
> records that come from the trace_pipe file follow this format
> description.)
>
> 3) /debug/tracing/objects/mm/pages/filter
>
> This is the tracing filter that can be set based on the 'format'
> descriptor. So with the above (signal-send tracepoint) you can
> define such filter expressions:
>
> echo "(sig == 10 && comm == bash) || sig == 13" > filter
>
> To restrict the 'scope' of the object collection along pretty much
> any key or combination of keys. (Or you can leave it as it is and
> dump all objects and do keying in user-space.)
>
> [ Using in-kernel filtering is obviously faster that streaming it
> out to user-space - but there might be details and types of
> visualization you want to do in user-space - so we dont want to
> restrict things here. ]
>
> For the mm object collection tracepoint i could imagine such filter
> expressions:
>
> echo "type == shared && file == /sbin/init" > filter
>
> To dump all shared pages that are mapped to /sbin/init.
>
> 4) /debug/tracing/objects/mm/pages/trace_pipe
>
> The 'trace_pipe' file can be used to dump all objects in the
> collection, which match the filter ('all objects' by default). The
> record format is described in 'format'.
>
> trace_pipe would be a reuse of the existing trace_pipe code: it is a
> modern, poll()-able, read()-able, splice()-able pipe abstraction.
>
> 5) /debug/tracing/objects/mm/pages/stats
>
> The 'stats' file would be a reuse of the existing histogram code of
> the tracing code. We already make use of it for the branch tracers
> and for the workqueue tracer - it could be extended to be applicable
> to object collections as well.
>
> The advantage there would be that there's no dumping at all - all
> the integration is done straight in the kernel. ( The 'filter'
> condition is listened to - increasing flexibility. The filter file
> could perhaps also act as a default histogram key. )
>
> 6) /debug/tracing/objects/mm/pages/events/
>
> The 'events' directory offers links back to existing dynamic
> tracepoints that are under /debug/tracing/events/. This would serve
> as an additional coherent force that keeps dynamic tracepoints
> collected by subsystem and by object type as well. (Tools could make
> use of this information as well - without being aware of actual
> object semantics.)
>
>
> There would be a number of other object collections we could
> enumerate:
>
> tasks:
>
> /debug/tracing/objects/sched/tasks/
>
> active inodes known to the kernel:
>
> /debug/tracing/objects/fs/inodes/
>
> interrupts:
>
> /debug/tracing/objects/hw/irqs/
>
> etc.
>
> These would use the same 'object collection' framework. Once done we
> can use it for many other thing too.
>
> Note how organically integrated it all is with the tracing
> framework. You could start from an 'object view' to get an overview
> and then go towards a more dynamic view of specific object
> attributes (or specific objects), as you drill down on a specific
> problem you want to analyze.
>
> How does this all sound to you?
Great! I saw much opportunity to adapt the not yet submitted
/proc/filecache interface to the proposed framework.
Its basic form is:
# ino size cached cached% refcnt state age accessed process dev file
[snip]
320 1 4 100 1 D- 50443 1085 udevd 00:11(tmpfs) /.udev/uevent_seqnum
460725 123 124 100 35 -- 50444 6795 touch 08:02(sda2) /lib/libpthread-2.9.so
460727 31 32 100 14 -- 50444 2007 touch 08:02(sda2) /lib/librt-2.9.so
458865 97 80 82 1 -- 50444 49 mount 08:02(sda2) /lib/libdevmapper.so.1.02.1
460090 15 16 100 1 -- 50444 48 mount 08:02(sda2) /lib/libuuid.so.1.2
458866 46 48 100 1 -- 50444 47 mount 08:02(sda2) /lib/libblkid.so.1.0
460732 43 44 100 69 -- 50444 3581 rcS 08:02(sda2) /lib/libnss_nis-2.9.so
460739 87 88 100 73 -- 50444 3597 rcS 08:02(sda2) /lib/libnsl-2.9.so
460726 31 32 100 69 -- 50444 3581 rcS 08:02(sda2) /lib/libnss_compat-2.9.so
458804 250 252 100 11 -- 50445 8175 rcS 08:02(sda2) /lib/libncurses.so.5.6
229540 780 752 96 3 -- 50445 7594 init 08:02(sda2) /bin/bash
460735 15 16 100 89 -- 50445 17581 init 08:02(sda2) /lib/libdl-2.9.so
460721 1344 1340 99 117 -- 50445 48732 init 08:02(sda2) /lib/libc-2.9.so
458801 107 104 97 24 -- 50445 3586 init 08:02(sda2) /lib/libselinux.so.1
671870 37 24 65 1 -- 50446 1 swapper 08:02(sda2) /sbin/init
175 1 24412 100 1 -- 50446 0 swapper 00:01(rootfs) /dev/root
The patch basically does a traversal through one or more of the inode
lists to produce the output:
inode_in_use
inode_unused
sb->s_dirty
sb->s_io
sb->s_more_io
sb->s_inodes
The filtering feature is a necessity for this interface - or it will
take considerable time to do a full listing. It supports the following
filters:
{ LS_OPT_DIRTY, "dirty" },
{ LS_OPT_CLEAN, "clean" },
{ LS_OPT_INUSE, "inuse" },
{ LS_OPT_EMPTY, "empty" },
{ LS_OPT_ALL, "all" },
{ LS_OPT_DEV, "dev=%s" },
There are two possible challenges for the conversion:
- One trick it does is to select different lists to traverse on
different filter options. Will this be possible in the object
tracing framework?
- The file name lookup(last field) is the performance killer. Is it
possible to skip the file name lookup when the filter failed on the
leading fields?
Will the object tracing interface allow such flexibilities?
(Sorry I'm not yet familiar with the tracing framework.)
> Can you see any conceptual holes in the scheme, any use-case that
> /proc/kpageflags supports but the object collection approach does
> not?
kpageflags is simply a big (perhaps sparse) binary array.
I'd still prefer to retain its current form - the kernel patches and
user space tools are all ready made, and I see no benefits in
converting to the tracing framework.
> Would you be interested in seeing something like this, if we tried
> to implement it in the tracing tree? The majority of the code
> already exists, we just need interest from the MM side and we have
> to hook it all up. (it is by no means trivial to do - but looks like
> a very exciting feature.)
Definitely! /proc/filecache has another 'page view':
# head /proc/filecache
# file /bin/bash
# flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback X:readahead P:private O:owner b:buffer d:dirty w:writeback
# idx len state refcnt
0 1 RAMU________ 4
3 8 RAMU________ 4
12 1 RAMU________ 4
14 5 RAMU________ 4
20 7 RAMU________ 4
27 2 RAMU________ 5
29 1 RAMU________ 4
Which is also a good candidate. However I still need to investigate
whether it offers considerable margins over the mincore() syscall.
Thanks and Regards,
Fengguang
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
@ 2009-04-28 13:31 ` Wu Fengguang
0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-28 13:31 UTC (permalink / raw)
To: Ingo Molnar
Cc: Li Zefan, Tom Zanussi, KOSAKI Motohiro, Pekka Enberg, Andi Kleen,
Steven Rostedt, Fr馘駻ic Weisbecker, Larry Woodman,
Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
Matt Mackall, Alexey Dobriyan, linux-mm
On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> tent-Transfer-Encoding: quoted-printable
> Status: RO
> Content-Length: 5480
> Lines: 161
>
>
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > > The above 'get object state' interface (which allows passive
> > > sampling) - integrated into the tracing framework - would serve
> > > that goal, agreed?
> >
> > Agreed. That could in theory a good complement to dynamic
> > tracings.
> >
> > Then what will be the canonical form for all the 'get object
> > state' interfaces - "object.attr=value", or whatever? [...]
>
> Lemme outline what i'm thinking of.
>
> I'd call the feature "object collection tracing", which would live
> in /debug/tracing, accessed via such files:
>
> /debug/tracing/objects/mm/pages/
> /debug/tracing/objects/mm/pages/format
> /debug/tracing/objects/mm/pages/filter
> /debug/tracing/objects/mm/pages/trace_pipe
> /debug/tracing/objects/mm/pages/stats
> /debug/tracing/objects/mm/pages/events/
>
> here's the (proposed) semantics of those files:
>
> 1) /debug/tracing/objects/mm/pages/
>
> There's a subsystem / object basic directory structure to make it
> easy and intuitive to find our way around there.
>
> 2) /debug/tracing/objects/mm/pages/format
>
> the format file:
>
> /debug/tracing/objects/mm/pages/format
>
> Would reuse the existing dynamic-tracepoint structured-logging
> descriptor format and code (this is upstream already):
>
> [root@phoenix sched_signal_send]# pwd
> /debug/tracing/events/sched/sched_signal_send
>
> [root@phoenix sched_signal_send]# cat format
> name: sched_signal_send
> ID: 24
> format:
> field:unsigned short common_type; offset:0; size:2;
> field:unsigned char common_flags; offset:2; size:1;
> field:unsigned char common_preempt_count; offset:3; size:1;
> field:int common_pid; offset:4; size:4;
> field:int common_tgid; offset:8; size:4;
>
> field:int sig; offset:12; size:4;
> field:char comm[TASK_COMM_LEN]; offset:16; size:16;
> field:pid_t pid; offset:32; size:4;
>
> print fmt: "sig: %d task %s:%d", REC->sig, REC->comm, REC->pid
>
> These format descriptors enumerate fields, types and sizes, in a
> structured way that user-space tools can parse easily. (The binary
> records that come from the trace_pipe file follow this format
> description.)
>
> 3) /debug/tracing/objects/mm/pages/filter
>
> This is the tracing filter that can be set based on the 'format'
> descriptor. So with the above (signal-send tracepoint) you can
> define such filter expressions:
>
> echo "(sig == 10 && comm == bash) || sig == 13" > filter
>
> To restrict the 'scope' of the object collection along pretty much
> any key or combination of keys. (Or you can leave it as it is and
> dump all objects and do keying in user-space.)
>
> [ Using in-kernel filtering is obviously faster that streaming it
> out to user-space - but there might be details and types of
> visualization you want to do in user-space - so we dont want to
> restrict things here. ]
>
> For the mm object collection tracepoint i could imagine such filter
> expressions:
>
> echo "type == shared && file == /sbin/init" > filter
>
> To dump all shared pages that are mapped to /sbin/init.
>
> 4) /debug/tracing/objects/mm/pages/trace_pipe
>
> The 'trace_pipe' file can be used to dump all objects in the
> collection, which match the filter ('all objects' by default). The
> record format is described in 'format'.
>
> trace_pipe would be a reuse of the existing trace_pipe code: it is a
> modern, poll()-able, read()-able, splice()-able pipe abstraction.
>
> 5) /debug/tracing/objects/mm/pages/stats
>
> The 'stats' file would be a reuse of the existing histogram code of
> the tracing code. We already make use of it for the branch tracers
> and for the workqueue tracer - it could be extended to be applicable
> to object collections as well.
>
> The advantage there would be that there's no dumping at all - all
> the integration is done straight in the kernel. ( The 'filter'
> condition is listened to - increasing flexibility. The filter file
> could perhaps also act as a default histogram key. )
>
> 6) /debug/tracing/objects/mm/pages/events/
>
> The 'events' directory offers links back to existing dynamic
> tracepoints that are under /debug/tracing/events/. This would serve
> as an additional coherent force that keeps dynamic tracepoints
> collected by subsystem and by object type as well. (Tools could make
> use of this information as well - without being aware of actual
> object semantics.)
>
>
> There would be a number of other object collections we could
> enumerate:
>
> tasks:
>
> /debug/tracing/objects/sched/tasks/
>
> active inodes known to the kernel:
>
> /debug/tracing/objects/fs/inodes/
>
> interrupts:
>
> /debug/tracing/objects/hw/irqs/
>
> etc.
>
> These would use the same 'object collection' framework. Once done we
> can use it for many other thing too.
>
> Note how organically integrated it all is with the tracing
> framework. You could start from an 'object view' to get an overview
> and then go towards a more dynamic view of specific object
> attributes (or specific objects), as you drill down on a specific
> problem you want to analyze.
>
> How does this all sound to you?
Great! I saw much opportunity to adapt the not yet submitted
/proc/filecache interface to the proposed framework.
Its basic form is:
# ino size cached cached% refcnt state age accessed process dev file
[snip]
320 1 4 100 1 D- 50443 1085 udevd 00:11(tmpfs) /.udev/uevent_seqnum
460725 123 124 100 35 -- 50444 6795 touch 08:02(sda2) /lib/libpthread-2.9.so
460727 31 32 100 14 -- 50444 2007 touch 08:02(sda2) /lib/librt-2.9.so
458865 97 80 82 1 -- 50444 49 mount 08:02(sda2) /lib/libdevmapper.so.1.02.1
460090 15 16 100 1 -- 50444 48 mount 08:02(sda2) /lib/libuuid.so.1.2
458866 46 48 100 1 -- 50444 47 mount 08:02(sda2) /lib/libblkid.so.1.0
460732 43 44 100 69 -- 50444 3581 rcS 08:02(sda2) /lib/libnss_nis-2.9.so
460739 87 88 100 73 -- 50444 3597 rcS 08:02(sda2) /lib/libnsl-2.9.so
460726 31 32 100 69 -- 50444 3581 rcS 08:02(sda2) /lib/libnss_compat-2.9.so
458804 250 252 100 11 -- 50445 8175 rcS 08:02(sda2) /lib/libncurses.so.5.6
229540 780 752 96 3 -- 50445 7594 init 08:02(sda2) /bin/bash
460735 15 16 100 89 -- 50445 17581 init 08:02(sda2) /lib/libdl-2.9.so
460721 1344 1340 99 117 -- 50445 48732 init 08:02(sda2) /lib/libc-2.9.so
458801 107 104 97 24 -- 50445 3586 init 08:02(sda2) /lib/libselinux.so.1
671870 37 24 65 1 -- 50446 1 swapper 08:02(sda2) /sbin/init
175 1 24412 100 1 -- 50446 0 swapper 00:01(rootfs) /dev/root
The patch basically does a traversal through one or more of the inode
lists to produce the output:
inode_in_use
inode_unused
sb->s_dirty
sb->s_io
sb->s_more_io
sb->s_inodes
The filtering feature is a necessity for this interface - or it will
take considerable time to do a full listing. It supports the following
filters:
{ LS_OPT_DIRTY, "dirty" },
{ LS_OPT_CLEAN, "clean" },
{ LS_OPT_INUSE, "inuse" },
{ LS_OPT_EMPTY, "empty" },
{ LS_OPT_ALL, "all" },
{ LS_OPT_DEV, "dev=%s" },
There are two possible challenges for the conversion:
- One trick it does is to select different lists to traverse on
different filter options. Will this be possible in the object
tracing framework?
- The file name lookup(last field) is the performance killer. Is it
possible to skip the file name lookup when the filter failed on the
leading fields?
Will the object tracing interface allow such flexibilities?
(Sorry I'm not yet familiar with the tracing framework.)
> Can you see any conceptual holes in the scheme, any use-case that
> /proc/kpageflags supports but the object collection approach does
> not?
kpageflags is simply a big (perhaps sparse) binary array.
I'd still prefer to retain its current form - the kernel patches and
user space tools are all ready made, and I see no benefits in
converting to the tracing framework.
> Would you be interested in seeing something like this, if we tried
> to implement it in the tracing tree? The majority of the code
> already exists, we just need interest from the MM side and we have
> to hook it all up. (it is by no means trivial to do - but looks like
> a very exciting feature.)
Definitely! /proc/filecache has another 'page view':
# head /proc/filecache
# file /bin/bash
# flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback X:readahead P:private O:owner b:buffer d:dirty w:writeback
# idx len state refcnt
0 1 RAMU________ 4
3 8 RAMU________ 4
12 1 RAMU________ 4
14 5 RAMU________ 4
20 7 RAMU________ 4
27 2 RAMU________ 5
29 1 RAMU________ 4
Which is also a good candidate. However I still need to investigate
whether it offers considerable margins over the mincore() syscall.
Thanks and Regards,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 9:36 ` Ingo Molnar
@ 2009-04-28 17:42 ` Matt Mackall
-1 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 17:42 UTC (permalink / raw)
To: Ingo Molnar
Cc: Pekka Enberg, Andi Kleen, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Alexey Dobriyan, linux-mm
On Tue, 2009-04-28 at 11:36 +0200, Ingo Molnar wrote:
> I 'integrate' traces all the time to get summary counts. This series
> of dynamic events:
>
> allocation
> page count up
> page count up
> page count down
> page count up
> page count up
> page count up
> page count up
>
> integrates into: "page count is 6".
Perhaps you've failed calculus. The integral is 6 + C.
This is a critical distinction. Tracing is great for looking at changes,
but it completely falls down for static system-wide measurements because
it would require integrating from time=0 to get a meaningful summation.
That's completely useless for taking a measurement on a system that
already has an uptime of months.
Never mind that summing up page flag changes for every page on the
system since boot time through the trace interface is incredibly
wasteful given that we're keeping a per-page integral in the page tables
anyway.
Tracing is not the answer for everything.
--
http://selenic.com : development and support for Mercurial and Linux
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 17:42 ` Matt Mackall
0 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 17:42 UTC (permalink / raw)
To: Ingo Molnar
Cc: Pekka Enberg, Andi Kleen, Wu Fengguang, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Alexey Dobriyan, linux-mm
On Tue, 2009-04-28 at 11:36 +0200, Ingo Molnar wrote:
> I 'integrate' traces all the time to get summary counts. This series
> of dynamic events:
>
> allocation
> page count up
> page count up
> page count down
> page count up
> page count up
> page count up
> page count up
>
> integrates into: "page count is 6".
Perhaps you've failed calculus. The integral is 6 + C.
This is a critical distinction. Tracing is great for looking at changes,
but it completely falls down for static system-wide measurements because
it would require integrating from time=0 to get a meaningful summation.
That's completely useless for taking a measurement on a system that
already has an uptime of months.
Never mind that summing up page flag changes for every page on the
system since boot time through the trace interface is incredibly
wasteful given that we're keeping a per-page integral in the page tables
anyway.
Tracing is not the answer for everything.
--
http://selenic.com : development and support for Mercurial and Linux
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 1:09 ` Wu Fengguang
@ 2009-04-28 17:49 ` Matt Mackall
-1 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 17:49 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
Alexey Dobriyan, linux-mm
On Tue, 2009-04-28 at 09:09 +0800, Wu Fengguang wrote:
> plain text document attachment (kpageflags-extending.patch)
> Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
My only concern with this patch is it knows a bit too much about SLUB
internals (and perhaps not enough about SLOB, which also overloads
flags).
--
http://selenic.com : development and support for Mercurial and Linux
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 17:49 ` Matt Mackall
0 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 17:49 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
Alexey Dobriyan, linux-mm
On Tue, 2009-04-28 at 09:09 +0800, Wu Fengguang wrote:
> plain text document attachment (kpageflags-extending.patch)
> Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
My only concern with this patch is it knows a bit too much about SLUB
internals (and perhaps not enough about SLOB, which also overloads
flags).
--
http://selenic.com : development and support for Mercurial and Linux
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 8:33 ` Wu Fengguang
@ 2009-04-28 18:11 ` Tony Luck
-1 siblings, 0 replies; 137+ messages in thread
From: Tony Luck @ 2009-04-28 18:11 UTC (permalink / raw)
To: Wu Fengguang
Cc: Ingo Molnar, Steven Rostedt, Frédéric Weisbecker,
Larry Woodman, Peter Zijlstra, Pekka Enberg,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Andi Kleen, Matt Mackall, Alexey Dobriyan, linux-mm
On Tue, Apr 28, 2009 at 1:33 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> 1) FAST
>
> It takes merely 0.2s to scan 4GB pages:
>
> ./page-types 0.02s user 0.20s system 99% cpu 0.216 total
OK on a tiny system ... but sounds painful on a big
server. 0.2s for 4G scales up to 3 minutes 25 seconds
on a 4TB system (4TB systems were being sold two
years ago ... so by now the high end will have moved
up to 8TB or perhaps 16TB).
Would the resulting output be anything but noise on
a big system (a *lot* of pages can change state in
3 minutes)?
-Tony
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 18:11 ` Tony Luck
0 siblings, 0 replies; 137+ messages in thread
From: Tony Luck @ 2009-04-28 18:11 UTC (permalink / raw)
To: Wu Fengguang
Cc: Ingo Molnar, Steven Rostedt, Frédéric Weisbecker,
Larry Woodman, Peter Zijlstra, Pekka Enberg,
Eduard - Gabriel Munteanu, Andrew Morton, LKML, KOSAKI Motohiro,
Andi Kleen, Matt Mackall, Alexey Dobriyan, linux-mm
On Tue, Apr 28, 2009 at 1:33 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> 1) FAST
>
> It takes merely 0.2s to scan 4GB pages:
>
> ./page-types 0.02s user 0.20s system 99% cpu 0.216 total
OK on a tiny system ... but sounds painful on a big
server. 0.2s for 4G scales up to 3 minutes 25 seconds
on a 4TB system (4TB systems were being sold two
years ago ... so by now the high end will have moved
up to 8TB or perhaps 16TB).
Would the resulting output be anything but noise on
a big system (a *lot* of pages can change state in
3 minutes)?
-Tony
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 18:11 ` Tony Luck
@ 2009-04-28 18:34 ` Matt Mackall
-1 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 18:34 UTC (permalink / raw)
To: Tony Luck
Cc: Wu Fengguang, Ingo Molnar, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Pekka Enberg, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
KOSAKI Motohiro, Andi Kleen, Alexey Dobriyan, linux-mm
On Tue, 2009-04-28 at 11:11 -0700, Tony Luck wrote:
> On Tue, Apr 28, 2009 at 1:33 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 1) FAST
> >
> > It takes merely 0.2s to scan 4GB pages:
> >
> > ./page-types 0.02s user 0.20s system 99% cpu 0.216 total
>
> OK on a tiny system ... but sounds painful on a big
> server. 0.2s for 4G scales up to 3 minutes 25 seconds
> on a 4TB system (4TB systems were being sold two
> years ago ... so by now the high end will have moved
> up to 8TB or perhaps 16TB).
>
> Would the resulting output be anything but noise on
> a big system (a *lot* of pages can change state in
> 3 minutes)?
Bah. The rate of change is proportional to #cpus, not #pages. Assuming
you've got 1024 processors, you could run the scan in parallel in .2
seconds still.
It won't be an atomic snapshot, obviously. But stopping the whole
machine on a system that size is probably not what you want anyway.
--
http://selenic.com : development and support for Mercurial and Linux
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 18:34 ` Matt Mackall
0 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 18:34 UTC (permalink / raw)
To: Tony Luck
Cc: Wu Fengguang, Ingo Molnar, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Pekka Enberg, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
KOSAKI Motohiro, Andi Kleen, Alexey Dobriyan, linux-mm
On Tue, 2009-04-28 at 11:11 -0700, Tony Luck wrote:
> On Tue, Apr 28, 2009 at 1:33 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 1) FAST
> >
> > It takes merely 0.2s to scan 4GB pages:
> >
> > ./page-types 0.02s user 0.20s system 99% cpu 0.216 total
>
> OK on a tiny system ... but sounds painful on a big
> server. 0.2s for 4G scales up to 3 minutes 25 seconds
> on a 4TB system (4TB systems were being sold two
> years ago ... so by now the high end will have moved
> up to 8TB or perhaps 16TB).
>
> Would the resulting output be anything but noise on
> a big system (a *lot* of pages can change state in
> 3 minutes)?
Bah. The rate of change is proportional to #cpus, not #pages. Assuming
you've got 1024 processors, you could run the scan in parallel in .2
seconds still.
It won't be an atomic snapshot, obviously. But stopping the whole
machine on a system that size is probably not what you want anyway.
--
http://selenic.com : development and support for Mercurial and Linux
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 18:34 ` Matt Mackall
@ 2009-04-28 20:47 ` Tony Luck
-1 siblings, 0 replies; 137+ messages in thread
From: Tony Luck @ 2009-04-28 20:47 UTC (permalink / raw)
To: Matt Mackall
Cc: Wu Fengguang, Ingo Molnar, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Pekka Enberg, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
KOSAKI Motohiro, Andi Kleen, Alexey Dobriyan, linux-mm
On Tue, Apr 28, 2009 at 11:34 AM, Matt Mackall <mpm@selenic.com> wrote:
> Bah. The rate of change is proportional to #cpus, not #pages. Assuming
> you've got 1024 processors, you could run the scan in parallel in .2
> seconds still.
That would help ... it would also make the patch to support this
functionality a lot more complex.
-Tony
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 20:47 ` Tony Luck
0 siblings, 0 replies; 137+ messages in thread
From: Tony Luck @ 2009-04-28 20:47 UTC (permalink / raw)
To: Matt Mackall
Cc: Wu Fengguang, Ingo Molnar, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Pekka Enberg, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
KOSAKI Motohiro, Andi Kleen, Alexey Dobriyan, linux-mm
On Tue, Apr 28, 2009 at 11:34 AM, Matt Mackall <mpm@selenic.com> wrote:
> Bah. The rate of change is proportional to #cpus, not #pages. Assuming
> you've got 1024 processors, you could run the scan in parallel in .2
> seconds still.
That would help ... it would also make the patch to support this
functionality a lot more complex.
-Tony
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 20:47 ` Tony Luck
@ 2009-04-28 20:54 ` Andi Kleen
-1 siblings, 0 replies; 137+ messages in thread
From: Andi Kleen @ 2009-04-28 20:54 UTC (permalink / raw)
To: Tony Luck
Cc: Matt Mackall, Wu Fengguang, Ingo Molnar, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Pekka Enberg, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
KOSAKI Motohiro, Andi Kleen, Alexey Dobriyan, linux-mm
On Tue, Apr 28, 2009 at 01:47:07PM -0700, Tony Luck wrote:
> On Tue, Apr 28, 2009 at 11:34 AM, Matt Mackall <mpm@selenic.com> wrote:
> > Bah. The rate of change is proportional to #cpus, not #pages. Assuming
> > you've got 1024 processors, you could run the scan in parallel in .2
> > seconds still.
>
> That would help ... it would also make the patch to support this
> functionality a lot more complex.
I suspect 4TB memory users are used to some things running
a little slower. I'm not sure we really need to make every obscure
debugging functionality scale well to these systems too.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 20:54 ` Andi Kleen
0 siblings, 0 replies; 137+ messages in thread
From: Andi Kleen @ 2009-04-28 20:54 UTC (permalink / raw)
To: Tony Luck
Cc: Matt Mackall, Wu Fengguang, Ingo Molnar, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Pekka Enberg, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
KOSAKI Motohiro, Andi Kleen, Alexey Dobriyan, linux-mm
On Tue, Apr 28, 2009 at 01:47:07PM -0700, Tony Luck wrote:
> On Tue, Apr 28, 2009 at 11:34 AM, Matt Mackall <mpm@selenic.com> wrote:
> > Bah. The rate of change is proportional to #cpus, not #pages. Assuming
> > you've got 1024 processors, you could run the scan in parallel in .2
> > seconds still.
>
> That would help ... it would also make the patch to support this
> functionality a lot more complex.
I suspect 4TB memory users are used to some things running
a little slower. I'm not sure we really need to make every obscure
debugging functionality scale well to these systems too.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 20:47 ` Tony Luck
@ 2009-04-28 20:59 ` Matt Mackall
-1 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 20:59 UTC (permalink / raw)
To: Tony Luck
Cc: Wu Fengguang, Ingo Molnar, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Pekka Enberg, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
KOSAKI Motohiro, Andi Kleen, Alexey Dobriyan, linux-mm
On Tue, 2009-04-28 at 13:47 -0700, Tony Luck wrote:
> On Tue, Apr 28, 2009 at 11:34 AM, Matt Mackall <mpm@selenic.com> wrote:
> > Bah. The rate of change is proportional to #cpus, not #pages. Assuming
> > you've got 1024 processors, you could run the scan in parallel in .2
> > seconds still.
>
> That would help ... it would also make the patch to support this
> functionality a lot more complex.
The kernel bits should handle this already today. You just need 1k
userspace threads to open /proc/kpageflags, seek() appropriately, and
read().
--
http://selenic.com : development and support for Mercurial and Linux
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 20:59 ` Matt Mackall
0 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 20:59 UTC (permalink / raw)
To: Tony Luck
Cc: Wu Fengguang, Ingo Molnar, Steven Rostedt,
Frédéric Weisbecker, Larry Woodman, Peter Zijlstra,
Pekka Enberg, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
KOSAKI Motohiro, Andi Kleen, Alexey Dobriyan, linux-mm
On Tue, 2009-04-28 at 13:47 -0700, Tony Luck wrote:
> On Tue, Apr 28, 2009 at 11:34 AM, Matt Mackall <mpm@selenic.com> wrote:
> > Bah. The rate of change is proportional to #cpus, not #pages. Assuming
> > you've got 1024 processors, you could run the scan in parallel in .2
> > seconds still.
>
> That would help ... it would also make the patch to support this
> functionality a lot more complex.
The kernel bits should handle this already today. You just need 1k
userspace threads to open /proc/kpageflags, seek() appropriately, and
read().
--
http://selenic.com : development and support for Mercurial and Linux
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 18:11 ` Tony Luck
@ 2009-04-28 21:17 ` Andrew Morton
-1 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-28 21:17 UTC (permalink / raw)
To: Tony Luck
Cc: fengguang.wu, mingo, rostedt, fweisbec, lwoodman, a.p.zijlstra,
penberg, eduard.munteanu, linux-kernel, kosaki.motohiro, andi,
mpm, adobriyan, linux-mm
On Tue, 28 Apr 2009 11:11:52 -0700
Tony Luck <tony.luck@gmail.com> wrote:
> On Tue, Apr 28, 2009 at 1:33 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 1) FAST
> >
> > It takes merely 0.2s to scan 4GB pages:
> >
> > __ __ __ __./page-types __0.02s user 0.20s system 99% cpu 0.216 total
>
> OK on a tiny system ... but sounds painful on a big
> server. 0.2s for 4G scales up to 3 minutes 25 seconds
> on a 4TB system (4TB systems were being sold two
> years ago ... so by now the high end will have moved
> up to 8TB or perhaps 16TB).
>
> Would the resulting output be anything but noise on
> a big system (a *lot* of pages can change state in
> 3 minutes)?
>
Reading the state of all of memory in this fashion would be a somewhat
peculiar thing to do. Bear in mind that kpagemap and friends are also
designed to allow userspace to inspect the state of a particular
process's memory.
Documentation/vm/pagemap.txt describes it nicely:
: The general procedure for using pagemap to find out about a process' memory
: usage goes like this:
:
: 1. Read /proc/pid/maps to determine which parts of the memory space are
: mapped to what.
: 2. Select the maps you are interested in -- all of them, or a particular
: library, or the stack or the heap, etc.
: 3. Open /proc/pid/pagemap and seek to the pages you would like to examine.
: 4. Read a u64 for each page from pagemap.
: 5. Open /proc/kpagecount and/or /proc/kpageflags. For each PFN you just
: read, seek to that entry in the file, and read the data you want.
although I expect that this is not the use case when the feature is
being used to debug/tune readahead.
But yes, if you have huge amounts of memory and you decide to write an
application which inspects the state of every physical page in the
machine, you can expect it to take a long time!
Of course, the VM does also accumulate bulk aggregated page statistics
and presents them in /proc/meminfo, /proc/vmstat and probably other
places. These numbers are maintained at runtime and the cost of doing
this is significant.
I don't _think_ there are presently any such counters which are
accumulated simply for instrumentation purposes - the kernel needs to
maintain them anyway for various reasons and it's a simple (and useful)
matter to make them available to userspace.
Generally, I think that pagemap is another of those things where we've
failed on the follow-through. There's a nice and powerful interface
for inspecting the state of a process's VM, but nobody knows about it
and there are no tools for accessing it and nobody is using it.
(Or maybe I'm wrong about that - I expect I'd have bugged Matt about
this and I expect that he'd have done something. Brain failed).
Either way, I think we'd serve the world better if we were to have some
nice little userspace tools which users could use to access this
information. Documentation/vm already has a Makefile!
Fengguang, you mention an executable called "page-types". Perhaps you
could "productise" that sometime?
A model here is Documentation/accounting/getdelays.c - that proved
quite useful and successful in the development of taskstats and I know
that several people are actually using getdelays.c as-is in serious
production environments. If we hadn't provided and maintained that
code in the kernel tree, it's unlikely that taskstats would have proved
as useful to users.
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 21:17 ` Andrew Morton
0 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-28 21:17 UTC (permalink / raw)
To: Tony Luck
Cc: fengguang.wu, mingo, rostedt, fweisbec, lwoodman, a.p.zijlstra,
penberg, eduard.munteanu, linux-kernel, kosaki.motohiro, andi,
mpm, adobriyan, linux-mm
On Tue, 28 Apr 2009 11:11:52 -0700
Tony Luck <tony.luck@gmail.com> wrote:
> On Tue, Apr 28, 2009 at 1:33 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 1) FAST
> >
> > It takes merely 0.2s to scan 4GB pages:
> >
> > __ __ __ __./page-types __0.02s user 0.20s system 99% cpu 0.216 total
>
> OK on a tiny system ... but sounds painful on a big
> server. 0.2s for 4G scales up to 3 minutes 25 seconds
> on a 4TB system (4TB systems were being sold two
> years ago ... so by now the high end will have moved
> up to 8TB or perhaps 16TB).
>
> Would the resulting output be anything but noise on
> a big system (a *lot* of pages can change state in
> 3 minutes)?
>
Reading the state of all of memory in this fashion would be a somewhat
peculiar thing to do. Bear in mind that kpagemap and friends are also
designed to allow userspace to inspect the state of a particular
process's memory.
Documentation/vm/pagemap.txt describes it nicely:
: The general procedure for using pagemap to find out about a process' memory
: usage goes like this:
:
: 1. Read /proc/pid/maps to determine which parts of the memory space are
: mapped to what.
: 2. Select the maps you are interested in -- all of them, or a particular
: library, or the stack or the heap, etc.
: 3. Open /proc/pid/pagemap and seek to the pages you would like to examine.
: 4. Read a u64 for each page from pagemap.
: 5. Open /proc/kpagecount and/or /proc/kpageflags. For each PFN you just
: read, seek to that entry in the file, and read the data you want.
although I expect that this is not the use case when the feature is
being used to debug/tune readahead.
But yes, if you have huge amounts of memory and you decide to write an
application which inspects the state of every physical page in the
machine, you can expect it to take a long time!
Of course, the VM does also accumulate bulk aggregated page statistics
and presents them in /proc/meminfo, /proc/vmstat and probably other
places. These numbers are maintained at runtime and the cost of doing
this is significant.
I don't _think_ there are presently any such counters which are
accumulated simply for instrumentation purposes - the kernel needs to
maintain them anyway for various reasons and it's a simple (and useful)
matter to make them available to userspace.
Generally, I think that pagemap is another of those things where we've
failed on the follow-through. There's a nice and powerful interface
for inspecting the state of a process's VM, but nobody knows about it
and there are no tools for accessing it and nobody is using it.
(Or maybe I'm wrong about that - I expect I'd have bugged Matt about
this and I expect that he'd have done something. Brain failed).
Either way, I think we'd serve the world better if we were to have some
nice little userspace tools which users could use to access this
information. Documentation/vm already has a Makefile!
Fengguang, you mention an executable called "page-types". Perhaps you
could "productise" that sometime?
A model here is Documentation/accounting/getdelays.c - that proved
quite useful and successful in the development of taskstats and I know
that several people are actually using getdelays.c as-is in serious
production environments. If we hadn't provided and maintained that
code in the kernel tree, it's unlikely that taskstats would have proved
as useful to users.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 1:09 ` Wu Fengguang
@ 2009-04-28 21:32 ` Andrew Morton
-1 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-28 21:32 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-kernel, kosaki.motohiro, andi, mpm, adobriyan,
fengguang.wu, linux-mm
On Tue, 28 Apr 2009 09:09:12 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:
> +/*
> + * Kernel flags are exported faithfully to Linus and his fellow hackers.
> + * Otherwise some details are masked to avoid confusing the end user:
> + * - some kernel flags are completely invisible
> + * - some kernel flags are conditionally invisible on their odd usages
> + */
> +#ifdef CONFIG_DEBUG_KERNEL
> +static inline int genuine_linus(void) { return 1; }
Although he's a fine chap, the use of the "_linus" tag isn't terribly
clear (to me). I think what you're saying here is that this enables
kernel-developer-only features, yes?
If so, perhaps we could come up with an identifier which expresses that
more clearly.
But I'd expect that everyone and all distros enable CONFIG_DEBUG_KERNEL
for _some_ reason, so what's the point?
It is preferable that we always implement the same interface for all
Kconfig settings. If this exposes information which is confusing or
not useful to end-users then so be it - we should be able to cover that
in supporting documentation.
Also, as mentioned in the other email, it would be good if we were to
publish a little userspace app which people can use to access this raw
data. We could give that application an `--i-am-a-kernel-developer'
option!
> +#else
> +static inline int genuine_linus(void) { return 0; }
> +#endif
This isn't an appropriate use of CONFIG_DEBUG_KERNEL.
DEBUG_KERNEL is a Kconfig-only construct which is use to enable _other_
debugging features. The way you've used it here, if the person who is
configuring the kernel wants to enable any other completely-unrelated
debug feature, they have to enable DEBUG_KERNEL first. But when they
do that, they unexpectedly alter the behaviour of pagemap!
There are two other places where CONFIG_DEBUG_KERNEL affects code
generation in .c files: arch/parisc/mm/init.c and
arch/powerpc/kernel/sysfs.c. These are both wrong, and need slapping ;)
> +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit) \
> + do { \
> + if (visible || genuine_linus()) \
> + uflags |= ((kflags >> kbit) & 1) << ubit; \
> + } while (0);
Did this have to be implemented as a macro?
It's bad, because it might or might not reference its argument, so if
someone passes it an expression-with-side-effects, the end result is
unpredictable. A C function is almost always preferable if possible.
> +/* a helper function _not_ intended for more general uses */
> +static inline int page_cap_writeback_dirty(struct page *page)
> +{
> + struct address_space *mapping;
> +
> + if (!PageSlab(page))
> + mapping = page_mapping(page);
> + else
> + mapping = NULL;
> +
> + return mapping && mapping_cap_writeback_dirty(mapping);
> +}
If the page isn't locked then page->mapping can be concurrently removed
and freed. This actually happened to me in real-life testing several
years ago.
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 21:32 ` Andrew Morton
0 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-28 21:32 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm
On Tue, 28 Apr 2009 09:09:12 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:
> +/*
> + * Kernel flags are exported faithfully to Linus and his fellow hackers.
> + * Otherwise some details are masked to avoid confusing the end user:
> + * - some kernel flags are completely invisible
> + * - some kernel flags are conditionally invisible on their odd usages
> + */
> +#ifdef CONFIG_DEBUG_KERNEL
> +static inline int genuine_linus(void) { return 1; }
Although he's a fine chap, the use of the "_linus" tag isn't terribly
clear (to me). I think what you're saying here is that this enables
kernel-developer-only features, yes?
If so, perhaps we could come up with an identifier which expresses that
more clearly.
But I'd expect that everyone and all distros enable CONFIG_DEBUG_KERNEL
for _some_ reason, so what's the point?
It is preferable that we always implement the same interface for all
Kconfig settings. If this exposes information which is confusing or
not useful to end-users then so be it - we should be able to cover that
in supporting documentation.
Also, as mentioned in the other email, it would be good if we were to
publish a little userspace app which people can use to access this raw
data. We could give that application an `--i-am-a-kernel-developer'
option!
> +#else
> +static inline int genuine_linus(void) { return 0; }
> +#endif
This isn't an appropriate use of CONFIG_DEBUG_KERNEL.
DEBUG_KERNEL is a Kconfig-only construct which is use to enable _other_
debugging features. The way you've used it here, if the person who is
configuring the kernel wants to enable any other completely-unrelated
debug feature, they have to enable DEBUG_KERNEL first. But when they
do that, they unexpectedly alter the behaviour of pagemap!
There are two other places where CONFIG_DEBUG_KERNEL affects code
generation in .c files: arch/parisc/mm/init.c and
arch/powerpc/kernel/sysfs.c. These are both wrong, and need slapping ;)
> +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit) \
> + do { \
> + if (visible || genuine_linus()) \
> + uflags |= ((kflags >> kbit) & 1) << ubit; \
> + } while (0);
Did this have to be implemented as a macro?
It's bad, because it might or might not reference its argument, so if
someone passes it an expression-with-side-effects, the end result is
unpredictable. A C function is almost always preferable if possible.
> +/* a helper function _not_ intended for more general uses */
> +static inline int page_cap_writeback_dirty(struct page *page)
> +{
> + struct address_space *mapping;
> +
> + if (!PageSlab(page))
> + mapping = page_mapping(page);
> + else
> + mapping = NULL;
> +
> + return mapping && mapping_cap_writeback_dirty(mapping);
> +}
If the page isn't locked then page->mapping can be concurrently removed
and freed. This actually happened to me in real-life testing several
years ago.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 21:17 ` Andrew Morton
@ 2009-04-28 21:49 ` Matt Mackall
-1 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 21:49 UTC (permalink / raw)
To: Andrew Morton
Cc: Tony Luck, fengguang.wu, mingo, rostedt, fweisbec, lwoodman,
a.p.zijlstra, penberg, eduard.munteanu, linux-kernel,
kosaki.motohiro, andi, adobriyan, linux-mm
On Tue, 2009-04-28 at 14:17 -0700, Andrew Morton wrote:
> On Tue, 28 Apr 2009 11:11:52 -0700
> Tony Luck <tony.luck@gmail.com> wrote:
>
> > On Tue, Apr 28, 2009 at 1:33 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 1) FAST
> > >
> > > It takes merely 0.2s to scan 4GB pages:
> > >
> > > __ __ __ __./page-types __0.02s user 0.20s system 99% cpu 0.216 total
> >
> > OK on a tiny system ... but sounds painful on a big
> > server. 0.2s for 4G scales up to 3 minutes 25 seconds
> > on a 4TB system (4TB systems were being sold two
> > years ago ... so by now the high end will have moved
> > up to 8TB or perhaps 16TB).
> >
> > Would the resulting output be anything but noise on
> > a big system (a *lot* of pages can change state in
> > 3 minutes)?
> >
>
> Reading the state of all of memory in this fashion would be a somewhat
> peculiar thing to do.
Not entirely. If you've got, say, a large NUMA box, it could be
incredibly illustrative to see that "oh, this node is entirely dominated
by SLAB allocations". Or on a smaller machine "oh, this is fragmented to
hell and there's no way I'm going to get a huge page". Things you're not
going to get from individual stats.
> Generally, I think that pagemap is another of those things where we've
> failed on the follow-through. There's a nice and powerful interface
> for inspecting the state of a process's VM, but nobody knows about it
> and there are no tools for accessing it and nobody is using it.
People keep finding bugs in the thing exercising it in new ways, so I
presume people are writing their own tools. My hope was that my original
tools would inspire someone to take it and run with it - I really have
no stomach for writing GUI tools.
However, I've recent gone and written a pretty generically useful
command-line tool that hopefully will get more traction:
http://www.selenic.com/smem/
I'm expecting it to get written up on LWN shortly, so I haven't spent
much time doing my own advertising.
--
http://selenic.com : development and support for Mercurial and Linux
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 21:49 ` Matt Mackall
0 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 21:49 UTC (permalink / raw)
To: Andrew Morton
Cc: Tony Luck, fengguang.wu, mingo, rostedt, fweisbec, lwoodman,
a.p.zijlstra, penberg, eduard.munteanu, linux-kernel,
kosaki.motohiro, andi, adobriyan, linux-mm
On Tue, 2009-04-28 at 14:17 -0700, Andrew Morton wrote:
> On Tue, 28 Apr 2009 11:11:52 -0700
> Tony Luck <tony.luck@gmail.com> wrote:
>
> > On Tue, Apr 28, 2009 at 1:33 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 1) FAST
> > >
> > > It takes merely 0.2s to scan 4GB pages:
> > >
> > > __ __ __ __./page-types __0.02s user 0.20s system 99% cpu 0.216 total
> >
> > OK on a tiny system ... but sounds painful on a big
> > server. 0.2s for 4G scales up to 3 minutes 25 seconds
> > on a 4TB system (4TB systems were being sold two
> > years ago ... so by now the high end will have moved
> > up to 8TB or perhaps 16TB).
> >
> > Would the resulting output be anything but noise on
> > a big system (a *lot* of pages can change state in
> > 3 minutes)?
> >
>
> Reading the state of all of memory in this fashion would be a somewhat
> peculiar thing to do.
Not entirely. If you've got, say, a large NUMA box, it could be
incredibly illustrative to see that "oh, this node is entirely dominated
by SLAB allocations". Or on a smaller machine "oh, this is fragmented to
hell and there's no way I'm going to get a huge page". Things you're not
going to get from individual stats.
> Generally, I think that pagemap is another of those things where we've
> failed on the follow-through. There's a nice and powerful interface
> for inspecting the state of a process's VM, but nobody knows about it
> and there are no tools for accessing it and nobody is using it.
People keep finding bugs in the thing exercising it in new ways, so I
presume people are writing their own tools. My hope was that my original
tools would inspire someone to take it and run with it - I really have
no stomach for writing GUI tools.
However, I've recent gone and written a pretty generically useful
command-line tool that hopefully will get more traction:
http://www.selenic.com/smem/
I'm expecting it to get written up on LWN shortly, so I haven't spent
much time doing my own advertising.
--
http://selenic.com : development and support for Mercurial and Linux
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 21:32 ` Andrew Morton
@ 2009-04-28 22:46 ` Matt Mackall
-1 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 22:46 UTC (permalink / raw)
To: Andrew Morton
Cc: Wu Fengguang, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm
On Tue, 2009-04-28 at 14:32 -0700, Andrew Morton wrote:
> > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit) \
> > + do { \
> > + if (visible || genuine_linus()) \
> > + uflags |= ((kflags >> kbit) & 1) << ubit; \
> > + } while (0);
>
> Did this have to be implemented as a macro?
I'm mostly to blame for that. I seem to recall the optimizer doing a
better job on this as a macro.
> It's bad, because it might or might not reference its argument, so if
> someone passes it an expression-with-side-effects, the end result is
> unpredictable. A C function is almost always preferable if possible.
I don't think there's any use case for it outside of its one user?
> > +/* a helper function _not_ intended for more general uses */
> > +static inline int page_cap_writeback_dirty(struct page *page)
> > +{
> > + struct address_space *mapping;
> > +
> > + if (!PageSlab(page))
> > + mapping = page_mapping(page);
> > + else
> > + mapping = NULL;
> > +
> > + return mapping && mapping_cap_writeback_dirty(mapping);
> > +}
>
> If the page isn't locked then page->mapping can be concurrently removed
> and freed. This actually happened to me in real-life testing several
> years ago.
We certainly don't want to be taking locks per page to build the flags
data here. As we don't have any pretense of being atomic, it's ok if we
can find a way to do the test that's inaccurate when a race occurs, so
long as it doesn't dereference null.
But if there's not an obvious way to do that, we should probably just
drop this flag bit for this iteration.
--
http://selenic.com : development and support for Mercurial and Linux
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 22:46 ` Matt Mackall
0 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 22:46 UTC (permalink / raw)
To: Andrew Morton
Cc: Wu Fengguang, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm
On Tue, 2009-04-28 at 14:32 -0700, Andrew Morton wrote:
> > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit) \
> > + do { \
> > + if (visible || genuine_linus()) \
> > + uflags |= ((kflags >> kbit) & 1) << ubit; \
> > + } while (0);
>
> Did this have to be implemented as a macro?
I'm mostly to blame for that. I seem to recall the optimizer doing a
better job on this as a macro.
> It's bad, because it might or might not reference its argument, so if
> someone passes it an expression-with-side-effects, the end result is
> unpredictable. A C function is almost always preferable if possible.
I don't think there's any use case for it outside of its one user?
> > +/* a helper function _not_ intended for more general uses */
> > +static inline int page_cap_writeback_dirty(struct page *page)
> > +{
> > + struct address_space *mapping;
> > +
> > + if (!PageSlab(page))
> > + mapping = page_mapping(page);
> > + else
> > + mapping = NULL;
> > +
> > + return mapping && mapping_cap_writeback_dirty(mapping);
> > +}
>
> If the page isn't locked then page->mapping can be concurrently removed
> and freed. This actually happened to me in real-life testing several
> years ago.
We certainly don't want to be taking locks per page to build the flags
data here. As we don't have any pretense of being atomic, it's ok if we
can find a way to do the test that's inaccurate when a race occurs, so
long as it doesn't dereference null.
But if there's not an obvious way to do that, we should probably just
drop this flag bit for this iteration.
--
http://selenic.com : development and support for Mercurial and Linux
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 22:46 ` Matt Mackall
@ 2009-04-28 23:02 ` Andrew Morton
-1 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-28 23:02 UTC (permalink / raw)
To: Matt Mackall
Cc: fengguang.wu, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm
On Tue, 28 Apr 2009 17:46:34 -0500
Matt Mackall <mpm@selenic.com> wrote:
> > > +/* a helper function _not_ intended for more general uses */
> > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > +{
> > > + struct address_space *mapping;
> > > +
> > > + if (!PageSlab(page))
> > > + mapping = page_mapping(page);
> > > + else
> > > + mapping = NULL;
> > > +
> > > + return mapping && mapping_cap_writeback_dirty(mapping);
> > > +}
> >
> > If the page isn't locked then page->mapping can be concurrently removed
> > and freed. This actually happened to me in real-life testing several
> > years ago.
>
> We certainly don't want to be taking locks per page to build the flags
> data here. As we don't have any pretense of being atomic, it's ok if we
> can find a way to do the test that's inaccurate when a race occurs, so
> long as it doesn't dereference null.
>
> But if there's not an obvious way to do that, we should probably just
> drop this flag bit for this iteration.
trylock_page() could be used here, perhaps.
Then again, why _not_ just do lock_page()? After all, few pages are
ever locked. There will be latency if the caller stumbles across a
page which is under read I/O, but so be it?
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 23:02 ` Andrew Morton
0 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-28 23:02 UTC (permalink / raw)
To: Matt Mackall
Cc: fengguang.wu, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm
On Tue, 28 Apr 2009 17:46:34 -0500
Matt Mackall <mpm@selenic.com> wrote:
> > > +/* a helper function _not_ intended for more general uses */
> > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > +{
> > > + struct address_space *mapping;
> > > +
> > > + if (!PageSlab(page))
> > > + mapping = page_mapping(page);
> > > + else
> > > + mapping = NULL;
> > > +
> > > + return mapping && mapping_cap_writeback_dirty(mapping);
> > > +}
> >
> > If the page isn't locked then page->mapping can be concurrently removed
> > and freed. This actually happened to me in real-life testing several
> > years ago.
>
> We certainly don't want to be taking locks per page to build the flags
> data here. As we don't have any pretense of being atomic, it's ok if we
> can find a way to do the test that's inaccurate when a race occurs, so
> long as it doesn't dereference null.
>
> But if there's not an obvious way to do that, we should probably just
> drop this flag bit for this iteration.
trylock_page() could be used here, perhaps.
Then again, why _not_ just do lock_page()? After all, few pages are
ever locked. There will be latency if the caller stumbles across a
page which is under read I/O, but so be it?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 23:02 ` Andrew Morton
@ 2009-04-28 23:31 ` Matt Mackall
-1 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 23:31 UTC (permalink / raw)
To: Andrew Morton
Cc: fengguang.wu, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm
On Tue, 2009-04-28 at 16:02 -0700, Andrew Morton wrote:
> On Tue, 28 Apr 2009 17:46:34 -0500
> Matt Mackall <mpm@selenic.com> wrote:
>
> > > > +/* a helper function _not_ intended for more general uses */
> > > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > > +{
> > > > + struct address_space *mapping;
> > > > +
> > > > + if (!PageSlab(page))
> > > > + mapping = page_mapping(page);
> > > > + else
> > > > + mapping = NULL;
> > > > +
> > > > + return mapping && mapping_cap_writeback_dirty(mapping);
> > > > +}
> > >
> > > If the page isn't locked then page->mapping can be concurrently removed
> > > and freed. This actually happened to me in real-life testing several
> > > years ago.
> >
> > We certainly don't want to be taking locks per page to build the flags
> > data here. As we don't have any pretense of being atomic, it's ok if we
> > can find a way to do the test that's inaccurate when a race occurs, so
> > long as it doesn't dereference null.
> >
> > But if there's not an obvious way to do that, we should probably just
> > drop this flag bit for this iteration.
>
> trylock_page() could be used here, perhaps.
>
> Then again, why _not_ just do lock_page()? After all, few pages are
> ever locked. There will be latency if the caller stumbles across a
> page which is under read I/O, but so be it?
As I mentioned just a bit ago, it's really not an unreasonable use case
to want to do this on every page in the system back to back. So per page
overhead matters. And the odds of stalling on a locked page when
visiting 1M pages while under load are probably not negligible.
Our lock primitives are pretty low overhead in the fast path, but every
cycle counts. The new tests and branches this code already adds are a
bit worrisome, but on balance probably worth it.
--
http://selenic.com : development and support for Mercurial and Linux
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 23:31 ` Matt Mackall
0 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 23:31 UTC (permalink / raw)
To: Andrew Morton
Cc: fengguang.wu, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm
On Tue, 2009-04-28 at 16:02 -0700, Andrew Morton wrote:
> On Tue, 28 Apr 2009 17:46:34 -0500
> Matt Mackall <mpm@selenic.com> wrote:
>
> > > > +/* a helper function _not_ intended for more general uses */
> > > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > > +{
> > > > + struct address_space *mapping;
> > > > +
> > > > + if (!PageSlab(page))
> > > > + mapping = page_mapping(page);
> > > > + else
> > > > + mapping = NULL;
> > > > +
> > > > + return mapping && mapping_cap_writeback_dirty(mapping);
> > > > +}
> > >
> > > If the page isn't locked then page->mapping can be concurrently removed
> > > and freed. This actually happened to me in real-life testing several
> > > years ago.
> >
> > We certainly don't want to be taking locks per page to build the flags
> > data here. As we don't have any pretense of being atomic, it's ok if we
> > can find a way to do the test that's inaccurate when a race occurs, so
> > long as it doesn't dereference null.
> >
> > But if there's not an obvious way to do that, we should probably just
> > drop this flag bit for this iteration.
>
> trylock_page() could be used here, perhaps.
>
> Then again, why _not_ just do lock_page()? After all, few pages are
> ever locked. There will be latency if the caller stumbles across a
> page which is under read I/O, but so be it?
As I mentioned just a bit ago, it's really not an unreasonable use case
to want to do this on every page in the system back to back. So per page
overhead matters. And the odds of stalling on a locked page when
visiting 1M pages while under load are probably not negligible.
Our lock primitives are pretty low overhead in the fast path, but every
cycle counts. The new tests and branches this code already adds are a
bit worrisome, but on balance probably worth it.
--
http://selenic.com : development and support for Mercurial and Linux
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 23:31 ` Matt Mackall
@ 2009-04-28 23:42 ` Andrew Morton
-1 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-28 23:42 UTC (permalink / raw)
To: Matt Mackall
Cc: fengguang.wu, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm
On Tue, 28 Apr 2009 18:31:09 -0500
Matt Mackall <mpm@selenic.com> wrote:
> On Tue, 2009-04-28 at 16:02 -0700, Andrew Morton wrote:
> > On Tue, 28 Apr 2009 17:46:34 -0500
> > Matt Mackall <mpm@selenic.com> wrote:
> >
> > > > > +/* a helper function _not_ intended for more general uses */
> > > > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > > > +{
> > > > > + struct address_space *mapping;
> > > > > +
> > > > > + if (!PageSlab(page))
> > > > > + mapping = page_mapping(page);
> > > > > + else
> > > > > + mapping = NULL;
> > > > > +
> > > > > + return mapping && mapping_cap_writeback_dirty(mapping);
> > > > > +}
> > > >
> > > > If the page isn't locked then page->mapping can be concurrently removed
> > > > and freed. This actually happened to me in real-life testing several
> > > > years ago.
> > >
> > > We certainly don't want to be taking locks per page to build the flags
> > > data here. As we don't have any pretense of being atomic, it's ok if we
> > > can find a way to do the test that's inaccurate when a race occurs, so
> > > long as it doesn't dereference null.
> > >
> > > But if there's not an obvious way to do that, we should probably just
> > > drop this flag bit for this iteration.
> >
> > trylock_page() could be used here, perhaps.
> >
> > Then again, why _not_ just do lock_page()? After all, few pages are
> > ever locked. There will be latency if the caller stumbles across a
> > page which is under read I/O, but so be it?
>
> As I mentioned just a bit ago, it's really not an unreasonable use case
> to want to do this on every page in the system back to back. So per page
> overhead matters. And the odds of stalling on a locked page when
> visiting 1M pages while under load are probably not negligible.
The chances of stalling on a locked page are pretty good, and the
duration of the stall might be long indeed. Perhaps a trylock is a
decent compromise - it depends on the value of this metric, and I've
forgotten what we're talking about ;)
umm, seems that this flag is needed to enable PG_error, PG_dirty,
PG_uptodate and PG_writeback reporting. So simply removing this code
would put a huge hole in the patchset, no?
> Our lock primitives are pretty low overhead in the fast path, but every
> cycle counts. The new tests and branches this code already adds are a
> bit worrisome, but on balance probably worth it.
That should be easy to quantify (hint).
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 23:42 ` Andrew Morton
0 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-28 23:42 UTC (permalink / raw)
To: Matt Mackall
Cc: fengguang.wu, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm
On Tue, 28 Apr 2009 18:31:09 -0500
Matt Mackall <mpm@selenic.com> wrote:
> On Tue, 2009-04-28 at 16:02 -0700, Andrew Morton wrote:
> > On Tue, 28 Apr 2009 17:46:34 -0500
> > Matt Mackall <mpm@selenic.com> wrote:
> >
> > > > > +/* a helper function _not_ intended for more general uses */
> > > > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > > > +{
> > > > > + struct address_space *mapping;
> > > > > +
> > > > > + if (!PageSlab(page))
> > > > > + mapping = page_mapping(page);
> > > > > + else
> > > > > + mapping = NULL;
> > > > > +
> > > > > + return mapping && mapping_cap_writeback_dirty(mapping);
> > > > > +}
> > > >
> > > > If the page isn't locked then page->mapping can be concurrently removed
> > > > and freed. This actually happened to me in real-life testing several
> > > > years ago.
> > >
> > > We certainly don't want to be taking locks per page to build the flags
> > > data here. As we don't have any pretense of being atomic, it's ok if we
> > > can find a way to do the test that's inaccurate when a race occurs, so
> > > long as it doesn't dereference null.
> > >
> > > But if there's not an obvious way to do that, we should probably just
> > > drop this flag bit for this iteration.
> >
> > trylock_page() could be used here, perhaps.
> >
> > Then again, why _not_ just do lock_page()? After all, few pages are
> > ever locked. There will be latency if the caller stumbles across a
> > page which is under read I/O, but so be it?
>
> As I mentioned just a bit ago, it's really not an unreasonable use case
> to want to do this on every page in the system back to back. So per page
> overhead matters. And the odds of stalling on a locked page when
> visiting 1M pages while under load are probably not negligible.
The chances of stalling on a locked page are pretty good, and the
duration of the stall might be long indeed. Perhaps a trylock is a
decent compromise - it depends on the value of this metric, and I've
forgotten what we're talking about ;)
umm, seems that this flag is needed to enable PG_error, PG_dirty,
PG_uptodate and PG_writeback reporting. So simply removing this code
would put a huge hole in the patchset, no?
> Our lock primitives are pretty low overhead in the fast path, but every
> cycle counts. The new tests and branches this code already adds are a
> bit worrisome, but on balance probably worth it.
That should be easy to quantify (hint).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 23:42 ` Andrew Morton
@ 2009-04-28 23:55 ` Matt Mackall
-1 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 23:55 UTC (permalink / raw)
To: Andrew Morton
Cc: fengguang.wu, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm
On Tue, 2009-04-28 at 16:42 -0700, Andrew Morton wrote:
> On Tue, 28 Apr 2009 18:31:09 -0500
> Matt Mackall <mpm@selenic.com> wrote:
>
> > On Tue, 2009-04-28 at 16:02 -0700, Andrew Morton wrote:
> > > On Tue, 28 Apr 2009 17:46:34 -0500
> > > Matt Mackall <mpm@selenic.com> wrote:
> > >
> > > > > > +/* a helper function _not_ intended for more general uses */
> > > > > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > > > > +{
> > > > > > + struct address_space *mapping;
> > > > > > +
> > > > > > + if (!PageSlab(page))
> > > > > > + mapping = page_mapping(page);
> > > > > > + else
> > > > > > + mapping = NULL;
> > > > > > +
> > > > > > + return mapping && mapping_cap_writeback_dirty(mapping);
> > > > > > +}
> > > > >
> > > > > If the page isn't locked then page->mapping can be concurrently removed
> > > > > and freed. This actually happened to me in real-life testing several
> > > > > years ago.
> > > >
> > > > We certainly don't want to be taking locks per page to build the flags
> > > > data here. As we don't have any pretense of being atomic, it's ok if we
> > > > can find a way to do the test that's inaccurate when a race occurs, so
> > > > long as it doesn't dereference null.
> > > >
> > > > But if there's not an obvious way to do that, we should probably just
> > > > drop this flag bit for this iteration.
> > >
> > > trylock_page() could be used here, perhaps.
> > >
> > > Then again, why _not_ just do lock_page()? After all, few pages are
> > > ever locked. There will be latency if the caller stumbles across a
> > > page which is under read I/O, but so be it?
> >
> > As I mentioned just a bit ago, it's really not an unreasonable use case
> > to want to do this on every page in the system back to back. So per page
> > overhead matters. And the odds of stalling on a locked page when
> > visiting 1M pages while under load are probably not negligible.
>
> The chances of stalling on a locked page are pretty good, and the
> duration of the stall might be long indeed. Perhaps a trylock is a
> decent compromise - it depends on the value of this metric, and I've
> forgotten what we're talking about ;)
>
> umm, seems that this flag is needed to enable PG_error, PG_dirty,
> PG_uptodate and PG_writeback reporting. So simply removing this code
> would put a huge hole in the patchset, no?
We can report those bits anyway. But this patchset does something
clever: it filters irrelevant (and possibly overloaded) bits in various
contexts.
> > Our lock primitives are pretty low overhead in the fast path, but every
> > cycle counts. The new tests and branches this code already adds are a
> > bit worrisome, but on balance probably worth it.
>
> That should be easy to quantify (hint).
I'll let Fengguang address both these points.
--
http://selenic.com : development and support for Mercurial and Linux
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-28 23:55 ` Matt Mackall
0 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-28 23:55 UTC (permalink / raw)
To: Andrew Morton
Cc: fengguang.wu, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm
On Tue, 2009-04-28 at 16:42 -0700, Andrew Morton wrote:
> On Tue, 28 Apr 2009 18:31:09 -0500
> Matt Mackall <mpm@selenic.com> wrote:
>
> > On Tue, 2009-04-28 at 16:02 -0700, Andrew Morton wrote:
> > > On Tue, 28 Apr 2009 17:46:34 -0500
> > > Matt Mackall <mpm@selenic.com> wrote:
> > >
> > > > > > +/* a helper function _not_ intended for more general uses */
> > > > > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > > > > +{
> > > > > > + struct address_space *mapping;
> > > > > > +
> > > > > > + if (!PageSlab(page))
> > > > > > + mapping = page_mapping(page);
> > > > > > + else
> > > > > > + mapping = NULL;
> > > > > > +
> > > > > > + return mapping && mapping_cap_writeback_dirty(mapping);
> > > > > > +}
> > > > >
> > > > > If the page isn't locked then page->mapping can be concurrently removed
> > > > > and freed. This actually happened to me in real-life testing several
> > > > > years ago.
> > > >
> > > > We certainly don't want to be taking locks per page to build the flags
> > > > data here. As we don't have any pretense of being atomic, it's ok if we
> > > > can find a way to do the test that's inaccurate when a race occurs, so
> > > > long as it doesn't dereference null.
> > > >
> > > > But if there's not an obvious way to do that, we should probably just
> > > > drop this flag bit for this iteration.
> > >
> > > trylock_page() could be used here, perhaps.
> > >
> > > Then again, why _not_ just do lock_page()? After all, few pages are
> > > ever locked. There will be latency if the caller stumbles across a
> > > page which is under read I/O, but so be it?
> >
> > As I mentioned just a bit ago, it's really not an unreasonable use case
> > to want to do this on every page in the system back to back. So per page
> > overhead matters. And the odds of stalling on a locked page when
> > visiting 1M pages while under load are probably not negligible.
>
> The chances of stalling on a locked page are pretty good, and the
> duration of the stall might be long indeed. Perhaps a trylock is a
> decent compromise - it depends on the value of this metric, and I've
> forgotten what we're talking about ;)
>
> umm, seems that this flag is needed to enable PG_error, PG_dirty,
> PG_uptodate and PG_writeback reporting. So simply removing this code
> would put a huge hole in the patchset, no?
We can report those bits anyway. But this patchset does something
clever: it filters irrelevant (and possibly overloaded) bits in various
contexts.
> > Our lock primitives are pretty low overhead in the fast path, but every
> > cycle counts. The new tests and branches this code already adds are a
> > bit worrisome, but on balance probably worth it.
>
> That should be easy to quantify (hint).
I'll let Fengguang address both these points.
--
http://selenic.com : development and support for Mercurial and Linux
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 21:49 ` Matt Mackall
@ 2009-04-29 0:02 ` Robin Holt
-1 siblings, 0 replies; 137+ messages in thread
From: Robin Holt @ 2009-04-29 0:02 UTC (permalink / raw)
To: Matt Mackall
Cc: Andrew Morton, Tony Luck, fengguang.wu, mingo, rostedt, fweisbec,
lwoodman, a.p.zijlstra, penberg, eduard.munteanu, linux-kernel,
kosaki.motohiro, andi, adobriyan, linux-mm
On Tue, Apr 28, 2009 at 04:49:55PM -0500, Matt Mackall wrote:
> > Reading the state of all of memory in this fashion would be a somewhat
> > peculiar thing to do.
>
> Not entirely. If you've got, say, a large NUMA box, it could be
> incredibly illustrative to see that "oh, this node is entirely dominated
> by SLAB allocations". Or on a smaller machine "oh, this is fragmented to
> hell and there's no way I'm going to get a huge page". Things you're not
> going to get from individual stats.
I have, in the past, simply used grep on
/sys/devices/system/node/node*/meminfo and gotten the individual stats
I was concerned about. Not sure how much more detail would have been
needed or useful. I don't think I can recall a time where I needed to
write another tool.
Thanks,
Robin
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29 0:02 ` Robin Holt
0 siblings, 0 replies; 137+ messages in thread
From: Robin Holt @ 2009-04-29 0:02 UTC (permalink / raw)
To: Matt Mackall
Cc: Andrew Morton, Tony Luck, fengguang.wu, mingo, rostedt, fweisbec,
lwoodman, a.p.zijlstra, penberg, eduard.munteanu, linux-kernel,
kosaki.motohiro, andi, adobriyan, linux-mm
On Tue, Apr 28, 2009 at 04:49:55PM -0500, Matt Mackall wrote:
> > Reading the state of all of memory in this fashion would be a somewhat
> > peculiar thing to do.
>
> Not entirely. If you've got, say, a large NUMA box, it could be
> incredibly illustrative to see that "oh, this node is entirely dominated
> by SLAB allocations". Or on a smaller machine "oh, this is fragmented to
> hell and there's no way I'm going to get a huge page". Things you're not
> going to get from individual stats.
I have, in the past, simply used grep on
/sys/devices/system/node/node*/meminfo and gotten the individual stats
I was concerned about. Not sure how much more detail would have been
needed or useful. I don't think I can recall a time where I needed to
write another tool.
Thanks,
Robin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 21:32 ` Andrew Morton
@ 2009-04-29 2:38 ` Wu Fengguang
-1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-29 2:38 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm,
Stephen Rothwell, Chandra Seetharaman, Nathan Lynch,
Olof Johansson, Helge Deller
On Wed, Apr 29, 2009 at 05:32:44AM +0800, Andrew Morton wrote:
> On Tue, 28 Apr 2009 09:09:12 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > +/*
> > + * Kernel flags are exported faithfully to Linus and his fellow hackers.
> > + * Otherwise some details are masked to avoid confusing the end user:
> > + * - some kernel flags are completely invisible
> > + * - some kernel flags are conditionally invisible on their odd usages
> > + */
> > +#ifdef CONFIG_DEBUG_KERNEL
> > +static inline int genuine_linus(void) { return 1; }
>
> Although he's a fine chap, the use of the "_linus" tag isn't terribly
> clear (to me). I think what you're saying here is that this enables
> kernel-developer-only features, yes?
Yes.
> If so, perhaps we could come up with an identifier which expresses that
> more clearly.
>
> But I'd expect that everyone and all distros enable CONFIG_DEBUG_KERNEL
> for _some_ reason, so what's the point?
Good point! I can confirm my debian has CONFIG_DEBUG_KERNEL=Y!
> It is preferable that we always implement the same interface for all
> Kconfig settings. If this exposes information which is confusing or
> not useful to end-users then so be it - we should be able to cover that
> in supporting documentation.
My original patch takes that straightforward manner - and I still like it.
I would be very glad to move the filtering code from kernel to user space.
The use of more obscure flags could be discouraged by _not_ documenting
them. A really curious user is encouraged to refer to the code for the
exact meaning (and perhaps become a kernel developer ;-)
> Also, as mentioned in the other email, it would be good if we were to
> publish a little userspace app which people can use to access this raw
> data. We could give that application an `--i-am-a-kernel-developer'
> option!
OK. I'll include page-types.c in the next take.
> > +#else
> > +static inline int genuine_linus(void) { return 0; }
> > +#endif
>
> This isn't an appropriate use of CONFIG_DEBUG_KERNEL.
>
> DEBUG_KERNEL is a Kconfig-only construct which is use to enable _other_
> debugging features. The way you've used it here, if the person who is
> configuring the kernel wants to enable any other completely-unrelated
> debug feature, they have to enable DEBUG_KERNEL first. But when they
> do that, they unexpectedly alter the behaviour of pagemap!
>
> There are two other places where CONFIG_DEBUG_KERNEL affects code
> generation in .c files: arch/parisc/mm/init.c and
> arch/powerpc/kernel/sysfs.c. These are both wrong, and need slapping ;)
(add cc to related maintainers)
CONFIG_DEBUG_KERNEL being enabled in distro kernels effectively means
#ifdef CONFIG_DEBUG_KERNEL == #if 1
as the following patch demos. Now it becomes obviously silly.
diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
index 4356ceb..59fb910 100644
--- a/arch/parisc/mm/init.c
+++ b/arch/parisc/mm/init.c
@@ -368,19 +368,19 @@ static void __init setup_bootmem(void)
request_resource(&sysram_resources[0], &pdcdata_resource);
}
void free_initmem(void)
{
unsigned long addr, init_begin, init_end;
printk(KERN_INFO "Freeing unused kernel memory: ");
-#ifdef CONFIG_DEBUG_KERNEL
+#if 1
/* Attempt to catch anyone trying to execute code here
* by filling the page with BRK insns.
*
* If we disable interrupts for all CPUs, then IPI stops working.
* Kinda breaks the global cache flushing.
*/
local_irq_disable();
memset(__init_begin, 0x00,
@@ -519,19 +519,19 @@ void __init mem_init(void)
printk(KERN_INFO "Memory: %luk/%luk available (%dk kernel code, %dk reserved, %dk data, %dk init)\n",
(unsigned long)nr_free_pages() << (PAGE_SHIFT-10),
num_physpages << (PAGE_SHIFT-10),
codesize >> 10,
reservedpages << (PAGE_SHIFT-10),
datasize >> 10,
initsize >> 10
);
-#ifdef CONFIG_DEBUG_KERNEL /* double-sanity-check paranoia */
+#if 1 /* double-sanity-check paranoia */
printk("virtual kernel memory layout:\n"
" vmalloc : 0x%p - 0x%p (%4ld MB)\n"
" memory : 0x%p - 0x%p (%4ld MB)\n"
" .init : 0x%p - 0x%p (%4ld kB)\n"
" .data : 0x%p - 0x%p (%4ld kB)\n"
" .text : 0x%p - 0x%p (%4ld kB)\n",
(void*)VMALLOC_START, (void*)VMALLOC_END,
(VMALLOC_END - VMALLOC_START) >> 20,
diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index f41aec8..0d54c6b 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -212,19 +212,19 @@ static SYSDEV_ATTR(purr, 0600, show_purr, store_purr);
#endif /* CONFIG_PPC64 */
#ifdef HAS_PPC_PMC_PA6T
SYSFS_PMCSETUP(pa6t_pmc0, SPRN_PA6T_PMC0);
SYSFS_PMCSETUP(pa6t_pmc1, SPRN_PA6T_PMC1);
SYSFS_PMCSETUP(pa6t_pmc2, SPRN_PA6T_PMC2);
SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3);
SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4);
SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5);
-#ifdef CONFIG_DEBUG_KERNEL
+#if 1
SYSFS_PMCSETUP(hid0, SPRN_HID0);
SYSFS_PMCSETUP(hid1, SPRN_HID1);
SYSFS_PMCSETUP(hid4, SPRN_HID4);
SYSFS_PMCSETUP(hid5, SPRN_HID5);
SYSFS_PMCSETUP(ima0, SPRN_PA6T_IMA0);
SYSFS_PMCSETUP(ima1, SPRN_PA6T_IMA1);
SYSFS_PMCSETUP(ima2, SPRN_PA6T_IMA2);
SYSFS_PMCSETUP(ima3, SPRN_PA6T_IMA3);
SYSFS_PMCSETUP(ima4, SPRN_PA6T_IMA4);
@@ -282,19 +282,19 @@ static struct sysdev_attribute classic_pmc_attrs[] = {
static struct sysdev_attribute pa6t_attrs[] = {
_SYSDEV_ATTR(mmcr0, 0600, show_mmcr0, store_mmcr0),
_SYSDEV_ATTR(mmcr1, 0600, show_mmcr1, store_mmcr1),
_SYSDEV_ATTR(pmc0, 0600, show_pa6t_pmc0, store_pa6t_pmc0),
_SYSDEV_ATTR(pmc1, 0600, show_pa6t_pmc1, store_pa6t_pmc1),
_SYSDEV_ATTR(pmc2, 0600, show_pa6t_pmc2, store_pa6t_pmc2),
_SYSDEV_ATTR(pmc3, 0600, show_pa6t_pmc3, store_pa6t_pmc3),
_SYSDEV_ATTR(pmc4, 0600, show_pa6t_pmc4, store_pa6t_pmc4),
_SYSDEV_ATTR(pmc5, 0600, show_pa6t_pmc5, store_pa6t_pmc5),
-#ifdef CONFIG_DEBUG_KERNEL
+#if 1
_SYSDEV_ATTR(hid0, 0600, show_hid0, store_hid0),
_SYSDEV_ATTR(hid1, 0600, show_hid1, store_hid1),
_SYSDEV_ATTR(hid4, 0600, show_hid4, store_hid4),
_SYSDEV_ATTR(hid5, 0600, show_hid5, store_hid5),
_SYSDEV_ATTR(ima0, 0600, show_ima0, store_ima0),
_SYSDEV_ATTR(ima1, 0600, show_ima1, store_ima1),
_SYSDEV_ATTR(ima2, 0600, show_ima2, store_ima2),
_SYSDEV_ATTR(ima3, 0600, show_ima3, store_ima3),
_SYSDEV_ATTR(ima4, 0600, show_ima4, store_ima4),
> > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit) \
> > + do { \
> > + if (visible || genuine_linus()) \
> > + uflags |= ((kflags >> kbit) & 1) << ubit; \
> > + } while (0);
>
> Did this have to be implemented as a macro?
>
> It's bad, because it might or might not reference its argument, so if
> someone passes it an expression-with-side-effects, the end result is
> unpredictable. A C function is almost always preferable if possible.
Just tried inline function, the code size is increased slightly:
text data bss dec hex filename
macro 1804 128 0 1932 78c fs/proc/page.o
inline 1828 128 0 1956 7a4 fs/proc/page.o
So I'll keep the macro, but add brackets to make it a bit safer.
Thanks,
Fengguang
^ permalink raw reply related [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29 2:38 ` Wu Fengguang
0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-29 2:38 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm,
Stephen Rothwell, Chandra Seetharaman, Nathan Lynch,
Olof Johansson, Helge Deller
On Wed, Apr 29, 2009 at 05:32:44AM +0800, Andrew Morton wrote:
> On Tue, 28 Apr 2009 09:09:12 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > +/*
> > + * Kernel flags are exported faithfully to Linus and his fellow hackers.
> > + * Otherwise some details are masked to avoid confusing the end user:
> > + * - some kernel flags are completely invisible
> > + * - some kernel flags are conditionally invisible on their odd usages
> > + */
> > +#ifdef CONFIG_DEBUG_KERNEL
> > +static inline int genuine_linus(void) { return 1; }
>
> Although he's a fine chap, the use of the "_linus" tag isn't terribly
> clear (to me). I think what you're saying here is that this enables
> kernel-developer-only features, yes?
Yes.
> If so, perhaps we could come up with an identifier which expresses that
> more clearly.
>
> But I'd expect that everyone and all distros enable CONFIG_DEBUG_KERNEL
> for _some_ reason, so what's the point?
Good point! I can confirm my debian has CONFIG_DEBUG_KERNEL=Y!
> It is preferable that we always implement the same interface for all
> Kconfig settings. If this exposes information which is confusing or
> not useful to end-users then so be it - we should be able to cover that
> in supporting documentation.
My original patch takes that straightforward manner - and I still like it.
I would be very glad to move the filtering code from kernel to user space.
The use of more obscure flags could be discouraged by _not_ documenting
them. A really curious user is encouraged to refer to the code for the
exact meaning (and perhaps become a kernel developer ;-)
> Also, as mentioned in the other email, it would be good if we were to
> publish a little userspace app which people can use to access this raw
> data. We could give that application an `--i-am-a-kernel-developer'
> option!
OK. I'll include page-types.c in the next take.
> > +#else
> > +static inline int genuine_linus(void) { return 0; }
> > +#endif
>
> This isn't an appropriate use of CONFIG_DEBUG_KERNEL.
>
> DEBUG_KERNEL is a Kconfig-only construct which is use to enable _other_
> debugging features. The way you've used it here, if the person who is
> configuring the kernel wants to enable any other completely-unrelated
> debug feature, they have to enable DEBUG_KERNEL first. But when they
> do that, they unexpectedly alter the behaviour of pagemap!
>
> There are two other places where CONFIG_DEBUG_KERNEL affects code
> generation in .c files: arch/parisc/mm/init.c and
> arch/powerpc/kernel/sysfs.c. These are both wrong, and need slapping ;)
(add cc to related maintainers)
CONFIG_DEBUG_KERNEL being enabled in distro kernels effectively means
#ifdef CONFIG_DEBUG_KERNEL == #if 1
as the following patch demos. Now it becomes obviously silly.
diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
index 4356ceb..59fb910 100644
--- a/arch/parisc/mm/init.c
+++ b/arch/parisc/mm/init.c
@@ -368,19 +368,19 @@ static void __init setup_bootmem(void)
request_resource(&sysram_resources[0], &pdcdata_resource);
}
void free_initmem(void)
{
unsigned long addr, init_begin, init_end;
printk(KERN_INFO "Freeing unused kernel memory: ");
-#ifdef CONFIG_DEBUG_KERNEL
+#if 1
/* Attempt to catch anyone trying to execute code here
* by filling the page with BRK insns.
*
* If we disable interrupts for all CPUs, then IPI stops working.
* Kinda breaks the global cache flushing.
*/
local_irq_disable();
memset(__init_begin, 0x00,
@@ -519,19 +519,19 @@ void __init mem_init(void)
printk(KERN_INFO "Memory: %luk/%luk available (%dk kernel code, %dk reserved, %dk data, %dk init)\n",
(unsigned long)nr_free_pages() << (PAGE_SHIFT-10),
num_physpages << (PAGE_SHIFT-10),
codesize >> 10,
reservedpages << (PAGE_SHIFT-10),
datasize >> 10,
initsize >> 10
);
-#ifdef CONFIG_DEBUG_KERNEL /* double-sanity-check paranoia */
+#if 1 /* double-sanity-check paranoia */
printk("virtual kernel memory layout:\n"
" vmalloc : 0x%p - 0x%p (%4ld MB)\n"
" memory : 0x%p - 0x%p (%4ld MB)\n"
" .init : 0x%p - 0x%p (%4ld kB)\n"
" .data : 0x%p - 0x%p (%4ld kB)\n"
" .text : 0x%p - 0x%p (%4ld kB)\n",
(void*)VMALLOC_START, (void*)VMALLOC_END,
(VMALLOC_END - VMALLOC_START) >> 20,
diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index f41aec8..0d54c6b 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -212,19 +212,19 @@ static SYSDEV_ATTR(purr, 0600, show_purr, store_purr);
#endif /* CONFIG_PPC64 */
#ifdef HAS_PPC_PMC_PA6T
SYSFS_PMCSETUP(pa6t_pmc0, SPRN_PA6T_PMC0);
SYSFS_PMCSETUP(pa6t_pmc1, SPRN_PA6T_PMC1);
SYSFS_PMCSETUP(pa6t_pmc2, SPRN_PA6T_PMC2);
SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3);
SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4);
SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5);
-#ifdef CONFIG_DEBUG_KERNEL
+#if 1
SYSFS_PMCSETUP(hid0, SPRN_HID0);
SYSFS_PMCSETUP(hid1, SPRN_HID1);
SYSFS_PMCSETUP(hid4, SPRN_HID4);
SYSFS_PMCSETUP(hid5, SPRN_HID5);
SYSFS_PMCSETUP(ima0, SPRN_PA6T_IMA0);
SYSFS_PMCSETUP(ima1, SPRN_PA6T_IMA1);
SYSFS_PMCSETUP(ima2, SPRN_PA6T_IMA2);
SYSFS_PMCSETUP(ima3, SPRN_PA6T_IMA3);
SYSFS_PMCSETUP(ima4, SPRN_PA6T_IMA4);
@@ -282,19 +282,19 @@ static struct sysdev_attribute classic_pmc_attrs[] = {
static struct sysdev_attribute pa6t_attrs[] = {
_SYSDEV_ATTR(mmcr0, 0600, show_mmcr0, store_mmcr0),
_SYSDEV_ATTR(mmcr1, 0600, show_mmcr1, store_mmcr1),
_SYSDEV_ATTR(pmc0, 0600, show_pa6t_pmc0, store_pa6t_pmc0),
_SYSDEV_ATTR(pmc1, 0600, show_pa6t_pmc1, store_pa6t_pmc1),
_SYSDEV_ATTR(pmc2, 0600, show_pa6t_pmc2, store_pa6t_pmc2),
_SYSDEV_ATTR(pmc3, 0600, show_pa6t_pmc3, store_pa6t_pmc3),
_SYSDEV_ATTR(pmc4, 0600, show_pa6t_pmc4, store_pa6t_pmc4),
_SYSDEV_ATTR(pmc5, 0600, show_pa6t_pmc5, store_pa6t_pmc5),
-#ifdef CONFIG_DEBUG_KERNEL
+#if 1
_SYSDEV_ATTR(hid0, 0600, show_hid0, store_hid0),
_SYSDEV_ATTR(hid1, 0600, show_hid1, store_hid1),
_SYSDEV_ATTR(hid4, 0600, show_hid4, store_hid4),
_SYSDEV_ATTR(hid5, 0600, show_hid5, store_hid5),
_SYSDEV_ATTR(ima0, 0600, show_ima0, store_ima0),
_SYSDEV_ATTR(ima1, 0600, show_ima1, store_ima1),
_SYSDEV_ATTR(ima2, 0600, show_ima2, store_ima2),
_SYSDEV_ATTR(ima3, 0600, show_ima3, store_ima3),
_SYSDEV_ATTR(ima4, 0600, show_ima4, store_ima4),
> > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit) \
> > + do { \
> > + if (visible || genuine_linus()) \
> > + uflags |= ((kflags >> kbit) & 1) << ubit; \
> > + } while (0);
>
> Did this have to be implemented as a macro?
>
> It's bad, because it might or might not reference its argument, so if
> someone passes it an expression-with-side-effects, the end result is
> unpredictable. A C function is almost always preferable if possible.
Just tried inline function, the code size is increased slightly:
text data bss dec hex filename
macro 1804 128 0 1932 78c fs/proc/page.o
inline 1828 128 0 1956 7a4 fs/proc/page.o
So I'll keep the macro, but add brackets to make it a bit safer.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-29 2:38 ` Wu Fengguang
@ 2009-04-29 2:55 ` Andrew Morton
-1 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-29 2:55 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm,
Stephen Rothwell, Chandra Seetharaman, Nathan Lynch,
Olof Johansson, Helge Deller
On Wed, 29 Apr 2009 10:38:42 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit) \
> > > + do { \
> > > + if (visible || genuine_linus()) \
> > > + uflags |= ((kflags >> kbit) & 1) << ubit; \
> > > + } while (0);
> >
> > Did this have to be implemented as a macro?
> >
> > It's bad, because it might or might not reference its argument, so if
> > someone passes it an expression-with-side-effects, the end result is
> > unpredictable. A C function is almost always preferable if possible.
>
> Just tried inline function, the code size is increased slightly:
>
> text data bss dec hex filename
> macro 1804 128 0 1932 78c fs/proc/page.o
> inline 1828 128 0 1956 7a4 fs/proc/page.o
>
hm, I wonder why. Maybe it fixed a bug ;)
The code is effectively doing
if (expr1)
something();
if (expr1)
something_else();
if (expr1)
something_else2();
etc. Obviously we _hope_ that the compiler turns that into
if (expr1) {
something();
something_else();
something_else2();
}
for us, but it would be good to check...
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29 2:55 ` Andrew Morton
0 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-29 2:55 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm,
Stephen Rothwell, Chandra Seetharaman, Nathan Lynch,
Olof Johansson, Helge Deller
On Wed, 29 Apr 2009 10:38:42 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit) \
> > > + do { \
> > > + if (visible || genuine_linus()) \
> > > + uflags |= ((kflags >> kbit) & 1) << ubit; \
> > > + } while (0);
> >
> > Did this have to be implemented as a macro?
> >
> > It's bad, because it might or might not reference its argument, so if
> > someone passes it an expression-with-side-effects, the end result is
> > unpredictable. A C function is almost always preferable if possible.
>
> Just tried inline function, the code size is increased slightly:
>
> text data bss dec hex filename
> macro 1804 128 0 1932 78c fs/proc/page.o
> inline 1828 128 0 1956 7a4 fs/proc/page.o
>
hm, I wonder why. Maybe it fixed a bug ;)
The code is effectively doing
if (expr1)
something();
if (expr1)
something_else();
if (expr1)
something_else2();
etc. Obviously we _hope_ that the compiler turns that into
if (expr1) {
something();
something_else();
something_else2();
}
for us, but it would be good to check...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 23:55 ` Matt Mackall
@ 2009-04-29 3:33 ` Wu Fengguang
-1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-29 3:33 UTC (permalink / raw)
To: Matt Mackall
Cc: Andrew Morton, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm
On Wed, Apr 29, 2009 at 07:55:10AM +0800, Matt Mackall wrote:
> On Tue, 2009-04-28 at 16:42 -0700, Andrew Morton wrote:
> > On Tue, 28 Apr 2009 18:31:09 -0500
> > Matt Mackall <mpm@selenic.com> wrote:
> >
> > > On Tue, 2009-04-28 at 16:02 -0700, Andrew Morton wrote:
> > > > On Tue, 28 Apr 2009 17:46:34 -0500
> > > > Matt Mackall <mpm@selenic.com> wrote:
> > > >
> > > > > > > +/* a helper function _not_ intended for more general uses */
> > > > > > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > > > > > +{
> > > > > > > + struct address_space *mapping;
> > > > > > > +
> > > > > > > + if (!PageSlab(page))
> > > > > > > + mapping = page_mapping(page);
> > > > > > > + else
> > > > > > > + mapping = NULL;
> > > > > > > +
> > > > > > > + return mapping && mapping_cap_writeback_dirty(mapping);
> > > > > > > +}
> > > > > >
> > > > > > If the page isn't locked then page->mapping can be concurrently removed
> > > > > > and freed. This actually happened to me in real-life testing several
> > > > > > years ago.
> > > > >
> > > > > We certainly don't want to be taking locks per page to build the flags
> > > > > data here. As we don't have any pretense of being atomic, it's ok if we
> > > > > can find a way to do the test that's inaccurate when a race occurs, so
> > > > > long as it doesn't dereference null.
> > > > >
> > > > > But if there's not an obvious way to do that, we should probably just
> > > > > drop this flag bit for this iteration.
> > > >
> > > > trylock_page() could be used here, perhaps.
> > > >
> > > > Then again, why _not_ just do lock_page()? After all, few pages are
> > > > ever locked. There will be latency if the caller stumbles across a
> > > > page which is under read I/O, but so be it?
> > >
> > > As I mentioned just a bit ago, it's really not an unreasonable use case
> > > to want to do this on every page in the system back to back. So per page
> > > overhead matters. And the odds of stalling on a locked page when
> > > visiting 1M pages while under load are probably not negligible.
> >
> > The chances of stalling on a locked page are pretty good, and the
> > duration of the stall might be long indeed. Perhaps a trylock is a
> > decent compromise - it depends on the value of this metric, and I've
> > forgotten what we're talking about ;)
> >
> > umm, seems that this flag is needed to enable PG_error, PG_dirty,
> > PG_uptodate and PG_writeback reporting. So simply removing this code
> > would put a huge hole in the patchset, no?
>
> We can report those bits anyway. But this patchset does something
> clever: it filters irrelevant (and possibly overloaded) bits in various
> contexts.
>
> > > Our lock primitives are pretty low overhead in the fast path, but every
> > > cycle counts. The new tests and branches this code already adds are a
> > > bit worrisome, but on balance probably worth it.
> >
> > That should be easy to quantify (hint).
>
> I'll let Fengguang address both these points.
A quick micro bench: 100 runs on another T7300@2GHz 2GB laptop:
user system total
no lock 0.270 22.850 23.607
trylock 0.310 25.890 26.484
+13.3% +12.2%
But anyway, the plan is to move filtering to user space and eliminate
the complex kernel logics.
The IO filtering is no longer possible in user space, but I didn't see
the error/dirty/writeback bits on this testing system. So I guess it
won't be a big loss.
The huge/gigantic page filtering is also not possible in user space.
So I tend to add a KPF_HUGE flag to distinguish (hardware supported)
huge pages from normal (software) compound pages. Any objections?
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29 3:33 ` Wu Fengguang
0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-29 3:33 UTC (permalink / raw)
To: Matt Mackall
Cc: Andrew Morton, linux-kernel, kosaki.motohiro, andi, adobriyan, linux-mm
On Wed, Apr 29, 2009 at 07:55:10AM +0800, Matt Mackall wrote:
> On Tue, 2009-04-28 at 16:42 -0700, Andrew Morton wrote:
> > On Tue, 28 Apr 2009 18:31:09 -0500
> > Matt Mackall <mpm@selenic.com> wrote:
> >
> > > On Tue, 2009-04-28 at 16:02 -0700, Andrew Morton wrote:
> > > > On Tue, 28 Apr 2009 17:46:34 -0500
> > > > Matt Mackall <mpm@selenic.com> wrote:
> > > >
> > > > > > > +/* a helper function _not_ intended for more general uses */
> > > > > > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > > > > > +{
> > > > > > > + struct address_space *mapping;
> > > > > > > +
> > > > > > > + if (!PageSlab(page))
> > > > > > > + mapping = page_mapping(page);
> > > > > > > + else
> > > > > > > + mapping = NULL;
> > > > > > > +
> > > > > > > + return mapping && mapping_cap_writeback_dirty(mapping);
> > > > > > > +}
> > > > > >
> > > > > > If the page isn't locked then page->mapping can be concurrently removed
> > > > > > and freed. This actually happened to me in real-life testing several
> > > > > > years ago.
> > > > >
> > > > > We certainly don't want to be taking locks per page to build the flags
> > > > > data here. As we don't have any pretense of being atomic, it's ok if we
> > > > > can find a way to do the test that's inaccurate when a race occurs, so
> > > > > long as it doesn't dereference null.
> > > > >
> > > > > But if there's not an obvious way to do that, we should probably just
> > > > > drop this flag bit for this iteration.
> > > >
> > > > trylock_page() could be used here, perhaps.
> > > >
> > > > Then again, why _not_ just do lock_page()? After all, few pages are
> > > > ever locked. There will be latency if the caller stumbles across a
> > > > page which is under read I/O, but so be it?
> > >
> > > As I mentioned just a bit ago, it's really not an unreasonable use case
> > > to want to do this on every page in the system back to back. So per page
> > > overhead matters. And the odds of stalling on a locked page when
> > > visiting 1M pages while under load are probably not negligible.
> >
> > The chances of stalling on a locked page are pretty good, and the
> > duration of the stall might be long indeed. Perhaps a trylock is a
> > decent compromise - it depends on the value of this metric, and I've
> > forgotten what we're talking about ;)
> >
> > umm, seems that this flag is needed to enable PG_error, PG_dirty,
> > PG_uptodate and PG_writeback reporting. So simply removing this code
> > would put a huge hole in the patchset, no?
>
> We can report those bits anyway. But this patchset does something
> clever: it filters irrelevant (and possibly overloaded) bits in various
> contexts.
>
> > > Our lock primitives are pretty low overhead in the fast path, but every
> > > cycle counts. The new tests and branches this code already adds are a
> > > bit worrisome, but on balance probably worth it.
> >
> > That should be easy to quantify (hint).
>
> I'll let Fengguang address both these points.
A quick micro bench: 100 runs on another T7300@2GHz 2GB laptop:
user system total
no lock 0.270 22.850 23.607
trylock 0.310 25.890 26.484
+13.3% +12.2%
But anyway, the plan is to move filtering to user space and eliminate
the complex kernel logics.
The IO filtering is no longer possible in user space, but I didn't see
the error/dirty/writeback bits on this testing system. So I guess it
won't be a big loss.
The huge/gigantic page filtering is also not possible in user space.
So I tend to add a KPF_HUGE flag to distinguish (hardware supported)
huge pages from normal (software) compound pages. Any objections?
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-29 2:55 ` Andrew Morton
@ 2009-04-29 3:48 ` Wu Fengguang
-1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-29 3:48 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm,
Stephen Rothwell, Chandra Seetharaman, Nathan Lynch,
Olof Johansson, Helge Deller
On Wed, Apr 29, 2009 at 10:55:27AM +0800, Andrew Morton wrote:
> On Wed, 29 Apr 2009 10:38:42 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > > > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit) \
> > > > + do { \
> > > > + if (visible || genuine_linus()) \
> > > > + uflags |= ((kflags >> kbit) & 1) << ubit; \
> > > > + } while (0);
> > >
> > > Did this have to be implemented as a macro?
> > >
> > > It's bad, because it might or might not reference its argument, so if
> > > someone passes it an expression-with-side-effects, the end result is
> > > unpredictable. A C function is almost always preferable if possible.
> >
> > Just tried inline function, the code size is increased slightly:
> >
> > text data bss dec hex filename
> > macro 1804 128 0 1932 78c fs/proc/page.o
> > inline 1828 128 0 1956 7a4 fs/proc/page.o
> >
>
> hm, I wonder why. Maybe it fixed a bug ;)
>
> The code is effectively doing
>
> if (expr1)
> something();
> if (expr1)
> something_else();
> if (expr1)
> something_else2();
>
> etc. Obviously we _hope_ that the compiler turns that into
>
> if (expr1) {
> something();
> something_else();
> something_else2();
> }
>
> for us, but it would be good to check...
By 'expr1', you mean (visible || genuine_linus())?
No, I can confirm the inefficiency does not lie here.
I simplified the kpf_copy_bit() to
#define kpf_copy_bit(uflags, kflags, ubit, kbit) \
uflags |= (((kflags) >> (kbit)) & 1) << (ubit);
or
static inline u64 kpf_copy_bit(u64 kflags, int ubit, int kbit)
{
return (((kflags) >> (kbit)) & 1) << (ubit);
}
and double checked the differences: the gap grows unexpectedly!
text data bss dec hex filename
macro 1829 168 0 1997 7cd fs/proc/page.o
inline 1893 168 0 2061 80d fs/proc/page.o
+3.5%
(note: the larger absolute text size is due to some experimental code elsewhere.)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29 3:48 ` Wu Fengguang
0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-29 3:48 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm,
Stephen Rothwell, Chandra Seetharaman, Nathan Lynch,
Olof Johansson, Helge Deller
On Wed, Apr 29, 2009 at 10:55:27AM +0800, Andrew Morton wrote:
> On Wed, 29 Apr 2009 10:38:42 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > > > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit) \
> > > > + do { \
> > > > + if (visible || genuine_linus()) \
> > > > + uflags |= ((kflags >> kbit) & 1) << ubit; \
> > > > + } while (0);
> > >
> > > Did this have to be implemented as a macro?
> > >
> > > It's bad, because it might or might not reference its argument, so if
> > > someone passes it an expression-with-side-effects, the end result is
> > > unpredictable. A C function is almost always preferable if possible.
> >
> > Just tried inline function, the code size is increased slightly:
> >
> > text data bss dec hex filename
> > macro 1804 128 0 1932 78c fs/proc/page.o
> > inline 1828 128 0 1956 7a4 fs/proc/page.o
> >
>
> hm, I wonder why. Maybe it fixed a bug ;)
>
> The code is effectively doing
>
> if (expr1)
> something();
> if (expr1)
> something_else();
> if (expr1)
> something_else2();
>
> etc. Obviously we _hope_ that the compiler turns that into
>
> if (expr1) {
> something();
> something_else();
> something_else2();
> }
>
> for us, but it would be good to check...
By 'expr1', you mean (visible || genuine_linus())?
No, I can confirm the inefficiency does not lie here.
I simplified the kpf_copy_bit() to
#define kpf_copy_bit(uflags, kflags, ubit, kbit) \
uflags |= (((kflags) >> (kbit)) & 1) << (ubit);
or
static inline u64 kpf_copy_bit(u64 kflags, int ubit, int kbit)
{
return (((kflags) >> (kbit)) & 1) << (ubit);
}
and double checked the differences: the gap grows unexpectedly!
text data bss dec hex filename
macro 1829 168 0 1997 7cd fs/proc/page.o
inline 1893 168 0 2061 80d fs/proc/page.o
+3.5%
(note: the larger absolute text size is due to some experimental code elsewhere.)
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-29 2:38 ` Wu Fengguang
(?)
@ 2009-04-29 4:41 ` Nathan Lynch
-1 siblings, 0 replies; 137+ messages in thread
From: Nathan Lynch @ 2009-04-29 4:41 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, linux-kernel, kosaki.motohiro, andi, mpm,
adobriyan, linux-mm, Stephen Rothwell, Chandra Seetharaman,
Olof Johansson, Helge Deller, linuxppc-dev
Wu Fengguang <fengguang.wu@intel.com> writes:
> On Wed, Apr 29, 2009 at 05:32:44AM +0800, Andrew Morton wrote:
>> On Tue, 28 Apr 2009 09:09:12 +0800
>> Wu Fengguang <fengguang.wu@intel.com> wrote:
>>
>> > +/*
>> > + * Kernel flags are exported faithfully to Linus and his fellow hackers.
>> > + * Otherwise some details are masked to avoid confusing the end user:
>> > + * - some kernel flags are completely invisible
>> > + * - some kernel flags are conditionally invisible on their odd usages
>> > + */
>> > +#ifdef CONFIG_DEBUG_KERNEL
>> > +static inline int genuine_linus(void) { return 1; }
>>
>> Although he's a fine chap, the use of the "_linus" tag isn't terribly
>> clear (to me). I think what you're saying here is that this enables
>> kernel-developer-only features, yes?
>
> Yes.
>
>> If so, perhaps we could come up with an identifier which expresses that
>> more clearly.
>>
>> But I'd expect that everyone and all distros enable CONFIG_DEBUG_KERNEL
>> for _some_ reason, so what's the point?
At the least, it has not always been so...
>
> Good point! I can confirm my debian has CONFIG_DEBUG_KERNEL=Y!
I can confirm mine does not.
etch-i386:~# uname -a
Linux etch-i386 2.6.18-6-686 #1 SMP Fri Dec 12 16:48:28 UTC 2008 i686 GNU/Linux
etch-i386:~# grep DEBUG_KERNEL /boot/config-2.6.18-6-686
# CONFIG_DEBUG_KERNEL is not set
For what that's worth.
>> It is preferable that we always implement the same interface for all
>> Kconfig settings. If this exposes information which is confusing or
>> not useful to end-users then so be it - we should be able to cover that
>> in supporting documentation.
>
> My original patch takes that straightforward manner - and I still like it.
> I would be very glad to move the filtering code from kernel to user space.
>
> The use of more obscure flags could be discouraged by _not_ documenting
> them. A really curious user is encouraged to refer to the code for the
> exact meaning (and perhaps become a kernel developer ;-)
>
>> Also, as mentioned in the other email, it would be good if we were to
>> publish a little userspace app which people can use to access this raw
>> data. We could give that application an `--i-am-a-kernel-developer'
>> option!
>
> OK. I'll include page-types.c in the next take.
>
>> > +#else
>> > +static inline int genuine_linus(void) { return 0; }
>> > +#endif
>>
>> This isn't an appropriate use of CONFIG_DEBUG_KERNEL.
>>
>> DEBUG_KERNEL is a Kconfig-only construct which is use to enable _other_
>> debugging features. The way you've used it here, if the person who is
>> configuring the kernel wants to enable any other completely-unrelated
>> debug feature, they have to enable DEBUG_KERNEL first. But when they
>> do that, they unexpectedly alter the behaviour of pagemap!
>>
>> There are two other places where CONFIG_DEBUG_KERNEL affects code
>> generation in .c files: arch/parisc/mm/init.c and
>> arch/powerpc/kernel/sysfs.c. These are both wrong, and need slapping ;)
>
> (add cc to related maintainers)
I assume I was cc'd because I've changed arch/powerpc/kernel/sysfs.c a
couple of times in the last year, but I can't claim to maintain that
code. I'm pretty sure I haven't touched the code in question in this
discussion. I've cc'd linuxppc-dev.
> CONFIG_DEBUG_KERNEL being enabled in distro kernels effectively means
>
> #ifdef CONFIG_DEBUG_KERNEL == #if 1
>
> as the following patch demos. Now it becomes obviously silly.
Sure, #if 1 is usually silly. But if the point is that DEBUG_KERNEL is
not supposed to directly affect code generation, then I see two options
for powerpc:
- remove the #ifdef CONFIG_DEBUG_KERNEL guards from
arch/powerpc/kernel/sysfs.c, unconditionally enabling the hid/ima
sysfs attributes, or
- define a new config symbol which governs whether those attributes are
enabled, and make it depend on DEBUG_KERNEL
> --- a/arch/powerpc/kernel/sysfs.c
> +++ b/arch/powerpc/kernel/sysfs.c
> @@ -212,19 +212,19 @@ static SYSDEV_ATTR(purr, 0600, show_purr, store_purr);
> #endif /* CONFIG_PPC64 */
>
> #ifdef HAS_PPC_PMC_PA6T
> SYSFS_PMCSETUP(pa6t_pmc0, SPRN_PA6T_PMC0);
> SYSFS_PMCSETUP(pa6t_pmc1, SPRN_PA6T_PMC1);
> SYSFS_PMCSETUP(pa6t_pmc2, SPRN_PA6T_PMC2);
> SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3);
> SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4);
> SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5);
> -#ifdef CONFIG_DEBUG_KERNEL
> +#if 1
> SYSFS_PMCSETUP(hid0, SPRN_HID0);
> SYSFS_PMCSETUP(hid1, SPRN_HID1);
> SYSFS_PMCSETUP(hid4, SPRN_HID4);
> SYSFS_PMCSETUP(hid5, SPRN_HID5);
> SYSFS_PMCSETUP(ima0, SPRN_PA6T_IMA0);
> SYSFS_PMCSETUP(ima1, SPRN_PA6T_IMA1);
> SYSFS_PMCSETUP(ima2, SPRN_PA6T_IMA2);
> SYSFS_PMCSETUP(ima3, SPRN_PA6T_IMA3);
> SYSFS_PMCSETUP(ima4, SPRN_PA6T_IMA4);
> @@ -282,19 +282,19 @@ static struct sysdev_attribute classic_pmc_attrs[] = {
> static struct sysdev_attribute pa6t_attrs[] = {
> _SYSDEV_ATTR(mmcr0, 0600, show_mmcr0, store_mmcr0),
> _SYSDEV_ATTR(mmcr1, 0600, show_mmcr1, store_mmcr1),
> _SYSDEV_ATTR(pmc0, 0600, show_pa6t_pmc0, store_pa6t_pmc0),
> _SYSDEV_ATTR(pmc1, 0600, show_pa6t_pmc1, store_pa6t_pmc1),
> _SYSDEV_ATTR(pmc2, 0600, show_pa6t_pmc2, store_pa6t_pmc2),
> _SYSDEV_ATTR(pmc3, 0600, show_pa6t_pmc3, store_pa6t_pmc3),
> _SYSDEV_ATTR(pmc4, 0600, show_pa6t_pmc4, store_pa6t_pmc4),
> _SYSDEV_ATTR(pmc5, 0600, show_pa6t_pmc5, store_pa6t_pmc5),
> -#ifdef CONFIG_DEBUG_KERNEL
> +#if 1
> _SYSDEV_ATTR(hid0, 0600, show_hid0, store_hid0),
> _SYSDEV_ATTR(hid1, 0600, show_hid1, store_hid1),
> _SYSDEV_ATTR(hid4, 0600, show_hid4, store_hid4),
> _SYSDEV_ATTR(hid5, 0600, show_hid5, store_hid5),
> _SYSDEV_ATTR(ima0, 0600, show_ima0, store_ima0),
> _SYSDEV_ATTR(ima1, 0600, show_ima1, store_ima1),
> _SYSDEV_ATTR(ima2, 0600, show_ima2, store_ima2),
> _SYSDEV_ATTR(ima3, 0600, show_ima3, store_ima3),
> _SYSDEV_ATTR(ima4, 0600, show_ima4, store_ima4),
>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29 4:41 ` Nathan Lynch
0 siblings, 0 replies; 137+ messages in thread
From: Nathan Lynch @ 2009-04-29 4:41 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, linux-kernel, kosaki.motohiro, andi, mpm,
adobriyan, linux-mm, Stephen Rothwell, Chandra Seetharaman,
Olof Johansson, Helge Deller, linuxppc-dev
Wu Fengguang <fengguang.wu@intel.com> writes:
> On Wed, Apr 29, 2009 at 05:32:44AM +0800, Andrew Morton wrote:
>> On Tue, 28 Apr 2009 09:09:12 +0800
>> Wu Fengguang <fengguang.wu@intel.com> wrote:
>>
>> > +/*
>> > + * Kernel flags are exported faithfully to Linus and his fellow hackers.
>> > + * Otherwise some details are masked to avoid confusing the end user:
>> > + * - some kernel flags are completely invisible
>> > + * - some kernel flags are conditionally invisible on their odd usages
>> > + */
>> > +#ifdef CONFIG_DEBUG_KERNEL
>> > +static inline int genuine_linus(void) { return 1; }
>>
>> Although he's a fine chap, the use of the "_linus" tag isn't terribly
>> clear (to me). I think what you're saying here is that this enables
>> kernel-developer-only features, yes?
>
> Yes.
>
>> If so, perhaps we could come up with an identifier which expresses that
>> more clearly.
>>
>> But I'd expect that everyone and all distros enable CONFIG_DEBUG_KERNEL
>> for _some_ reason, so what's the point?
At the least, it has not always been so...
>
> Good point! I can confirm my debian has CONFIG_DEBUG_KERNEL=Y!
I can confirm mine does not.
etch-i386:~# uname -a
Linux etch-i386 2.6.18-6-686 #1 SMP Fri Dec 12 16:48:28 UTC 2008 i686 GNU/Linux
etch-i386:~# grep DEBUG_KERNEL /boot/config-2.6.18-6-686
# CONFIG_DEBUG_KERNEL is not set
For what that's worth.
>> It is preferable that we always implement the same interface for all
>> Kconfig settings. If this exposes information which is confusing or
>> not useful to end-users then so be it - we should be able to cover that
>> in supporting documentation.
>
> My original patch takes that straightforward manner - and I still like it.
> I would be very glad to move the filtering code from kernel to user space.
>
> The use of more obscure flags could be discouraged by _not_ documenting
> them. A really curious user is encouraged to refer to the code for the
> exact meaning (and perhaps become a kernel developer ;-)
>
>> Also, as mentioned in the other email, it would be good if we were to
>> publish a little userspace app which people can use to access this raw
>> data. We could give that application an `--i-am-a-kernel-developer'
>> option!
>
> OK. I'll include page-types.c in the next take.
>
>> > +#else
>> > +static inline int genuine_linus(void) { return 0; }
>> > +#endif
>>
>> This isn't an appropriate use of CONFIG_DEBUG_KERNEL.
>>
>> DEBUG_KERNEL is a Kconfig-only construct which is use to enable _other_
>> debugging features. The way you've used it here, if the person who is
>> configuring the kernel wants to enable any other completely-unrelated
>> debug feature, they have to enable DEBUG_KERNEL first. But when they
>> do that, they unexpectedly alter the behaviour of pagemap!
>>
>> There are two other places where CONFIG_DEBUG_KERNEL affects code
>> generation in .c files: arch/parisc/mm/init.c and
>> arch/powerpc/kernel/sysfs.c. These are both wrong, and need slapping ;)
>
> (add cc to related maintainers)
I assume I was cc'd because I've changed arch/powerpc/kernel/sysfs.c a
couple of times in the last year, but I can't claim to maintain that
code. I'm pretty sure I haven't touched the code in question in this
discussion. I've cc'd linuxppc-dev.
> CONFIG_DEBUG_KERNEL being enabled in distro kernels effectively means
>
> #ifdef CONFIG_DEBUG_KERNEL == #if 1
>
> as the following patch demos. Now it becomes obviously silly.
Sure, #if 1 is usually silly. But if the point is that DEBUG_KERNEL is
not supposed to directly affect code generation, then I see two options
for powerpc:
- remove the #ifdef CONFIG_DEBUG_KERNEL guards from
arch/powerpc/kernel/sysfs.c, unconditionally enabling the hid/ima
sysfs attributes, or
- define a new config symbol which governs whether those attributes are
enabled, and make it depend on DEBUG_KERNEL
> --- a/arch/powerpc/kernel/sysfs.c
> +++ b/arch/powerpc/kernel/sysfs.c
> @@ -212,19 +212,19 @@ static SYSDEV_ATTR(purr, 0600, show_purr, store_purr);
> #endif /* CONFIG_PPC64 */
>
> #ifdef HAS_PPC_PMC_PA6T
> SYSFS_PMCSETUP(pa6t_pmc0, SPRN_PA6T_PMC0);
> SYSFS_PMCSETUP(pa6t_pmc1, SPRN_PA6T_PMC1);
> SYSFS_PMCSETUP(pa6t_pmc2, SPRN_PA6T_PMC2);
> SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3);
> SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4);
> SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5);
> -#ifdef CONFIG_DEBUG_KERNEL
> +#if 1
> SYSFS_PMCSETUP(hid0, SPRN_HID0);
> SYSFS_PMCSETUP(hid1, SPRN_HID1);
> SYSFS_PMCSETUP(hid4, SPRN_HID4);
> SYSFS_PMCSETUP(hid5, SPRN_HID5);
> SYSFS_PMCSETUP(ima0, SPRN_PA6T_IMA0);
> SYSFS_PMCSETUP(ima1, SPRN_PA6T_IMA1);
> SYSFS_PMCSETUP(ima2, SPRN_PA6T_IMA2);
> SYSFS_PMCSETUP(ima3, SPRN_PA6T_IMA3);
> SYSFS_PMCSETUP(ima4, SPRN_PA6T_IMA4);
> @@ -282,19 +282,19 @@ static struct sysdev_attribute classic_pmc_attrs[] = {
> static struct sysdev_attribute pa6t_attrs[] = {
> _SYSDEV_ATTR(mmcr0, 0600, show_mmcr0, store_mmcr0),
> _SYSDEV_ATTR(mmcr1, 0600, show_mmcr1, store_mmcr1),
> _SYSDEV_ATTR(pmc0, 0600, show_pa6t_pmc0, store_pa6t_pmc0),
> _SYSDEV_ATTR(pmc1, 0600, show_pa6t_pmc1, store_pa6t_pmc1),
> _SYSDEV_ATTR(pmc2, 0600, show_pa6t_pmc2, store_pa6t_pmc2),
> _SYSDEV_ATTR(pmc3, 0600, show_pa6t_pmc3, store_pa6t_pmc3),
> _SYSDEV_ATTR(pmc4, 0600, show_pa6t_pmc4, store_pa6t_pmc4),
> _SYSDEV_ATTR(pmc5, 0600, show_pa6t_pmc5, store_pa6t_pmc5),
> -#ifdef CONFIG_DEBUG_KERNEL
> +#if 1
> _SYSDEV_ATTR(hid0, 0600, show_hid0, store_hid0),
> _SYSDEV_ATTR(hid1, 0600, show_hid1, store_hid1),
> _SYSDEV_ATTR(hid4, 0600, show_hid4, store_hid4),
> _SYSDEV_ATTR(hid5, 0600, show_hid5, store_hid5),
> _SYSDEV_ATTR(ima0, 0600, show_ima0, store_ima0),
> _SYSDEV_ATTR(ima1, 0600, show_ima1, store_ima1),
> _SYSDEV_ATTR(ima2, 0600, show_ima2, store_ima2),
> _SYSDEV_ATTR(ima3, 0600, show_ima3, store_ima3),
> _SYSDEV_ATTR(ima4, 0600, show_ima4, store_ima4),
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29 4:41 ` Nathan Lynch
0 siblings, 0 replies; 137+ messages in thread
From: Nathan Lynch @ 2009-04-29 4:41 UTC (permalink / raw)
To: Wu Fengguang
Cc: Stephen Rothwell, Chandra Seetharaman, Olof, linuxppc-dev,
linux-kernel, Helge Deller, linux-mm, andi, kosaki.motohiro, mpm,
Johansson, Andrew Morton, adobriyan
Wu Fengguang <fengguang.wu@intel.com> writes:
> On Wed, Apr 29, 2009 at 05:32:44AM +0800, Andrew Morton wrote:
>> On Tue, 28 Apr 2009 09:09:12 +0800
>> Wu Fengguang <fengguang.wu@intel.com> wrote:
>>
>> > +/*
>> > + * Kernel flags are exported faithfully to Linus and his fellow hackers.
>> > + * Otherwise some details are masked to avoid confusing the end user:
>> > + * - some kernel flags are completely invisible
>> > + * - some kernel flags are conditionally invisible on their odd usages
>> > + */
>> > +#ifdef CONFIG_DEBUG_KERNEL
>> > +static inline int genuine_linus(void) { return 1; }
>>
>> Although he's a fine chap, the use of the "_linus" tag isn't terribly
>> clear (to me). I think what you're saying here is that this enables
>> kernel-developer-only features, yes?
>
> Yes.
>
>> If so, perhaps we could come up with an identifier which expresses that
>> more clearly.
>>
>> But I'd expect that everyone and all distros enable CONFIG_DEBUG_KERNEL
>> for _some_ reason, so what's the point?
At the least, it has not always been so...
>
> Good point! I can confirm my debian has CONFIG_DEBUG_KERNEL=Y!
I can confirm mine does not.
etch-i386:~# uname -a
Linux etch-i386 2.6.18-6-686 #1 SMP Fri Dec 12 16:48:28 UTC 2008 i686 GNU/Linux
etch-i386:~# grep DEBUG_KERNEL /boot/config-2.6.18-6-686
# CONFIG_DEBUG_KERNEL is not set
For what that's worth.
>> It is preferable that we always implement the same interface for all
>> Kconfig settings. If this exposes information which is confusing or
>> not useful to end-users then so be it - we should be able to cover that
>> in supporting documentation.
>
> My original patch takes that straightforward manner - and I still like it.
> I would be very glad to move the filtering code from kernel to user space.
>
> The use of more obscure flags could be discouraged by _not_ documenting
> them. A really curious user is encouraged to refer to the code for the
> exact meaning (and perhaps become a kernel developer ;-)
>
>> Also, as mentioned in the other email, it would be good if we were to
>> publish a little userspace app which people can use to access this raw
>> data. We could give that application an `--i-am-a-kernel-developer'
>> option!
>
> OK. I'll include page-types.c in the next take.
>
>> > +#else
>> > +static inline int genuine_linus(void) { return 0; }
>> > +#endif
>>
>> This isn't an appropriate use of CONFIG_DEBUG_KERNEL.
>>
>> DEBUG_KERNEL is a Kconfig-only construct which is use to enable _other_
>> debugging features. The way you've used it here, if the person who is
>> configuring the kernel wants to enable any other completely-unrelated
>> debug feature, they have to enable DEBUG_KERNEL first. But when they
>> do that, they unexpectedly alter the behaviour of pagemap!
>>
>> There are two other places where CONFIG_DEBUG_KERNEL affects code
>> generation in .c files: arch/parisc/mm/init.c and
>> arch/powerpc/kernel/sysfs.c. These are both wrong, and need slapping ;)
>
> (add cc to related maintainers)
I assume I was cc'd because I've changed arch/powerpc/kernel/sysfs.c a
couple of times in the last year, but I can't claim to maintain that
code. I'm pretty sure I haven't touched the code in question in this
discussion. I've cc'd linuxppc-dev.
> CONFIG_DEBUG_KERNEL being enabled in distro kernels effectively means
>
> #ifdef CONFIG_DEBUG_KERNEL == #if 1
>
> as the following patch demos. Now it becomes obviously silly.
Sure, #if 1 is usually silly. But if the point is that DEBUG_KERNEL is
not supposed to directly affect code generation, then I see two options
for powerpc:
- remove the #ifdef CONFIG_DEBUG_KERNEL guards from
arch/powerpc/kernel/sysfs.c, unconditionally enabling the hid/ima
sysfs attributes, or
- define a new config symbol which governs whether those attributes are
enabled, and make it depend on DEBUG_KERNEL
> --- a/arch/powerpc/kernel/sysfs.c
> +++ b/arch/powerpc/kernel/sysfs.c
> @@ -212,19 +212,19 @@ static SYSDEV_ATTR(purr, 0600, show_purr, store_purr);
> #endif /* CONFIG_PPC64 */
>
> #ifdef HAS_PPC_PMC_PA6T
> SYSFS_PMCSETUP(pa6t_pmc0, SPRN_PA6T_PMC0);
> SYSFS_PMCSETUP(pa6t_pmc1, SPRN_PA6T_PMC1);
> SYSFS_PMCSETUP(pa6t_pmc2, SPRN_PA6T_PMC2);
> SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3);
> SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4);
> SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5);
> -#ifdef CONFIG_DEBUG_KERNEL
> +#if 1
> SYSFS_PMCSETUP(hid0, SPRN_HID0);
> SYSFS_PMCSETUP(hid1, SPRN_HID1);
> SYSFS_PMCSETUP(hid4, SPRN_HID4);
> SYSFS_PMCSETUP(hid5, SPRN_HID5);
> SYSFS_PMCSETUP(ima0, SPRN_PA6T_IMA0);
> SYSFS_PMCSETUP(ima1, SPRN_PA6T_IMA1);
> SYSFS_PMCSETUP(ima2, SPRN_PA6T_IMA2);
> SYSFS_PMCSETUP(ima3, SPRN_PA6T_IMA3);
> SYSFS_PMCSETUP(ima4, SPRN_PA6T_IMA4);
> @@ -282,19 +282,19 @@ static struct sysdev_attribute classic_pmc_attrs[] = {
> static struct sysdev_attribute pa6t_attrs[] = {
> _SYSDEV_ATTR(mmcr0, 0600, show_mmcr0, store_mmcr0),
> _SYSDEV_ATTR(mmcr1, 0600, show_mmcr1, store_mmcr1),
> _SYSDEV_ATTR(pmc0, 0600, show_pa6t_pmc0, store_pa6t_pmc0),
> _SYSDEV_ATTR(pmc1, 0600, show_pa6t_pmc1, store_pa6t_pmc1),
> _SYSDEV_ATTR(pmc2, 0600, show_pa6t_pmc2, store_pa6t_pmc2),
> _SYSDEV_ATTR(pmc3, 0600, show_pa6t_pmc3, store_pa6t_pmc3),
> _SYSDEV_ATTR(pmc4, 0600, show_pa6t_pmc4, store_pa6t_pmc4),
> _SYSDEV_ATTR(pmc5, 0600, show_pa6t_pmc5, store_pa6t_pmc5),
> -#ifdef CONFIG_DEBUG_KERNEL
> +#if 1
> _SYSDEV_ATTR(hid0, 0600, show_hid0, store_hid0),
> _SYSDEV_ATTR(hid1, 0600, show_hid1, store_hid1),
> _SYSDEV_ATTR(hid4, 0600, show_hid4, store_hid4),
> _SYSDEV_ATTR(hid5, 0600, show_hid5, store_hid5),
> _SYSDEV_ATTR(ima0, 0600, show_ima0, store_ima0),
> _SYSDEV_ATTR(ima1, 0600, show_ima1, store_ima1),
> _SYSDEV_ATTR(ima2, 0600, show_ima2, store_ima2),
> _SYSDEV_ATTR(ima3, 0600, show_ima3, store_ima3),
> _SYSDEV_ATTR(ima4, 0600, show_ima4, store_ima4),
>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-29 4:41 ` Nathan Lynch
(?)
@ 2009-04-29 4:50 ` Andrew Morton
-1 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-29 4:50 UTC (permalink / raw)
To: Nathan Lynch
Cc: Wu Fengguang, linux-kernel, kosaki.motohiro, andi, mpm,
adobriyan, linux-mm, Stephen Rothwell, Chandra Seetharaman,
Olof Johansson, Helge Deller, linuxppc-dev
On Tue, 28 Apr 2009 23:41:52 -0500 Nathan Lynch <ntl@pobox.com> wrote:
> > CONFIG_DEBUG_KERNEL being enabled in distro kernels effectively means
> >
> > #ifdef CONFIG_DEBUG_KERNEL == #if 1
> >
> > as the following patch demos. Now it becomes obviously silly.
>
> Sure, #if 1 is usually silly. But if the point is that DEBUG_KERNEL is
> not supposed to directly affect code generation, then I see two options
> for powerpc:
>
> - remove the #ifdef CONFIG_DEBUG_KERNEL guards from
> arch/powerpc/kernel/sysfs.c, unconditionally enabling the hid/ima
> sysfs attributes, or
>
> - define a new config symbol which governs whether those attributes are
> enabled, and make it depend on DEBUG_KERNEL
yup.
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29 4:50 ` Andrew Morton
0 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-29 4:50 UTC (permalink / raw)
To: Nathan Lynch
Cc: Wu Fengguang, linux-kernel, kosaki.motohiro, andi, mpm,
adobriyan, linux-mm, Stephen Rothwell, Chandra Seetharaman,
Olof Johansson, Helge Deller, linuxppc-dev
On Tue, 28 Apr 2009 23:41:52 -0500 Nathan Lynch <ntl@pobox.com> wrote:
> > CONFIG_DEBUG_KERNEL being enabled in distro kernels effectively means
> >
> > #ifdef CONFIG_DEBUG_KERNEL == #if 1
> >
> > as the following patch demos. Now it becomes obviously silly.
>
> Sure, #if 1 is usually silly. But if the point is that DEBUG_KERNEL is
> not supposed to directly affect code generation, then I see two options
> for powerpc:
>
> - remove the #ifdef CONFIG_DEBUG_KERNEL guards from
> arch/powerpc/kernel/sysfs.c, unconditionally enabling the hid/ima
> sysfs attributes, or
>
> - define a new config symbol which governs whether those attributes are
> enabled, and make it depend on DEBUG_KERNEL
yup.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29 4:50 ` Andrew Morton
0 siblings, 0 replies; 137+ messages in thread
From: Andrew Morton @ 2009-04-29 4:50 UTC (permalink / raw)
To: Nathan Lynch
Cc: Stephen Rothwell, Helge, Seetharaman, Deller, linuxppc-dev,
linux-kernel, linux-mm, andi, Chandra, kosaki.motohiro, mpm,
Olof Johansson, Wu Fengguang, adobriyan
On Tue, 28 Apr 2009 23:41:52 -0500 Nathan Lynch <ntl@pobox.com> wrote:
> > CONFIG_DEBUG_KERNEL being enabled in distro kernels effectively means
> >
> > #ifdef CONFIG_DEBUG_KERNEL == #if 1
> >
> > as the following patch demos. Now it becomes obviously silly.
>
> Sure, #if 1 is usually silly. But if the point is that DEBUG_KERNEL is
> not supposed to directly affect code generation, then I see two options
> for powerpc:
>
> - remove the #ifdef CONFIG_DEBUG_KERNEL guards from
> arch/powerpc/kernel/sysfs.c, unconditionally enabling the hid/ima
> sysfs attributes, or
>
> - define a new config symbol which governs whether those attributes are
> enabled, and make it depend on DEBUG_KERNEL
yup.
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-29 3:48 ` Wu Fengguang
@ 2009-04-29 5:09 ` Wu Fengguang
-1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-29 5:09 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm,
Stephen Rothwell, Chandra Seetharaman, Nathan Lynch,
Olof Johansson, Helge Deller
On Wed, Apr 29, 2009 at 11:48:29AM +0800, Wu Fengguang wrote:
> On Wed, Apr 29, 2009 at 10:55:27AM +0800, Andrew Morton wrote:
> > On Wed, 29 Apr 2009 10:38:42 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> >
> > > > > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit) \
> > > > > + do { \
> > > > > + if (visible || genuine_linus()) \
> > > > > + uflags |= ((kflags >> kbit) & 1) << ubit; \
> > > > > + } while (0);
> > > >
> > > > Did this have to be implemented as a macro?
> > > >
> > > > It's bad, because it might or might not reference its argument, so if
> > > > someone passes it an expression-with-side-effects, the end result is
> > > > unpredictable. A C function is almost always preferable if possible.
> > >
> > > Just tried inline function, the code size is increased slightly:
> > >
> > > text data bss dec hex filename
> > > macro 1804 128 0 1932 78c fs/proc/page.o
> > > inline 1828 128 0 1956 7a4 fs/proc/page.o
> > >
> >
> > hm, I wonder why. Maybe it fixed a bug ;)
> >
> > The code is effectively doing
> >
> > if (expr1)
> > something();
> > if (expr1)
> > something_else();
> > if (expr1)
> > something_else2();
> >
> > etc. Obviously we _hope_ that the compiler turns that into
> >
> > if (expr1) {
> > something();
> > something_else();
> > something_else2();
> > }
> >
> > for us, but it would be good to check...
>
> By 'expr1', you mean (visible || genuine_linus())?
>
> No, I can confirm the inefficiency does not lie here.
>
> I simplified the kpf_copy_bit() to
>
> #define kpf_copy_bit(uflags, kflags, ubit, kbit) \
> uflags |= (((kflags) >> (kbit)) & 1) << (ubit);
>
> or
>
> static inline u64 kpf_copy_bit(u64 kflags, int ubit, int kbit)
> {
> return (((kflags) >> (kbit)) & 1) << (ubit);
> }
>
> and double checked the differences: the gap grows unexpectedly!
>
> text data bss dec hex filename
> macro 1829 168 0 1997 7cd fs/proc/page.o
> inline 1893 168 0 2061 80d fs/proc/page.o
> +3.5%
>
> (note: the larger absolute text size is due to some experimental code elsewhere.)
Wow, after simplifications the text size goes down by -13.2%:
text data bss dec hex filename
macro 1644 8 0 1652 674 fs/proc/page.o
inline 1644 8 0 1652 674 fs/proc/page.o
Amazingly we can now use inline function without performance penalty!
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29 5:09 ` Wu Fengguang
0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-29 5:09 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, kosaki.motohiro, andi, mpm, adobriyan, linux-mm,
Stephen Rothwell, Chandra Seetharaman, Nathan Lynch,
Olof Johansson, Helge Deller
On Wed, Apr 29, 2009 at 11:48:29AM +0800, Wu Fengguang wrote:
> On Wed, Apr 29, 2009 at 10:55:27AM +0800, Andrew Morton wrote:
> > On Wed, 29 Apr 2009 10:38:42 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> >
> > > > > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit) \
> > > > > + do { \
> > > > > + if (visible || genuine_linus()) \
> > > > > + uflags |= ((kflags >> kbit) & 1) << ubit; \
> > > > > + } while (0);
> > > >
> > > > Did this have to be implemented as a macro?
> > > >
> > > > It's bad, because it might or might not reference its argument, so if
> > > > someone passes it an expression-with-side-effects, the end result is
> > > > unpredictable. A C function is almost always preferable if possible.
> > >
> > > Just tried inline function, the code size is increased slightly:
> > >
> > > text data bss dec hex filename
> > > macro 1804 128 0 1932 78c fs/proc/page.o
> > > inline 1828 128 0 1956 7a4 fs/proc/page.o
> > >
> >
> > hm, I wonder why. Maybe it fixed a bug ;)
> >
> > The code is effectively doing
> >
> > if (expr1)
> > something();
> > if (expr1)
> > something_else();
> > if (expr1)
> > something_else2();
> >
> > etc. Obviously we _hope_ that the compiler turns that into
> >
> > if (expr1) {
> > something();
> > something_else();
> > something_else2();
> > }
> >
> > for us, but it would be good to check...
>
> By 'expr1', you mean (visible || genuine_linus())?
>
> No, I can confirm the inefficiency does not lie here.
>
> I simplified the kpf_copy_bit() to
>
> #define kpf_copy_bit(uflags, kflags, ubit, kbit) \
> uflags |= (((kflags) >> (kbit)) & 1) << (ubit);
>
> or
>
> static inline u64 kpf_copy_bit(u64 kflags, int ubit, int kbit)
> {
> return (((kflags) >> (kbit)) & 1) << (ubit);
> }
>
> and double checked the differences: the gap grows unexpectedly!
>
> text data bss dec hex filename
> macro 1829 168 0 1997 7cd fs/proc/page.o
> inline 1893 168 0 2061 80d fs/proc/page.o
> +3.5%
>
> (note: the larger absolute text size is due to some experimental code elsewhere.)
Wow, after simplifications the text size goes down by -13.2%:
text data bss dec hex filename
macro 1644 8 0 1652 674 fs/proc/page.o
inline 1644 8 0 1652 674 fs/proc/page.o
Amazingly we can now use inline function without performance penalty!
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-28 17:49 ` Matt Mackall
@ 2009-04-29 8:05 ` Wu Fengguang
-1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-29 8:05 UTC (permalink / raw)
To: Matt Mackall
Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
Alexey Dobriyan, linux-mm
On Wed, Apr 29, 2009 at 01:49:21AM +0800, Matt Mackall wrote:
> On Tue, 2009-04-28 at 09:09 +0800, Wu Fengguang wrote:
> > plain text document attachment (kpageflags-extending.patch)
> > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
>
> My only concern with this patch is it knows a bit too much about SLUB
> internals (and perhaps not enough about SLOB, which also overloads
> flags).
Yup. PG_private=PG_slob_free is not masked because SLOB actually does
not set PG_slab at all. I wonder if it's safe to do this change:
/* SLOB */
- PG_slob_page = PG_active,
+ PG_slob_page = PG_slab,
PG_slob_free = PG_private,
In the page-types output:
flags page-count MB symbolic-flags long-symbolic-flags
0x000800000040 7113 27 ______A_________________P____ active,private
0x000000000040 66 0 ______A______________________ active
The above two lines are obviously for SLOB pages. It indicates lots of
free SLOB pages. So my question is:
- Do you have other means to get the nr_free_slobs info? (I found none in the code)
or
- Will exporting the SL*B overloaded flags going to help?
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29 8:05 ` Wu Fengguang
0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-29 8:05 UTC (permalink / raw)
To: Matt Mackall
Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
Alexey Dobriyan, linux-mm
On Wed, Apr 29, 2009 at 01:49:21AM +0800, Matt Mackall wrote:
> On Tue, 2009-04-28 at 09:09 +0800, Wu Fengguang wrote:
> > plain text document attachment (kpageflags-extending.patch)
> > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
>
> My only concern with this patch is it knows a bit too much about SLUB
> internals (and perhaps not enough about SLOB, which also overloads
> flags).
Yup. PG_private=PG_slob_free is not masked because SLOB actually does
not set PG_slab at all. I wonder if it's safe to do this change:
/* SLOB */
- PG_slob_page = PG_active,
+ PG_slob_page = PG_slab,
PG_slob_free = PG_private,
In the page-types output:
flags page-count MB symbolic-flags long-symbolic-flags
0x000800000040 7113 27 ______A_________________P____ active,private
0x000000000040 66 0 ______A______________________ active
The above two lines are obviously for SLOB pages. It indicates lots of
free SLOB pages. So my question is:
- Do you have other means to get the nr_free_slobs info? (I found none in the code)
or
- Will exporting the SL*B overloaded flags going to help?
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-29 8:05 ` Wu Fengguang
@ 2009-04-29 19:13 ` Matt Mackall
-1 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-29 19:13 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
Alexey Dobriyan, linux-mm
On Wed, 2009-04-29 at 16:05 +0800, Wu Fengguang wrote:
> On Wed, Apr 29, 2009 at 01:49:21AM +0800, Matt Mackall wrote:
> > On Tue, 2009-04-28 at 09:09 +0800, Wu Fengguang wrote:
> > > plain text document attachment (kpageflags-extending.patch)
> > > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> >
> > My only concern with this patch is it knows a bit too much about SLUB
> > internals (and perhaps not enough about SLOB, which also overloads
> > flags).
>
> Yup. PG_private=PG_slob_free is not masked because SLOB actually does
> not set PG_slab at all. I wonder if it's safe to do this change:
>
> /* SLOB */
> - PG_slob_page = PG_active,
> + PG_slob_page = PG_slab,
> PG_slob_free = PG_private,
Yep.
> In the page-types output:
>
> flags page-count MB symbolic-flags long-symbolic-flags
> 0x000800000040 7113 27 ______A_________________P____ active,private
> 0x000000000040 66 0 ______A______________________ active
>
> The above two lines are obviously for SLOB pages. It indicates lots of
> free SLOB pages. So my question is:
Free here just means partially allocated.
> - Do you have other means to get the nr_free_slobs info? (I found none in the code)
> or
> - Will exporting the SL*B overloaded flags going to help?
Yes, it's useful.
--
http://selenic.com : development and support for Mercurial and Linux
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-29 19:13 ` Matt Mackall
0 siblings, 0 replies; 137+ messages in thread
From: Matt Mackall @ 2009-04-29 19:13 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
Alexey Dobriyan, linux-mm
On Wed, 2009-04-29 at 16:05 +0800, Wu Fengguang wrote:
> On Wed, Apr 29, 2009 at 01:49:21AM +0800, Matt Mackall wrote:
> > On Tue, 2009-04-28 at 09:09 +0800, Wu Fengguang wrote:
> > > plain text document attachment (kpageflags-extending.patch)
> > > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> >
> > My only concern with this patch is it knows a bit too much about SLUB
> > internals (and perhaps not enough about SLOB, which also overloads
> > flags).
>
> Yup. PG_private=PG_slob_free is not masked because SLOB actually does
> not set PG_slab at all. I wonder if it's safe to do this change:
>
> /* SLOB */
> - PG_slob_page = PG_active,
> + PG_slob_page = PG_slab,
> PG_slob_free = PG_private,
Yep.
> In the page-types output:
>
> flags page-count MB symbolic-flags long-symbolic-flags
> 0x000800000040 7113 27 ______A_________________P____ active,private
> 0x000000000040 66 0 ______A______________________ active
>
> The above two lines are obviously for SLOB pages. It indicates lots of
> free SLOB pages. So my question is:
Free here just means partially allocated.
> - Do you have other means to get the nr_free_slobs info? (I found none in the code)
> or
> - Will exporting the SL*B overloaded flags going to help?
Yes, it's useful.
--
http://selenic.com : development and support for Mercurial and Linux
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
2009-04-29 19:13 ` Matt Mackall
@ 2009-04-30 1:00 ` Wu Fengguang
-1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-30 1:00 UTC (permalink / raw)
To: Matt Mackall
Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
Alexey Dobriyan, linux-mm
On Thu, Apr 30, 2009 at 03:13:56AM +0800, Matt Mackall wrote:
> On Wed, 2009-04-29 at 16:05 +0800, Wu Fengguang wrote:
> > On Wed, Apr 29, 2009 at 01:49:21AM +0800, Matt Mackall wrote:
> > > On Tue, 2009-04-28 at 09:09 +0800, Wu Fengguang wrote:
> > > > plain text document attachment (kpageflags-extending.patch)
> > > > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> > >
> > > My only concern with this patch is it knows a bit too much about SLUB
> > > internals (and perhaps not enough about SLOB, which also overloads
> > > flags).
> >
> > Yup. PG_private=PG_slob_free is not masked because SLOB actually does
> > not set PG_slab at all. I wonder if it's safe to do this change:
> >
> > /* SLOB */
> > - PG_slob_page = PG_active,
> > + PG_slob_page = PG_slab,
> > PG_slob_free = PG_private,
>
> Yep.
OK. I'll do it - for consistency.
> > In the page-types output:
> >
> > flags page-count MB symbolic-flags long-symbolic-flags
> > 0x000800000040 7113 27 ______A_________________P____ active,private
> > 0x000000000040 66 0 ______A______________________ active
> >
> > The above two lines are obviously for SLOB pages. It indicates lots of
> > free SLOB pages. So my question is:
>
> Free here just means partially allocated.
Yes, I realized this when lying in bed ;-)
> > - Do you have other means to get the nr_free_slobs info? (I found none in the code)
> > or
> > - Will exporting the SL*B overloaded flags going to help?
>
> Yes, it's useful.
Thank you. SLUB/SLOB overload different page flags, so it's possible
for user space tools to restore their real meanings - ugly but useful.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
@ 2009-04-30 1:00 ` Wu Fengguang
0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-04-30 1:00 UTC (permalink / raw)
To: Matt Mackall
Cc: Andrew Morton, LKML, KOSAKI Motohiro, Andi Kleen,
Alexey Dobriyan, linux-mm
On Thu, Apr 30, 2009 at 03:13:56AM +0800, Matt Mackall wrote:
> On Wed, 2009-04-29 at 16:05 +0800, Wu Fengguang wrote:
> > On Wed, Apr 29, 2009 at 01:49:21AM +0800, Matt Mackall wrote:
> > > On Tue, 2009-04-28 at 09:09 +0800, Wu Fengguang wrote:
> > > > plain text document attachment (kpageflags-extending.patch)
> > > > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> > >
> > > My only concern with this patch is it knows a bit too much about SLUB
> > > internals (and perhaps not enough about SLOB, which also overloads
> > > flags).
> >
> > Yup. PG_private=PG_slob_free is not masked because SLOB actually does
> > not set PG_slab at all. I wonder if it's safe to do this change:
> >
> > /* SLOB */
> > - PG_slob_page = PG_active,
> > + PG_slob_page = PG_slab,
> > PG_slob_free = PG_private,
>
> Yep.
OK. I'll do it - for consistency.
> > In the page-types output:
> >
> > flags page-count MB symbolic-flags long-symbolic-flags
> > 0x000800000040 7113 27 ______A_________________P____ active,private
> > 0x000000000040 66 0 ______A______________________ active
> >
> > The above two lines are obviously for SLOB pages. It indicates lots of
> > free SLOB pages. So my question is:
>
> Free here just means partially allocated.
Yes, I realized this when lying in bed ;-)
> > - Do you have other means to get the nr_free_slobs info? (I found none in the code)
> > or
> > - Will exporting the SL*B overloaded flags going to help?
>
> Yes, it's useful.
Thank you. SLUB/SLOB overload different page flags, so it's possible
for user space tools to restore their real meanings - ugly but useful.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
2009-04-28 13:31 ` Wu Fengguang
@ 2009-05-12 13:01 ` Frederic Weisbecker
-1 siblings, 0 replies; 137+ messages in thread
From: Frederic Weisbecker @ 2009-05-12 13:01 UTC (permalink / raw)
To: Wu Fengguang
Cc: Ingo Molnar, Li Zefan, Tom Zanussi, KOSAKI Motohiro,
Pekka Enberg, Andi Kleen, Steven Rostedt, Larry Woodman,
Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
Matt Mackall, Alexey Dobriyan, linux-mm
On Tue, Apr 28, 2009 at 09:31:08PM +0800, Wu Fengguang wrote:
> On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> > tent-Transfer-Encoding: quoted-printable
> > Status: RO
> > Content-Length: 5480
> > Lines: 161
> >
> >
> > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> >
> > > > The above 'get object state' interface (which allows passive
> > > > sampling) - integrated into the tracing framework - would serve
> > > > that goal, agreed?
> > >
> > > Agreed. That could in theory a good complement to dynamic
> > > tracings.
> > >
> > > Then what will be the canonical form for all the 'get object
> > > state' interfaces - "object.attr=value", or whatever? [...]
> >
> > Lemme outline what i'm thinking of.
> >
> > I'd call the feature "object collection tracing", which would live
> > in /debug/tracing, accessed via such files:
> >
> > /debug/tracing/objects/mm/pages/
> > /debug/tracing/objects/mm/pages/format
> > /debug/tracing/objects/mm/pages/filter
> > /debug/tracing/objects/mm/pages/trace_pipe
> > /debug/tracing/objects/mm/pages/stats
> > /debug/tracing/objects/mm/pages/events/
> >
> > here's the (proposed) semantics of those files:
> >
> > 1) /debug/tracing/objects/mm/pages/
> >
> > There's a subsystem / object basic directory structure to make it
> > easy and intuitive to find our way around there.
> >
> > 2) /debug/tracing/objects/mm/pages/format
> >
> > the format file:
> >
> > /debug/tracing/objects/mm/pages/format
> >
> > Would reuse the existing dynamic-tracepoint structured-logging
> > descriptor format and code (this is upstream already):
> >
> > [root@phoenix sched_signal_send]# pwd
> > /debug/tracing/events/sched/sched_signal_send
> >
> > [root@phoenix sched_signal_send]# cat format
> > name: sched_signal_send
> > ID: 24
> > format:
> > field:unsigned short common_type; offset:0; size:2;
> > field:unsigned char common_flags; offset:2; size:1;
> > field:unsigned char common_preempt_count; offset:3; size:1;
> > field:int common_pid; offset:4; size:4;
> > field:int common_tgid; offset:8; size:4;
> >
> > field:int sig; offset:12; size:4;
> > field:char comm[TASK_COMM_LEN]; offset:16; size:16;
> > field:pid_t pid; offset:32; size:4;
> >
> > print fmt: "sig: %d task %s:%d", REC->sig, REC->comm, REC->pid
> >
> > These format descriptors enumerate fields, types and sizes, in a
> > structured way that user-space tools can parse easily. (The binary
> > records that come from the trace_pipe file follow this format
> > description.)
> >
> > 3) /debug/tracing/objects/mm/pages/filter
> >
> > This is the tracing filter that can be set based on the 'format'
> > descriptor. So with the above (signal-send tracepoint) you can
> > define such filter expressions:
> >
> > echo "(sig == 10 && comm == bash) || sig == 13" > filter
> >
> > To restrict the 'scope' of the object collection along pretty much
> > any key or combination of keys. (Or you can leave it as it is and
> > dump all objects and do keying in user-space.)
> >
> > [ Using in-kernel filtering is obviously faster that streaming it
> > out to user-space - but there might be details and types of
> > visualization you want to do in user-space - so we dont want to
> > restrict things here. ]
> >
> > For the mm object collection tracepoint i could imagine such filter
> > expressions:
> >
> > echo "type == shared && file == /sbin/init" > filter
> >
> > To dump all shared pages that are mapped to /sbin/init.
> >
> > 4) /debug/tracing/objects/mm/pages/trace_pipe
> >
> > The 'trace_pipe' file can be used to dump all objects in the
> > collection, which match the filter ('all objects' by default). The
> > record format is described in 'format'.
> >
> > trace_pipe would be a reuse of the existing trace_pipe code: it is a
> > modern, poll()-able, read()-able, splice()-able pipe abstraction.
> >
> > 5) /debug/tracing/objects/mm/pages/stats
> >
> > The 'stats' file would be a reuse of the existing histogram code of
> > the tracing code. We already make use of it for the branch tracers
> > and for the workqueue tracer - it could be extended to be applicable
> > to object collections as well.
> >
> > The advantage there would be that there's no dumping at all - all
> > the integration is done straight in the kernel. ( The 'filter'
> > condition is listened to - increasing flexibility. The filter file
> > could perhaps also act as a default histogram key. )
> >
> > 6) /debug/tracing/objects/mm/pages/events/
> >
> > The 'events' directory offers links back to existing dynamic
> > tracepoints that are under /debug/tracing/events/. This would serve
> > as an additional coherent force that keeps dynamic tracepoints
> > collected by subsystem and by object type as well. (Tools could make
> > use of this information as well - without being aware of actual
> > object semantics.)
> >
> >
> > There would be a number of other object collections we could
> > enumerate:
> >
> > tasks:
> >
> > /debug/tracing/objects/sched/tasks/
> >
> > active inodes known to the kernel:
> >
> > /debug/tracing/objects/fs/inodes/
> >
> > interrupts:
> >
> > /debug/tracing/objects/hw/irqs/
> >
> > etc.
> >
> > These would use the same 'object collection' framework. Once done we
> > can use it for many other thing too.
> >
> > Note how organically integrated it all is with the tracing
> > framework. You could start from an 'object view' to get an overview
> > and then go towards a more dynamic view of specific object
> > attributes (or specific objects), as you drill down on a specific
> > problem you want to analyze.
> >
> > How does this all sound to you?
>
> Great! I saw much opportunity to adapt the not yet submitted
> /proc/filecache interface to the proposed framework.
>
> Its basic form is:
>
> # ino size cached cached% refcnt state age accessed process dev file
> [snip]
> 320 1 4 100 1 D- 50443 1085 udevd 00:11(tmpfs) /.udev/uevent_seqnum
> 460725 123 124 100 35 -- 50444 6795 touch 08:02(sda2) /lib/libpthread-2.9.so
> 460727 31 32 100 14 -- 50444 2007 touch 08:02(sda2) /lib/librt-2.9.so
> 458865 97 80 82 1 -- 50444 49 mount 08:02(sda2) /lib/libdevmapper.so.1.02.1
> 460090 15 16 100 1 -- 50444 48 mount 08:02(sda2) /lib/libuuid.so.1.2
> 458866 46 48 100 1 -- 50444 47 mount 08:02(sda2) /lib/libblkid.so.1.0
> 460732 43 44 100 69 -- 50444 3581 rcS 08:02(sda2) /lib/libnss_nis-2.9.so
> 460739 87 88 100 73 -- 50444 3597 rcS 08:02(sda2) /lib/libnsl-2.9.so
> 460726 31 32 100 69 -- 50444 3581 rcS 08:02(sda2) /lib/libnss_compat-2.9.so
> 458804 250 252 100 11 -- 50445 8175 rcS 08:02(sda2) /lib/libncurses.so.5.6
> 229540 780 752 96 3 -- 50445 7594 init 08:02(sda2) /bin/bash
> 460735 15 16 100 89 -- 50445 17581 init 08:02(sda2) /lib/libdl-2.9.so
> 460721 1344 1340 99 117 -- 50445 48732 init 08:02(sda2) /lib/libc-2.9.so
> 458801 107 104 97 24 -- 50445 3586 init 08:02(sda2) /lib/libselinux.so.1
> 671870 37 24 65 1 -- 50446 1 swapper 08:02(sda2) /sbin/init
> 175 1 24412 100 1 -- 50446 0 swapper 00:01(rootfs) /dev/root
>
> The patch basically does a traversal through one or more of the inode
> lists to produce the output:
> inode_in_use
> inode_unused
> sb->s_dirty
> sb->s_io
> sb->s_more_io
> sb->s_inodes
>
> The filtering feature is a necessity for this interface - or it will
> take considerable time to do a full listing. It supports the following
> filters:
> { LS_OPT_DIRTY, "dirty" },
> { LS_OPT_CLEAN, "clean" },
> { LS_OPT_INUSE, "inuse" },
> { LS_OPT_EMPTY, "empty" },
> { LS_OPT_ALL, "all" },
> { LS_OPT_DEV, "dev=%s" },
>
> There are two possible challenges for the conversion:
>
> - One trick it does is to select different lists to traverse on
> different filter options. Will this be possible in the object
> tracing framework?
Yeah, I guess.
> - The file name lookup(last field) is the performance killer. Is it
> possible to skip the file name lookup when the filter failed on the
> leading fields?
objects collection lays on trace events where filters basically ignore
a whole entry in case of non-matching. Not sure if we can easily only
ignore one field.
But I guess we can do something about the performances...
Could you send us the (sob'ed) patch you made which implements this.
I could try to adapt it to object collection.
Thanks,
Frederic.
> Will the object tracing interface allow such flexibilities?
> (Sorry I'm not yet familiar with the tracing framework.)
>
> > Can you see any conceptual holes in the scheme, any use-case that
> > /proc/kpageflags supports but the object collection approach does
> > not?
>
> kpageflags is simply a big (perhaps sparse) binary array.
> I'd still prefer to retain its current form - the kernel patches and
> user space tools are all ready made, and I see no benefits in
> converting to the tracing framework.
>
> > Would you be interested in seeing something like this, if we tried
> > to implement it in the tracing tree? The majority of the code
> > already exists, we just need interest from the MM side and we have
> > to hook it all up. (it is by no means trivial to do - but looks like
> > a very exciting feature.)
>
> Definitely! /proc/filecache has another 'page view':
>
> # head /proc/filecache
> # file /bin/bash
> # flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback X:readahead P:private O:owner b:buffer d:dirty w:writeback
> # idx len state refcnt
> 0 1 RAMU________ 4
> 3 8 RAMU________ 4
> 12 1 RAMU________ 4
> 14 5 RAMU________ 4
> 20 7 RAMU________ 4
> 27 2 RAMU________ 5
> 29 1 RAMU________ 4
>
> Which is also a good candidate. However I still need to investigate
> whether it offers considerable margins over the mincore() syscall.
>
> Thanks and Regards,
> Fengguang
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
@ 2009-05-12 13:01 ` Frederic Weisbecker
0 siblings, 0 replies; 137+ messages in thread
From: Frederic Weisbecker @ 2009-05-12 13:01 UTC (permalink / raw)
To: Wu Fengguang
Cc: Ingo Molnar, Li Zefan, Tom Zanussi, KOSAKI Motohiro,
Pekka Enberg, Andi Kleen, Steven Rostedt, Larry Woodman,
Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
Matt Mackall, Alexey Dobriyan, linux-mm
On Tue, Apr 28, 2009 at 09:31:08PM +0800, Wu Fengguang wrote:
> On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> > tent-Transfer-Encoding: quoted-printable
> > Status: RO
> > Content-Length: 5480
> > Lines: 161
> >
> >
> > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> >
> > > > The above 'get object state' interface (which allows passive
> > > > sampling) - integrated into the tracing framework - would serve
> > > > that goal, agreed?
> > >
> > > Agreed. That could in theory a good complement to dynamic
> > > tracings.
> > >
> > > Then what will be the canonical form for all the 'get object
> > > state' interfaces - "object.attr=value", or whatever? [...]
> >
> > Lemme outline what i'm thinking of.
> >
> > I'd call the feature "object collection tracing", which would live
> > in /debug/tracing, accessed via such files:
> >
> > /debug/tracing/objects/mm/pages/
> > /debug/tracing/objects/mm/pages/format
> > /debug/tracing/objects/mm/pages/filter
> > /debug/tracing/objects/mm/pages/trace_pipe
> > /debug/tracing/objects/mm/pages/stats
> > /debug/tracing/objects/mm/pages/events/
> >
> > here's the (proposed) semantics of those files:
> >
> > 1) /debug/tracing/objects/mm/pages/
> >
> > There's a subsystem / object basic directory structure to make it
> > easy and intuitive to find our way around there.
> >
> > 2) /debug/tracing/objects/mm/pages/format
> >
> > the format file:
> >
> > /debug/tracing/objects/mm/pages/format
> >
> > Would reuse the existing dynamic-tracepoint structured-logging
> > descriptor format and code (this is upstream already):
> >
> > [root@phoenix sched_signal_send]# pwd
> > /debug/tracing/events/sched/sched_signal_send
> >
> > [root@phoenix sched_signal_send]# cat format
> > name: sched_signal_send
> > ID: 24
> > format:
> > field:unsigned short common_type; offset:0; size:2;
> > field:unsigned char common_flags; offset:2; size:1;
> > field:unsigned char common_preempt_count; offset:3; size:1;
> > field:int common_pid; offset:4; size:4;
> > field:int common_tgid; offset:8; size:4;
> >
> > field:int sig; offset:12; size:4;
> > field:char comm[TASK_COMM_LEN]; offset:16; size:16;
> > field:pid_t pid; offset:32; size:4;
> >
> > print fmt: "sig: %d task %s:%d", REC->sig, REC->comm, REC->pid
> >
> > These format descriptors enumerate fields, types and sizes, in a
> > structured way that user-space tools can parse easily. (The binary
> > records that come from the trace_pipe file follow this format
> > description.)
> >
> > 3) /debug/tracing/objects/mm/pages/filter
> >
> > This is the tracing filter that can be set based on the 'format'
> > descriptor. So with the above (signal-send tracepoint) you can
> > define such filter expressions:
> >
> > echo "(sig == 10 && comm == bash) || sig == 13" > filter
> >
> > To restrict the 'scope' of the object collection along pretty much
> > any key or combination of keys. (Or you can leave it as it is and
> > dump all objects and do keying in user-space.)
> >
> > [ Using in-kernel filtering is obviously faster that streaming it
> > out to user-space - but there might be details and types of
> > visualization you want to do in user-space - so we dont want to
> > restrict things here. ]
> >
> > For the mm object collection tracepoint i could imagine such filter
> > expressions:
> >
> > echo "type == shared && file == /sbin/init" > filter
> >
> > To dump all shared pages that are mapped to /sbin/init.
> >
> > 4) /debug/tracing/objects/mm/pages/trace_pipe
> >
> > The 'trace_pipe' file can be used to dump all objects in the
> > collection, which match the filter ('all objects' by default). The
> > record format is described in 'format'.
> >
> > trace_pipe would be a reuse of the existing trace_pipe code: it is a
> > modern, poll()-able, read()-able, splice()-able pipe abstraction.
> >
> > 5) /debug/tracing/objects/mm/pages/stats
> >
> > The 'stats' file would be a reuse of the existing histogram code of
> > the tracing code. We already make use of it for the branch tracers
> > and for the workqueue tracer - it could be extended to be applicable
> > to object collections as well.
> >
> > The advantage there would be that there's no dumping at all - all
> > the integration is done straight in the kernel. ( The 'filter'
> > condition is listened to - increasing flexibility. The filter file
> > could perhaps also act as a default histogram key. )
> >
> > 6) /debug/tracing/objects/mm/pages/events/
> >
> > The 'events' directory offers links back to existing dynamic
> > tracepoints that are under /debug/tracing/events/. This would serve
> > as an additional coherent force that keeps dynamic tracepoints
> > collected by subsystem and by object type as well. (Tools could make
> > use of this information as well - without being aware of actual
> > object semantics.)
> >
> >
> > There would be a number of other object collections we could
> > enumerate:
> >
> > tasks:
> >
> > /debug/tracing/objects/sched/tasks/
> >
> > active inodes known to the kernel:
> >
> > /debug/tracing/objects/fs/inodes/
> >
> > interrupts:
> >
> > /debug/tracing/objects/hw/irqs/
> >
> > etc.
> >
> > These would use the same 'object collection' framework. Once done we
> > can use it for many other thing too.
> >
> > Note how organically integrated it all is with the tracing
> > framework. You could start from an 'object view' to get an overview
> > and then go towards a more dynamic view of specific object
> > attributes (or specific objects), as you drill down on a specific
> > problem you want to analyze.
> >
> > How does this all sound to you?
>
> Great! I saw much opportunity to adapt the not yet submitted
> /proc/filecache interface to the proposed framework.
>
> Its basic form is:
>
> # ino size cached cached% refcnt state age accessed process dev file
> [snip]
> 320 1 4 100 1 D- 50443 1085 udevd 00:11(tmpfs) /.udev/uevent_seqnum
> 460725 123 124 100 35 -- 50444 6795 touch 08:02(sda2) /lib/libpthread-2.9.so
> 460727 31 32 100 14 -- 50444 2007 touch 08:02(sda2) /lib/librt-2.9.so
> 458865 97 80 82 1 -- 50444 49 mount 08:02(sda2) /lib/libdevmapper.so.1.02.1
> 460090 15 16 100 1 -- 50444 48 mount 08:02(sda2) /lib/libuuid.so.1.2
> 458866 46 48 100 1 -- 50444 47 mount 08:02(sda2) /lib/libblkid.so.1.0
> 460732 43 44 100 69 -- 50444 3581 rcS 08:02(sda2) /lib/libnss_nis-2.9.so
> 460739 87 88 100 73 -- 50444 3597 rcS 08:02(sda2) /lib/libnsl-2.9.so
> 460726 31 32 100 69 -- 50444 3581 rcS 08:02(sda2) /lib/libnss_compat-2.9.so
> 458804 250 252 100 11 -- 50445 8175 rcS 08:02(sda2) /lib/libncurses.so.5.6
> 229540 780 752 96 3 -- 50445 7594 init 08:02(sda2) /bin/bash
> 460735 15 16 100 89 -- 50445 17581 init 08:02(sda2) /lib/libdl-2.9.so
> 460721 1344 1340 99 117 -- 50445 48732 init 08:02(sda2) /lib/libc-2.9.so
> 458801 107 104 97 24 -- 50445 3586 init 08:02(sda2) /lib/libselinux.so.1
> 671870 37 24 65 1 -- 50446 1 swapper 08:02(sda2) /sbin/init
> 175 1 24412 100 1 -- 50446 0 swapper 00:01(rootfs) /dev/root
>
> The patch basically does a traversal through one or more of the inode
> lists to produce the output:
> inode_in_use
> inode_unused
> sb->s_dirty
> sb->s_io
> sb->s_more_io
> sb->s_inodes
>
> The filtering feature is a necessity for this interface - or it will
> take considerable time to do a full listing. It supports the following
> filters:
> { LS_OPT_DIRTY, "dirty" },
> { LS_OPT_CLEAN, "clean" },
> { LS_OPT_INUSE, "inuse" },
> { LS_OPT_EMPTY, "empty" },
> { LS_OPT_ALL, "all" },
> { LS_OPT_DEV, "dev=%s" },
>
> There are two possible challenges for the conversion:
>
> - One trick it does is to select different lists to traverse on
> different filter options. Will this be possible in the object
> tracing framework?
Yeah, I guess.
> - The file name lookup(last field) is the performance killer. Is it
> possible to skip the file name lookup when the filter failed on the
> leading fields?
objects collection lays on trace events where filters basically ignore
a whole entry in case of non-matching. Not sure if we can easily only
ignore one field.
But I guess we can do something about the performances...
Could you send us the (sob'ed) patch you made which implements this.
I could try to adapt it to object collection.
Thanks,
Frederic.
> Will the object tracing interface allow such flexibilities?
> (Sorry I'm not yet familiar with the tracing framework.)
>
> > Can you see any conceptual holes in the scheme, any use-case that
> > /proc/kpageflags supports but the object collection approach does
> > not?
>
> kpageflags is simply a big (perhaps sparse) binary array.
> I'd still prefer to retain its current form - the kernel patches and
> user space tools are all ready made, and I see no benefits in
> converting to the tracing framework.
>
> > Would you be interested in seeing something like this, if we tried
> > to implement it in the tracing tree? The majority of the code
> > already exists, we just need interest from the MM side and we have
> > to hook it all up. (it is by no means trivial to do - but looks like
> > a very exciting feature.)
>
> Definitely! /proc/filecache has another 'page view':
>
> # head /proc/filecache
> # file /bin/bash
> # flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback X:readahead P:private O:owner b:buffer d:dirty w:writeback
> # idx len state refcnt
> 0 1 RAMU________ 4
> 3 8 RAMU________ 4
> 12 1 RAMU________ 4
> 14 5 RAMU________ 4
> 20 7 RAMU________ 4
> 27 2 RAMU________ 5
> 29 1 RAMU________ 4
>
> Which is also a good candidate. However I still need to investigate
> whether it offers considerable margins over the mincore() syscall.
>
> Thanks and Regards,
> Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
2009-05-12 13:01 ` Frederic Weisbecker
(?)
@ 2009-05-17 13:36 ` Wu Fengguang
2009-05-17 13:55 ` Frederic Weisbecker
2009-05-18 11:44 ` KOSAKI Motohiro
-1 siblings, 2 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-05-17 13:36 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Ingo Molnar, Li Zefan, Tom Zanussi, KOSAKI Motohiro,
Pekka Enberg, Andi Kleen, Steven Rostedt, Larry Woodman,
Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
Matt Mackall, Alexey Dobriyan, linux-mm
[-- Attachment #1: Type: text/plain, Size: 3473 bytes --]
On Tue, May 12, 2009 at 09:01:12PM +0800, Frederic Weisbecker wrote:
> On Tue, Apr 28, 2009 at 09:31:08PM +0800, Wu Fengguang wrote:
> > On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> >
> > There are two possible challenges for the conversion:
> >
> > - One trick it does is to select different lists to traverse on
> > different filter options. Will this be possible in the object
> > tracing framework?
>
> Yeah, I guess.
Great.
>
> > - The file name lookup(last field) is the performance killer. Is it
> > possible to skip the file name lookup when the filter failed on the
> > leading fields?
>
> objects collection lays on trace events where filters basically ignore
> a whole entry in case of non-matching. Not sure if we can easily only
> ignore one field.
>
> But I guess we can do something about the performances...
OK, but it's not as important as the previous requirement, so it could
be the last thing to work on :)
> Could you send us the (sob'ed) patch you made which implements this.
> I could try to adapt it to object collection.
Attached for your reference. Be aware that I still have plans to
change it in non trivial way, and there are ongoing works by Nick(on
inode_lock) and Jens(on s_dirty) that can create merge conflicts.
So basically it is not a right time to do the adaption.
However we can still do something to polish up the page object
collection under /debug/tracing/objects/mm/pages/. For example,
the timestamps and function name could be removed from the following
list :)
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
<...>-3743 [001] 3035.649769: dump_pages: pfn=1 flags=400 count=1 mapcount=0 index=0
<...>-3743 [001] 3044.176403: dump_pages: pfn=1 flags=400 count=1 mapcount=0 index=0
<...>-3743 [001] 3044.176407: dump_pages: pfn=2 flags=400 count=1 mapcount=0 index=0
<...>-3743 [001] 3044.176408: dump_pages: pfn=3 flags=400 count=1 mapcount=0 index=0
<...>-3743 [001] 3044.176409: dump_pages: pfn=4 flags=400 count=1 mapcount=0 index=0
<...>-3743 [001] 3044.176409: dump_pages: pfn=5 flags=400 count=1 mapcount=0 index=0
<...>-3743 [001] 3044.176410: dump_pages: pfn=6 flags=400 count=1 mapcount=0 index=0
<...>-3743 [001] 3044.176410: dump_pages: pfn=7 flags=400 count=1 mapcount=0 index=0
<...>-3743 [001] 3044.176411: dump_pages: pfn=8 flags=400 count=1 mapcount=0 index=0
<...>-3743 [001] 3044.176411: dump_pages: pfn=9 flags=400 count=1 mapcount=0 index=0
<...>-3743 [001] 3044.176412: dump_pages: pfn=10 flags=400 count=1 mapcount=0 index=0
Thanks,
Fengguang
[-- Attachment #2: filecache-2.6.30.patch --]
[-- Type: text/x-diff, Size: 33820 bytes --]
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -27,6 +27,7 @@ extern unsigned long max_mapnr;
extern unsigned long num_physpages;
extern void * high_memory;
extern int page_cluster;
+extern char * const zone_names[];
#ifdef CONFIG_SYSCTL
extern int sysctl_legacy_va_layout;
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -104,7 +104,7 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_Z
EXPORT_SYMBOL(totalram_pages);
-static char * const zone_names[MAX_NR_ZONES] = {
+char * const zone_names[MAX_NR_ZONES] = {
#ifdef CONFIG_ZONE_DMA
"DMA",
#endif
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -1925,7 +1925,10 @@ char *__d_path(const struct path *path,
if (dentry == root->dentry && vfsmnt == root->mnt)
break;
- if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
+ if (unlikely(!vfsmnt)) {
+ if (IS_ROOT(dentry))
+ break;
+ } else if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
/* Global root? */
if (vfsmnt->mnt_parent == vfsmnt) {
goto global_root;
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -564,7 +564,6 @@ out:
}
EXPORT_SYMBOL(radix_tree_tag_clear);
-#ifndef __KERNEL__ /* Only the test harness uses this at present */
/**
* radix_tree_tag_get - get a tag on a radix tree node
* @root: radix tree root
@@ -627,7 +626,6 @@ int radix_tree_tag_get(struct radix_tree
}
}
EXPORT_SYMBOL(radix_tree_tag_get);
-#endif
/**
* radix_tree_next_hole - find the next hole (not-present entry)
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -84,6 +84,10 @@ static struct hlist_head *inode_hashtabl
*/
DEFINE_SPINLOCK(inode_lock);
+EXPORT_SYMBOL(inode_in_use);
+EXPORT_SYMBOL(inode_unused);
+EXPORT_SYMBOL(inode_lock);
+
/*
* iprune_mutex provides exclusion between the kswapd or try_to_free_pages
* icache shrinking path, and the umount path. Without this exclusion,
@@ -110,6 +114,13 @@ static void wake_up_inode(struct inode *
wake_up_bit(&inode->i_state, __I_LOCK);
}
+static inline void inode_created_by(struct inode *inode, struct task_struct *task)
+{
+#ifdef CONFIG_PROC_FILECACHE_EXTRAS
+ memcpy(inode->i_comm, task->comm, sizeof(task->comm));
+#endif
+}
+
/**
* inode_init_always - perform inode structure intialisation
* @sb: superblock inode belongs to
@@ -147,7 +158,7 @@ struct inode *inode_init_always(struct s
inode->i_bdev = NULL;
inode->i_cdev = NULL;
inode->i_rdev = 0;
- inode->dirtied_when = 0;
+ inode->dirtied_when = jiffies;
if (security_inode_alloc(inode))
goto out_free_inode;
@@ -188,6 +199,7 @@ struct inode *inode_init_always(struct s
}
inode->i_private = NULL;
inode->i_mapping = mapping;
+ inode_created_by(inode, current);
return inode;
@@ -276,6 +288,8 @@ void __iget(struct inode *inode)
inodes_stat.nr_unused--;
}
+EXPORT_SYMBOL(__iget);
+
/**
* clear_inode - clear an inode
* @inode: inode to clear
@@ -1459,6 +1473,16 @@ static void __wait_on_freeing_inode(stru
spin_lock(&inode_lock);
}
+
+struct hlist_head * get_inode_hash_budget(unsigned long index)
+{
+ if (index >= (1 << i_hash_shift))
+ return NULL;
+
+ return inode_hashtable + index;
+}
+EXPORT_SYMBOL_GPL(get_inode_hash_budget);
+
static __initdata unsigned long ihash_entries;
static int __init set_ihash_entries(char *str)
{
--- linux-2.6.orig/fs/super.c
+++ linux-2.6/fs/super.c
@@ -46,6 +46,9 @@
LIST_HEAD(super_blocks);
DEFINE_SPINLOCK(sb_lock);
+EXPORT_SYMBOL(super_blocks);
+EXPORT_SYMBOL(sb_lock);
+
/**
* alloc_super - create new superblock
* @type: filesystem type superblock should belong to
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -262,6 +262,7 @@ unsigned long shrink_slab(unsigned long
up_read(&shrinker_rwsem);
return ret;
}
+EXPORT_SYMBOL(shrink_slab);
/* Called without lock on whether page is mapped, so answer is unstable */
static inline int page_mapping_inuse(struct page *page)
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -45,6 +45,7 @@ struct address_space swapper_space = {
.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
.backing_dev_info = &swap_backing_dev_info,
};
+EXPORT_SYMBOL_GPL(swapper_space);
#define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0)
--- linux-2.6.orig/Documentation/filesystems/proc.txt
+++ linux-2.6/Documentation/filesystems/proc.txt
@@ -260,6 +260,7 @@ Table 1-4: Kernel info in /proc
driver Various drivers grouped here, currently rtc (2.4)
execdomains Execdomains, related to security (2.4)
fb Frame Buffer devices (2.4)
+ filecache Query/drop in-memory file cache
fs File system parameters, currently nfs/exports (2.4)
ide Directory containing info about the IDE subsystem
interrupts Interrupt usage
@@ -450,6 +451,88 @@ varies by architecture and compile optio
> cat /proc/meminfo
+..............................................................................
+
+filecache:
+
+Provides access to the in-memory file cache.
+
+To list an index of all cached files:
+
+ echo ls > /proc/filecache
+ cat /proc/filecache
+
+The output looks like:
+
+ # filecache 1.0
+ # ino size cached cached% state refcnt dev file
+ 1026334 91 92 100 -- 66 03:02(hda2) /lib/ld-2.3.6.so
+ 233608 1242 972 78 -- 66 03:02(hda2) /lib/tls/libc-2.3.6.so
+ 65203 651 476 73 -- 1 03:02(hda2) /bin/bash
+ 1026445 261 160 61 -- 10 03:02(hda2) /lib/libncurses.so.5.5
+ 235427 10 12 100 -- 44 03:02(hda2) /lib/tls/libdl-2.3.6.so
+
+FIELD INTRO
+---------------------------------------------------------------------------
+ino inode number
+size inode size in KB
+cached cached size in KB
+cached% percent of file data cached
+state1 '-' clean; 'd' metadata dirty; 'D' data dirty
+state2 '-' unlocked; 'L' locked, normally indicates file being written out
+refcnt file reference count, it's an in-kernel one, not exactly open count
+dev major:minor numbers in hex, followed by a descriptive device name
+file file path _inside_ the filesystem. There are several special names:
+ '(noname)': the file name is not available
+ '(03:02)': the file is a block device file of major:minor
+ '...(deleted)': the named file has been deleted from the disk
+
+To list the cached pages of a perticular file:
+
+ echo /bin/bash > /proc/filecache
+ cat /proc/filecache
+
+ # file /bin/bash
+ # flags R:referenced A:active U:uptodate D:dirty W:writeback M:mmap
+ # idx len state refcnt
+ 0 36 RAU__M 3
+ 36 1 RAU__M 2
+ 37 8 RAU__M 3
+ 45 2 RAU___ 1
+ 47 6 RAU__M 3
+ 53 3 RAU__M 2
+ 56 2 RAU__M 3
+
+FIELD INTRO
+----------------------------------------------------------------------------
+idx page index
+len number of pages which are cached and share the same state
+state page state of the flags listed in line two
+refcnt page reference count
+
+Careful users may notice that the file name to be queried is remembered between
+commands. Internally, the module has a global variable to store the file name
+parameter, so that it can be inherited by newly opened /proc/filecache file.
+However it can lead to interference for multiple queriers. The solution here
+is to obey a rule: only root can interactively change the file name parameter;
+normal users must go for scripts to access the interface. Scripts should do it
+by following the code example below:
+
+ filecache = open("/proc/filecache", "rw");
+ # avoid polluting the global parameter filename
+ filecache.write("set private");
+
+To instruct the kernel to drop clean caches, dentries and inodes from memory,
+causing that memory to become free:
+
+ # drop clean file data cache (i.e. file backed pagecache)
+ echo drop pagecache > /proc/filecache
+
+ # drop clean file metadata cache (i.e. dentries and inodes)
+ echo drop slabcache > /proc/filecache
+
+Note that the drop commands are non-destructive operations and dirty objects
+are not freeable, the user should run `sync' first.
MemTotal: 16344972 kB
MemFree: 13634064 kB
--- /dev/null
+++ linux-2.6/fs/proc/filecache.c
@@ -0,0 +1,1045 @@
+/*
+ * fs/proc/filecache.c
+ *
+ * Copyright (C) 2006, 2007 Fengguang Wu <wfg@mail.ustc.edu.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/radix-tree.h>
+#include <linux/page-flags.h>
+#include <linux/pagevec.h>
+#include <linux/pagemap.h>
+#include <linux/vmalloc.h>
+#include <linux/writeback.h>
+#include <linux/buffer_head.h>
+#include <linux/parser.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/module.h>
+#include <asm/uaccess.h>
+
+/*
+ * Increase minor version when new columns are added;
+ * Increase major version when existing columns are changed.
+ */
+#define FILECACHE_VERSION "1.0"
+
+/* Internal buffer sizes. The larger the more effcient. */
+#define SBUF_SIZE (128<<10)
+#define IWIN_PAGE_ORDER 3
+#define IWIN_SIZE ((PAGE_SIZE<<IWIN_PAGE_ORDER) / sizeof(struct inode *))
+
+/*
+ * Session management.
+ *
+ * Each opened /proc/filecache file is assiocated with a session object.
+ * Also there is a global_session that maintains status across open()/close()
+ * (i.e. the lifetime of an opened file), so that a casual user can query the
+ * filecache via _multiple_ simple shell commands like
+ * 'echo cat /bin/bash > /proc/filecache; cat /proc/filecache'.
+ *
+ * session.query_file is the file whose cache info is to be queried.
+ * Its value determines what we get on read():
+ * - NULL: ii_*() called to show the inode index
+ * - filp: pg_*() called to show the page groups of a filp
+ *
+ * session.query_file is
+ * - cloned from global_session.query_file on open();
+ * - updated on write("cat filename");
+ * note that the new file will also be saved in global_session.query_file if
+ * session.private_session is false.
+ */
+
+struct session {
+ /* options */
+ int private_session;
+ unsigned long ls_options;
+ dev_t ls_dev;
+
+ /* parameters */
+ struct file *query_file;
+
+ /* seqfile pos */
+ pgoff_t start_offset;
+ pgoff_t next_offset;
+
+ /* inode at last pos */
+ struct {
+ unsigned long pos;
+ unsigned long state;
+ struct inode *inode;
+ struct inode *pinned_inode;
+ } ipos;
+
+ /* inode window */
+ struct {
+ unsigned long cursor;
+ unsigned long origin;
+ unsigned long size;
+ struct inode **inodes;
+ } iwin;
+};
+
+static struct session global_session;
+
+/*
+ * Session address is stored in proc_file->f_ra.start:
+ * we assume that there will be no readahead for proc_file.
+ */
+static struct session *get_session(struct file *proc_file)
+{
+ return (struct session *)proc_file->f_ra.start;
+}
+
+static void set_session(struct file *proc_file, struct session *s)
+{
+ BUG_ON(proc_file->f_ra.start);
+ proc_file->f_ra.start = (unsigned long)s;
+}
+
+static void update_global_file(struct session *s)
+{
+ if (s->private_session)
+ return;
+
+ if (global_session.query_file)
+ fput(global_session.query_file);
+
+ global_session.query_file = s->query_file;
+
+ if (global_session.query_file)
+ get_file(global_session.query_file);
+}
+
+/*
+ * Cases of the name:
+ * 1) NULL (new session)
+ * s->query_file = global_session.query_file = 0;
+ * 2) "" (ls/la)
+ * s->query_file = global_session.query_file;
+ * 3) a regular file name (cat newfile)
+ * s->query_file = global_session.query_file = newfile;
+ */
+static int session_update_file(struct session *s, char *name)
+{
+ static DEFINE_MUTEX(mutex); /* protects global_session.query_file */
+ int err = 0;
+
+ mutex_lock(&mutex);
+
+ /*
+ * We are to quit, or to list the cached files.
+ * Reset *.query_file.
+ */
+ if (!name) {
+ if (s->query_file) {
+ fput(s->query_file);
+ s->query_file = NULL;
+ }
+ update_global_file(s);
+ goto out;
+ }
+
+ /*
+ * This is a new session.
+ * Inherit options/parameters from global ones.
+ */
+ if (name[0] == '\0') {
+ *s = global_session;
+ if (s->query_file)
+ get_file(s->query_file);
+ goto out;
+ }
+
+ /*
+ * Open the named file.
+ */
+ if (s->query_file)
+ fput(s->query_file);
+ s->query_file = filp_open(name, O_RDONLY|O_LARGEFILE, 0);
+ if (IS_ERR(s->query_file)) {
+ err = PTR_ERR(s->query_file);
+ s->query_file = NULL;
+ } else
+ update_global_file(s);
+
+out:
+ mutex_unlock(&mutex);
+
+ return err;
+}
+
+static struct session *session_create(void)
+{
+ struct session *s;
+ int err = 0;
+
+ s = kmalloc(sizeof(*s), GFP_KERNEL);
+ if (s)
+ err = session_update_file(s, "");
+ else
+ err = -ENOMEM;
+
+ return err ? ERR_PTR(err) : s;
+}
+
+static void session_release(struct session *s)
+{
+ if (s->ipos.pinned_inode)
+ iput(s->ipos.pinned_inode);
+ if (s->query_file)
+ fput(s->query_file);
+ kfree(s);
+}
+
+
+/*
+ * Listing of cached files.
+ *
+ * Usage:
+ * echo > /proc/filecache # enter listing mode
+ * cat /proc/filecache # get the file listing
+ */
+
+/* code style borrowed from ib_srp.c */
+enum {
+ LS_OPT_ERR = 0,
+ LS_OPT_DIRTY = 1 << 0,
+ LS_OPT_CLEAN = 1 << 1,
+ LS_OPT_INUSE = 1 << 2,
+ LS_OPT_EMPTY = 1 << 3,
+ LS_OPT_ALL = 1 << 4,
+ LS_OPT_DEV = 1 << 5,
+};
+
+static match_table_t ls_opt_tokens = {
+ { LS_OPT_DIRTY, "dirty" },
+ { LS_OPT_CLEAN, "clean" },
+ { LS_OPT_INUSE, "inuse" },
+ { LS_OPT_EMPTY, "empty" },
+ { LS_OPT_ALL, "all" },
+ { LS_OPT_DEV, "dev=%s" },
+ { LS_OPT_ERR, NULL }
+};
+
+static int ls_parse_options(const char *buf, struct session *s)
+{
+ substring_t args[MAX_OPT_ARGS];
+ char *options, *sep_opt;
+ char *p;
+ int token;
+ int ret = 0;
+
+ if (!buf)
+ return 0;
+ options = kstrdup(buf, GFP_KERNEL);
+ if (!options)
+ return -ENOMEM;
+
+ s->ls_options = 0;
+ sep_opt = options;
+ while ((p = strsep(&sep_opt, " ")) != NULL) {
+ if (!*p)
+ continue;
+
+ token = match_token(p, ls_opt_tokens, args);
+
+ switch (token) {
+ case LS_OPT_DIRTY:
+ case LS_OPT_CLEAN:
+ case LS_OPT_INUSE:
+ case LS_OPT_EMPTY:
+ case LS_OPT_ALL:
+ s->ls_options |= token;
+ break;
+ case LS_OPT_DEV:
+ p = match_strdup(args);
+ if (!p) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ if (*p == '/') {
+ struct kstat stat;
+ struct nameidata nd;
+ ret = path_lookup(p, LOOKUP_FOLLOW, &nd);
+ if (!ret)
+ ret = vfs_getattr(nd.path.mnt,
+ nd.path.dentry, &stat);
+ if (!ret)
+ s->ls_dev = stat.rdev;
+ } else
+ s->ls_dev = simple_strtoul(p, NULL, 0);
+ /* printk("%lx %s\n", (long)s->ls_dev, p); */
+ kfree(p);
+ break;
+
+ default:
+ printk(KERN_WARNING "unknown parameter or missing value "
+ "'%s' in ls command\n", p);
+ ret = -EINVAL;
+ goto out;
+ }
+ }
+
+out:
+ kfree(options);
+ return ret;
+}
+
+/*
+ * Add possible filters here.
+ * No permission check: we cannot verify the path's permission anyway.
+ * We simply demand root previledge for accessing /proc/filecache.
+ */
+static int may_show_inode(struct session *s, struct inode *inode)
+{
+ if (!atomic_read(&inode->i_count))
+ return 0;
+ if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+ return 0;
+ if (!inode->i_mapping)
+ return 0;
+
+ if (s->ls_dev && s->ls_dev != inode->i_sb->s_dev)
+ return 0;
+
+ if (s->ls_options & LS_OPT_ALL)
+ return 1;
+
+ if (!(s->ls_options & LS_OPT_EMPTY) && !inode->i_mapping->nrpages)
+ return 0;
+
+ if ((s->ls_options & LS_OPT_DIRTY) && !(inode->i_state & I_DIRTY))
+ return 0;
+
+ if ((s->ls_options & LS_OPT_CLEAN) && (inode->i_state & I_DIRTY))
+ return 0;
+
+ if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
+ S_ISLNK(inode->i_mode) || S_ISBLK(inode->i_mode)))
+ return 0;
+
+ return 1;
+}
+
+/*
+ * Full: there are more data following.
+ */
+static int iwin_full(struct session *s)
+{
+ return !s->iwin.cursor ||
+ s->iwin.cursor > s->iwin.origin + s->iwin.size;
+}
+
+static int iwin_push(struct session *s, struct inode *inode)
+{
+ if (!may_show_inode(s, inode))
+ return 0;
+
+ s->iwin.cursor++;
+
+ if (s->iwin.size >= IWIN_SIZE)
+ return 1;
+
+ if (s->iwin.cursor > s->iwin.origin)
+ s->iwin.inodes[s->iwin.size++] = inode;
+ return 0;
+}
+
+/*
+ * Travease the inode lists in order - newest first.
+ * And fill @s->iwin.inodes with inodes positioned in [@pos, @pos+IWIN_SIZE).
+ */
+static int iwin_fill(struct session *s, unsigned long pos)
+{
+ struct inode *inode;
+ struct super_block *sb;
+
+ s->iwin.origin = pos;
+ s->iwin.cursor = 0;
+ s->iwin.size = 0;
+
+ /*
+ * We have a cursor inode, clean and expected to be unchanged.
+ */
+ if (s->ipos.inode && pos >= s->ipos.pos &&
+ !(s->ipos.state & I_DIRTY) &&
+ s->ipos.state == s->ipos.inode->i_state) {
+ inode = s->ipos.inode;
+ s->iwin.cursor = s->ipos.pos;
+ goto continue_from_saved;
+ }
+
+ if (s->ls_options & LS_OPT_CLEAN)
+ goto clean_inodes;
+
+ spin_lock(&sb_lock);
+ list_for_each_entry(sb, &super_blocks, s_list) {
+ if (s->ls_dev && s->ls_dev != sb->s_dev)
+ continue;
+
+ list_for_each_entry(inode, &sb->s_dirty, i_list) {
+ if (iwin_push(s, inode))
+ goto out_full_unlock;
+ }
+ list_for_each_entry(inode, &sb->s_io, i_list) {
+ if (iwin_push(s, inode))
+ goto out_full_unlock;
+ }
+ }
+ spin_unlock(&sb_lock);
+
+clean_inodes:
+ list_for_each_entry(inode, &inode_in_use, i_list) {
+ if (iwin_push(s, inode))
+ goto out_full;
+continue_from_saved:
+ ;
+ }
+
+ if (s->ls_options & LS_OPT_INUSE)
+ return 0;
+
+ list_for_each_entry(inode, &inode_unused, i_list) {
+ if (iwin_push(s, inode))
+ goto out_full;
+ }
+
+ return 0;
+
+out_full_unlock:
+ spin_unlock(&sb_lock);
+out_full:
+ return 1;
+}
+
+static struct inode *iwin_inode(struct session *s, unsigned long pos)
+{
+ if ((iwin_full(s) && pos >= s->iwin.origin + s->iwin.size)
+ || pos < s->iwin.origin)
+ iwin_fill(s, pos);
+
+ if (pos >= s->iwin.cursor)
+ return NULL;
+
+ s->ipos.pos = pos;
+ s->ipos.inode = s->iwin.inodes[pos - s->iwin.origin];
+ BUG_ON(!s->ipos.inode);
+ return s->ipos.inode;
+}
+
+static void show_inode(struct seq_file *m, struct inode *inode)
+{
+ char state[] = "--"; /* dirty, locked */
+ struct dentry *dentry;
+ loff_t size = i_size_read(inode);
+ unsigned long nrpages;
+ int percent;
+ int refcnt;
+ int shift;
+
+ if (!size)
+ size++;
+
+ if (inode->i_mapping)
+ nrpages = inode->i_mapping->nrpages;
+ else {
+ nrpages = 0;
+ WARN_ON(1);
+ }
+
+ for (shift = 0; (size >> shift) > ULONG_MAX / 128; shift += 12)
+ ;
+ percent = min(100UL, (((100 * nrpages) >> shift) << PAGE_CACHE_SHIFT) /
+ (unsigned long)(size >> shift));
+
+ if (inode->i_state & (I_DIRTY_DATASYNC|I_DIRTY_PAGES))
+ state[0] = 'D';
+ else if (inode->i_state & I_DIRTY_SYNC)
+ state[0] = 'd';
+
+ if (inode->i_state & I_LOCK)
+ state[0] = 'L';
+
+ refcnt = 0;
+ list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
+ refcnt += atomic_read(&dentry->d_count);
+ }
+
+ seq_printf(m, "%10lu %10llu %8lu %7d ",
+ inode->i_ino,
+ DIV_ROUND_UP(size, 1024),
+ nrpages << (PAGE_CACHE_SHIFT - 10),
+ percent);
+
+ seq_printf(m, "%6d %5s %9lu ",
+ refcnt,
+ state,
+ (jiffies - inode->dirtied_when) / HZ);
+
+#ifdef CONFIG_PROC_FILECACHE_EXTRAS
+ seq_printf(m, "%8u %-16s",
+ inode->i_access_count,
+ inode->i_comm);
+#endif
+
+ seq_printf(m, "%02x:%02x(%s)\t",
+ MAJOR(inode->i_sb->s_dev),
+ MINOR(inode->i_sb->s_dev),
+ inode->i_sb->s_id);
+
+ if (list_empty(&inode->i_dentry)) {
+ if (!atomic_read(&inode->i_count))
+ seq_puts(m, "(noname)\n");
+ else
+ seq_printf(m, "(%02x:%02x)\n",
+ imajor(inode), iminor(inode));
+ } else {
+ struct path path = {
+ .mnt = NULL,
+ .dentry = list_entry(inode->i_dentry.next,
+ struct dentry, d_alias)
+ };
+
+ seq_path(m, &path, " \t\n\\");
+ seq_putc(m, '\n');
+ }
+}
+
+static int ii_show(struct seq_file *m, void *v)
+{
+ unsigned long index = *(loff_t *) v;
+ struct session *s = m->private;
+ struct inode *inode;
+
+ if (index == 0) {
+ seq_puts(m, "# filecache " FILECACHE_VERSION "\n");
+ seq_puts(m, "# ino size cached cached% "
+ "refcnt state age "
+#ifdef CONFIG_PROC_FILECACHE_EXTRAS
+ "accessed process "
+#endif
+ "dev\t\tfile\n");
+ }
+
+ inode = iwin_inode(s,index);
+ show_inode(m, inode);
+
+ return 0;
+}
+
+static void *ii_start(struct seq_file *m, loff_t *pos)
+{
+ struct session *s = m->private;
+
+ s->iwin.size = 0;
+ s->iwin.inodes = (struct inode **)
+ __get_free_pages( GFP_KERNEL, IWIN_PAGE_ORDER);
+ if (!s->iwin.inodes)
+ return NULL;
+
+ spin_lock(&inode_lock);
+
+ return iwin_inode(s, *pos) ? pos : NULL;
+}
+
+static void *ii_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ struct session *s = m->private;
+
+ (*pos)++;
+ return iwin_inode(s, *pos) ? pos : NULL;
+}
+
+static void ii_stop(struct seq_file *m, void *v)
+{
+ struct session *s = m->private;
+ struct inode *inode = s->ipos.inode;
+
+ if (!s->iwin.inodes)
+ return;
+
+ if (inode) {
+ __iget(inode);
+ s->ipos.state = inode->i_state;
+ }
+ spin_unlock(&inode_lock);
+
+ free_pages((unsigned long) s->iwin.inodes, IWIN_PAGE_ORDER);
+ if (s->ipos.pinned_inode)
+ iput(s->ipos.pinned_inode);
+ s->ipos.pinned_inode = inode;
+}
+
+/*
+ * Listing of cached page ranges of a file.
+ *
+ * Usage:
+ * echo 'file name' > /proc/filecache
+ * cat /proc/filecache
+ */
+
+unsigned long page_mask;
+#define PG_MMAP PG_lru /* reuse any non-relevant flag */
+#define PG_BUFFER PG_swapcache /* ditto */
+#define PG_DIRTY PG_error /* ditto */
+#define PG_WRITEBACK PG_buddy /* ditto */
+
+/*
+ * Page state names, prefixed by their abbreviations.
+ */
+struct {
+ unsigned long mask;
+ const char *name;
+ int faked;
+} page_flag [] = {
+ {1 << PG_referenced, "R:referenced", 0},
+ {1 << PG_active, "A:active", 0},
+ {1 << PG_MMAP, "M:mmap", 1},
+
+ {1 << PG_uptodate, "U:uptodate", 0},
+ {1 << PG_dirty, "D:dirty", 0},
+ {1 << PG_writeback, "W:writeback", 0},
+ {1 << PG_reclaim, "X:readahead", 0},
+
+ {1 << PG_private, "P:private", 0},
+ {1 << PG_owner_priv_1, "O:owner", 0},
+
+ {1 << PG_BUFFER, "b:buffer", 1},
+ {1 << PG_DIRTY, "d:dirty", 1},
+ {1 << PG_WRITEBACK, "w:writeback", 1},
+};
+
+static unsigned long page_flags(struct page* page)
+{
+ unsigned long flags;
+ struct address_space *mapping = page_mapping(page);
+
+ flags = page->flags & page_mask;
+
+ if (page_mapped(page))
+ flags |= (1 << PG_MMAP);
+
+ if (page_has_buffers(page))
+ flags |= (1 << PG_BUFFER);
+
+ if (mapping) {
+ if (radix_tree_tag_get(&mapping->page_tree,
+ page_index(page),
+ PAGECACHE_TAG_WRITEBACK))
+ flags |= (1 << PG_WRITEBACK);
+
+ if (radix_tree_tag_get(&mapping->page_tree,
+ page_index(page),
+ PAGECACHE_TAG_DIRTY))
+ flags |= (1 << PG_DIRTY);
+ }
+
+ return flags;
+}
+
+static int pages_similiar(struct page* page0, struct page* page)
+{
+ if (page_count(page0) != page_count(page))
+ return 0;
+
+ if (page_flags(page0) != page_flags(page))
+ return 0;
+
+ return 1;
+}
+
+static void show_range(struct seq_file *m, struct page* page, unsigned long len)
+{
+ int i;
+ unsigned long flags;
+
+ if (!m || !page)
+ return;
+
+ seq_printf(m, "%lu\t%lu\t", page->index, len);
+
+ flags = page_flags(page);
+ for (i = 0; i < ARRAY_SIZE(page_flag); i++)
+ seq_putc(m, (flags & page_flag[i].mask) ?
+ page_flag[i].name[0] : '_');
+
+ seq_printf(m, "\t%d\n", page_count(page));
+}
+
+#define BATCH_LINES 100
+static pgoff_t show_file_cache(struct seq_file *m,
+ struct address_space *mapping, pgoff_t start)
+{
+ int i;
+ int lines = 0;
+ pgoff_t len = 0;
+ struct pagevec pvec;
+ struct page *page;
+ struct page *page0 = NULL;
+
+ for (;;) {
+ pagevec_init(&pvec, 0);
+ pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
+ (void **)pvec.pages, start + len, PAGEVEC_SIZE);
+
+ if (pvec.nr == 0) {
+ show_range(m, page0, len);
+ start = ULONG_MAX;
+ goto out;
+ }
+
+ if (!page0)
+ page0 = pvec.pages[0];
+
+ for (i = 0; i < pvec.nr; i++) {
+ page = pvec.pages[i];
+
+ if (page->index == start + len &&
+ pages_similiar(page0, page))
+ len++;
+ else {
+ show_range(m, page0, len);
+ page0 = page;
+ start = page->index;
+ len = 1;
+ if (++lines > BATCH_LINES)
+ goto out;
+ }
+ }
+ }
+
+out:
+ return start;
+}
+
+static int pg_show(struct seq_file *m, void *v)
+{
+ struct session *s = m->private;
+ struct file *file = s->query_file;
+ pgoff_t offset;
+
+ if (!file)
+ return ii_show(m, v);
+
+ offset = *(loff_t *) v;
+
+ if (!offset) { /* print header */
+ int i;
+
+ seq_puts(m, "# file ");
+ seq_path(m, &file->f_path, " \t\n\\");
+
+ seq_puts(m, "\n# flags");
+ for (i = 0; i < ARRAY_SIZE(page_flag); i++)
+ seq_printf(m, " %s", page_flag[i].name);
+
+ seq_puts(m, "\n# idx\tlen\tstate\t\trefcnt\n");
+ }
+
+ s->start_offset = offset;
+ s->next_offset = show_file_cache(m, file->f_mapping, offset);
+
+ return 0;
+}
+
+static void *file_pos(struct file *file, loff_t *pos)
+{
+ loff_t size = i_size_read(file->f_mapping->host);
+ pgoff_t end = DIV_ROUND_UP(size, PAGE_CACHE_SIZE);
+ pgoff_t offset = *pos;
+
+ return offset < end ? pos : NULL;
+}
+
+static void *pg_start(struct seq_file *m, loff_t *pos)
+{
+ struct session *s = m->private;
+ struct file *file = s->query_file;
+ pgoff_t offset = *pos;
+
+ if (!file)
+ return ii_start(m, pos);
+
+ rcu_read_lock();
+
+ if (offset - s->start_offset == 1)
+ *pos = s->next_offset;
+ return file_pos(file, pos);
+}
+
+static void *pg_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ struct session *s = m->private;
+ struct file *file = s->query_file;
+
+ if (!file)
+ return ii_next(m, v, pos);
+
+ *pos = s->next_offset;
+ return file_pos(file, pos);
+}
+
+static void pg_stop(struct seq_file *m, void *v)
+{
+ struct session *s = m->private;
+ struct file *file = s->query_file;
+
+ if (!file)
+ return ii_stop(m, v);
+
+ rcu_read_unlock();
+}
+
+struct seq_operations seq_filecache_op = {
+ .start = pg_start,
+ .next = pg_next,
+ .stop = pg_stop,
+ .show = pg_show,
+};
+
+/*
+ * Implement the manual drop-all-pagecache function
+ */
+
+#define MAX_INODES (PAGE_SIZE / sizeof(struct inode *))
+static int drop_pagecache(void)
+{
+ struct hlist_head *head;
+ struct hlist_node *node;
+ struct inode *inode;
+ struct inode **inodes;
+ unsigned long i, j, k;
+ int err = 0;
+
+ inodes = (struct inode **)__get_free_pages(GFP_KERNEL, IWIN_PAGE_ORDER);
+ if (!inodes)
+ return -ENOMEM;
+
+ for (i = 0; (head = get_inode_hash_budget(i)); i++) {
+ if (hlist_empty(head))
+ continue;
+
+ j = 0;
+ cond_resched();
+
+ /*
+ * Grab some inodes.
+ */
+ spin_lock(&inode_lock);
+ hlist_for_each (node, head) {
+ inode = hlist_entry(node, struct inode, i_hash);
+ if (!atomic_read(&inode->i_count))
+ continue;
+ if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+ continue;
+ if (!inode->i_mapping || !inode->i_mapping->nrpages)
+ continue;
+ __iget(inode);
+ inodes[j++] = inode;
+ if (j >= MAX_INODES)
+ break;
+ }
+ spin_unlock(&inode_lock);
+
+ /*
+ * Free clean pages.
+ */
+ for (k = 0; k < j; k++) {
+ inode = inodes[k];
+ invalidate_mapping_pages(inode->i_mapping, 0, ~1);
+ iput(inode);
+ }
+
+ /*
+ * Simply ignore the remaining inodes.
+ */
+ if (j >= MAX_INODES && !err) {
+ printk(KERN_WARNING
+ "Too many collides in inode hash table.\n"
+ "Pls boot with a larger ihash_entries=XXX.\n");
+ err = -EAGAIN;
+ }
+ }
+
+ free_pages((unsigned long) inodes, IWIN_PAGE_ORDER);
+ return err;
+}
+
+static void drop_slabcache(void)
+{
+ int nr_objects;
+
+ do {
+ nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
+ } while (nr_objects > 10);
+}
+
+/*
+ * Proc file operations.
+ */
+
+static int filecache_open(struct inode *inode, struct file *proc_file)
+{
+ struct seq_file *m;
+ struct session *s;
+ unsigned size;
+ char *buf = 0;
+ int ret;
+
+ if (!try_module_get(THIS_MODULE))
+ return -ENOENT;
+
+ s = session_create();
+ if (IS_ERR(s)) {
+ ret = PTR_ERR(s);
+ goto out;
+ }
+ set_session(proc_file, s);
+
+ size = SBUF_SIZE;
+ buf = kmalloc(size, GFP_KERNEL);
+ if (!buf) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ ret = seq_open(proc_file, &seq_filecache_op);
+ if (!ret) {
+ m = proc_file->private_data;
+ m->private = s;
+ m->buf = buf;
+ m->size = size;
+ }
+
+out:
+ if (ret) {
+ kfree(s);
+ kfree(buf);
+ module_put(THIS_MODULE);
+ }
+ return ret;
+}
+
+static int filecache_release(struct inode *inode, struct file *proc_file)
+{
+ struct session *s = get_session(proc_file);
+ int ret;
+
+ session_release(s);
+ ret = seq_release(inode, proc_file);
+ module_put(THIS_MODULE);
+ return ret;
+}
+
+ssize_t filecache_write(struct file *proc_file, const char __user * buffer,
+ size_t count, loff_t *ppos)
+{
+ struct session *s;
+ char *name;
+ int err = 0;
+
+ if (count >= PATH_MAX + 5)
+ return -ENAMETOOLONG;
+
+ name = kmalloc(count+1, GFP_KERNEL);
+ if (!name)
+ return -ENOMEM;
+
+ if (copy_from_user(name, buffer, count)) {
+ err = -EFAULT;
+ goto out;
+ }
+
+ /* strip the optional newline */
+ if (count && name[count-1] == '\n')
+ name[count-1] = '\0';
+ else
+ name[count] = '\0';
+
+ s = get_session(proc_file);
+ if (!strcmp(name, "set private")) {
+ s->private_session = 1;
+ goto out;
+ }
+
+ if (!strncmp(name, "cat ", 4)) {
+ err = session_update_file(s, name+4);
+ goto out;
+ }
+
+ if (!strncmp(name, "ls", 2)) {
+ err = session_update_file(s, NULL);
+ if (!err)
+ err = ls_parse_options(name+2, s);
+ if (!err && !s->private_session) {
+ global_session.ls_dev = s->ls_dev;
+ global_session.ls_options = s->ls_options;
+ }
+ goto out;
+ }
+
+ if (!strncmp(name, "drop pagecache", 14)) {
+ err = drop_pagecache();
+ goto out;
+ }
+
+ if (!strncmp(name, "drop slabcache", 14)) {
+ drop_slabcache();
+ goto out;
+ }
+
+ /* err = -EINVAL; */
+ err = session_update_file(s, name);
+
+out:
+ kfree(name);
+
+ return err ? err : count;
+}
+
+static struct file_operations proc_filecache_fops = {
+ .owner = THIS_MODULE,
+ .open = filecache_open,
+ .release = filecache_release,
+ .write = filecache_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+};
+
+
+static __init int filecache_init(void)
+{
+ int i;
+ struct proc_dir_entry *entry;
+
+ entry = create_proc_entry("filecache", 0600, NULL);
+ if (entry)
+ entry->proc_fops = &proc_filecache_fops;
+
+ for (page_mask = i = 0; i < ARRAY_SIZE(page_flag); i++)
+ if (!page_flag[i].faked)
+ page_mask |= page_flag[i].mask;
+
+ return 0;
+}
+
+static void filecache_exit(void)
+{
+ remove_proc_entry("filecache", NULL);
+ if (global_session.query_file)
+ fput(global_session.query_file);
+}
+
+MODULE_AUTHOR("Fengguang Wu <wfg@mail.ustc.edu.cn>");
+MODULE_LICENSE("GPL");
+
+module_init(filecache_init);
+module_exit(filecache_exit);
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -775,6 +775,11 @@ struct inode {
void *i_security;
#endif
void *i_private; /* fs or device private pointer */
+
+#ifdef CONFIG_PROC_FILECACHE_EXTRAS
+ unsigned int i_access_count; /* opened how many times? */
+ char i_comm[16]; /* opened first by which app? */
+#endif
};
/*
@@ -860,6 +865,13 @@ static inline unsigned imajor(const stru
return MAJOR(inode->i_rdev);
}
+static inline void inode_accessed(struct inode *inode)
+{
+#ifdef CONFIG_PROC_FILECACHE_EXTRAS
+ inode->i_access_count++;
+#endif
+}
+
extern struct block_device *I_BDEV(struct inode *inode);
struct fown_struct {
@@ -2171,6 +2183,7 @@ extern void remove_inode_hash(struct ino
static inline void insert_inode_hash(struct inode *inode) {
__insert_inode_hash(inode, inode->i_ino);
}
+struct hlist_head * get_inode_hash_budget(unsigned long index);
extern struct file * get_empty_filp(void);
extern void file_move(struct file *f, struct list_head *list);
--- linux-2.6.orig/fs/open.c
+++ linux-2.6/fs/open.c
@@ -842,6 +842,7 @@ static struct file *__dentry_open(struct
goto cleanup_all;
}
+ inode_accessed(inode);
f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
--- linux-2.6.orig/fs/Kconfig
+++ linux-2.6/fs/Kconfig
@@ -265,4 +265,34 @@ endif
source "fs/nls/Kconfig"
source "fs/dlm/Kconfig"
+config PROC_FILECACHE
+ tristate "/proc/filecache support"
+ default m
+ depends on PROC_FS
+ help
+ This option creates a file /proc/filecache which enables one to
+ query/drop the cached files in memory.
+
+ A quick start guide:
+
+ # echo 'ls' > /proc/filecache
+ # head /proc/filecache
+
+ # echo 'cat /bin/bash' > /proc/filecache
+ # head /proc/filecache
+
+ # echo 'drop pagecache' > /proc/filecache
+ # echo 'drop slabcache' > /proc/filecache
+
+ For more details, please check Documentation/filesystems/proc.txt .
+
+ It can be a handy tool for sysadms and desktop users.
+
+config PROC_FILECACHE_EXTRAS
+ bool "track extra states"
+ default y
+ depends on PROC_FILECACHE
+ help
+ Track extra states that costs a little more time/space.
+
endmenu
--- linux-2.6.orig/fs/proc/Makefile
+++ linux-2.6/fs/proc/Makefile
@@ -2,7 +2,8 @@
# Makefile for the Linux proc filesystem routines.
#
-obj-$(CONFIG_PROC_FS) += proc.o
+obj-$(CONFIG_PROC_FS) += proc.o
+obj-$(CONFIG_PROC_FILECACHE) += filecache.o
proc-y := nommu.o task_nommu.o
proc-$(CONFIG_MMU) := mmu.o task_mmu.o
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
2009-05-17 13:36 ` Wu Fengguang
@ 2009-05-17 13:55 ` Frederic Weisbecker
2009-05-18 11:44 ` KOSAKI Motohiro
1 sibling, 0 replies; 137+ messages in thread
From: Frederic Weisbecker @ 2009-05-17 13:55 UTC (permalink / raw)
To: Wu Fengguang
Cc: Ingo Molnar, Li Zefan, Tom Zanussi, KOSAKI Motohiro,
Pekka Enberg, Andi Kleen, Steven Rostedt, Larry Woodman,
Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
Matt Mackall, Alexey Dobriyan, linux-mm
On Sun, May 17, 2009 at 09:36:59PM +0800, Wu Fengguang wrote:
> On Tue, May 12, 2009 at 09:01:12PM +0800, Frederic Weisbecker wrote:
> > On Tue, Apr 28, 2009 at 09:31:08PM +0800, Wu Fengguang wrote:
> > > On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> > >
> > > There are two possible challenges for the conversion:
> > >
> > > - One trick it does is to select different lists to traverse on
> > > different filter options. Will this be possible in the object
> > > tracing framework?
> >
> > Yeah, I guess.
>
> Great.
>
> >
> > > - The file name lookup(last field) is the performance killer. Is it
> > > possible to skip the file name lookup when the filter failed on the
> > > leading fields?
> >
> > objects collection lays on trace events where filters basically ignore
> > a whole entry in case of non-matching. Not sure if we can easily only
> > ignore one field.
> >
> > But I guess we can do something about the performances...
>
> OK, but it's not as important as the previous requirement, so it could
> be the last thing to work on :)
>
> > Could you send us the (sob'ed) patch you made which implements this.
> > I could try to adapt it to object collection.
>
> Attached for your reference. Be aware that I still have plans to
> change it in non trivial way, and there are ongoing works by Nick(on
> inode_lock) and Jens(on s_dirty) that can create merge conflicts.
> So basically it is not a right time to do the adaption.
Ah ok, so I will wait a bit :-)
> However we can still do something to polish up the page object
> collection under /debug/tracing/objects/mm/pages/. For example,
> the timestamps and function name could be removed from the following
> list :)
>
> # tracer: nop
> #
> # TASK-PID CPU# TIMESTAMP FUNCTION
> # | | | | |
> <...>-3743 [001] 3035.649769: dump_pages: pfn=1 flags=400 count=1 mapcount=0 index=0
> <...>-3743 [001] 3044.176403: dump_pages: pfn=1 flags=400 count=1 mapcount=0 index=0
> <...>-3743 [001] 3044.176407: dump_pages: pfn=2 flags=400 count=1 mapcount=0 index=0
> <...>-3743 [001] 3044.176408: dump_pages: pfn=3 flags=400 count=1 mapcount=0 index=0
> <...>-3743 [001] 3044.176409: dump_pages: pfn=4 flags=400 count=1 mapcount=0 index=0
> <...>-3743 [001] 3044.176409: dump_pages: pfn=5 flags=400 count=1 mapcount=0 index=0
> <...>-3743 [001] 3044.176410: dump_pages: pfn=6 flags=400 count=1 mapcount=0 index=0
> <...>-3743 [001] 3044.176410: dump_pages: pfn=7 flags=400 count=1 mapcount=0 index=0
> <...>-3743 [001] 3044.176411: dump_pages: pfn=8 flags=400 count=1 mapcount=0 index=0
> <...>-3743 [001] 3044.176411: dump_pages: pfn=9 flags=400 count=1 mapcount=0 index=0
> <...>-3743 [001] 3044.176412: dump_pages: pfn=10 flags=400 count=1 mapcount=0 index=0
echo nocontext-info > /debug/tracing/trace_options :-)
But you'll have only the function and the pages specifics. It's not really the
function but more specifically the name of the event. It's useful to distinguish
multiple events to a trace.
Hmm, may be it's not that much useful in a object dump...
Thanks.
> Thanks,
> Fengguang
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -27,6 +27,7 @@ extern unsigned long max_mapnr;
> extern unsigned long num_physpages;
> extern void * high_memory;
> extern int page_cluster;
> +extern char * const zone_names[];
>
> #ifdef CONFIG_SYSCTL
> extern int sysctl_legacy_va_layout;
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -104,7 +104,7 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_Z
>
> EXPORT_SYMBOL(totalram_pages);
>
> -static char * const zone_names[MAX_NR_ZONES] = {
> +char * const zone_names[MAX_NR_ZONES] = {
> #ifdef CONFIG_ZONE_DMA
> "DMA",
> #endif
> --- linux-2.6.orig/fs/dcache.c
> +++ linux-2.6/fs/dcache.c
> @@ -1925,7 +1925,10 @@ char *__d_path(const struct path *path,
>
> if (dentry == root->dentry && vfsmnt == root->mnt)
> break;
> - if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
> + if (unlikely(!vfsmnt)) {
> + if (IS_ROOT(dentry))
> + break;
> + } else if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
> /* Global root? */
> if (vfsmnt->mnt_parent == vfsmnt) {
> goto global_root;
> --- linux-2.6.orig/lib/radix-tree.c
> +++ linux-2.6/lib/radix-tree.c
> @@ -564,7 +564,6 @@ out:
> }
> EXPORT_SYMBOL(radix_tree_tag_clear);
>
> -#ifndef __KERNEL__ /* Only the test harness uses this at present */
> /**
> * radix_tree_tag_get - get a tag on a radix tree node
> * @root: radix tree root
> @@ -627,7 +626,6 @@ int radix_tree_tag_get(struct radix_tree
> }
> }
> EXPORT_SYMBOL(radix_tree_tag_get);
> -#endif
>
> /**
> * radix_tree_next_hole - find the next hole (not-present entry)
> --- linux-2.6.orig/fs/inode.c
> +++ linux-2.6/fs/inode.c
> @@ -84,6 +84,10 @@ static struct hlist_head *inode_hashtabl
> */
> DEFINE_SPINLOCK(inode_lock);
>
> +EXPORT_SYMBOL(inode_in_use);
> +EXPORT_SYMBOL(inode_unused);
> +EXPORT_SYMBOL(inode_lock);
> +
> /*
> * iprune_mutex provides exclusion between the kswapd or try_to_free_pages
> * icache shrinking path, and the umount path. Without this exclusion,
> @@ -110,6 +114,13 @@ static void wake_up_inode(struct inode *
> wake_up_bit(&inode->i_state, __I_LOCK);
> }
>
> +static inline void inode_created_by(struct inode *inode, struct task_struct *task)
> +{
> +#ifdef CONFIG_PROC_FILECACHE_EXTRAS
> + memcpy(inode->i_comm, task->comm, sizeof(task->comm));
> +#endif
> +}
> +
> /**
> * inode_init_always - perform inode structure intialisation
> * @sb: superblock inode belongs to
> @@ -147,7 +158,7 @@ struct inode *inode_init_always(struct s
> inode->i_bdev = NULL;
> inode->i_cdev = NULL;
> inode->i_rdev = 0;
> - inode->dirtied_when = 0;
> + inode->dirtied_when = jiffies;
>
> if (security_inode_alloc(inode))
> goto out_free_inode;
> @@ -188,6 +199,7 @@ struct inode *inode_init_always(struct s
> }
> inode->i_private = NULL;
> inode->i_mapping = mapping;
> + inode_created_by(inode, current);
>
> return inode;
>
> @@ -276,6 +288,8 @@ void __iget(struct inode *inode)
> inodes_stat.nr_unused--;
> }
>
> +EXPORT_SYMBOL(__iget);
> +
> /**
> * clear_inode - clear an inode
> * @inode: inode to clear
> @@ -1459,6 +1473,16 @@ static void __wait_on_freeing_inode(stru
> spin_lock(&inode_lock);
> }
>
> +
> +struct hlist_head * get_inode_hash_budget(unsigned long index)
> +{
> + if (index >= (1 << i_hash_shift))
> + return NULL;
> +
> + return inode_hashtable + index;
> +}
> +EXPORT_SYMBOL_GPL(get_inode_hash_budget);
> +
> static __initdata unsigned long ihash_entries;
> static int __init set_ihash_entries(char *str)
> {
> --- linux-2.6.orig/fs/super.c
> +++ linux-2.6/fs/super.c
> @@ -46,6 +46,9 @@
> LIST_HEAD(super_blocks);
> DEFINE_SPINLOCK(sb_lock);
>
> +EXPORT_SYMBOL(super_blocks);
> +EXPORT_SYMBOL(sb_lock);
> +
> /**
> * alloc_super - create new superblock
> * @type: filesystem type superblock should belong to
> --- linux-2.6.orig/mm/vmscan.c
> +++ linux-2.6/mm/vmscan.c
> @@ -262,6 +262,7 @@ unsigned long shrink_slab(unsigned long
> up_read(&shrinker_rwsem);
> return ret;
> }
> +EXPORT_SYMBOL(shrink_slab);
>
> /* Called without lock on whether page is mapped, so answer is unstable */
> static inline int page_mapping_inuse(struct page *page)
> --- linux-2.6.orig/mm/swap_state.c
> +++ linux-2.6/mm/swap_state.c
> @@ -45,6 +45,7 @@ struct address_space swapper_space = {
> .i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
> .backing_dev_info = &swap_backing_dev_info,
> };
> +EXPORT_SYMBOL_GPL(swapper_space);
>
> #define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0)
>
> --- linux-2.6.orig/Documentation/filesystems/proc.txt
> +++ linux-2.6/Documentation/filesystems/proc.txt
> @@ -260,6 +260,7 @@ Table 1-4: Kernel info in /proc
> driver Various drivers grouped here, currently rtc (2.4)
> execdomains Execdomains, related to security (2.4)
> fb Frame Buffer devices (2.4)
> + filecache Query/drop in-memory file cache
> fs File system parameters, currently nfs/exports (2.4)
> ide Directory containing info about the IDE subsystem
> interrupts Interrupt usage
> @@ -450,6 +451,88 @@ varies by architecture and compile optio
>
> > cat /proc/meminfo
>
> +..............................................................................
> +
> +filecache:
> +
> +Provides access to the in-memory file cache.
> +
> +To list an index of all cached files:
> +
> + echo ls > /proc/filecache
> + cat /proc/filecache
> +
> +The output looks like:
> +
> + # filecache 1.0
> + # ino size cached cached% state refcnt dev file
> + 1026334 91 92 100 -- 66 03:02(hda2) /lib/ld-2.3.6.so
> + 233608 1242 972 78 -- 66 03:02(hda2) /lib/tls/libc-2.3.6.so
> + 65203 651 476 73 -- 1 03:02(hda2) /bin/bash
> + 1026445 261 160 61 -- 10 03:02(hda2) /lib/libncurses.so.5.5
> + 235427 10 12 100 -- 44 03:02(hda2) /lib/tls/libdl-2.3.6.so
> +
> +FIELD INTRO
> +---------------------------------------------------------------------------
> +ino inode number
> +size inode size in KB
> +cached cached size in KB
> +cached% percent of file data cached
> +state1 '-' clean; 'd' metadata dirty; 'D' data dirty
> +state2 '-' unlocked; 'L' locked, normally indicates file being written out
> +refcnt file reference count, it's an in-kernel one, not exactly open count
> +dev major:minor numbers in hex, followed by a descriptive device name
> +file file path _inside_ the filesystem. There are several special names:
> + '(noname)': the file name is not available
> + '(03:02)': the file is a block device file of major:minor
> + '...(deleted)': the named file has been deleted from the disk
> +
> +To list the cached pages of a perticular file:
> +
> + echo /bin/bash > /proc/filecache
> + cat /proc/filecache
> +
> + # file /bin/bash
> + # flags R:referenced A:active U:uptodate D:dirty W:writeback M:mmap
> + # idx len state refcnt
> + 0 36 RAU__M 3
> + 36 1 RAU__M 2
> + 37 8 RAU__M 3
> + 45 2 RAU___ 1
> + 47 6 RAU__M 3
> + 53 3 RAU__M 2
> + 56 2 RAU__M 3
> +
> +FIELD INTRO
> +----------------------------------------------------------------------------
> +idx page index
> +len number of pages which are cached and share the same state
> +state page state of the flags listed in line two
> +refcnt page reference count
> +
> +Careful users may notice that the file name to be queried is remembered between
> +commands. Internally, the module has a global variable to store the file name
> +parameter, so that it can be inherited by newly opened /proc/filecache file.
> +However it can lead to interference for multiple queriers. The solution here
> +is to obey a rule: only root can interactively change the file name parameter;
> +normal users must go for scripts to access the interface. Scripts should do it
> +by following the code example below:
> +
> + filecache = open("/proc/filecache", "rw");
> + # avoid polluting the global parameter filename
> + filecache.write("set private");
> +
> +To instruct the kernel to drop clean caches, dentries and inodes from memory,
> +causing that memory to become free:
> +
> + # drop clean file data cache (i.e. file backed pagecache)
> + echo drop pagecache > /proc/filecache
> +
> + # drop clean file metadata cache (i.e. dentries and inodes)
> + echo drop slabcache > /proc/filecache
> +
> +Note that the drop commands are non-destructive operations and dirty objects
> +are not freeable, the user should run `sync' first.
>
> MemTotal: 16344972 kB
> MemFree: 13634064 kB
> --- /dev/null
> +++ linux-2.6/fs/proc/filecache.c
> @@ -0,0 +1,1045 @@
> +/*
> + * fs/proc/filecache.c
> + *
> + * Copyright (C) 2006, 2007 Fengguang Wu <wfg@mail.ustc.edu.cn>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/radix-tree.h>
> +#include <linux/page-flags.h>
> +#include <linux/pagevec.h>
> +#include <linux/pagemap.h>
> +#include <linux/vmalloc.h>
> +#include <linux/writeback.h>
> +#include <linux/buffer_head.h>
> +#include <linux/parser.h>
> +#include <linux/proc_fs.h>
> +#include <linux/seq_file.h>
> +#include <linux/file.h>
> +#include <linux/namei.h>
> +#include <linux/module.h>
> +#include <asm/uaccess.h>
> +
> +/*
> + * Increase minor version when new columns are added;
> + * Increase major version when existing columns are changed.
> + */
> +#define FILECACHE_VERSION "1.0"
> +
> +/* Internal buffer sizes. The larger the more effcient. */
> +#define SBUF_SIZE (128<<10)
> +#define IWIN_PAGE_ORDER 3
> +#define IWIN_SIZE ((PAGE_SIZE<<IWIN_PAGE_ORDER) / sizeof(struct inode *))
> +
> +/*
> + * Session management.
> + *
> + * Each opened /proc/filecache file is assiocated with a session object.
> + * Also there is a global_session that maintains status across open()/close()
> + * (i.e. the lifetime of an opened file), so that a casual user can query the
> + * filecache via _multiple_ simple shell commands like
> + * 'echo cat /bin/bash > /proc/filecache; cat /proc/filecache'.
> + *
> + * session.query_file is the file whose cache info is to be queried.
> + * Its value determines what we get on read():
> + * - NULL: ii_*() called to show the inode index
> + * - filp: pg_*() called to show the page groups of a filp
> + *
> + * session.query_file is
> + * - cloned from global_session.query_file on open();
> + * - updated on write("cat filename");
> + * note that the new file will also be saved in global_session.query_file if
> + * session.private_session is false.
> + */
> +
> +struct session {
> + /* options */
> + int private_session;
> + unsigned long ls_options;
> + dev_t ls_dev;
> +
> + /* parameters */
> + struct file *query_file;
> +
> + /* seqfile pos */
> + pgoff_t start_offset;
> + pgoff_t next_offset;
> +
> + /* inode at last pos */
> + struct {
> + unsigned long pos;
> + unsigned long state;
> + struct inode *inode;
> + struct inode *pinned_inode;
> + } ipos;
> +
> + /* inode window */
> + struct {
> + unsigned long cursor;
> + unsigned long origin;
> + unsigned long size;
> + struct inode **inodes;
> + } iwin;
> +};
> +
> +static struct session global_session;
> +
> +/*
> + * Session address is stored in proc_file->f_ra.start:
> + * we assume that there will be no readahead for proc_file.
> + */
> +static struct session *get_session(struct file *proc_file)
> +{
> + return (struct session *)proc_file->f_ra.start;
> +}
> +
> +static void set_session(struct file *proc_file, struct session *s)
> +{
> + BUG_ON(proc_file->f_ra.start);
> + proc_file->f_ra.start = (unsigned long)s;
> +}
> +
> +static void update_global_file(struct session *s)
> +{
> + if (s->private_session)
> + return;
> +
> + if (global_session.query_file)
> + fput(global_session.query_file);
> +
> + global_session.query_file = s->query_file;
> +
> + if (global_session.query_file)
> + get_file(global_session.query_file);
> +}
> +
> +/*
> + * Cases of the name:
> + * 1) NULL (new session)
> + * s->query_file = global_session.query_file = 0;
> + * 2) "" (ls/la)
> + * s->query_file = global_session.query_file;
> + * 3) a regular file name (cat newfile)
> + * s->query_file = global_session.query_file = newfile;
> + */
> +static int session_update_file(struct session *s, char *name)
> +{
> + static DEFINE_MUTEX(mutex); /* protects global_session.query_file */
> + int err = 0;
> +
> + mutex_lock(&mutex);
> +
> + /*
> + * We are to quit, or to list the cached files.
> + * Reset *.query_file.
> + */
> + if (!name) {
> + if (s->query_file) {
> + fput(s->query_file);
> + s->query_file = NULL;
> + }
> + update_global_file(s);
> + goto out;
> + }
> +
> + /*
> + * This is a new session.
> + * Inherit options/parameters from global ones.
> + */
> + if (name[0] == '\0') {
> + *s = global_session;
> + if (s->query_file)
> + get_file(s->query_file);
> + goto out;
> + }
> +
> + /*
> + * Open the named file.
> + */
> + if (s->query_file)
> + fput(s->query_file);
> + s->query_file = filp_open(name, O_RDONLY|O_LARGEFILE, 0);
> + if (IS_ERR(s->query_file)) {
> + err = PTR_ERR(s->query_file);
> + s->query_file = NULL;
> + } else
> + update_global_file(s);
> +
> +out:
> + mutex_unlock(&mutex);
> +
> + return err;
> +}
> +
> +static struct session *session_create(void)
> +{
> + struct session *s;
> + int err = 0;
> +
> + s = kmalloc(sizeof(*s), GFP_KERNEL);
> + if (s)
> + err = session_update_file(s, "");
> + else
> + err = -ENOMEM;
> +
> + return err ? ERR_PTR(err) : s;
> +}
> +
> +static void session_release(struct session *s)
> +{
> + if (s->ipos.pinned_inode)
> + iput(s->ipos.pinned_inode);
> + if (s->query_file)
> + fput(s->query_file);
> + kfree(s);
> +}
> +
> +
> +/*
> + * Listing of cached files.
> + *
> + * Usage:
> + * echo > /proc/filecache # enter listing mode
> + * cat /proc/filecache # get the file listing
> + */
> +
> +/* code style borrowed from ib_srp.c */
> +enum {
> + LS_OPT_ERR = 0,
> + LS_OPT_DIRTY = 1 << 0,
> + LS_OPT_CLEAN = 1 << 1,
> + LS_OPT_INUSE = 1 << 2,
> + LS_OPT_EMPTY = 1 << 3,
> + LS_OPT_ALL = 1 << 4,
> + LS_OPT_DEV = 1 << 5,
> +};
> +
> +static match_table_t ls_opt_tokens = {
> + { LS_OPT_DIRTY, "dirty" },
> + { LS_OPT_CLEAN, "clean" },
> + { LS_OPT_INUSE, "inuse" },
> + { LS_OPT_EMPTY, "empty" },
> + { LS_OPT_ALL, "all" },
> + { LS_OPT_DEV, "dev=%s" },
> + { LS_OPT_ERR, NULL }
> +};
> +
> +static int ls_parse_options(const char *buf, struct session *s)
> +{
> + substring_t args[MAX_OPT_ARGS];
> + char *options, *sep_opt;
> + char *p;
> + int token;
> + int ret = 0;
> +
> + if (!buf)
> + return 0;
> + options = kstrdup(buf, GFP_KERNEL);
> + if (!options)
> + return -ENOMEM;
> +
> + s->ls_options = 0;
> + sep_opt = options;
> + while ((p = strsep(&sep_opt, " ")) != NULL) {
> + if (!*p)
> + continue;
> +
> + token = match_token(p, ls_opt_tokens, args);
> +
> + switch (token) {
> + case LS_OPT_DIRTY:
> + case LS_OPT_CLEAN:
> + case LS_OPT_INUSE:
> + case LS_OPT_EMPTY:
> + case LS_OPT_ALL:
> + s->ls_options |= token;
> + break;
> + case LS_OPT_DEV:
> + p = match_strdup(args);
> + if (!p) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + if (*p == '/') {
> + struct kstat stat;
> + struct nameidata nd;
> + ret = path_lookup(p, LOOKUP_FOLLOW, &nd);
> + if (!ret)
> + ret = vfs_getattr(nd.path.mnt,
> + nd.path.dentry, &stat);
> + if (!ret)
> + s->ls_dev = stat.rdev;
> + } else
> + s->ls_dev = simple_strtoul(p, NULL, 0);
> + /* printk("%lx %s\n", (long)s->ls_dev, p); */
> + kfree(p);
> + break;
> +
> + default:
> + printk(KERN_WARNING "unknown parameter or missing value "
> + "'%s' in ls command\n", p);
> + ret = -EINVAL;
> + goto out;
> + }
> + }
> +
> +out:
> + kfree(options);
> + return ret;
> +}
> +
> +/*
> + * Add possible filters here.
> + * No permission check: we cannot verify the path's permission anyway.
> + * We simply demand root previledge for accessing /proc/filecache.
> + */
> +static int may_show_inode(struct session *s, struct inode *inode)
> +{
> + if (!atomic_read(&inode->i_count))
> + return 0;
> + if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
> + return 0;
> + if (!inode->i_mapping)
> + return 0;
> +
> + if (s->ls_dev && s->ls_dev != inode->i_sb->s_dev)
> + return 0;
> +
> + if (s->ls_options & LS_OPT_ALL)
> + return 1;
> +
> + if (!(s->ls_options & LS_OPT_EMPTY) && !inode->i_mapping->nrpages)
> + return 0;
> +
> + if ((s->ls_options & LS_OPT_DIRTY) && !(inode->i_state & I_DIRTY))
> + return 0;
> +
> + if ((s->ls_options & LS_OPT_CLEAN) && (inode->i_state & I_DIRTY))
> + return 0;
> +
> + if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
> + S_ISLNK(inode->i_mode) || S_ISBLK(inode->i_mode)))
> + return 0;
> +
> + return 1;
> +}
> +
> +/*
> + * Full: there are more data following.
> + */
> +static int iwin_full(struct session *s)
> +{
> + return !s->iwin.cursor ||
> + s->iwin.cursor > s->iwin.origin + s->iwin.size;
> +}
> +
> +static int iwin_push(struct session *s, struct inode *inode)
> +{
> + if (!may_show_inode(s, inode))
> + return 0;
> +
> + s->iwin.cursor++;
> +
> + if (s->iwin.size >= IWIN_SIZE)
> + return 1;
> +
> + if (s->iwin.cursor > s->iwin.origin)
> + s->iwin.inodes[s->iwin.size++] = inode;
> + return 0;
> +}
> +
> +/*
> + * Travease the inode lists in order - newest first.
> + * And fill @s->iwin.inodes with inodes positioned in [@pos, @pos+IWIN_SIZE).
> + */
> +static int iwin_fill(struct session *s, unsigned long pos)
> +{
> + struct inode *inode;
> + struct super_block *sb;
> +
> + s->iwin.origin = pos;
> + s->iwin.cursor = 0;
> + s->iwin.size = 0;
> +
> + /*
> + * We have a cursor inode, clean and expected to be unchanged.
> + */
> + if (s->ipos.inode && pos >= s->ipos.pos &&
> + !(s->ipos.state & I_DIRTY) &&
> + s->ipos.state == s->ipos.inode->i_state) {
> + inode = s->ipos.inode;
> + s->iwin.cursor = s->ipos.pos;
> + goto continue_from_saved;
> + }
> +
> + if (s->ls_options & LS_OPT_CLEAN)
> + goto clean_inodes;
> +
> + spin_lock(&sb_lock);
> + list_for_each_entry(sb, &super_blocks, s_list) {
> + if (s->ls_dev && s->ls_dev != sb->s_dev)
> + continue;
> +
> + list_for_each_entry(inode, &sb->s_dirty, i_list) {
> + if (iwin_push(s, inode))
> + goto out_full_unlock;
> + }
> + list_for_each_entry(inode, &sb->s_io, i_list) {
> + if (iwin_push(s, inode))
> + goto out_full_unlock;
> + }
> + }
> + spin_unlock(&sb_lock);
> +
> +clean_inodes:
> + list_for_each_entry(inode, &inode_in_use, i_list) {
> + if (iwin_push(s, inode))
> + goto out_full;
> +continue_from_saved:
> + ;
> + }
> +
> + if (s->ls_options & LS_OPT_INUSE)
> + return 0;
> +
> + list_for_each_entry(inode, &inode_unused, i_list) {
> + if (iwin_push(s, inode))
> + goto out_full;
> + }
> +
> + return 0;
> +
> +out_full_unlock:
> + spin_unlock(&sb_lock);
> +out_full:
> + return 1;
> +}
> +
> +static struct inode *iwin_inode(struct session *s, unsigned long pos)
> +{
> + if ((iwin_full(s) && pos >= s->iwin.origin + s->iwin.size)
> + || pos < s->iwin.origin)
> + iwin_fill(s, pos);
> +
> + if (pos >= s->iwin.cursor)
> + return NULL;
> +
> + s->ipos.pos = pos;
> + s->ipos.inode = s->iwin.inodes[pos - s->iwin.origin];
> + BUG_ON(!s->ipos.inode);
> + return s->ipos.inode;
> +}
> +
> +static void show_inode(struct seq_file *m, struct inode *inode)
> +{
> + char state[] = "--"; /* dirty, locked */
> + struct dentry *dentry;
> + loff_t size = i_size_read(inode);
> + unsigned long nrpages;
> + int percent;
> + int refcnt;
> + int shift;
> +
> + if (!size)
> + size++;
> +
> + if (inode->i_mapping)
> + nrpages = inode->i_mapping->nrpages;
> + else {
> + nrpages = 0;
> + WARN_ON(1);
> + }
> +
> + for (shift = 0; (size >> shift) > ULONG_MAX / 128; shift += 12)
> + ;
> + percent = min(100UL, (((100 * nrpages) >> shift) << PAGE_CACHE_SHIFT) /
> + (unsigned long)(size >> shift));
> +
> + if (inode->i_state & (I_DIRTY_DATASYNC|I_DIRTY_PAGES))
> + state[0] = 'D';
> + else if (inode->i_state & I_DIRTY_SYNC)
> + state[0] = 'd';
> +
> + if (inode->i_state & I_LOCK)
> + state[0] = 'L';
> +
> + refcnt = 0;
> + list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
> + refcnt += atomic_read(&dentry->d_count);
> + }
> +
> + seq_printf(m, "%10lu %10llu %8lu %7d ",
> + inode->i_ino,
> + DIV_ROUND_UP(size, 1024),
> + nrpages << (PAGE_CACHE_SHIFT - 10),
> + percent);
> +
> + seq_printf(m, "%6d %5s %9lu ",
> + refcnt,
> + state,
> + (jiffies - inode->dirtied_when) / HZ);
> +
> +#ifdef CONFIG_PROC_FILECACHE_EXTRAS
> + seq_printf(m, "%8u %-16s",
> + inode->i_access_count,
> + inode->i_comm);
> +#endif
> +
> + seq_printf(m, "%02x:%02x(%s)\t",
> + MAJOR(inode->i_sb->s_dev),
> + MINOR(inode->i_sb->s_dev),
> + inode->i_sb->s_id);
> +
> + if (list_empty(&inode->i_dentry)) {
> + if (!atomic_read(&inode->i_count))
> + seq_puts(m, "(noname)\n");
> + else
> + seq_printf(m, "(%02x:%02x)\n",
> + imajor(inode), iminor(inode));
> + } else {
> + struct path path = {
> + .mnt = NULL,
> + .dentry = list_entry(inode->i_dentry.next,
> + struct dentry, d_alias)
> + };
> +
> + seq_path(m, &path, " \t\n\\");
> + seq_putc(m, '\n');
> + }
> +}
> +
> +static int ii_show(struct seq_file *m, void *v)
> +{
> + unsigned long index = *(loff_t *) v;
> + struct session *s = m->private;
> + struct inode *inode;
> +
> + if (index == 0) {
> + seq_puts(m, "# filecache " FILECACHE_VERSION "\n");
> + seq_puts(m, "# ino size cached cached% "
> + "refcnt state age "
> +#ifdef CONFIG_PROC_FILECACHE_EXTRAS
> + "accessed process "
> +#endif
> + "dev\t\tfile\n");
> + }
> +
> + inode = iwin_inode(s,index);
> + show_inode(m, inode);
> +
> + return 0;
> +}
> +
> +static void *ii_start(struct seq_file *m, loff_t *pos)
> +{
> + struct session *s = m->private;
> +
> + s->iwin.size = 0;
> + s->iwin.inodes = (struct inode **)
> + __get_free_pages( GFP_KERNEL, IWIN_PAGE_ORDER);
> + if (!s->iwin.inodes)
> + return NULL;
> +
> + spin_lock(&inode_lock);
> +
> + return iwin_inode(s, *pos) ? pos : NULL;
> +}
> +
> +static void *ii_next(struct seq_file *m, void *v, loff_t *pos)
> +{
> + struct session *s = m->private;
> +
> + (*pos)++;
> + return iwin_inode(s, *pos) ? pos : NULL;
> +}
> +
> +static void ii_stop(struct seq_file *m, void *v)
> +{
> + struct session *s = m->private;
> + struct inode *inode = s->ipos.inode;
> +
> + if (!s->iwin.inodes)
> + return;
> +
> + if (inode) {
> + __iget(inode);
> + s->ipos.state = inode->i_state;
> + }
> + spin_unlock(&inode_lock);
> +
> + free_pages((unsigned long) s->iwin.inodes, IWIN_PAGE_ORDER);
> + if (s->ipos.pinned_inode)
> + iput(s->ipos.pinned_inode);
> + s->ipos.pinned_inode = inode;
> +}
> +
> +/*
> + * Listing of cached page ranges of a file.
> + *
> + * Usage:
> + * echo 'file name' > /proc/filecache
> + * cat /proc/filecache
> + */
> +
> +unsigned long page_mask;
> +#define PG_MMAP PG_lru /* reuse any non-relevant flag */
> +#define PG_BUFFER PG_swapcache /* ditto */
> +#define PG_DIRTY PG_error /* ditto */
> +#define PG_WRITEBACK PG_buddy /* ditto */
> +
> +/*
> + * Page state names, prefixed by their abbreviations.
> + */
> +struct {
> + unsigned long mask;
> + const char *name;
> + int faked;
> +} page_flag [] = {
> + {1 << PG_referenced, "R:referenced", 0},
> + {1 << PG_active, "A:active", 0},
> + {1 << PG_MMAP, "M:mmap", 1},
> +
> + {1 << PG_uptodate, "U:uptodate", 0},
> + {1 << PG_dirty, "D:dirty", 0},
> + {1 << PG_writeback, "W:writeback", 0},
> + {1 << PG_reclaim, "X:readahead", 0},
> +
> + {1 << PG_private, "P:private", 0},
> + {1 << PG_owner_priv_1, "O:owner", 0},
> +
> + {1 << PG_BUFFER, "b:buffer", 1},
> + {1 << PG_DIRTY, "d:dirty", 1},
> + {1 << PG_WRITEBACK, "w:writeback", 1},
> +};
> +
> +static unsigned long page_flags(struct page* page)
> +{
> + unsigned long flags;
> + struct address_space *mapping = page_mapping(page);
> +
> + flags = page->flags & page_mask;
> +
> + if (page_mapped(page))
> + flags |= (1 << PG_MMAP);
> +
> + if (page_has_buffers(page))
> + flags |= (1 << PG_BUFFER);
> +
> + if (mapping) {
> + if (radix_tree_tag_get(&mapping->page_tree,
> + page_index(page),
> + PAGECACHE_TAG_WRITEBACK))
> + flags |= (1 << PG_WRITEBACK);
> +
> + if (radix_tree_tag_get(&mapping->page_tree,
> + page_index(page),
> + PAGECACHE_TAG_DIRTY))
> + flags |= (1 << PG_DIRTY);
> + }
> +
> + return flags;
> +}
> +
> +static int pages_similiar(struct page* page0, struct page* page)
> +{
> + if (page_count(page0) != page_count(page))
> + return 0;
> +
> + if (page_flags(page0) != page_flags(page))
> + return 0;
> +
> + return 1;
> +}
> +
> +static void show_range(struct seq_file *m, struct page* page, unsigned long len)
> +{
> + int i;
> + unsigned long flags;
> +
> + if (!m || !page)
> + return;
> +
> + seq_printf(m, "%lu\t%lu\t", page->index, len);
> +
> + flags = page_flags(page);
> + for (i = 0; i < ARRAY_SIZE(page_flag); i++)
> + seq_putc(m, (flags & page_flag[i].mask) ?
> + page_flag[i].name[0] : '_');
> +
> + seq_printf(m, "\t%d\n", page_count(page));
> +}
> +
> +#define BATCH_LINES 100
> +static pgoff_t show_file_cache(struct seq_file *m,
> + struct address_space *mapping, pgoff_t start)
> +{
> + int i;
> + int lines = 0;
> + pgoff_t len = 0;
> + struct pagevec pvec;
> + struct page *page;
> + struct page *page0 = NULL;
> +
> + for (;;) {
> + pagevec_init(&pvec, 0);
> + pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
> + (void **)pvec.pages, start + len, PAGEVEC_SIZE);
> +
> + if (pvec.nr == 0) {
> + show_range(m, page0, len);
> + start = ULONG_MAX;
> + goto out;
> + }
> +
> + if (!page0)
> + page0 = pvec.pages[0];
> +
> + for (i = 0; i < pvec.nr; i++) {
> + page = pvec.pages[i];
> +
> + if (page->index == start + len &&
> + pages_similiar(page0, page))
> + len++;
> + else {
> + show_range(m, page0, len);
> + page0 = page;
> + start = page->index;
> + len = 1;
> + if (++lines > BATCH_LINES)
> + goto out;
> + }
> + }
> + }
> +
> +out:
> + return start;
> +}
> +
> +static int pg_show(struct seq_file *m, void *v)
> +{
> + struct session *s = m->private;
> + struct file *file = s->query_file;
> + pgoff_t offset;
> +
> + if (!file)
> + return ii_show(m, v);
> +
> + offset = *(loff_t *) v;
> +
> + if (!offset) { /* print header */
> + int i;
> +
> + seq_puts(m, "# file ");
> + seq_path(m, &file->f_path, " \t\n\\");
> +
> + seq_puts(m, "\n# flags");
> + for (i = 0; i < ARRAY_SIZE(page_flag); i++)
> + seq_printf(m, " %s", page_flag[i].name);
> +
> + seq_puts(m, "\n# idx\tlen\tstate\t\trefcnt\n");
> + }
> +
> + s->start_offset = offset;
> + s->next_offset = show_file_cache(m, file->f_mapping, offset);
> +
> + return 0;
> +}
> +
> +static void *file_pos(struct file *file, loff_t *pos)
> +{
> + loff_t size = i_size_read(file->f_mapping->host);
> + pgoff_t end = DIV_ROUND_UP(size, PAGE_CACHE_SIZE);
> + pgoff_t offset = *pos;
> +
> + return offset < end ? pos : NULL;
> +}
> +
> +static void *pg_start(struct seq_file *m, loff_t *pos)
> +{
> + struct session *s = m->private;
> + struct file *file = s->query_file;
> + pgoff_t offset = *pos;
> +
> + if (!file)
> + return ii_start(m, pos);
> +
> + rcu_read_lock();
> +
> + if (offset - s->start_offset == 1)
> + *pos = s->next_offset;
> + return file_pos(file, pos);
> +}
> +
> +static void *pg_next(struct seq_file *m, void *v, loff_t *pos)
> +{
> + struct session *s = m->private;
> + struct file *file = s->query_file;
> +
> + if (!file)
> + return ii_next(m, v, pos);
> +
> + *pos = s->next_offset;
> + return file_pos(file, pos);
> +}
> +
> +static void pg_stop(struct seq_file *m, void *v)
> +{
> + struct session *s = m->private;
> + struct file *file = s->query_file;
> +
> + if (!file)
> + return ii_stop(m, v);
> +
> + rcu_read_unlock();
> +}
> +
> +struct seq_operations seq_filecache_op = {
> + .start = pg_start,
> + .next = pg_next,
> + .stop = pg_stop,
> + .show = pg_show,
> +};
> +
> +/*
> + * Implement the manual drop-all-pagecache function
> + */
> +
> +#define MAX_INODES (PAGE_SIZE / sizeof(struct inode *))
> +static int drop_pagecache(void)
> +{
> + struct hlist_head *head;
> + struct hlist_node *node;
> + struct inode *inode;
> + struct inode **inodes;
> + unsigned long i, j, k;
> + int err = 0;
> +
> + inodes = (struct inode **)__get_free_pages(GFP_KERNEL, IWIN_PAGE_ORDER);
> + if (!inodes)
> + return -ENOMEM;
> +
> + for (i = 0; (head = get_inode_hash_budget(i)); i++) {
> + if (hlist_empty(head))
> + continue;
> +
> + j = 0;
> + cond_resched();
> +
> + /*
> + * Grab some inodes.
> + */
> + spin_lock(&inode_lock);
> + hlist_for_each (node, head) {
> + inode = hlist_entry(node, struct inode, i_hash);
> + if (!atomic_read(&inode->i_count))
> + continue;
> + if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
> + continue;
> + if (!inode->i_mapping || !inode->i_mapping->nrpages)
> + continue;
> + __iget(inode);
> + inodes[j++] = inode;
> + if (j >= MAX_INODES)
> + break;
> + }
> + spin_unlock(&inode_lock);
> +
> + /*
> + * Free clean pages.
> + */
> + for (k = 0; k < j; k++) {
> + inode = inodes[k];
> + invalidate_mapping_pages(inode->i_mapping, 0, ~1);
> + iput(inode);
> + }
> +
> + /*
> + * Simply ignore the remaining inodes.
> + */
> + if (j >= MAX_INODES && !err) {
> + printk(KERN_WARNING
> + "Too many collides in inode hash table.\n"
> + "Pls boot with a larger ihash_entries=XXX.\n");
> + err = -EAGAIN;
> + }
> + }
> +
> + free_pages((unsigned long) inodes, IWIN_PAGE_ORDER);
> + return err;
> +}
> +
> +static void drop_slabcache(void)
> +{
> + int nr_objects;
> +
> + do {
> + nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
> + } while (nr_objects > 10);
> +}
> +
> +/*
> + * Proc file operations.
> + */
> +
> +static int filecache_open(struct inode *inode, struct file *proc_file)
> +{
> + struct seq_file *m;
> + struct session *s;
> + unsigned size;
> + char *buf = 0;
> + int ret;
> +
> + if (!try_module_get(THIS_MODULE))
> + return -ENOENT;
> +
> + s = session_create();
> + if (IS_ERR(s)) {
> + ret = PTR_ERR(s);
> + goto out;
> + }
> + set_session(proc_file, s);
> +
> + size = SBUF_SIZE;
> + buf = kmalloc(size, GFP_KERNEL);
> + if (!buf) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + ret = seq_open(proc_file, &seq_filecache_op);
> + if (!ret) {
> + m = proc_file->private_data;
> + m->private = s;
> + m->buf = buf;
> + m->size = size;
> + }
> +
> +out:
> + if (ret) {
> + kfree(s);
> + kfree(buf);
> + module_put(THIS_MODULE);
> + }
> + return ret;
> +}
> +
> +static int filecache_release(struct inode *inode, struct file *proc_file)
> +{
> + struct session *s = get_session(proc_file);
> + int ret;
> +
> + session_release(s);
> + ret = seq_release(inode, proc_file);
> + module_put(THIS_MODULE);
> + return ret;
> +}
> +
> +ssize_t filecache_write(struct file *proc_file, const char __user * buffer,
> + size_t count, loff_t *ppos)
> +{
> + struct session *s;
> + char *name;
> + int err = 0;
> +
> + if (count >= PATH_MAX + 5)
> + return -ENAMETOOLONG;
> +
> + name = kmalloc(count+1, GFP_KERNEL);
> + if (!name)
> + return -ENOMEM;
> +
> + if (copy_from_user(name, buffer, count)) {
> + err = -EFAULT;
> + goto out;
> + }
> +
> + /* strip the optional newline */
> + if (count && name[count-1] == '\n')
> + name[count-1] = '\0';
> + else
> + name[count] = '\0';
> +
> + s = get_session(proc_file);
> + if (!strcmp(name, "set private")) {
> + s->private_session = 1;
> + goto out;
> + }
> +
> + if (!strncmp(name, "cat ", 4)) {
> + err = session_update_file(s, name+4);
> + goto out;
> + }
> +
> + if (!strncmp(name, "ls", 2)) {
> + err = session_update_file(s, NULL);
> + if (!err)
> + err = ls_parse_options(name+2, s);
> + if (!err && !s->private_session) {
> + global_session.ls_dev = s->ls_dev;
> + global_session.ls_options = s->ls_options;
> + }
> + goto out;
> + }
> +
> + if (!strncmp(name, "drop pagecache", 14)) {
> + err = drop_pagecache();
> + goto out;
> + }
> +
> + if (!strncmp(name, "drop slabcache", 14)) {
> + drop_slabcache();
> + goto out;
> + }
> +
> + /* err = -EINVAL; */
> + err = session_update_file(s, name);
> +
> +out:
> + kfree(name);
> +
> + return err ? err : count;
> +}
> +
> +static struct file_operations proc_filecache_fops = {
> + .owner = THIS_MODULE,
> + .open = filecache_open,
> + .release = filecache_release,
> + .write = filecache_write,
> + .read = seq_read,
> + .llseek = seq_lseek,
> +};
> +
> +
> +static __init int filecache_init(void)
> +{
> + int i;
> + struct proc_dir_entry *entry;
> +
> + entry = create_proc_entry("filecache", 0600, NULL);
> + if (entry)
> + entry->proc_fops = &proc_filecache_fops;
> +
> + for (page_mask = i = 0; i < ARRAY_SIZE(page_flag); i++)
> + if (!page_flag[i].faked)
> + page_mask |= page_flag[i].mask;
> +
> + return 0;
> +}
> +
> +static void filecache_exit(void)
> +{
> + remove_proc_entry("filecache", NULL);
> + if (global_session.query_file)
> + fput(global_session.query_file);
> +}
> +
> +MODULE_AUTHOR("Fengguang Wu <wfg@mail.ustc.edu.cn>");
> +MODULE_LICENSE("GPL");
> +
> +module_init(filecache_init);
> +module_exit(filecache_exit);
> --- linux-2.6.orig/include/linux/fs.h
> +++ linux-2.6/include/linux/fs.h
> @@ -775,6 +775,11 @@ struct inode {
> void *i_security;
> #endif
> void *i_private; /* fs or device private pointer */
> +
> +#ifdef CONFIG_PROC_FILECACHE_EXTRAS
> + unsigned int i_access_count; /* opened how many times? */
> + char i_comm[16]; /* opened first by which app? */
> +#endif
> };
>
> /*
> @@ -860,6 +865,13 @@ static inline unsigned imajor(const stru
> return MAJOR(inode->i_rdev);
> }
>
> +static inline void inode_accessed(struct inode *inode)
> +{
> +#ifdef CONFIG_PROC_FILECACHE_EXTRAS
> + inode->i_access_count++;
> +#endif
> +}
> +
> extern struct block_device *I_BDEV(struct inode *inode);
>
> struct fown_struct {
> @@ -2171,6 +2183,7 @@ extern void remove_inode_hash(struct ino
> static inline void insert_inode_hash(struct inode *inode) {
> __insert_inode_hash(inode, inode->i_ino);
> }
> +struct hlist_head * get_inode_hash_budget(unsigned long index);
>
> extern struct file * get_empty_filp(void);
> extern void file_move(struct file *f, struct list_head *list);
> --- linux-2.6.orig/fs/open.c
> +++ linux-2.6/fs/open.c
> @@ -842,6 +842,7 @@ static struct file *__dentry_open(struct
> goto cleanup_all;
> }
>
> + inode_accessed(inode);
> f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
>
> file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
> --- linux-2.6.orig/fs/Kconfig
> +++ linux-2.6/fs/Kconfig
> @@ -265,4 +265,34 @@ endif
> source "fs/nls/Kconfig"
> source "fs/dlm/Kconfig"
>
> +config PROC_FILECACHE
> + tristate "/proc/filecache support"
> + default m
> + depends on PROC_FS
> + help
> + This option creates a file /proc/filecache which enables one to
> + query/drop the cached files in memory.
> +
> + A quick start guide:
> +
> + # echo 'ls' > /proc/filecache
> + # head /proc/filecache
> +
> + # echo 'cat /bin/bash' > /proc/filecache
> + # head /proc/filecache
> +
> + # echo 'drop pagecache' > /proc/filecache
> + # echo 'drop slabcache' > /proc/filecache
> +
> + For more details, please check Documentation/filesystems/proc.txt .
> +
> + It can be a handy tool for sysadms and desktop users.
> +
> +config PROC_FILECACHE_EXTRAS
> + bool "track extra states"
> + default y
> + depends on PROC_FILECACHE
> + help
> + Track extra states that costs a little more time/space.
> +
> endmenu
> --- linux-2.6.orig/fs/proc/Makefile
> +++ linux-2.6/fs/proc/Makefile
> @@ -2,7 +2,8 @@
> # Makefile for the Linux proc filesystem routines.
> #
>
> -obj-$(CONFIG_PROC_FS) += proc.o
> +obj-$(CONFIG_PROC_FS) += proc.o
> +obj-$(CONFIG_PROC_FILECACHE) += filecache.o
>
> proc-y := nommu.o task_nommu.o
> proc-$(CONFIG_MMU) := mmu.o task_mmu.o
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
@ 2009-05-17 13:55 ` Frederic Weisbecker
0 siblings, 0 replies; 137+ messages in thread
From: Frederic Weisbecker @ 2009-05-17 13:55 UTC (permalink / raw)
To: Wu Fengguang
Cc: Ingo Molnar, Li Zefan, Tom Zanussi, KOSAKI Motohiro,
Pekka Enberg, Andi Kleen, Steven Rostedt, Larry Woodman,
Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
Matt Mackall, Alexey Dobriyan, linux-mm
On Sun, May 17, 2009 at 09:36:59PM +0800, Wu Fengguang wrote:
> On Tue, May 12, 2009 at 09:01:12PM +0800, Frederic Weisbecker wrote:
> > On Tue, Apr 28, 2009 at 09:31:08PM +0800, Wu Fengguang wrote:
> > > On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> > >
> > > There are two possible challenges for the conversion:
> > >
> > > - One trick it does is to select different lists to traverse on
> > > different filter options. Will this be possible in the object
> > > tracing framework?
> >
> > Yeah, I guess.
>
> Great.
>
> >
> > > - The file name lookup(last field) is the performance killer. Is it
> > > possible to skip the file name lookup when the filter failed on the
> > > leading fields?
> >
> > objects collection lays on trace events where filters basically ignore
> > a whole entry in case of non-matching. Not sure if we can easily only
> > ignore one field.
> >
> > But I guess we can do something about the performances...
>
> OK, but it's not as important as the previous requirement, so it could
> be the last thing to work on :)
>
> > Could you send us the (sob'ed) patch you made which implements this.
> > I could try to adapt it to object collection.
>
> Attached for your reference. Be aware that I still have plans to
> change it in non trivial way, and there are ongoing works by Nick(on
> inode_lock) and Jens(on s_dirty) that can create merge conflicts.
> So basically it is not a right time to do the adaption.
Ah ok, so I will wait a bit :-)
> However we can still do something to polish up the page object
> collection under /debug/tracing/objects/mm/pages/. For example,
> the timestamps and function name could be removed from the following
> list :)
>
> # tracer: nop
> #
> # TASK-PID CPU# TIMESTAMP FUNCTION
> # | | | | |
> <...>-3743 [001] 3035.649769: dump_pages: pfn=1 flags=400 count=1 mapcount=0 index=0
> <...>-3743 [001] 3044.176403: dump_pages: pfn=1 flags=400 count=1 mapcount=0 index=0
> <...>-3743 [001] 3044.176407: dump_pages: pfn=2 flags=400 count=1 mapcount=0 index=0
> <...>-3743 [001] 3044.176408: dump_pages: pfn=3 flags=400 count=1 mapcount=0 index=0
> <...>-3743 [001] 3044.176409: dump_pages: pfn=4 flags=400 count=1 mapcount=0 index=0
> <...>-3743 [001] 3044.176409: dump_pages: pfn=5 flags=400 count=1 mapcount=0 index=0
> <...>-3743 [001] 3044.176410: dump_pages: pfn=6 flags=400 count=1 mapcount=0 index=0
> <...>-3743 [001] 3044.176410: dump_pages: pfn=7 flags=400 count=1 mapcount=0 index=0
> <...>-3743 [001] 3044.176411: dump_pages: pfn=8 flags=400 count=1 mapcount=0 index=0
> <...>-3743 [001] 3044.176411: dump_pages: pfn=9 flags=400 count=1 mapcount=0 index=0
> <...>-3743 [001] 3044.176412: dump_pages: pfn=10 flags=400 count=1 mapcount=0 index=0
echo nocontext-info > /debug/tracing/trace_options :-)
But you'll have only the function and the pages specifics. It's not really the
function but more specifically the name of the event. It's useful to distinguish
multiple events to a trace.
Hmm, may be it's not that much useful in a object dump...
Thanks.
> Thanks,
> Fengguang
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -27,6 +27,7 @@ extern unsigned long max_mapnr;
> extern unsigned long num_physpages;
> extern void * high_memory;
> extern int page_cluster;
> +extern char * const zone_names[];
>
> #ifdef CONFIG_SYSCTL
> extern int sysctl_legacy_va_layout;
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -104,7 +104,7 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_Z
>
> EXPORT_SYMBOL(totalram_pages);
>
> -static char * const zone_names[MAX_NR_ZONES] = {
> +char * const zone_names[MAX_NR_ZONES] = {
> #ifdef CONFIG_ZONE_DMA
> "DMA",
> #endif
> --- linux-2.6.orig/fs/dcache.c
> +++ linux-2.6/fs/dcache.c
> @@ -1925,7 +1925,10 @@ char *__d_path(const struct path *path,
>
> if (dentry == root->dentry && vfsmnt == root->mnt)
> break;
> - if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
> + if (unlikely(!vfsmnt)) {
> + if (IS_ROOT(dentry))
> + break;
> + } else if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
> /* Global root? */
> if (vfsmnt->mnt_parent == vfsmnt) {
> goto global_root;
> --- linux-2.6.orig/lib/radix-tree.c
> +++ linux-2.6/lib/radix-tree.c
> @@ -564,7 +564,6 @@ out:
> }
> EXPORT_SYMBOL(radix_tree_tag_clear);
>
> -#ifndef __KERNEL__ /* Only the test harness uses this at present */
> /**
> * radix_tree_tag_get - get a tag on a radix tree node
> * @root: radix tree root
> @@ -627,7 +626,6 @@ int radix_tree_tag_get(struct radix_tree
> }
> }
> EXPORT_SYMBOL(radix_tree_tag_get);
> -#endif
>
> /**
> * radix_tree_next_hole - find the next hole (not-present entry)
> --- linux-2.6.orig/fs/inode.c
> +++ linux-2.6/fs/inode.c
> @@ -84,6 +84,10 @@ static struct hlist_head *inode_hashtabl
> */
> DEFINE_SPINLOCK(inode_lock);
>
> +EXPORT_SYMBOL(inode_in_use);
> +EXPORT_SYMBOL(inode_unused);
> +EXPORT_SYMBOL(inode_lock);
> +
> /*
> * iprune_mutex provides exclusion between the kswapd or try_to_free_pages
> * icache shrinking path, and the umount path. Without this exclusion,
> @@ -110,6 +114,13 @@ static void wake_up_inode(struct inode *
> wake_up_bit(&inode->i_state, __I_LOCK);
> }
>
> +static inline void inode_created_by(struct inode *inode, struct task_struct *task)
> +{
> +#ifdef CONFIG_PROC_FILECACHE_EXTRAS
> + memcpy(inode->i_comm, task->comm, sizeof(task->comm));
> +#endif
> +}
> +
> /**
> * inode_init_always - perform inode structure intialisation
> * @sb: superblock inode belongs to
> @@ -147,7 +158,7 @@ struct inode *inode_init_always(struct s
> inode->i_bdev = NULL;
> inode->i_cdev = NULL;
> inode->i_rdev = 0;
> - inode->dirtied_when = 0;
> + inode->dirtied_when = jiffies;
>
> if (security_inode_alloc(inode))
> goto out_free_inode;
> @@ -188,6 +199,7 @@ struct inode *inode_init_always(struct s
> }
> inode->i_private = NULL;
> inode->i_mapping = mapping;
> + inode_created_by(inode, current);
>
> return inode;
>
> @@ -276,6 +288,8 @@ void __iget(struct inode *inode)
> inodes_stat.nr_unused--;
> }
>
> +EXPORT_SYMBOL(__iget);
> +
> /**
> * clear_inode - clear an inode
> * @inode: inode to clear
> @@ -1459,6 +1473,16 @@ static void __wait_on_freeing_inode(stru
> spin_lock(&inode_lock);
> }
>
> +
> +struct hlist_head * get_inode_hash_budget(unsigned long index)
> +{
> + if (index >= (1 << i_hash_shift))
> + return NULL;
> +
> + return inode_hashtable + index;
> +}
> +EXPORT_SYMBOL_GPL(get_inode_hash_budget);
> +
> static __initdata unsigned long ihash_entries;
> static int __init set_ihash_entries(char *str)
> {
> --- linux-2.6.orig/fs/super.c
> +++ linux-2.6/fs/super.c
> @@ -46,6 +46,9 @@
> LIST_HEAD(super_blocks);
> DEFINE_SPINLOCK(sb_lock);
>
> +EXPORT_SYMBOL(super_blocks);
> +EXPORT_SYMBOL(sb_lock);
> +
> /**
> * alloc_super - create new superblock
> * @type: filesystem type superblock should belong to
> --- linux-2.6.orig/mm/vmscan.c
> +++ linux-2.6/mm/vmscan.c
> @@ -262,6 +262,7 @@ unsigned long shrink_slab(unsigned long
> up_read(&shrinker_rwsem);
> return ret;
> }
> +EXPORT_SYMBOL(shrink_slab);
>
> /* Called without lock on whether page is mapped, so answer is unstable */
> static inline int page_mapping_inuse(struct page *page)
> --- linux-2.6.orig/mm/swap_state.c
> +++ linux-2.6/mm/swap_state.c
> @@ -45,6 +45,7 @@ struct address_space swapper_space = {
> .i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
> .backing_dev_info = &swap_backing_dev_info,
> };
> +EXPORT_SYMBOL_GPL(swapper_space);
>
> #define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0)
>
> --- linux-2.6.orig/Documentation/filesystems/proc.txt
> +++ linux-2.6/Documentation/filesystems/proc.txt
> @@ -260,6 +260,7 @@ Table 1-4: Kernel info in /proc
> driver Various drivers grouped here, currently rtc (2.4)
> execdomains Execdomains, related to security (2.4)
> fb Frame Buffer devices (2.4)
> + filecache Query/drop in-memory file cache
> fs File system parameters, currently nfs/exports (2.4)
> ide Directory containing info about the IDE subsystem
> interrupts Interrupt usage
> @@ -450,6 +451,88 @@ varies by architecture and compile optio
>
> > cat /proc/meminfo
>
> +..............................................................................
> +
> +filecache:
> +
> +Provides access to the in-memory file cache.
> +
> +To list an index of all cached files:
> +
> + echo ls > /proc/filecache
> + cat /proc/filecache
> +
> +The output looks like:
> +
> + # filecache 1.0
> + # ino size cached cached% state refcnt dev file
> + 1026334 91 92 100 -- 66 03:02(hda2) /lib/ld-2.3.6.so
> + 233608 1242 972 78 -- 66 03:02(hda2) /lib/tls/libc-2.3.6.so
> + 65203 651 476 73 -- 1 03:02(hda2) /bin/bash
> + 1026445 261 160 61 -- 10 03:02(hda2) /lib/libncurses.so.5.5
> + 235427 10 12 100 -- 44 03:02(hda2) /lib/tls/libdl-2.3.6.so
> +
> +FIELD INTRO
> +---------------------------------------------------------------------------
> +ino inode number
> +size inode size in KB
> +cached cached size in KB
> +cached% percent of file data cached
> +state1 '-' clean; 'd' metadata dirty; 'D' data dirty
> +state2 '-' unlocked; 'L' locked, normally indicates file being written out
> +refcnt file reference count, it's an in-kernel one, not exactly open count
> +dev major:minor numbers in hex, followed by a descriptive device name
> +file file path _inside_ the filesystem. There are several special names:
> + '(noname)': the file name is not available
> + '(03:02)': the file is a block device file of major:minor
> + '...(deleted)': the named file has been deleted from the disk
> +
> +To list the cached pages of a perticular file:
> +
> + echo /bin/bash > /proc/filecache
> + cat /proc/filecache
> +
> + # file /bin/bash
> + # flags R:referenced A:active U:uptodate D:dirty W:writeback M:mmap
> + # idx len state refcnt
> + 0 36 RAU__M 3
> + 36 1 RAU__M 2
> + 37 8 RAU__M 3
> + 45 2 RAU___ 1
> + 47 6 RAU__M 3
> + 53 3 RAU__M 2
> + 56 2 RAU__M 3
> +
> +FIELD INTRO
> +----------------------------------------------------------------------------
> +idx page index
> +len number of pages which are cached and share the same state
> +state page state of the flags listed in line two
> +refcnt page reference count
> +
> +Careful users may notice that the file name to be queried is remembered between
> +commands. Internally, the module has a global variable to store the file name
> +parameter, so that it can be inherited by newly opened /proc/filecache file.
> +However it can lead to interference for multiple queriers. The solution here
> +is to obey a rule: only root can interactively change the file name parameter;
> +normal users must go for scripts to access the interface. Scripts should do it
> +by following the code example below:
> +
> + filecache = open("/proc/filecache", "rw");
> + # avoid polluting the global parameter filename
> + filecache.write("set private");
> +
> +To instruct the kernel to drop clean caches, dentries and inodes from memory,
> +causing that memory to become free:
> +
> + # drop clean file data cache (i.e. file backed pagecache)
> + echo drop pagecache > /proc/filecache
> +
> + # drop clean file metadata cache (i.e. dentries and inodes)
> + echo drop slabcache > /proc/filecache
> +
> +Note that the drop commands are non-destructive operations and dirty objects
> +are not freeable, the user should run `sync' first.
>
> MemTotal: 16344972 kB
> MemFree: 13634064 kB
> --- /dev/null
> +++ linux-2.6/fs/proc/filecache.c
> @@ -0,0 +1,1045 @@
> +/*
> + * fs/proc/filecache.c
> + *
> + * Copyright (C) 2006, 2007 Fengguang Wu <wfg@mail.ustc.edu.cn>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/radix-tree.h>
> +#include <linux/page-flags.h>
> +#include <linux/pagevec.h>
> +#include <linux/pagemap.h>
> +#include <linux/vmalloc.h>
> +#include <linux/writeback.h>
> +#include <linux/buffer_head.h>
> +#include <linux/parser.h>
> +#include <linux/proc_fs.h>
> +#include <linux/seq_file.h>
> +#include <linux/file.h>
> +#include <linux/namei.h>
> +#include <linux/module.h>
> +#include <asm/uaccess.h>
> +
> +/*
> + * Increase minor version when new columns are added;
> + * Increase major version when existing columns are changed.
> + */
> +#define FILECACHE_VERSION "1.0"
> +
> +/* Internal buffer sizes. The larger the more effcient. */
> +#define SBUF_SIZE (128<<10)
> +#define IWIN_PAGE_ORDER 3
> +#define IWIN_SIZE ((PAGE_SIZE<<IWIN_PAGE_ORDER) / sizeof(struct inode *))
> +
> +/*
> + * Session management.
> + *
> + * Each opened /proc/filecache file is assiocated with a session object.
> + * Also there is a global_session that maintains status across open()/close()
> + * (i.e. the lifetime of an opened file), so that a casual user can query the
> + * filecache via _multiple_ simple shell commands like
> + * 'echo cat /bin/bash > /proc/filecache; cat /proc/filecache'.
> + *
> + * session.query_file is the file whose cache info is to be queried.
> + * Its value determines what we get on read():
> + * - NULL: ii_*() called to show the inode index
> + * - filp: pg_*() called to show the page groups of a filp
> + *
> + * session.query_file is
> + * - cloned from global_session.query_file on open();
> + * - updated on write("cat filename");
> + * note that the new file will also be saved in global_session.query_file if
> + * session.private_session is false.
> + */
> +
> +struct session {
> + /* options */
> + int private_session;
> + unsigned long ls_options;
> + dev_t ls_dev;
> +
> + /* parameters */
> + struct file *query_file;
> +
> + /* seqfile pos */
> + pgoff_t start_offset;
> + pgoff_t next_offset;
> +
> + /* inode at last pos */
> + struct {
> + unsigned long pos;
> + unsigned long state;
> + struct inode *inode;
> + struct inode *pinned_inode;
> + } ipos;
> +
> + /* inode window */
> + struct {
> + unsigned long cursor;
> + unsigned long origin;
> + unsigned long size;
> + struct inode **inodes;
> + } iwin;
> +};
> +
> +static struct session global_session;
> +
> +/*
> + * Session address is stored in proc_file->f_ra.start:
> + * we assume that there will be no readahead for proc_file.
> + */
> +static struct session *get_session(struct file *proc_file)
> +{
> + return (struct session *)proc_file->f_ra.start;
> +}
> +
> +static void set_session(struct file *proc_file, struct session *s)
> +{
> + BUG_ON(proc_file->f_ra.start);
> + proc_file->f_ra.start = (unsigned long)s;
> +}
> +
> +static void update_global_file(struct session *s)
> +{
> + if (s->private_session)
> + return;
> +
> + if (global_session.query_file)
> + fput(global_session.query_file);
> +
> + global_session.query_file = s->query_file;
> +
> + if (global_session.query_file)
> + get_file(global_session.query_file);
> +}
> +
> +/*
> + * Cases of the name:
> + * 1) NULL (new session)
> + * s->query_file = global_session.query_file = 0;
> + * 2) "" (ls/la)
> + * s->query_file = global_session.query_file;
> + * 3) a regular file name (cat newfile)
> + * s->query_file = global_session.query_file = newfile;
> + */
> +static int session_update_file(struct session *s, char *name)
> +{
> + static DEFINE_MUTEX(mutex); /* protects global_session.query_file */
> + int err = 0;
> +
> + mutex_lock(&mutex);
> +
> + /*
> + * We are to quit, or to list the cached files.
> + * Reset *.query_file.
> + */
> + if (!name) {
> + if (s->query_file) {
> + fput(s->query_file);
> + s->query_file = NULL;
> + }
> + update_global_file(s);
> + goto out;
> + }
> +
> + /*
> + * This is a new session.
> + * Inherit options/parameters from global ones.
> + */
> + if (name[0] == '\0') {
> + *s = global_session;
> + if (s->query_file)
> + get_file(s->query_file);
> + goto out;
> + }
> +
> + /*
> + * Open the named file.
> + */
> + if (s->query_file)
> + fput(s->query_file);
> + s->query_file = filp_open(name, O_RDONLY|O_LARGEFILE, 0);
> + if (IS_ERR(s->query_file)) {
> + err = PTR_ERR(s->query_file);
> + s->query_file = NULL;
> + } else
> + update_global_file(s);
> +
> +out:
> + mutex_unlock(&mutex);
> +
> + return err;
> +}
> +
> +static struct session *session_create(void)
> +{
> + struct session *s;
> + int err = 0;
> +
> + s = kmalloc(sizeof(*s), GFP_KERNEL);
> + if (s)
> + err = session_update_file(s, "");
> + else
> + err = -ENOMEM;
> +
> + return err ? ERR_PTR(err) : s;
> +}
> +
> +static void session_release(struct session *s)
> +{
> + if (s->ipos.pinned_inode)
> + iput(s->ipos.pinned_inode);
> + if (s->query_file)
> + fput(s->query_file);
> + kfree(s);
> +}
> +
> +
> +/*
> + * Listing of cached files.
> + *
> + * Usage:
> + * echo > /proc/filecache # enter listing mode
> + * cat /proc/filecache # get the file listing
> + */
> +
> +/* code style borrowed from ib_srp.c */
> +enum {
> + LS_OPT_ERR = 0,
> + LS_OPT_DIRTY = 1 << 0,
> + LS_OPT_CLEAN = 1 << 1,
> + LS_OPT_INUSE = 1 << 2,
> + LS_OPT_EMPTY = 1 << 3,
> + LS_OPT_ALL = 1 << 4,
> + LS_OPT_DEV = 1 << 5,
> +};
> +
> +static match_table_t ls_opt_tokens = {
> + { LS_OPT_DIRTY, "dirty" },
> + { LS_OPT_CLEAN, "clean" },
> + { LS_OPT_INUSE, "inuse" },
> + { LS_OPT_EMPTY, "empty" },
> + { LS_OPT_ALL, "all" },
> + { LS_OPT_DEV, "dev=%s" },
> + { LS_OPT_ERR, NULL }
> +};
> +
> +static int ls_parse_options(const char *buf, struct session *s)
> +{
> + substring_t args[MAX_OPT_ARGS];
> + char *options, *sep_opt;
> + char *p;
> + int token;
> + int ret = 0;
> +
> + if (!buf)
> + return 0;
> + options = kstrdup(buf, GFP_KERNEL);
> + if (!options)
> + return -ENOMEM;
> +
> + s->ls_options = 0;
> + sep_opt = options;
> + while ((p = strsep(&sep_opt, " ")) != NULL) {
> + if (!*p)
> + continue;
> +
> + token = match_token(p, ls_opt_tokens, args);
> +
> + switch (token) {
> + case LS_OPT_DIRTY:
> + case LS_OPT_CLEAN:
> + case LS_OPT_INUSE:
> + case LS_OPT_EMPTY:
> + case LS_OPT_ALL:
> + s->ls_options |= token;
> + break;
> + case LS_OPT_DEV:
> + p = match_strdup(args);
> + if (!p) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + if (*p == '/') {
> + struct kstat stat;
> + struct nameidata nd;
> + ret = path_lookup(p, LOOKUP_FOLLOW, &nd);
> + if (!ret)
> + ret = vfs_getattr(nd.path.mnt,
> + nd.path.dentry, &stat);
> + if (!ret)
> + s->ls_dev = stat.rdev;
> + } else
> + s->ls_dev = simple_strtoul(p, NULL, 0);
> + /* printk("%lx %s\n", (long)s->ls_dev, p); */
> + kfree(p);
> + break;
> +
> + default:
> + printk(KERN_WARNING "unknown parameter or missing value "
> + "'%s' in ls command\n", p);
> + ret = -EINVAL;
> + goto out;
> + }
> + }
> +
> +out:
> + kfree(options);
> + return ret;
> +}
> +
> +/*
> + * Add possible filters here.
> + * No permission check: we cannot verify the path's permission anyway.
> + * We simply demand root previledge for accessing /proc/filecache.
> + */
> +static int may_show_inode(struct session *s, struct inode *inode)
> +{
> + if (!atomic_read(&inode->i_count))
> + return 0;
> + if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
> + return 0;
> + if (!inode->i_mapping)
> + return 0;
> +
> + if (s->ls_dev && s->ls_dev != inode->i_sb->s_dev)
> + return 0;
> +
> + if (s->ls_options & LS_OPT_ALL)
> + return 1;
> +
> + if (!(s->ls_options & LS_OPT_EMPTY) && !inode->i_mapping->nrpages)
> + return 0;
> +
> + if ((s->ls_options & LS_OPT_DIRTY) && !(inode->i_state & I_DIRTY))
> + return 0;
> +
> + if ((s->ls_options & LS_OPT_CLEAN) && (inode->i_state & I_DIRTY))
> + return 0;
> +
> + if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
> + S_ISLNK(inode->i_mode) || S_ISBLK(inode->i_mode)))
> + return 0;
> +
> + return 1;
> +}
> +
> +/*
> + * Full: there are more data following.
> + */
> +static int iwin_full(struct session *s)
> +{
> + return !s->iwin.cursor ||
> + s->iwin.cursor > s->iwin.origin + s->iwin.size;
> +}
> +
> +static int iwin_push(struct session *s, struct inode *inode)
> +{
> + if (!may_show_inode(s, inode))
> + return 0;
> +
> + s->iwin.cursor++;
> +
> + if (s->iwin.size >= IWIN_SIZE)
> + return 1;
> +
> + if (s->iwin.cursor > s->iwin.origin)
> + s->iwin.inodes[s->iwin.size++] = inode;
> + return 0;
> +}
> +
> +/*
> + * Travease the inode lists in order - newest first.
> + * And fill @s->iwin.inodes with inodes positioned in [@pos, @pos+IWIN_SIZE).
> + */
> +static int iwin_fill(struct session *s, unsigned long pos)
> +{
> + struct inode *inode;
> + struct super_block *sb;
> +
> + s->iwin.origin = pos;
> + s->iwin.cursor = 0;
> + s->iwin.size = 0;
> +
> + /*
> + * We have a cursor inode, clean and expected to be unchanged.
> + */
> + if (s->ipos.inode && pos >= s->ipos.pos &&
> + !(s->ipos.state & I_DIRTY) &&
> + s->ipos.state == s->ipos.inode->i_state) {
> + inode = s->ipos.inode;
> + s->iwin.cursor = s->ipos.pos;
> + goto continue_from_saved;
> + }
> +
> + if (s->ls_options & LS_OPT_CLEAN)
> + goto clean_inodes;
> +
> + spin_lock(&sb_lock);
> + list_for_each_entry(sb, &super_blocks, s_list) {
> + if (s->ls_dev && s->ls_dev != sb->s_dev)
> + continue;
> +
> + list_for_each_entry(inode, &sb->s_dirty, i_list) {
> + if (iwin_push(s, inode))
> + goto out_full_unlock;
> + }
> + list_for_each_entry(inode, &sb->s_io, i_list) {
> + if (iwin_push(s, inode))
> + goto out_full_unlock;
> + }
> + }
> + spin_unlock(&sb_lock);
> +
> +clean_inodes:
> + list_for_each_entry(inode, &inode_in_use, i_list) {
> + if (iwin_push(s, inode))
> + goto out_full;
> +continue_from_saved:
> + ;
> + }
> +
> + if (s->ls_options & LS_OPT_INUSE)
> + return 0;
> +
> + list_for_each_entry(inode, &inode_unused, i_list) {
> + if (iwin_push(s, inode))
> + goto out_full;
> + }
> +
> + return 0;
> +
> +out_full_unlock:
> + spin_unlock(&sb_lock);
> +out_full:
> + return 1;
> +}
> +
> +static struct inode *iwin_inode(struct session *s, unsigned long pos)
> +{
> + if ((iwin_full(s) && pos >= s->iwin.origin + s->iwin.size)
> + || pos < s->iwin.origin)
> + iwin_fill(s, pos);
> +
> + if (pos >= s->iwin.cursor)
> + return NULL;
> +
> + s->ipos.pos = pos;
> + s->ipos.inode = s->iwin.inodes[pos - s->iwin.origin];
> + BUG_ON(!s->ipos.inode);
> + return s->ipos.inode;
> +}
> +
> +static void show_inode(struct seq_file *m, struct inode *inode)
> +{
> + char state[] = "--"; /* dirty, locked */
> + struct dentry *dentry;
> + loff_t size = i_size_read(inode);
> + unsigned long nrpages;
> + int percent;
> + int refcnt;
> + int shift;
> +
> + if (!size)
> + size++;
> +
> + if (inode->i_mapping)
> + nrpages = inode->i_mapping->nrpages;
> + else {
> + nrpages = 0;
> + WARN_ON(1);
> + }
> +
> + for (shift = 0; (size >> shift) > ULONG_MAX / 128; shift += 12)
> + ;
> + percent = min(100UL, (((100 * nrpages) >> shift) << PAGE_CACHE_SHIFT) /
> + (unsigned long)(size >> shift));
> +
> + if (inode->i_state & (I_DIRTY_DATASYNC|I_DIRTY_PAGES))
> + state[0] = 'D';
> + else if (inode->i_state & I_DIRTY_SYNC)
> + state[0] = 'd';
> +
> + if (inode->i_state & I_LOCK)
> + state[0] = 'L';
> +
> + refcnt = 0;
> + list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
> + refcnt += atomic_read(&dentry->d_count);
> + }
> +
> + seq_printf(m, "%10lu %10llu %8lu %7d ",
> + inode->i_ino,
> + DIV_ROUND_UP(size, 1024),
> + nrpages << (PAGE_CACHE_SHIFT - 10),
> + percent);
> +
> + seq_printf(m, "%6d %5s %9lu ",
> + refcnt,
> + state,
> + (jiffies - inode->dirtied_when) / HZ);
> +
> +#ifdef CONFIG_PROC_FILECACHE_EXTRAS
> + seq_printf(m, "%8u %-16s",
> + inode->i_access_count,
> + inode->i_comm);
> +#endif
> +
> + seq_printf(m, "%02x:%02x(%s)\t",
> + MAJOR(inode->i_sb->s_dev),
> + MINOR(inode->i_sb->s_dev),
> + inode->i_sb->s_id);
> +
> + if (list_empty(&inode->i_dentry)) {
> + if (!atomic_read(&inode->i_count))
> + seq_puts(m, "(noname)\n");
> + else
> + seq_printf(m, "(%02x:%02x)\n",
> + imajor(inode), iminor(inode));
> + } else {
> + struct path path = {
> + .mnt = NULL,
> + .dentry = list_entry(inode->i_dentry.next,
> + struct dentry, d_alias)
> + };
> +
> + seq_path(m, &path, " \t\n\\");
> + seq_putc(m, '\n');
> + }
> +}
> +
> +static int ii_show(struct seq_file *m, void *v)
> +{
> + unsigned long index = *(loff_t *) v;
> + struct session *s = m->private;
> + struct inode *inode;
> +
> + if (index == 0) {
> + seq_puts(m, "# filecache " FILECACHE_VERSION "\n");
> + seq_puts(m, "# ino size cached cached% "
> + "refcnt state age "
> +#ifdef CONFIG_PROC_FILECACHE_EXTRAS
> + "accessed process "
> +#endif
> + "dev\t\tfile\n");
> + }
> +
> + inode = iwin_inode(s,index);
> + show_inode(m, inode);
> +
> + return 0;
> +}
> +
> +static void *ii_start(struct seq_file *m, loff_t *pos)
> +{
> + struct session *s = m->private;
> +
> + s->iwin.size = 0;
> + s->iwin.inodes = (struct inode **)
> + __get_free_pages( GFP_KERNEL, IWIN_PAGE_ORDER);
> + if (!s->iwin.inodes)
> + return NULL;
> +
> + spin_lock(&inode_lock);
> +
> + return iwin_inode(s, *pos) ? pos : NULL;
> +}
> +
> +static void *ii_next(struct seq_file *m, void *v, loff_t *pos)
> +{
> + struct session *s = m->private;
> +
> + (*pos)++;
> + return iwin_inode(s, *pos) ? pos : NULL;
> +}
> +
> +static void ii_stop(struct seq_file *m, void *v)
> +{
> + struct session *s = m->private;
> + struct inode *inode = s->ipos.inode;
> +
> + if (!s->iwin.inodes)
> + return;
> +
> + if (inode) {
> + __iget(inode);
> + s->ipos.state = inode->i_state;
> + }
> + spin_unlock(&inode_lock);
> +
> + free_pages((unsigned long) s->iwin.inodes, IWIN_PAGE_ORDER);
> + if (s->ipos.pinned_inode)
> + iput(s->ipos.pinned_inode);
> + s->ipos.pinned_inode = inode;
> +}
> +
> +/*
> + * Listing of cached page ranges of a file.
> + *
> + * Usage:
> + * echo 'file name' > /proc/filecache
> + * cat /proc/filecache
> + */
> +
> +unsigned long page_mask;
> +#define PG_MMAP PG_lru /* reuse any non-relevant flag */
> +#define PG_BUFFER PG_swapcache /* ditto */
> +#define PG_DIRTY PG_error /* ditto */
> +#define PG_WRITEBACK PG_buddy /* ditto */
> +
> +/*
> + * Page state names, prefixed by their abbreviations.
> + */
> +struct {
> + unsigned long mask;
> + const char *name;
> + int faked;
> +} page_flag [] = {
> + {1 << PG_referenced, "R:referenced", 0},
> + {1 << PG_active, "A:active", 0},
> + {1 << PG_MMAP, "M:mmap", 1},
> +
> + {1 << PG_uptodate, "U:uptodate", 0},
> + {1 << PG_dirty, "D:dirty", 0},
> + {1 << PG_writeback, "W:writeback", 0},
> + {1 << PG_reclaim, "X:readahead", 0},
> +
> + {1 << PG_private, "P:private", 0},
> + {1 << PG_owner_priv_1, "O:owner", 0},
> +
> + {1 << PG_BUFFER, "b:buffer", 1},
> + {1 << PG_DIRTY, "d:dirty", 1},
> + {1 << PG_WRITEBACK, "w:writeback", 1},
> +};
> +
> +static unsigned long page_flags(struct page* page)
> +{
> + unsigned long flags;
> + struct address_space *mapping = page_mapping(page);
> +
> + flags = page->flags & page_mask;
> +
> + if (page_mapped(page))
> + flags |= (1 << PG_MMAP);
> +
> + if (page_has_buffers(page))
> + flags |= (1 << PG_BUFFER);
> +
> + if (mapping) {
> + if (radix_tree_tag_get(&mapping->page_tree,
> + page_index(page),
> + PAGECACHE_TAG_WRITEBACK))
> + flags |= (1 << PG_WRITEBACK);
> +
> + if (radix_tree_tag_get(&mapping->page_tree,
> + page_index(page),
> + PAGECACHE_TAG_DIRTY))
> + flags |= (1 << PG_DIRTY);
> + }
> +
> + return flags;
> +}
> +
> +static int pages_similiar(struct page* page0, struct page* page)
> +{
> + if (page_count(page0) != page_count(page))
> + return 0;
> +
> + if (page_flags(page0) != page_flags(page))
> + return 0;
> +
> + return 1;
> +}
> +
> +static void show_range(struct seq_file *m, struct page* page, unsigned long len)
> +{
> + int i;
> + unsigned long flags;
> +
> + if (!m || !page)
> + return;
> +
> + seq_printf(m, "%lu\t%lu\t", page->index, len);
> +
> + flags = page_flags(page);
> + for (i = 0; i < ARRAY_SIZE(page_flag); i++)
> + seq_putc(m, (flags & page_flag[i].mask) ?
> + page_flag[i].name[0] : '_');
> +
> + seq_printf(m, "\t%d\n", page_count(page));
> +}
> +
> +#define BATCH_LINES 100
> +static pgoff_t show_file_cache(struct seq_file *m,
> + struct address_space *mapping, pgoff_t start)
> +{
> + int i;
> + int lines = 0;
> + pgoff_t len = 0;
> + struct pagevec pvec;
> + struct page *page;
> + struct page *page0 = NULL;
> +
> + for (;;) {
> + pagevec_init(&pvec, 0);
> + pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
> + (void **)pvec.pages, start + len, PAGEVEC_SIZE);
> +
> + if (pvec.nr == 0) {
> + show_range(m, page0, len);
> + start = ULONG_MAX;
> + goto out;
> + }
> +
> + if (!page0)
> + page0 = pvec.pages[0];
> +
> + for (i = 0; i < pvec.nr; i++) {
> + page = pvec.pages[i];
> +
> + if (page->index == start + len &&
> + pages_similiar(page0, page))
> + len++;
> + else {
> + show_range(m, page0, len);
> + page0 = page;
> + start = page->index;
> + len = 1;
> + if (++lines > BATCH_LINES)
> + goto out;
> + }
> + }
> + }
> +
> +out:
> + return start;
> +}
> +
> +static int pg_show(struct seq_file *m, void *v)
> +{
> + struct session *s = m->private;
> + struct file *file = s->query_file;
> + pgoff_t offset;
> +
> + if (!file)
> + return ii_show(m, v);
> +
> + offset = *(loff_t *) v;
> +
> + if (!offset) { /* print header */
> + int i;
> +
> + seq_puts(m, "# file ");
> + seq_path(m, &file->f_path, " \t\n\\");
> +
> + seq_puts(m, "\n# flags");
> + for (i = 0; i < ARRAY_SIZE(page_flag); i++)
> + seq_printf(m, " %s", page_flag[i].name);
> +
> + seq_puts(m, "\n# idx\tlen\tstate\t\trefcnt\n");
> + }
> +
> + s->start_offset = offset;
> + s->next_offset = show_file_cache(m, file->f_mapping, offset);
> +
> + return 0;
> +}
> +
> +static void *file_pos(struct file *file, loff_t *pos)
> +{
> + loff_t size = i_size_read(file->f_mapping->host);
> + pgoff_t end = DIV_ROUND_UP(size, PAGE_CACHE_SIZE);
> + pgoff_t offset = *pos;
> +
> + return offset < end ? pos : NULL;
> +}
> +
> +static void *pg_start(struct seq_file *m, loff_t *pos)
> +{
> + struct session *s = m->private;
> + struct file *file = s->query_file;
> + pgoff_t offset = *pos;
> +
> + if (!file)
> + return ii_start(m, pos);
> +
> + rcu_read_lock();
> +
> + if (offset - s->start_offset == 1)
> + *pos = s->next_offset;
> + return file_pos(file, pos);
> +}
> +
> +static void *pg_next(struct seq_file *m, void *v, loff_t *pos)
> +{
> + struct session *s = m->private;
> + struct file *file = s->query_file;
> +
> + if (!file)
> + return ii_next(m, v, pos);
> +
> + *pos = s->next_offset;
> + return file_pos(file, pos);
> +}
> +
> +static void pg_stop(struct seq_file *m, void *v)
> +{
> + struct session *s = m->private;
> + struct file *file = s->query_file;
> +
> + if (!file)
> + return ii_stop(m, v);
> +
> + rcu_read_unlock();
> +}
> +
> +struct seq_operations seq_filecache_op = {
> + .start = pg_start,
> + .next = pg_next,
> + .stop = pg_stop,
> + .show = pg_show,
> +};
> +
> +/*
> + * Implement the manual drop-all-pagecache function
> + */
> +
> +#define MAX_INODES (PAGE_SIZE / sizeof(struct inode *))
> +static int drop_pagecache(void)
> +{
> + struct hlist_head *head;
> + struct hlist_node *node;
> + struct inode *inode;
> + struct inode **inodes;
> + unsigned long i, j, k;
> + int err = 0;
> +
> + inodes = (struct inode **)__get_free_pages(GFP_KERNEL, IWIN_PAGE_ORDER);
> + if (!inodes)
> + return -ENOMEM;
> +
> + for (i = 0; (head = get_inode_hash_budget(i)); i++) {
> + if (hlist_empty(head))
> + continue;
> +
> + j = 0;
> + cond_resched();
> +
> + /*
> + * Grab some inodes.
> + */
> + spin_lock(&inode_lock);
> + hlist_for_each (node, head) {
> + inode = hlist_entry(node, struct inode, i_hash);
> + if (!atomic_read(&inode->i_count))
> + continue;
> + if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
> + continue;
> + if (!inode->i_mapping || !inode->i_mapping->nrpages)
> + continue;
> + __iget(inode);
> + inodes[j++] = inode;
> + if (j >= MAX_INODES)
> + break;
> + }
> + spin_unlock(&inode_lock);
> +
> + /*
> + * Free clean pages.
> + */
> + for (k = 0; k < j; k++) {
> + inode = inodes[k];
> + invalidate_mapping_pages(inode->i_mapping, 0, ~1);
> + iput(inode);
> + }
> +
> + /*
> + * Simply ignore the remaining inodes.
> + */
> + if (j >= MAX_INODES && !err) {
> + printk(KERN_WARNING
> + "Too many collides in inode hash table.\n"
> + "Pls boot with a larger ihash_entries=XXX.\n");
> + err = -EAGAIN;
> + }
> + }
> +
> + free_pages((unsigned long) inodes, IWIN_PAGE_ORDER);
> + return err;
> +}
> +
> +static void drop_slabcache(void)
> +{
> + int nr_objects;
> +
> + do {
> + nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
> + } while (nr_objects > 10);
> +}
> +
> +/*
> + * Proc file operations.
> + */
> +
> +static int filecache_open(struct inode *inode, struct file *proc_file)
> +{
> + struct seq_file *m;
> + struct session *s;
> + unsigned size;
> + char *buf = 0;
> + int ret;
> +
> + if (!try_module_get(THIS_MODULE))
> + return -ENOENT;
> +
> + s = session_create();
> + if (IS_ERR(s)) {
> + ret = PTR_ERR(s);
> + goto out;
> + }
> + set_session(proc_file, s);
> +
> + size = SBUF_SIZE;
> + buf = kmalloc(size, GFP_KERNEL);
> + if (!buf) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + ret = seq_open(proc_file, &seq_filecache_op);
> + if (!ret) {
> + m = proc_file->private_data;
> + m->private = s;
> + m->buf = buf;
> + m->size = size;
> + }
> +
> +out:
> + if (ret) {
> + kfree(s);
> + kfree(buf);
> + module_put(THIS_MODULE);
> + }
> + return ret;
> +}
> +
> +static int filecache_release(struct inode *inode, struct file *proc_file)
> +{
> + struct session *s = get_session(proc_file);
> + int ret;
> +
> + session_release(s);
> + ret = seq_release(inode, proc_file);
> + module_put(THIS_MODULE);
> + return ret;
> +}
> +
> +ssize_t filecache_write(struct file *proc_file, const char __user * buffer,
> + size_t count, loff_t *ppos)
> +{
> + struct session *s;
> + char *name;
> + int err = 0;
> +
> + if (count >= PATH_MAX + 5)
> + return -ENAMETOOLONG;
> +
> + name = kmalloc(count+1, GFP_KERNEL);
> + if (!name)
> + return -ENOMEM;
> +
> + if (copy_from_user(name, buffer, count)) {
> + err = -EFAULT;
> + goto out;
> + }
> +
> + /* strip the optional newline */
> + if (count && name[count-1] == '\n')
> + name[count-1] = '\0';
> + else
> + name[count] = '\0';
> +
> + s = get_session(proc_file);
> + if (!strcmp(name, "set private")) {
> + s->private_session = 1;
> + goto out;
> + }
> +
> + if (!strncmp(name, "cat ", 4)) {
> + err = session_update_file(s, name+4);
> + goto out;
> + }
> +
> + if (!strncmp(name, "ls", 2)) {
> + err = session_update_file(s, NULL);
> + if (!err)
> + err = ls_parse_options(name+2, s);
> + if (!err && !s->private_session) {
> + global_session.ls_dev = s->ls_dev;
> + global_session.ls_options = s->ls_options;
> + }
> + goto out;
> + }
> +
> + if (!strncmp(name, "drop pagecache", 14)) {
> + err = drop_pagecache();
> + goto out;
> + }
> +
> + if (!strncmp(name, "drop slabcache", 14)) {
> + drop_slabcache();
> + goto out;
> + }
> +
> + /* err = -EINVAL; */
> + err = session_update_file(s, name);
> +
> +out:
> + kfree(name);
> +
> + return err ? err : count;
> +}
> +
> +static struct file_operations proc_filecache_fops = {
> + .owner = THIS_MODULE,
> + .open = filecache_open,
> + .release = filecache_release,
> + .write = filecache_write,
> + .read = seq_read,
> + .llseek = seq_lseek,
> +};
> +
> +
> +static __init int filecache_init(void)
> +{
> + int i;
> + struct proc_dir_entry *entry;
> +
> + entry = create_proc_entry("filecache", 0600, NULL);
> + if (entry)
> + entry->proc_fops = &proc_filecache_fops;
> +
> + for (page_mask = i = 0; i < ARRAY_SIZE(page_flag); i++)
> + if (!page_flag[i].faked)
> + page_mask |= page_flag[i].mask;
> +
> + return 0;
> +}
> +
> +static void filecache_exit(void)
> +{
> + remove_proc_entry("filecache", NULL);
> + if (global_session.query_file)
> + fput(global_session.query_file);
> +}
> +
> +MODULE_AUTHOR("Fengguang Wu <wfg@mail.ustc.edu.cn>");
> +MODULE_LICENSE("GPL");
> +
> +module_init(filecache_init);
> +module_exit(filecache_exit);
> --- linux-2.6.orig/include/linux/fs.h
> +++ linux-2.6/include/linux/fs.h
> @@ -775,6 +775,11 @@ struct inode {
> void *i_security;
> #endif
> void *i_private; /* fs or device private pointer */
> +
> +#ifdef CONFIG_PROC_FILECACHE_EXTRAS
> + unsigned int i_access_count; /* opened how many times? */
> + char i_comm[16]; /* opened first by which app? */
> +#endif
> };
>
> /*
> @@ -860,6 +865,13 @@ static inline unsigned imajor(const stru
> return MAJOR(inode->i_rdev);
> }
>
> +static inline void inode_accessed(struct inode *inode)
> +{
> +#ifdef CONFIG_PROC_FILECACHE_EXTRAS
> + inode->i_access_count++;
> +#endif
> +}
> +
> extern struct block_device *I_BDEV(struct inode *inode);
>
> struct fown_struct {
> @@ -2171,6 +2183,7 @@ extern void remove_inode_hash(struct ino
> static inline void insert_inode_hash(struct inode *inode) {
> __insert_inode_hash(inode, inode->i_ino);
> }
> +struct hlist_head * get_inode_hash_budget(unsigned long index);
>
> extern struct file * get_empty_filp(void);
> extern void file_move(struct file *f, struct list_head *list);
> --- linux-2.6.orig/fs/open.c
> +++ linux-2.6/fs/open.c
> @@ -842,6 +842,7 @@ static struct file *__dentry_open(struct
> goto cleanup_all;
> }
>
> + inode_accessed(inode);
> f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
>
> file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
> --- linux-2.6.orig/fs/Kconfig
> +++ linux-2.6/fs/Kconfig
> @@ -265,4 +265,34 @@ endif
> source "fs/nls/Kconfig"
> source "fs/dlm/Kconfig"
>
> +config PROC_FILECACHE
> + tristate "/proc/filecache support"
> + default m
> + depends on PROC_FS
> + help
> + This option creates a file /proc/filecache which enables one to
> + query/drop the cached files in memory.
> +
> + A quick start guide:
> +
> + # echo 'ls' > /proc/filecache
> + # head /proc/filecache
> +
> + # echo 'cat /bin/bash' > /proc/filecache
> + # head /proc/filecache
> +
> + # echo 'drop pagecache' > /proc/filecache
> + # echo 'drop slabcache' > /proc/filecache
> +
> + For more details, please check Documentation/filesystems/proc.txt .
> +
> + It can be a handy tool for sysadms and desktop users.
> +
> +config PROC_FILECACHE_EXTRAS
> + bool "track extra states"
> + default y
> + depends on PROC_FILECACHE
> + help
> + Track extra states that costs a little more time/space.
> +
> endmenu
> --- linux-2.6.orig/fs/proc/Makefile
> +++ linux-2.6/fs/proc/Makefile
> @@ -2,7 +2,8 @@
> # Makefile for the Linux proc filesystem routines.
> #
>
> -obj-$(CONFIG_PROC_FS) += proc.o
> +obj-$(CONFIG_PROC_FS) += proc.o
> +obj-$(CONFIG_PROC_FILECACHE) += filecache.o
>
> proc-y := nommu.o task_nommu.o
> proc-$(CONFIG_MMU) := mmu.o task_mmu.o
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
2009-05-17 13:55 ` Frederic Weisbecker
@ 2009-05-17 14:12 ` Wu Fengguang
-1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-05-17 14:12 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Ingo Molnar, Li Zefan, Tom Zanussi, KOSAKI Motohiro,
Pekka Enberg, Andi Kleen, Steven Rostedt, Larry Woodman,
Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
Matt Mackall, Alexey Dobriyan, linux-mm
On Sun, May 17, 2009 at 09:55:12PM +0800, Frederic Weisbecker wrote:
> On Sun, May 17, 2009 at 09:36:59PM +0800, Wu Fengguang wrote:
> > On Tue, May 12, 2009 at 09:01:12PM +0800, Frederic Weisbecker wrote:
> > > On Tue, Apr 28, 2009 at 09:31:08PM +0800, Wu Fengguang wrote:
> > > > On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> > > >
> > > > There are two possible challenges for the conversion:
> > > >
> > > > - One trick it does is to select different lists to traverse on
> > > > different filter options. Will this be possible in the object
> > > > tracing framework?
> > >
> > > Yeah, I guess.
> >
> > Great.
> >
> > >
> > > > - The file name lookup(last field) is the performance killer. Is it
> > > > possible to skip the file name lookup when the filter failed on the
> > > > leading fields?
> > >
> > > objects collection lays on trace events where filters basically ignore
> > > a whole entry in case of non-matching. Not sure if we can easily only
> > > ignore one field.
> > >
> > > But I guess we can do something about the performances...
> >
> > OK, but it's not as important as the previous requirement, so it could
> > be the last thing to work on :)
> >
> > > Could you send us the (sob'ed) patch you made which implements this.
> > > I could try to adapt it to object collection.
> >
> > Attached for your reference. Be aware that I still have plans to
> > change it in non trivial way, and there are ongoing works by Nick(on
> > inode_lock) and Jens(on s_dirty) that can create merge conflicts.
> > So basically it is not a right time to do the adaption.
>
>
> Ah ok, so I will wait a bit :-)
>
>
> > However we can still do something to polish up the page object
> > collection under /debug/tracing/objects/mm/pages/. For example,
> > the timestamps and function name could be removed from the following
> > list :)
> >
> > # tracer: nop
> > #
> > # TASK-PID CPU# TIMESTAMP FUNCTION
> > # | | | | |
> > <...>-3743 [001] 3035.649769: dump_pages: pfn=1 flags=400 count=1 mapcount=0 index=0
> > <...>-3743 [001] 3044.176403: dump_pages: pfn=1 flags=400 count=1 mapcount=0 index=0
> > <...>-3743 [001] 3044.176407: dump_pages: pfn=2 flags=400 count=1 mapcount=0 index=0
> > <...>-3743 [001] 3044.176408: dump_pages: pfn=3 flags=400 count=1 mapcount=0 index=0
> > <...>-3743 [001] 3044.176409: dump_pages: pfn=4 flags=400 count=1 mapcount=0 index=0
> > <...>-3743 [001] 3044.176409: dump_pages: pfn=5 flags=400 count=1 mapcount=0 index=0
> > <...>-3743 [001] 3044.176410: dump_pages: pfn=6 flags=400 count=1 mapcount=0 index=0
> > <...>-3743 [001] 3044.176410: dump_pages: pfn=7 flags=400 count=1 mapcount=0 index=0
> > <...>-3743 [001] 3044.176411: dump_pages: pfn=8 flags=400 count=1 mapcount=0 index=0
> > <...>-3743 [001] 3044.176411: dump_pages: pfn=9 flags=400 count=1 mapcount=0 index=0
> > <...>-3743 [001] 3044.176412: dump_pages: pfn=10 flags=400 count=1 mapcount=0 index=0
>
>
> echo nocontext-info > /debug/tracing/trace_options :-)
Nice tip - I should really learn more about ftrace :-)
> But you'll have only the function and the pages specifics. It's not really the
> function but more specifically the name of the event. It's useful to distinguish
> multiple events to a trace.
>
> Hmm, may be it's not that much useful in a object dump...
Yeah - and to enable that option automatically in relevant code :)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
@ 2009-05-17 14:12 ` Wu Fengguang
0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-05-17 14:12 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Ingo Molnar, Li Zefan, Tom Zanussi, KOSAKI Motohiro,
Pekka Enberg, Andi Kleen, Steven Rostedt, Larry Woodman,
Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
Matt Mackall, Alexey Dobriyan, linux-mm
On Sun, May 17, 2009 at 09:55:12PM +0800, Frederic Weisbecker wrote:
> On Sun, May 17, 2009 at 09:36:59PM +0800, Wu Fengguang wrote:
> > On Tue, May 12, 2009 at 09:01:12PM +0800, Frederic Weisbecker wrote:
> > > On Tue, Apr 28, 2009 at 09:31:08PM +0800, Wu Fengguang wrote:
> > > > On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> > > >
> > > > There are two possible challenges for the conversion:
> > > >
> > > > - One trick it does is to select different lists to traverse on
> > > > different filter options. Will this be possible in the object
> > > > tracing framework?
> > >
> > > Yeah, I guess.
> >
> > Great.
> >
> > >
> > > > - The file name lookup(last field) is the performance killer. Is it
> > > > possible to skip the file name lookup when the filter failed on the
> > > > leading fields?
> > >
> > > objects collection lays on trace events where filters basically ignore
> > > a whole entry in case of non-matching. Not sure if we can easily only
> > > ignore one field.
> > >
> > > But I guess we can do something about the performances...
> >
> > OK, but it's not as important as the previous requirement, so it could
> > be the last thing to work on :)
> >
> > > Could you send us the (sob'ed) patch you made which implements this.
> > > I could try to adapt it to object collection.
> >
> > Attached for your reference. Be aware that I still have plans to
> > change it in non trivial way, and there are ongoing works by Nick(on
> > inode_lock) and Jens(on s_dirty) that can create merge conflicts.
> > So basically it is not a right time to do the adaption.
>
>
> Ah ok, so I will wait a bit :-)
>
>
> > However we can still do something to polish up the page object
> > collection under /debug/tracing/objects/mm/pages/. For example,
> > the timestamps and function name could be removed from the following
> > list :)
> >
> > # tracer: nop
> > #
> > # TASK-PID CPU# TIMESTAMP FUNCTION
> > # | | | | |
> > <...>-3743 [001] 3035.649769: dump_pages: pfn=1 flags=400 count=1 mapcount=0 index=0
> > <...>-3743 [001] 3044.176403: dump_pages: pfn=1 flags=400 count=1 mapcount=0 index=0
> > <...>-3743 [001] 3044.176407: dump_pages: pfn=2 flags=400 count=1 mapcount=0 index=0
> > <...>-3743 [001] 3044.176408: dump_pages: pfn=3 flags=400 count=1 mapcount=0 index=0
> > <...>-3743 [001] 3044.176409: dump_pages: pfn=4 flags=400 count=1 mapcount=0 index=0
> > <...>-3743 [001] 3044.176409: dump_pages: pfn=5 flags=400 count=1 mapcount=0 index=0
> > <...>-3743 [001] 3044.176410: dump_pages: pfn=6 flags=400 count=1 mapcount=0 index=0
> > <...>-3743 [001] 3044.176410: dump_pages: pfn=7 flags=400 count=1 mapcount=0 index=0
> > <...>-3743 [001] 3044.176411: dump_pages: pfn=8 flags=400 count=1 mapcount=0 index=0
> > <...>-3743 [001] 3044.176411: dump_pages: pfn=9 flags=400 count=1 mapcount=0 index=0
> > <...>-3743 [001] 3044.176412: dump_pages: pfn=10 flags=400 count=1 mapcount=0 index=0
>
>
> echo nocontext-info > /debug/tracing/trace_options :-)
Nice tip - I should really learn more about ftrace :-)
> But you'll have only the function and the pages specifics. It's not really the
> function but more specifically the name of the event. It's useful to distinguish
> multiple events to a trace.
>
> Hmm, may be it's not that much useful in a object dump...
Yeah - and to enable that option automatically in relevant code :)
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
2009-05-17 13:36 ` Wu Fengguang
@ 2009-05-18 11:44 ` KOSAKI Motohiro
2009-05-18 11:44 ` KOSAKI Motohiro
1 sibling, 0 replies; 137+ messages in thread
From: KOSAKI Motohiro @ 2009-05-18 11:44 UTC (permalink / raw)
To: Wu Fengguang
Cc: kosaki.motohiro, Frederic Weisbecker, Ingo Molnar, Li Zefan,
Tom Zanussi, Pekka Enberg, Andi Kleen, Steven Rostedt,
Larry Woodman, Peter Zijlstra, Eduard - Gabriel Munteanu,
Andrew Morton, LKML, Matt Mackall, Alexey Dobriyan, linux-mm
Hi
> > Could you send us the (sob'ed) patch you made which implements this.
> > I could try to adapt it to object collection.
>
> Attached for your reference. Be aware that I still have plans to
> change it in non trivial way, and there are ongoing works by Nick(on
> inode_lock) and Jens(on s_dirty) that can create merge conflicts.
> So basically it is not a right time to do the adaption.
if you can make object collection based filecache viewer, could you
please cc me? I guess I can review it in mm part.
thanks.
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
@ 2009-05-18 11:44 ` KOSAKI Motohiro
0 siblings, 0 replies; 137+ messages in thread
From: KOSAKI Motohiro @ 2009-05-18 11:44 UTC (permalink / raw)
To: Wu Fengguang
Cc: kosaki.motohiro, Frederic Weisbecker, Ingo Molnar, Li Zefan,
Tom Zanussi, Pekka Enberg, Andi Kleen, Steven Rostedt,
Larry Woodman, Peter Zijlstra, Eduard - Gabriel Munteanu,
Andrew Morton, LKML, Matt Mackall, Alexey Dobriyan, linux-mm
Hi
> > Could you send us the (sob'ed) patch you made which implements this.
> > I could try to adapt it to object collection.
>
> Attached for your reference. Be aware that I still have plans to
> change it in non trivial way, and there are ongoing works by Nick(on
> inode_lock) and Jens(on s_dirty) that can create merge conflicts.
> So basically it is not a right time to do the adaption.
if you can make object collection based filecache viewer, could you
please cc me? I guess I can review it in mm part.
thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
2009-05-18 11:44 ` KOSAKI Motohiro
@ 2009-05-18 11:47 ` Wu Fengguang
-1 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-05-18 11:47 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Frederic Weisbecker, Ingo Molnar, Li Zefan, Tom Zanussi,
Pekka Enberg, Andi Kleen, Steven Rostedt, Larry Woodman,
Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
Matt Mackall, Alexey Dobriyan, linux-mm
On Mon, May 18, 2009 at 07:44:21PM +0800, KOSAKI Motohiro wrote:
> Hi
>
>
> > > Could you send us the (sob'ed) patch you made which implements this.
> > > I could try to adapt it to object collection.
> >
> > Attached for your reference. Be aware that I still have plans to
> > change it in non trivial way, and there are ongoing works by Nick(on
> > inode_lock) and Jens(on s_dirty) that can create merge conflicts.
> > So basically it is not a right time to do the adaption.
>
> if you can make object collection based filecache viewer, could you
> please cc me? I guess I can review it in mm part.
OK, thank you! I should be able to work on it in next month.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 137+ messages in thread
* Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
@ 2009-05-18 11:47 ` Wu Fengguang
0 siblings, 0 replies; 137+ messages in thread
From: Wu Fengguang @ 2009-05-18 11:47 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Frederic Weisbecker, Ingo Molnar, Li Zefan, Tom Zanussi,
Pekka Enberg, Andi Kleen, Steven Rostedt, Larry Woodman,
Peter Zijlstra, Eduard - Gabriel Munteanu, Andrew Morton, LKML,
Matt Mackall, Alexey Dobriyan, linux-mm
On Mon, May 18, 2009 at 07:44:21PM +0800, KOSAKI Motohiro wrote:
> Hi
>
>
> > > Could you send us the (sob'ed) patch you made which implements this.
> > > I could try to adapt it to object collection.
> >
> > Attached for your reference. Be aware that I still have plans to
> > change it in non trivial way, and there are ongoing works by Nick(on
> > inode_lock) and Jens(on s_dirty) that can create merge conflicts.
> > So basically it is not a right time to do the adaption.
>
> if you can make object collection based filecache viewer, could you
> please cc me? I guess I can review it in mm part.
OK, thank you! I should be able to work on it in next month.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 137+ messages in thread
end of thread, other threads:[~2009-05-18 11:47 UTC | newest]
Thread overview: 137+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-28 1:09 [PATCH 0/5] proc: export more page flags in /proc/kpageflags (take 4) Wu Fengguang
2009-04-28 1:09 ` Wu Fengguang
2009-04-28 1:09 ` [PATCH 1/5] pagemap: document clarifications Wu Fengguang
2009-04-28 1:09 ` Wu Fengguang
2009-04-28 7:11 ` Tommi Rantala
2009-04-28 7:11 ` Tommi Rantala
2009-04-28 1:09 ` [PATCH 2/5] pagemap: documentation 9 more exported page flags Wu Fengguang
2009-04-28 1:09 ` Wu Fengguang
2009-04-28 1:09 ` [PATCH 3/5] mm: introduce PageHuge() for testing huge/gigantic pages Wu Fengguang
2009-04-28 1:09 ` Wu Fengguang
2009-04-28 1:09 ` [PATCH 4/5] proc: kpagecount/kpageflags code cleanup Wu Fengguang
2009-04-28 1:09 ` Wu Fengguang
2009-04-28 1:09 ` [PATCH 5/5] proc: export more page flags in /proc/kpageflags Wu Fengguang
2009-04-28 1:09 ` Wu Fengguang
2009-04-28 6:55 ` Ingo Molnar
2009-04-28 6:55 ` Ingo Molnar
2009-04-28 7:40 ` Andi Kleen
2009-04-28 7:40 ` Andi Kleen
2009-04-28 9:04 ` Pekka Enberg
2009-04-28 9:04 ` Pekka Enberg
2009-04-28 9:10 ` Andi Kleen
2009-04-28 9:10 ` Andi Kleen
2009-04-28 9:15 ` Pekka Enberg
2009-04-28 9:15 ` Pekka Enberg
2009-04-28 9:15 ` Ingo Molnar
2009-04-28 9:15 ` Ingo Molnar
2009-04-28 9:19 ` Pekka Enberg
2009-04-28 9:19 ` Pekka Enberg
2009-04-28 9:25 ` Pekka Enberg
2009-04-28 9:25 ` Pekka Enberg
2009-04-28 9:36 ` Wu Fengguang
2009-04-28 9:36 ` Wu Fengguang
2009-04-28 9:36 ` Ingo Molnar
2009-04-28 9:36 ` Ingo Molnar
2009-04-28 9:57 ` Pekka Enberg
2009-04-28 9:57 ` Pekka Enberg
2009-04-28 10:10 ` KOSAKI Motohiro
2009-04-28 10:10 ` KOSAKI Motohiro
2009-04-28 10:21 ` Pekka Enberg
2009-04-28 10:21 ` Pekka Enberg
2009-04-28 10:56 ` Ingo Molnar
2009-04-28 10:56 ` Ingo Molnar
2009-04-28 11:09 ` KOSAKI Motohiro
2009-04-28 11:09 ` KOSAKI Motohiro
2009-04-28 12:42 ` Ingo Molnar
2009-04-28 12:42 ` Ingo Molnar
2009-04-28 11:03 ` Ingo Molnar
2009-04-28 11:03 ` Ingo Molnar
2009-04-28 17:42 ` Matt Mackall
2009-04-28 17:42 ` Matt Mackall
2009-04-28 9:29 ` Ingo Molnar
2009-04-28 9:29 ` Ingo Molnar
2009-04-28 9:34 ` KOSAKI Motohiro
2009-04-28 9:34 ` KOSAKI Motohiro
2009-04-28 9:38 ` Ingo Molnar
2009-04-28 9:38 ` Ingo Molnar
2009-04-28 9:55 ` Wu Fengguang
2009-04-28 9:55 ` Wu Fengguang
2009-04-28 10:11 ` KOSAKI Motohiro
2009-04-28 10:11 ` KOSAKI Motohiro
2009-04-28 11:05 ` Ingo Molnar
2009-04-28 11:05 ` Ingo Molnar
2009-04-28 11:36 ` Wu Fengguang
2009-04-28 11:36 ` Wu Fengguang
2009-04-28 12:17 ` [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags) Ingo Molnar
2009-04-28 12:17 ` Ingo Molnar
2009-04-28 13:31 ` Wu Fengguang
2009-04-28 13:31 ` Wu Fengguang
2009-05-12 13:01 ` Frederic Weisbecker
2009-05-12 13:01 ` Frederic Weisbecker
2009-05-17 13:36 ` Wu Fengguang
2009-05-17 13:55 ` Frederic Weisbecker
2009-05-17 13:55 ` Frederic Weisbecker
2009-05-17 14:12 ` Wu Fengguang
2009-05-17 14:12 ` Wu Fengguang
2009-05-18 11:44 ` KOSAKI Motohiro
2009-05-18 11:44 ` KOSAKI Motohiro
2009-05-18 11:47 ` Wu Fengguang
2009-05-18 11:47 ` Wu Fengguang
2009-04-28 10:18 ` [PATCH 5/5] proc: export more page flags in /proc/kpageflags Andi Kleen
2009-04-28 10:18 ` Andi Kleen
2009-04-28 8:33 ` Wu Fengguang
2009-04-28 8:33 ` Wu Fengguang
2009-04-28 9:24 ` Ingo Molnar
2009-04-28 9:24 ` Ingo Molnar
2009-04-28 18:11 ` Tony Luck
2009-04-28 18:11 ` Tony Luck
2009-04-28 18:34 ` Matt Mackall
2009-04-28 18:34 ` Matt Mackall
2009-04-28 20:47 ` Tony Luck
2009-04-28 20:47 ` Tony Luck
2009-04-28 20:54 ` Andi Kleen
2009-04-28 20:54 ` Andi Kleen
2009-04-28 20:59 ` Matt Mackall
2009-04-28 20:59 ` Matt Mackall
2009-04-28 21:17 ` Andrew Morton
2009-04-28 21:17 ` Andrew Morton
2009-04-28 21:49 ` Matt Mackall
2009-04-28 21:49 ` Matt Mackall
2009-04-29 0:02 ` Robin Holt
2009-04-29 0:02 ` Robin Holt
2009-04-28 17:49 ` Matt Mackall
2009-04-28 17:49 ` Matt Mackall
2009-04-29 8:05 ` Wu Fengguang
2009-04-29 8:05 ` Wu Fengguang
2009-04-29 19:13 ` Matt Mackall
2009-04-29 19:13 ` Matt Mackall
2009-04-30 1:00 ` Wu Fengguang
2009-04-30 1:00 ` Wu Fengguang
2009-04-28 21:32 ` Andrew Morton
2009-04-28 21:32 ` Andrew Morton
2009-04-28 22:46 ` Matt Mackall
2009-04-28 22:46 ` Matt Mackall
2009-04-28 23:02 ` Andrew Morton
2009-04-28 23:02 ` Andrew Morton
2009-04-28 23:31 ` Matt Mackall
2009-04-28 23:31 ` Matt Mackall
2009-04-28 23:42 ` Andrew Morton
2009-04-28 23:42 ` Andrew Morton
2009-04-28 23:55 ` Matt Mackall
2009-04-28 23:55 ` Matt Mackall
2009-04-29 3:33 ` Wu Fengguang
2009-04-29 3:33 ` Wu Fengguang
2009-04-29 2:38 ` Wu Fengguang
2009-04-29 2:38 ` Wu Fengguang
2009-04-29 2:55 ` Andrew Morton
2009-04-29 2:55 ` Andrew Morton
2009-04-29 3:48 ` Wu Fengguang
2009-04-29 3:48 ` Wu Fengguang
2009-04-29 5:09 ` Wu Fengguang
2009-04-29 5:09 ` Wu Fengguang
2009-04-29 4:41 ` Nathan Lynch
2009-04-29 4:41 ` Nathan Lynch
2009-04-29 4:41 ` Nathan Lynch
2009-04-29 4:50 ` Andrew Morton
2009-04-29 4:50 ` Andrew Morton
2009-04-29 4:50 ` Andrew Morton
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.