linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] pagecache scanning with /proc/kpagecache
@ 2014-05-21  2:26 Naoya Horiguchi
  2014-05-21  2:26 ` [PATCH 1/4] radix-tree: add end_index to support ranged iteration Naoya Horiguchi
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Naoya Horiguchi @ 2014-05-21  2:26 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov

This patchset adds a new procfs interface to extrace information about
pagecache status. In-kernel tool tools/vm/page-types.c has already some
code for pagecache scanning without kernel's help, but it's not free
from measurement-disturbance, so here I'm suggesting another approach.

Patch 1/4 changes radix tree API to support ranged iteration as a preparation
for patch 2/4 which adds /proc/kpagecache. Patch 3/4 changes page-types to
use the interface in file scanning mode. Patch 4/4 is documentation update.

This patchset were previously posted as a part of memory error reporting
patchset (http://lwn.net/Articles/590690/), so changelogs in individual
patches are changes from that version.

Thanks,
Naoya Horiguchi
---
Summary:

Naoya Horiguchi (4):
      radix-tree: add end_index to support ranged iteration
      fs/proc/page.c: introduce /proc/kpagecache interface
      tools/vm/page-types.c: rework on file cache scanning mode
      Documentation: update Documentation/vm/pagemap.txt

 Documentation/vm/pagemap.txt  |  29 +++++
 drivers/gpu/drm/qxl/qxl_ttm.c |   2 +-
 fs/proc/page.c                | 105 ++++++++++++++++
 include/linux/fs.h            |   9 +-
 include/linux/radix-tree.h    |  27 +++--
 kernel/irq/irqdomain.c        |   2 +-
 lib/radix-tree.c              |   8 +-
 mm/filemap.c                  |   8 +-
 tools/vm/page-types.c         | 276 +++++++++++++++++-------------------------
 9 files changed, 284 insertions(+), 182 deletions(-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/4] radix-tree: add end_index to support ranged iteration
  2014-05-21  2:26 [PATCH 0/4] pagecache scanning with /proc/kpagecache Naoya Horiguchi
@ 2014-05-21  2:26 ` Naoya Horiguchi
  2014-05-21  8:21   ` Konstantin Khlebnikov
  2014-05-21  2:26 ` [PATCH 2/4] fs/proc/page.c: introduce /proc/kpagecache interface Naoya Horiguchi
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Naoya Horiguchi @ 2014-05-21  2:26 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov

It's useful if we can run only over a specific index range of radix trees,
which this patch does. This patch changes only radix_tree_for_each_slot()
and radix_tree_for_each_tagged(), because we need it only for them for now.

ChangeLog:
- rebased onto v3.15-rc5, which has e a few new caller of radix_tree_for_each_slot(),
  so apply this change them too.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 drivers/gpu/drm/qxl/qxl_ttm.c |  2 +-
 include/linux/radix-tree.h    | 27 ++++++++++++++++++++-------
 kernel/irq/irqdomain.c        |  2 +-
 lib/radix-tree.c              |  8 ++++----
 mm/filemap.c                  |  8 ++++----
 5 files changed, 30 insertions(+), 17 deletions(-)

diff --git v3.15-rc5.orig/drivers/gpu/drm/qxl/qxl_ttm.c v3.15-rc5/drivers/gpu/drm/qxl/qxl_ttm.c
index d52c27527b9a..d807e66fe308 100644
--- v3.15-rc5.orig/drivers/gpu/drm/qxl/qxl_ttm.c
+++ v3.15-rc5/drivers/gpu/drm/qxl/qxl_ttm.c
@@ -398,7 +398,7 @@ static int qxl_sync_obj_wait(void *sync_obj,
 		struct radix_tree_iter iter;
 		int release_id;
 
-		radix_tree_for_each_slot(slot, &qfence->tree, &iter, 0) {
+		radix_tree_for_each_slot(slot, &qfence->tree, &iter, 0, ~0UL) {
 			struct qxl_release *release;
 
 			release_id = iter.index;
diff --git v3.15-rc5.orig/include/linux/radix-tree.h v3.15-rc5/include/linux/radix-tree.h
index 33170dbd9db4..88258c34371f 100644
--- v3.15-rc5.orig/include/linux/radix-tree.h
+++ v3.15-rc5/include/linux/radix-tree.h
@@ -312,6 +312,7 @@ static inline void radix_tree_preload_end(void)
  * @index:	index of current slot
  * @next_index:	next-to-last index for this chunk
  * @tags:	bit-mask for tag-iterating
+ * @end_index:  last index to be scanned
  *
  * This radix tree iterator works in terms of "chunks" of slots.  A chunk is a
  * subinterval of slots contained within one radix tree leaf node.  It is
@@ -324,6 +325,7 @@ struct radix_tree_iter {
 	unsigned long	index;
 	unsigned long	next_index;
 	unsigned long	tags;
+	unsigned long	end_index;
 };
 
 #define RADIX_TREE_ITER_TAG_MASK	0x00FF	/* tag index in lower byte */
@@ -335,10 +337,12 @@ struct radix_tree_iter {
  *
  * @iter:	pointer to iterator state
  * @start:	iteration starting index
+ * @end:	iteration ending index
  * Returns:	NULL
  */
 static __always_inline void **
-radix_tree_iter_init(struct radix_tree_iter *iter, unsigned long start)
+radix_tree_iter_init(struct radix_tree_iter *iter, unsigned long start,
+			unsigned long end)
 {
 	/*
 	 * Leave iter->tags uninitialized. radix_tree_next_chunk() will fill it
@@ -350,6 +354,7 @@ radix_tree_iter_init(struct radix_tree_iter *iter, unsigned long start)
 	 */
 	iter->index = 0;
 	iter->next_index = start;
+	iter->end_index = end;
 	return NULL;
 }
 
@@ -399,6 +404,8 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
 		iter->tags >>= 1;
 		if (likely(iter->tags & 1ul)) {
 			iter->index++;
+			if (iter->index > iter->end_index)
+				return NULL;
 			return slot + 1;
 		}
 		if (!(flags & RADIX_TREE_ITER_CONTIG) && likely(iter->tags)) {
@@ -406,6 +413,8 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
 
 			iter->tags >>= offset;
 			iter->index += offset + 1;
+			if (iter->index > iter->end_index)
+				return NULL;
 			return slot + offset + 1;
 		}
 	} else {
@@ -414,6 +423,8 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
 		while (size--) {
 			slot++;
 			iter->index++;
+			if (iter->index > iter->end_index)
+				return NULL;
 			if (likely(*slot))
 				return slot;
 			if (flags & RADIX_TREE_ITER_CONTIG) {
@@ -438,7 +449,7 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
  * Locks can be released and reacquired between iterations.
  */
 #define radix_tree_for_each_chunk(slot, root, iter, start, flags)	\
-	for (slot = radix_tree_iter_init(iter, start) ;			\
+	for (slot = radix_tree_iter_init(iter, start, ~0UL) ;		\
 	      (slot = radix_tree_next_chunk(root, iter, flags)) ;)
 
 /**
@@ -461,11 +472,12 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
  * @root:	the struct radix_tree_root pointer
  * @iter:	the struct radix_tree_iter pointer
  * @start:	iteration starting index
+ * @end:	iteration ending index
  *
  * @slot points to radix tree slot, @iter->index contains its index.
  */
-#define radix_tree_for_each_slot(slot, root, iter, start)		\
-	for (slot = radix_tree_iter_init(iter, start) ;			\
+#define radix_tree_for_each_slot(slot, root, iter, start, end)		\
+	for (slot = radix_tree_iter_init(iter, start, end) ;		\
 	     slot || (slot = radix_tree_next_chunk(root, iter, 0)) ;	\
 	     slot = radix_tree_next_slot(slot, iter, 0))
 
@@ -480,7 +492,7 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
  * @slot points to radix tree slot, @iter->index contains its index.
  */
 #define radix_tree_for_each_contig(slot, root, iter, start)		\
-	for (slot = radix_tree_iter_init(iter, start) ;			\
+	for (slot = radix_tree_iter_init(iter, start, ~0UL) ;		\
 	     slot || (slot = radix_tree_next_chunk(root, iter,		\
 				RADIX_TREE_ITER_CONTIG)) ;		\
 	     slot = radix_tree_next_slot(slot, iter,			\
@@ -493,12 +505,13 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
  * @root:	the struct radix_tree_root pointer
  * @iter:	the struct radix_tree_iter pointer
  * @start:	iteration starting index
+ * @end:	iteration ending index
  * @tag:	tag index
  *
  * @slot points to radix tree slot, @iter->index contains its index.
  */
-#define radix_tree_for_each_tagged(slot, root, iter, start, tag)	\
-	for (slot = radix_tree_iter_init(iter, start) ;			\
+#define radix_tree_for_each_tagged(slot, root, iter, start, end, tag)	\
+	for (slot = radix_tree_iter_init(iter, start, end) ;		\
 	     slot || (slot = radix_tree_next_chunk(root, iter,		\
 			      RADIX_TREE_ITER_TAGGED | tag)) ;		\
 	     slot = radix_tree_next_slot(slot, iter,			\
diff --git v3.15-rc5.orig/kernel/irq/irqdomain.c v3.15-rc5/kernel/irq/irqdomain.c
index f14033700c25..55fc49b412e1 100644
--- v3.15-rc5.orig/kernel/irq/irqdomain.c
+++ v3.15-rc5/kernel/irq/irqdomain.c
@@ -571,7 +571,7 @@ static int virq_debug_show(struct seq_file *m, void *private)
 	mutex_lock(&irq_domain_mutex);
 	list_for_each_entry(domain, &irq_domain_list, link) {
 		int count = 0;
-		radix_tree_for_each_slot(slot, &domain->revmap_tree, &iter, 0)
+		radix_tree_for_each_slot(slot, &domain->revmap_tree, &iter, 0, ~0UL)
 			count++;
 		seq_printf(m, "%c%-16s  %6u  %10u  %10u  %s\n",
 			   domain == irq_default_domain ? '*' : ' ', domain->name,
diff --git v3.15-rc5.orig/lib/radix-tree.c v3.15-rc5/lib/radix-tree.c
index 9599aa72d7a0..531fba8a81db 100644
--- v3.15-rc5.orig/lib/radix-tree.c
+++ v3.15-rc5/lib/radix-tree.c
@@ -1007,7 +1007,7 @@ radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 	if (unlikely(!max_items))
 		return 0;
 
-	radix_tree_for_each_slot(slot, root, &iter, first_index) {
+	radix_tree_for_each_slot(slot, root, &iter, first_index, ~0UL) {
 		results[ret] = indirect_to_ptr(rcu_dereference_raw(*slot));
 		if (!results[ret])
 			continue;
@@ -1049,7 +1049,7 @@ radix_tree_gang_lookup_slot(struct radix_tree_root *root,
 	if (unlikely(!max_items))
 		return 0;
 
-	radix_tree_for_each_slot(slot, root, &iter, first_index) {
+	radix_tree_for_each_slot(slot, root, &iter, first_index, ~0UL) {
 		results[ret] = slot;
 		if (indices)
 			indices[ret] = iter.index;
@@ -1086,7 +1086,7 @@ radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
 	if (unlikely(!max_items))
 		return 0;
 
-	radix_tree_for_each_tagged(slot, root, &iter, first_index, tag) {
+	radix_tree_for_each_tagged(slot, root, &iter, first_index, ~0UL, tag) {
 		results[ret] = indirect_to_ptr(rcu_dereference_raw(*slot));
 		if (!results[ret])
 			continue;
@@ -1123,7 +1123,7 @@ radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
 	if (unlikely(!max_items))
 		return 0;
 
-	radix_tree_for_each_tagged(slot, root, &iter, first_index, tag) {
+	radix_tree_for_each_tagged(slot, root, &iter, first_index, ~0UL, tag) {
 		results[ret] = slot;
 		if (++ret == max_items)
 			break;
diff --git v3.15-rc5.orig/mm/filemap.c v3.15-rc5/mm/filemap.c
index 000a220e2a41..d684e4cffe96 100644
--- v3.15-rc5.orig/mm/filemap.c
+++ v3.15-rc5/mm/filemap.c
@@ -1118,7 +1118,7 @@ unsigned find_get_entries(struct address_space *mapping,
 
 	rcu_read_lock();
 restart:
-	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start, ~0UL) {
 		struct page *page;
 repeat:
 		page = radix_tree_deref_slot(slot);
@@ -1180,7 +1180,7 @@ unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 
 	rcu_read_lock();
 restart:
-	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start, ~0UL) {
 		struct page *page;
 repeat:
 		page = radix_tree_deref_slot(slot);
@@ -1324,7 +1324,7 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 	rcu_read_lock();
 restart:
 	radix_tree_for_each_tagged(slot, &mapping->page_tree,
-				   &iter, *index, tag) {
+				   &iter, *index, ~0UL, tag) {
 		struct page *page;
 repeat:
 		page = radix_tree_deref_slot(slot);
@@ -2023,7 +2023,7 @@ void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
 	pte_t *pte;
 
 	rcu_read_lock();
-	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff, ~0UL) {
 		if (iter.index > vmf->max_pgoff)
 			break;
 repeat:
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/4] fs/proc/page.c: introduce /proc/kpagecache interface
  2014-05-21  2:26 [PATCH 0/4] pagecache scanning with /proc/kpagecache Naoya Horiguchi
  2014-05-21  2:26 ` [PATCH 1/4] radix-tree: add end_index to support ranged iteration Naoya Horiguchi
@ 2014-05-21  2:26 ` Naoya Horiguchi
  2014-05-21  2:26 ` [PATCH 3/4] tools/vm/page-types.c: rework on file cache scanning mode Naoya Horiguchi
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Naoya Horiguchi @ 2014-05-21  2:26 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov

/proc/pid/pagemap is one of powerful analyzing and testing features about
page mapping. This is also useful to know about page status combined with
/proc/kpageflag or /proc/kpagecount. One missing is the similar interface to
scan over pagecache of a given file without opening it or mapping it to
virtual address, which could impact other workloads. So this patch provides it.

Usage is simple: 1) write a file path to be scanned into the interface,
and 2) read 64-bit entries, each of which is associated with the page on
each page index.

Good in-kernel tree example is tools/vm/page-types.c (some code added on
it in the later patch.)

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 fs/proc/page.c     | 105 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |   9 +++--
 2 files changed, 111 insertions(+), 3 deletions(-)

diff --git v3.15-rc5.orig/fs/proc/page.c v3.15-rc5/fs/proc/page.c
index e647c55275d9..d6fe458016e0 100644
--- v3.15-rc5.orig/fs/proc/page.c
+++ v3.15-rc5/fs/proc/page.c
@@ -9,6 +9,8 @@
 #include <linux/seq_file.h>
 #include <linux/hugetlb.h>
 #include <linux/kernel-page-flags.h>
+#include <linux/path.h>
+#include <linux/namei.h>
 #include <asm/uaccess.h>
 #include "internal.h"
 
@@ -212,10 +214,113 @@ static const struct file_operations proc_kpageflags_operations = {
 	.read = kpageflags_read,
 };
 
+static struct path kpagecache_path;
+
+#define KPC_TAGS_BITS	__NR_PAGECACHE_TAGS
+#define KPC_TAGS_OFFSET	(64 - KPC_TAGS_BITS)
+#define KPC_TAGS_MASK	(((1LL << KPC_TAGS_BITS) - 1) << KPC_TAGS_OFFSET)
+#define KPC_TAGS(bits)	(((bits) << KPC_TAGS_OFFSET) & KPC_TAGS_MASK)
+/* a few bits remaining between two fields. */
+#define KPC_PFN_BITS	(64 - PAGE_CACHE_SHIFT)
+#define KPC_PFN_MASK	((1LL << KPC_PFN_BITS) - 1)
+#define KPC_PFN(pfn)	((pfn) & KPC_PFN_MASK)
+
+static u64 get_pagecache_tags(struct radix_tree_root *root, unsigned long index)
+{
+	int i;
+	unsigned long tags = 0;
+	for (i = 0; i < __NR_PAGECACHE_TAGS; i++)
+		if (radix_tree_tag_get(root, index, i))
+			tags |=  1 << i;
+	return KPC_TAGS(tags);
+}
+
+static ssize_t kpagecache_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	u64 __user *out = (u64 __user *)buf;
+	unsigned long src = *ppos;
+	struct address_space *mapping;
+	loff_t size;
+	pgoff_t index;
+	struct radix_tree_iter iter;
+	void **slot;
+	ssize_t ret = 0;
+
+	if (!kpagecache_path.dentry)
+		return 0;
+	if (src & KPMMASK || count & KPMMASK)
+		return -EINVAL;
+	mapping = kpagecache_path.dentry->d_inode->i_mapping;
+	size = i_size_read(mapping->host);
+	if (!size)
+		return 0;
+	size = (size - 1) >> PAGE_CACHE_SHIFT;
+	index = src / KPMSIZE;
+	count = min_t(unsigned long, count, ((size + 1) * KPMSIZE) - src);
+
+	rcu_read_lock();
+	radix_tree_for_each_slot(slot, &mapping->page_tree,
+				 &iter, index, index + count / KPMSIZE - 1) {
+		struct page *page = radix_tree_deref_slot(slot);
+		u64 entry;
+		if (unlikely(!page))
+			continue;
+		entry = get_pagecache_tags(&mapping->page_tree, iter.index);
+		entry |= KPC_PFN(page_to_pfn(page));
+		count = (iter.index - index + 1) * KPMSIZE;
+		if (put_user(entry, out + iter.index - index))
+			break;
+	}
+	rcu_read_unlock();
+	*ppos += count;
+	if (!ret)
+		ret = count;
+	return ret;
+}
+
+static ssize_t kpagecache_write(struct file *file, const char __user *pathname,
+			       size_t count, loff_t *ppos)
+{
+	struct path path;
+	int err;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (!pathname) {
+		if (kpagecache_path.dentry) {
+			path_put(&kpagecache_path);
+			kpagecache_path.mnt = NULL;
+			kpagecache_path.dentry = NULL;
+		}
+		return count;
+	}
+
+	err = user_path_at(AT_FDCWD, pathname, LOOKUP_FOLLOW, &path);
+	if (err)
+		return -EINVAL;
+	if (kpagecache_path.dentry != path.dentry) {
+		path_put(&kpagecache_path);
+		kpagecache_path.mnt = path.mnt;
+		kpagecache_path.dentry = path.dentry;
+	} else
+		path_put(&path);
+	return count;
+}
+
+static const struct file_operations proc_kpagecache_operations = {
+	.llseek		= mem_lseek,
+	.read		= kpagecache_read,
+	.write		= kpagecache_write,
+};
+
 static int __init proc_page_init(void)
 {
 	proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
 	proc_create("kpageflags", S_IRUSR, NULL, &proc_kpageflags_operations);
+	proc_create("kpagecache", S_IRUSR|S_IWUSR, NULL,
+			&proc_kpagecache_operations);
 	return 0;
 }
 fs_initcall(proc_page_init);
diff --git v3.15-rc5.orig/include/linux/fs.h v3.15-rc5/include/linux/fs.h
index 878031227c57..5b489df9d964 100644
--- v3.15-rc5.orig/include/linux/fs.h
+++ v3.15-rc5/include/linux/fs.h
@@ -447,9 +447,12 @@ struct block_device {
  * Radix-tree tags, for tagging dirty and writeback pages within the pagecache
  * radix trees
  */
-#define PAGECACHE_TAG_DIRTY	0
-#define PAGECACHE_TAG_WRITEBACK	1
-#define PAGECACHE_TAG_TOWRITE	2
+enum {
+	PAGECACHE_TAG_DIRTY,
+	PAGECACHE_TAG_WRITEBACK,
+	PAGECACHE_TAG_TOWRITE,
+	__NR_PAGECACHE_TAGS,
+};
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/4] tools/vm/page-types.c: rework on file cache scanning mode
  2014-05-21  2:26 [PATCH 0/4] pagecache scanning with /proc/kpagecache Naoya Horiguchi
  2014-05-21  2:26 ` [PATCH 1/4] radix-tree: add end_index to support ranged iteration Naoya Horiguchi
  2014-05-21  2:26 ` [PATCH 2/4] fs/proc/page.c: introduce /proc/kpagecache interface Naoya Horiguchi
@ 2014-05-21  2:26 ` Naoya Horiguchi
  2014-05-21  2:26 ` [PATCH 4/4] Documentation: update Documentation/vm/pagemap.txt Naoya Horiguchi
  2014-05-21 22:42 ` [PATCH 0/4] pagecache scanning with /proc/kpagecache Andrew Morton
  4 siblings, 0 replies; 22+ messages in thread
From: Naoya Horiguchi @ 2014-05-21  2:26 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov

This patch reworks on the file cache scanning mode of page-types tool,
where when page-types is called with -f <filepath>, it can scan pages
in page cache tree of the specified file via /proc/kpagecache interface.

In the original implementation, it did mmap/madvise/mincore/pagemap over
page cache of the target file(s), so it gives us much measurement-disturbance.
This patch avoids this by using /proc/kpagecache.
And page-types does recursive walking when -f option specifies a directory,
which is too much, so let's keep it compact for code maintenability.
We can do the similar thing more flexibly for example by the following:

  find /tmp | \
      while read f ; do tools/vm/page-types -f $f ; done | \
      grep 0x | tr -s '\t' ' ' | awk '
    {
      label = $4;
      arr[label] = arr[label] + $2;
    }
    END {
      for ( a in arr ) {
        printf("%s %ld\n", a, arr[a]);
      }
    }
  '

This code gets page status summary of all files under /tmp, whose output
is like this:

  __RUDl________b_____________________ 2   # page count
  __RUDlA_______b_____________________ 4

ChangeLog:
- rebased onto v3.15-rc5 (resolved conflict with Konstantins patch
  commit 65a6a4105f "tools/vm/page-types.c: page-cache sniffing feature")

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 tools/vm/page-types.c | 276 +++++++++++++++++++++-----------------------------
 1 file changed, 114 insertions(+), 162 deletions(-)

diff --git v3.15-rc5.orig/tools/vm/page-types.c v3.15-rc5/tools/vm/page-types.c
index 05654f5e48d5..a0fb55489ea7 100644
--- v3.15-rc5.orig/tools/vm/page-types.c
+++ v3.15-rc5/tools/vm/page-types.c
@@ -30,14 +30,12 @@
 #include <getopt.h>
 #include <limits.h>
 #include <assert.h>
-#include <ftw.h>
-#include <time.h>
 #include <sys/types.h>
 #include <sys/errno.h>
 #include <sys/fcntl.h>
 #include <sys/mount.h>
+#include <sys/stat.h>
 #include <sys/statfs.h>
-#include <sys/mman.h>
 #include "../../include/uapi/linux/magic.h"
 #include "../../include/uapi/linux/kernel-page-flags.h"
 #include <api/fs/debugfs.h>
@@ -79,6 +77,7 @@
 
 #define KPF_BYTES		8
 #define PROC_KPAGEFLAGS		"/proc/kpageflags"
+#define PROC_KPAGECACHE		"/proc/kpagecache"
 
 /* [32-] kernel hacking assistances */
 #define KPF_RESERVED		32
@@ -162,7 +161,7 @@ static int		opt_raw;	/* for kernel developers */
 static int		opt_list;	/* list pages (in ranges) */
 static int		opt_no_summary;	/* don't show summary */
 static pid_t		opt_pid;	/* process to walk */
-const char *		opt_file;
+static char		*opt_file;	/* walk over pagecache of file */
 
 #define MAX_ADDR_RANGES	1024
 static int		nr_addr_ranges;
@@ -183,6 +182,7 @@ static int		page_size;
 
 static int		pagemap_fd;
 static int		kpageflags_fd;
+static int		kpagecache_fd;
 
 static int		opt_hwpoison;
 static int		opt_unpoison;
@@ -276,6 +276,13 @@ static unsigned long kpageflags_read(uint64_t *buf,
 	return do_u64_read(kpageflags_fd, PROC_KPAGEFLAGS, buf, index, pages);
 }
 
+static unsigned long kpagecache_read(uint64_t *buf,
+				     unsigned long index,
+				     unsigned long pages)
+{
+	return do_u64_read(kpagecache_fd, PROC_KPAGECACHE, buf, index, pages);
+}
+
 static unsigned long pagemap_read(uint64_t *buf,
 				  unsigned long index,
 				  unsigned long pages)
@@ -338,53 +345,62 @@ static char *page_flag_longname(uint64_t flags)
 	return buf;
 }
 
+#define __NR_PAGECACHE_TAGS	3
+#define KPC_TAGS_BITS	__NR_PAGECACHE_TAGS
+#define KPC_TAGS_OFFSET	(64 - KPC_TAGS_BITS)
+#define KPC_TAGS_MASK	(((1ULL << KPC_TAGS_BITS) - 1) << KPC_TAGS_OFFSET)
+#define KPC_TAGS(entry)	((entry & KPC_TAGS_MASK) >> KPC_TAGS_OFFSET)
 
 /*
  * page list and summary
  */
 
-static void show_page_range(unsigned long voffset, unsigned long offset,
-			    unsigned long size, uint64_t flags)
+static void show_page_range(unsigned long voffset,
+			unsigned long offset, uint64_t flags, uint64_t entry)
 {
 	static uint64_t      flags0;
 	static unsigned long voff;
 	static unsigned long index;
 	static unsigned long count;
+	static uint64_t	     entry0;
 
 	if (flags == flags0 && offset == index + count &&
-	    size && voffset == voff + count) {
-		count += size;
+	    (!opt_pid || voffset == voff + count) &&
+	    (!opt_file || (voffset == voff + count && entry == entry0))) {
+		count++;
 		return;
 	}
 
 	if (count) {
 		if (opt_pid)
-			printf("%lx\t", voff);
-		if (opt_file)
-			printf("%lu\t", voff);
-		printf("%lx\t%lx\t%s\n",
-				index, count, page_flag_name(flags0));
+			printf("%lx\t%lx\t%lx\t%s\n",
+			       voff, index, count, page_flag_name(flags0));
+		else if (opt_file)
+			printf("%lx\t%lx\t%lx\t%llx\t%s\n",
+			       voff, index, count, KPC_TAGS(entry0), page_flag_name(flags0));
+		else
+			printf("%lx\t%lx\t%s\n",
+			       index, count, page_flag_name(flags0));
 	}
 
 	flags0 = flags;
 	index  = offset;
 	voff   = voffset;
-	count  = size;
-}
-
-static void flush_page_range(void)
-{
-	show_page_range(0, 0, 0, 0);
+	count  = 1;
+	entry0 = entry;
 }
 
 static void show_page(unsigned long voffset,
-		      unsigned long offset, uint64_t flags)
+		      unsigned long offset, uint64_t flags, uint64_t entry)
 {
 	if (opt_pid)
-		printf("%lx\t", voffset);
-	if (opt_file)
-		printf("%lu\t", voffset);
-	printf("%lx\t%s\n", offset, page_flag_name(flags));
+		printf("%lx\t%lx\t%s\n",
+		       voffset, offset, page_flag_name(flags));
+	else if (opt_file)
+		printf("%lx\t%lx\t%llx\t%s\n",
+		       voffset, offset, KPC_TAGS(entry), page_flag_name(flags));
+	else
+		printf("%lx\t%s\n", offset, page_flag_name(flags));
 }
 
 static void show_summary(void)
@@ -574,9 +590,9 @@ static void add_page(unsigned long voffset,
 		unpoison_page(offset);
 
 	if (opt_list == 1)
-		show_page_range(voffset, offset, 1, flags);
+		show_page_range(voffset, offset, flags, pme);
 	else if (opt_list == 2)
-		show_page(voffset, offset, flags);
+		show_page(voffset, offset, flags, pme);
 
 	nr_pages[hash_slot(flags)]++;
 	total_pages++;
@@ -655,6 +671,40 @@ static void walk_task(unsigned long index, unsigned long count)
 	}
 }
 
+struct stat kpagecache_stat;
+
+#define KPAGECACHE_BATCH	(64 << 10)	/* 64k pages */
+static void walk_file(unsigned long index, unsigned long count)
+{
+	uint64_t buf[KPAGECACHE_BATCH];
+	unsigned long batch;
+	unsigned long pages;
+	unsigned long pfn;
+	unsigned long i;
+	unsigned long end_index = count;
+	unsigned long size;
+
+	stat(opt_file, &kpagecache_stat);
+	size = kpagecache_stat.st_size;
+	if (size > 0)
+		size = (size - 1) / 4096;
+	end_index = min_t(unsigned long, index + count - 1, size);
+	while (index <= end_index) {
+		batch = min_t(unsigned long, count, PAGEMAP_BATCH);
+		pages = kpagecache_read(buf, index, batch);
+		if (pages == 0)
+			break;
+		for (i = 0; i < pages; i++) {
+			pfn = buf[i] & ((1UL << 52) - 1UL);
+			if (pfn)
+				walk_pfn(index + i, pfn, 1, buf[i]);
+		}
+
+		index += pages;
+		count -= pages;
+	}
+}
+
 static void add_addr_range(unsigned long offset, unsigned long size)
 {
 	if (nr_addr_ranges >= MAX_ADDR_RANGES)
@@ -675,10 +725,12 @@ static void walk_addr_ranges(void)
 		add_addr_range(0, ULONG_MAX);
 
 	for (i = 0; i < nr_addr_ranges; i++)
-		if (!opt_pid)
-			walk_pfn(opt_offset[i], opt_offset[i], opt_size[i], 0);
-		else
+		if (opt_pid)
 			walk_task(opt_offset[i], opt_size[i]);
+		else if (opt_file)
+			walk_file(opt_offset[i], opt_size[i]);
+		else
+			walk_pfn(0, opt_offset[i], opt_size[i], 0);
 
 	close(kpageflags_fd);
 }
@@ -806,130 +858,21 @@ static void parse_pid(const char *str)
 	fclose(file);
 }
 
-static void show_file(const char *name, const struct stat *st)
-{
-	unsigned long long size = st->st_size;
-	char atime[64], mtime[64];
-	long now = time(NULL);
-
-	printf("%s\tInode: %u\tSize: %llu (%llu pages)\n",
-			name, (unsigned)st->st_ino,
-			size, (size + page_size - 1) / page_size);
-
-	strftime(atime, sizeof(atime), "%c", localtime(&st->st_atime));
-	strftime(mtime, sizeof(mtime), "%c", localtime(&st->st_mtime));
-
-	printf("Modify: %s (%ld seconds ago)\nAccess: %s (%ld seconds ago)\n",
-			mtime, now - st->st_mtime,
-			atime, now - st->st_atime);
-}
-
-static void walk_file(const char *name, const struct stat *st)
-{
-	uint8_t vec[PAGEMAP_BATCH];
-	uint64_t buf[PAGEMAP_BATCH], flags;
-	unsigned long nr_pages, pfn, i;
-	int fd;
-	off_t off;
-	ssize_t len;
-	void *ptr;
-	int first = 1;
-
-	fd = checked_open(name, O_RDONLY|O_NOATIME|O_NOFOLLOW);
-
-	for (off = 0; off < st->st_size; off += len) {
-		nr_pages = (st->st_size - off + page_size - 1) / page_size;
-		if (nr_pages > PAGEMAP_BATCH)
-			nr_pages = PAGEMAP_BATCH;
-		len = nr_pages * page_size;
-
-		ptr = mmap(NULL, len, PROT_READ, MAP_SHARED, fd, off);
-		if (ptr == MAP_FAILED)
-			fatal("mmap failed: %s", name);
-
-		/* determine cached pages */
-		if (mincore(ptr, len, vec))
-			fatal("mincore failed: %s", name);
-
-		/* turn off readahead */
-		if (madvise(ptr, len, MADV_RANDOM))
-			fatal("madvice failed: %s", name);
-
-		/* populate ptes */
-		for (i = 0; i < nr_pages ; i++) {
-			if (vec[i] & 1)
-				(void)*(volatile int *)(ptr + i * page_size);
-		}
-
-		/* turn off harvesting reference bits */
-		if (madvise(ptr, len, MADV_SEQUENTIAL))
-			fatal("madvice failed: %s", name);
-
-		if (pagemap_read(buf, (unsigned long)ptr / page_size,
-					nr_pages) != nr_pages)
-			fatal("cannot read pagemap");
-
-		munmap(ptr, len);
-
-		for (i = 0; i < nr_pages; i++) {
-			pfn = pagemap_pfn(buf[i]);
-			if (!pfn)
-				continue;
-			if (!kpageflags_read(&flags, pfn, 1))
-				continue;
-			if (first && opt_list) {
-				first = 0;
-				flush_page_range();
-				show_file(name, st);
-			}
-			add_page(off / page_size + i, pfn, flags, buf[i]);
-		}
-	}
-
-	close(fd);
-}
-
-int walk_tree(const char *name, const struct stat *st, int type, struct FTW *f)
-{
-	(void)f;
-	switch (type) {
-	case FTW_F:
-		if (S_ISREG(st->st_mode))
-			walk_file(name, st);
-		break;
-	case FTW_DNR:
-		fprintf(stderr, "cannot read dir: %s\n", name);
-		break;
-	}
-	return 0;
-}
-
-static void walk_page_cache(void)
+static void parse_file(const char *name)
 {
-	struct stat st;
-
-	kpageflags_fd = checked_open(PROC_KPAGEFLAGS, O_RDONLY);
-	pagemap_fd = checked_open("/proc/self/pagemap", O_RDONLY);
-
-	if (stat(opt_file, &st))
-		fatal("stat failed: %s\n", opt_file);
-
-	if (S_ISREG(st.st_mode)) {
-		walk_file(opt_file, &st);
-	} else if (S_ISDIR(st.st_mode)) {
-		/* do not follow symlinks and mountpoints */
-		if (nftw(opt_file, walk_tree, 64, FTW_MOUNT | FTW_PHYS) < 0)
-			fatal("nftw failed: %s\n", opt_file);
-	} else
-		fatal("unhandled file type: %s\n", opt_file);
-
-	close(kpageflags_fd);
-	close(pagemap_fd);
+	int ret;
+	opt_file = (char *)name;
+	kpagecache_fd = checked_open(PROC_KPAGECACHE, O_RDWR);
+	ret = write(kpagecache_fd, name, strlen(name));
+	if (ret != (int)strlen(name))
+		fatal("Failed to set file on %s\n", PROC_KPAGECACHE);
 }
 
-static void parse_file(const char *name)
+static void close_kpagecache(void)
 {
-	opt_file = name;
+	/* Reset in-kernel configuration. */
+	write(kpagecache_fd, NULL, 1);
+	close(kpagecache_fd);
 }
 
 static void parse_addr_range(const char *optarg)
@@ -1118,22 +1061,31 @@ int main(int argc, char *argv[])
 		}
 	}
 
-	if (opt_list && opt_pid)
-		printf("voffset\t");
-	if (opt_list && opt_file)
-		printf("foffset\t");
-	if (opt_list == 1)
-		printf("offset\tlen\tflags\n");
-	if (opt_list == 2)
-		printf("offset\tflags\n");
+	if (opt_pid && opt_file) {
+		fprintf(stderr,
+		"Option -p and -f are mutually exclusive. Don't set both.\n");
+		exit(1);
+	}
 
-	if (opt_file)
-		walk_page_cache();
-	else
-		walk_addr_ranges();
+	if (opt_pid) {
+		if (opt_list == 1)
+			printf("voffset\toffset\tlen\tflags\n");
+		if (opt_list == 2)
+			printf("voffset\toffset\tflags\n");
+	} else if (opt_file) {
+		if (opt_list == 1)
+			printf("voffset\toffset\tlen\ttag\tflags\n");
+		if (opt_list == 2)
+			printf("voffset\toffset\ttag\tflags\n");
+	}
+
+	walk_addr_ranges();
 
 	if (opt_list == 1)
-		flush_page_range();
+		show_page_range(0, 0, 0, 0);  /* drain the buffer */
+
+	if (opt_file)
+		close_kpagecache();
 
 	if (opt_no_summary)
 		return 0;
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/4] Documentation: update Documentation/vm/pagemap.txt
  2014-05-21  2:26 [PATCH 0/4] pagecache scanning with /proc/kpagecache Naoya Horiguchi
                   ` (2 preceding siblings ...)
  2014-05-21  2:26 ` [PATCH 3/4] tools/vm/page-types.c: rework on file cache scanning mode Naoya Horiguchi
@ 2014-05-21  2:26 ` Naoya Horiguchi
  2014-05-21 22:42 ` [PATCH 0/4] pagecache scanning with /proc/kpagecache Andrew Morton
  4 siblings, 0 replies; 22+ messages in thread
From: Naoya Horiguchi @ 2014-05-21  2:26 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov

This patch adds a chapter about kpagecache interface.

ChangeLog:
- add len column in example output

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 Documentation/vm/pagemap.txt | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git v3.15-rc5.orig/Documentation/vm/pagemap.txt v3.15-rc5/Documentation/vm/pagemap.txt
index 5948e455c4d2..12a871efd372 100644
--- v3.15-rc5.orig/Documentation/vm/pagemap.txt
+++ v3.15-rc5/Documentation/vm/pagemap.txt
@@ -150,3 +150,32 @@ once.
 Reading from any of the files will return -EINVAL if you are not starting
 the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
 into the file), or if the size of the read is not a multiple of 8 bytes.
+
+
+kpagecache, from file perspective
+---------------------------------
+
+Similarly to pagemap, we have a interface /proc/kpagecache to let userspace
+know about pagecache profile for a given file. Unlike pagemap interface,
+we don't have to mmap() and fault in the target file, so the impact on other
+workloads (maybe the target of your analysis) is minimum.
+
+To use this interface, firstly we open it and write the name of the target
+file to it for setup. And then we can read the pagecache info of the file.
+The file contains the array of 64-bit entries for each page offset. Data
+format is like below:
+
+    * Bits  0-49  page frame number (PFN) if present
+    * Bits 50-59  zero (reserved)
+    * Bits 60-63  pagecache tags
+
+Good example is tools/vm/page-types.c, where we can get the list of pages
+belonging to the file like below:
+
+  $ dd if=/dev/urandom of=file bs=4096 count=2
+  $ date >> file
+  $ tools/vm/page-types -f file -Nl
+  voffset offset  len     tag     flags
+  0       640c7   1       0       __RU_l______________________________
+  1       640d7   1       0       __RU_l______________________________
+  2       640f4   1       1       ___UDlA_____________________________
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] radix-tree: add end_index to support ranged iteration
  2014-05-21  2:26 ` [PATCH 1/4] radix-tree: add end_index to support ranged iteration Naoya Horiguchi
@ 2014-05-21  8:21   ` Konstantin Khlebnikov
  0 siblings, 0 replies; 22+ messages in thread
From: Konstantin Khlebnikov @ 2014-05-21  8:21 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Linux Kernel Mailing List, linux-mm, Andrew Morton, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov

On Wed, May 21, 2014 at 6:26 AM, Naoya Horiguchi
<n-horiguchi@ah.jp.nec.com> wrote:
> It's useful if we can run only over a specific index range of radix trees,
> which this patch does. This patch changes only radix_tree_for_each_slot()
> and radix_tree_for_each_tagged(), because we need it only for them for now.

NAK, I don't see how this is usefull. Main users don't need this.
Barely used argument don't makes complicated macro easier.
radix_tree_next_slot() is perfomance-critical. I'm not sure that
compiler can throw out
your checks if the end is -1, since it stored in structure which is
passed into function.

Just write something like this where needed.

radix_tree_for_each_slot(..) {
   if (iter.index > end)
      goto out;
   <...>
   if (iter.index == end)
      goto out;
}
out:

Probably this migh be hidden in a macro as well.
There is simple way to abort iterating: set next_index to zero and
break inner loop.

>
> ChangeLog:
> - rebased onto v3.15-rc5, which has e a few new caller of radix_tree_for_each_slot(),
>   so apply this change them too.
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  drivers/gpu/drm/qxl/qxl_ttm.c |  2 +-
>  include/linux/radix-tree.h    | 27 ++++++++++++++++++++-------
>  kernel/irq/irqdomain.c        |  2 +-
>  lib/radix-tree.c              |  8 ++++----
>  mm/filemap.c                  |  8 ++++----
>  5 files changed, 30 insertions(+), 17 deletions(-)
>
> diff --git v3.15-rc5.orig/drivers/gpu/drm/qxl/qxl_ttm.c v3.15-rc5/drivers/gpu/drm/qxl/qxl_ttm.c
> index d52c27527b9a..d807e66fe308 100644
> --- v3.15-rc5.orig/drivers/gpu/drm/qxl/qxl_ttm.c
> +++ v3.15-rc5/drivers/gpu/drm/qxl/qxl_ttm.c
> @@ -398,7 +398,7 @@ static int qxl_sync_obj_wait(void *sync_obj,
>                 struct radix_tree_iter iter;
>                 int release_id;
>
> -               radix_tree_for_each_slot(slot, &qfence->tree, &iter, 0) {
> +               radix_tree_for_each_slot(slot, &qfence->tree, &iter, 0, ~0UL) {
>                         struct qxl_release *release;
>
>                         release_id = iter.index;
> diff --git v3.15-rc5.orig/include/linux/radix-tree.h v3.15-rc5/include/linux/radix-tree.h
> index 33170dbd9db4..88258c34371f 100644
> --- v3.15-rc5.orig/include/linux/radix-tree.h
> +++ v3.15-rc5/include/linux/radix-tree.h
> @@ -312,6 +312,7 @@ static inline void radix_tree_preload_end(void)
>   * @index:     index of current slot
>   * @next_index:        next-to-last index for this chunk
>   * @tags:      bit-mask for tag-iterating
> + * @end_index:  last index to be scanned
>   *
>   * This radix tree iterator works in terms of "chunks" of slots.  A chunk is a
>   * subinterval of slots contained within one radix tree leaf node.  It is
> @@ -324,6 +325,7 @@ struct radix_tree_iter {
>         unsigned long   index;
>         unsigned long   next_index;
>         unsigned long   tags;
> +       unsigned long   end_index;
>  };
>
>  #define RADIX_TREE_ITER_TAG_MASK       0x00FF  /* tag index in lower byte */
> @@ -335,10 +337,12 @@ struct radix_tree_iter {
>   *
>   * @iter:      pointer to iterator state
>   * @start:     iteration starting index
> + * @end:       iteration ending index
>   * Returns:    NULL
>   */
>  static __always_inline void **
> -radix_tree_iter_init(struct radix_tree_iter *iter, unsigned long start)
> +radix_tree_iter_init(struct radix_tree_iter *iter, unsigned long start,
> +                       unsigned long end)
>  {
>         /*
>          * Leave iter->tags uninitialized. radix_tree_next_chunk() will fill it
> @@ -350,6 +354,7 @@ radix_tree_iter_init(struct radix_tree_iter *iter, unsigned long start)
>          */
>         iter->index = 0;
>         iter->next_index = start;
> +       iter->end_index = end;
>         return NULL;
>  }
>
> @@ -399,6 +404,8 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
>                 iter->tags >>= 1;
>                 if (likely(iter->tags & 1ul)) {
>                         iter->index++;
> +                       if (iter->index > iter->end_index)
> +                               return NULL;
>                         return slot + 1;
>                 }
>                 if (!(flags & RADIX_TREE_ITER_CONTIG) && likely(iter->tags)) {
> @@ -406,6 +413,8 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
>
>                         iter->tags >>= offset;
>                         iter->index += offset + 1;
> +                       if (iter->index > iter->end_index)
> +                               return NULL;
>                         return slot + offset + 1;
>                 }
>         } else {
> @@ -414,6 +423,8 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
>                 while (size--) {
>                         slot++;
>                         iter->index++;
> +                       if (iter->index > iter->end_index)
> +                               return NULL;
>                         if (likely(*slot))
>                                 return slot;
>                         if (flags & RADIX_TREE_ITER_CONTIG) {
> @@ -438,7 +449,7 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
>   * Locks can be released and reacquired between iterations.
>   */
>  #define radix_tree_for_each_chunk(slot, root, iter, start, flags)      \
> -       for (slot = radix_tree_iter_init(iter, start) ;                 \
> +       for (slot = radix_tree_iter_init(iter, start, ~0UL) ;           \
>               (slot = radix_tree_next_chunk(root, iter, flags)) ;)
>
>  /**
> @@ -461,11 +472,12 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
>   * @root:      the struct radix_tree_root pointer
>   * @iter:      the struct radix_tree_iter pointer
>   * @start:     iteration starting index
> + * @end:       iteration ending index
>   *
>   * @slot points to radix tree slot, @iter->index contains its index.
>   */
> -#define radix_tree_for_each_slot(slot, root, iter, start)              \
> -       for (slot = radix_tree_iter_init(iter, start) ;                 \
> +#define radix_tree_for_each_slot(slot, root, iter, start, end)         \
> +       for (slot = radix_tree_iter_init(iter, start, end) ;            \
>              slot || (slot = radix_tree_next_chunk(root, iter, 0)) ;    \
>              slot = radix_tree_next_slot(slot, iter, 0))
>
> @@ -480,7 +492,7 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
>   * @slot points to radix tree slot, @iter->index contains its index.
>   */
>  #define radix_tree_for_each_contig(slot, root, iter, start)            \
> -       for (slot = radix_tree_iter_init(iter, start) ;                 \
> +       for (slot = radix_tree_iter_init(iter, start, ~0UL) ;           \
>              slot || (slot = radix_tree_next_chunk(root, iter,          \
>                                 RADIX_TREE_ITER_CONTIG)) ;              \
>              slot = radix_tree_next_slot(slot, iter,                    \
> @@ -493,12 +505,13 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
>   * @root:      the struct radix_tree_root pointer
>   * @iter:      the struct radix_tree_iter pointer
>   * @start:     iteration starting index
> + * @end:       iteration ending index
>   * @tag:       tag index
>   *
>   * @slot points to radix tree slot, @iter->index contains its index.
>   */
> -#define radix_tree_for_each_tagged(slot, root, iter, start, tag)       \
> -       for (slot = radix_tree_iter_init(iter, start) ;                 \
> +#define radix_tree_for_each_tagged(slot, root, iter, start, end, tag)  \
> +       for (slot = radix_tree_iter_init(iter, start, end) ;            \
>              slot || (slot = radix_tree_next_chunk(root, iter,          \
>                               RADIX_TREE_ITER_TAGGED | tag)) ;          \
>              slot = radix_tree_next_slot(slot, iter,                    \
> diff --git v3.15-rc5.orig/kernel/irq/irqdomain.c v3.15-rc5/kernel/irq/irqdomain.c
> index f14033700c25..55fc49b412e1 100644
> --- v3.15-rc5.orig/kernel/irq/irqdomain.c
> +++ v3.15-rc5/kernel/irq/irqdomain.c
> @@ -571,7 +571,7 @@ static int virq_debug_show(struct seq_file *m, void *private)
>         mutex_lock(&irq_domain_mutex);
>         list_for_each_entry(domain, &irq_domain_list, link) {
>                 int count = 0;
> -               radix_tree_for_each_slot(slot, &domain->revmap_tree, &iter, 0)
> +               radix_tree_for_each_slot(slot, &domain->revmap_tree, &iter, 0, ~0UL)
>                         count++;
>                 seq_printf(m, "%c%-16s  %6u  %10u  %10u  %s\n",
>                            domain == irq_default_domain ? '*' : ' ', domain->name,
> diff --git v3.15-rc5.orig/lib/radix-tree.c v3.15-rc5/lib/radix-tree.c
> index 9599aa72d7a0..531fba8a81db 100644
> --- v3.15-rc5.orig/lib/radix-tree.c
> +++ v3.15-rc5/lib/radix-tree.c
> @@ -1007,7 +1007,7 @@ radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
>         if (unlikely(!max_items))
>                 return 0;
>
> -       radix_tree_for_each_slot(slot, root, &iter, first_index) {
> +       radix_tree_for_each_slot(slot, root, &iter, first_index, ~0UL) {
>                 results[ret] = indirect_to_ptr(rcu_dereference_raw(*slot));
>                 if (!results[ret])
>                         continue;
> @@ -1049,7 +1049,7 @@ radix_tree_gang_lookup_slot(struct radix_tree_root *root,
>         if (unlikely(!max_items))
>                 return 0;
>
> -       radix_tree_for_each_slot(slot, root, &iter, first_index) {
> +       radix_tree_for_each_slot(slot, root, &iter, first_index, ~0UL) {
>                 results[ret] = slot;
>                 if (indices)
>                         indices[ret] = iter.index;
> @@ -1086,7 +1086,7 @@ radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
>         if (unlikely(!max_items))
>                 return 0;
>
> -       radix_tree_for_each_tagged(slot, root, &iter, first_index, tag) {
> +       radix_tree_for_each_tagged(slot, root, &iter, first_index, ~0UL, tag) {
>                 results[ret] = indirect_to_ptr(rcu_dereference_raw(*slot));
>                 if (!results[ret])
>                         continue;
> @@ -1123,7 +1123,7 @@ radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
>         if (unlikely(!max_items))
>                 return 0;
>
> -       radix_tree_for_each_tagged(slot, root, &iter, first_index, tag) {
> +       radix_tree_for_each_tagged(slot, root, &iter, first_index, ~0UL, tag) {
>                 results[ret] = slot;
>                 if (++ret == max_items)
>                         break;
> diff --git v3.15-rc5.orig/mm/filemap.c v3.15-rc5/mm/filemap.c
> index 000a220e2a41..d684e4cffe96 100644
> --- v3.15-rc5.orig/mm/filemap.c
> +++ v3.15-rc5/mm/filemap.c
> @@ -1118,7 +1118,7 @@ unsigned find_get_entries(struct address_space *mapping,
>
>         rcu_read_lock();
>  restart:
> -       radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
> +       radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start, ~0UL) {
>                 struct page *page;
>  repeat:
>                 page = radix_tree_deref_slot(slot);
> @@ -1180,7 +1180,7 @@ unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
>
>         rcu_read_lock();
>  restart:
> -       radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
> +       radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start, ~0UL) {
>                 struct page *page;
>  repeat:
>                 page = radix_tree_deref_slot(slot);
> @@ -1324,7 +1324,7 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
>         rcu_read_lock();
>  restart:
>         radix_tree_for_each_tagged(slot, &mapping->page_tree,
> -                                  &iter, *index, tag) {
> +                                  &iter, *index, ~0UL, tag) {
>                 struct page *page;
>  repeat:
>                 page = radix_tree_deref_slot(slot);
> @@ -2023,7 +2023,7 @@ void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
>         pte_t *pte;
>
>         rcu_read_lock();
> -       radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
> +       radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff, ~0UL) {
>                 if (iter.index > vmf->max_pgoff)
>                         break;
>  repeat:
> --
> 1.9.0
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] pagecache scanning with /proc/kpagecache
  2014-05-21  2:26 [PATCH 0/4] pagecache scanning with /proc/kpagecache Naoya Horiguchi
                   ` (3 preceding siblings ...)
  2014-05-21  2:26 ` [PATCH 4/4] Documentation: update Documentation/vm/pagemap.txt Naoya Horiguchi
@ 2014-05-21 22:42 ` Andrew Morton
       [not found]   ` <537d5ee4.4914e00a.5672.ffff85d5SMTPIN_ADDED_BROKEN@mx.google.com>
  4 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2014-05-21 22:42 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-kernel, linux-mm, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov

On Tue, 20 May 2014 22:26:30 -0400 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> This patchset adds a new procfs interface to extrace information about
> pagecache status. In-kernel tool tools/vm/page-types.c has already some
> code for pagecache scanning without kernel's help, but it's not free
> from measurement-disturbance, so here I'm suggesting another approach.

I'm not seeing much explanation of why you think the kernel needs this.
The overall justification for a change is terribly important so please
do spend some time on it.

As I don't *really* know what the patch is for, I can't comment a lot
further, but...


A much nicer interface would be for us to (finally!) implement
fincore(), perhaps with an enhanced per-present-page payload which
presents the info which you need (although we don't actually know what
that info is!).

This would require open() - it appears to be a requirement that the
caller not open the file, but no reason was given for this.

Requiring open() would address some of the obvious security concerns,
but it will still be possible for processes to poke around and get some
understanding of the behaviour of other processes.  Careful attention
should be paid to this aspect of any such patchset.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] pagecache scanning with /proc/kpagecache
       [not found]   ` <537d5ee4.4914e00a.5672.ffff85d5SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2014-05-22  2:33     ` Andrew Morton
  2014-05-22  9:50       ` Konstantin Khlebnikov
  2014-06-02  5:24       ` [RFC][PATCH 0/3] mm: introduce fincore() Naoya Horiguchi
  0 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2014-05-22  2:33 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-kernel, linux-mm, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov

On Wed, 21 May 2014 22:19:55 -0400 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> > A much nicer interface would be for us to (finally!) implement
> > fincore(), perhaps with an enhanced per-present-page payload which
> > presents the info which you need (although we don't actually know what
> > that info is!).
> 
> page/pfn of each page slot and its page cache tag as shown in patch 4/4.
> 
> > This would require open() - it appears to be a requirement that the
> > caller not open the file, but no reason was given for this.
> > 
> > Requiring open() would address some of the obvious security concerns,
> > but it will still be possible for processes to poke around and get some
> > understanding of the behaviour of other processes.  Careful attention
> > should be paid to this aspect of any such patchset.
> 
> Sorry if I missed your point, but this interface defines fixed mapping
> between file position in /proc/kpagecache and in-file page offset of
> the target file. So we do not need to use seq_file mechanism, that's
> why open() is not defined and default one is used.
> The same thing is true for /proc/{kpagecount,kpageflags}, from which
> I copied/pasted some basic code.

I think you did miss my point ;) Please do a web search for fincore -
it's a syscall similar to mincore(), only it queries pagecache:
fincore(int fd, loff_t offset, ...).  In its simplest form it queries
just for present/absent, but we could increase the query payload to
incorporate additional per-page info.

It would take a lot of thought and discussion to nail down the
fincore() interface (we've already tried a couple of times).  But
unfortunately, fincore() is probably going to be implemented one day
and it will (or at least could) make /proc/kpagecache obsolete.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] pagecache scanning with /proc/kpagecache
  2014-05-22  2:33     ` Andrew Morton
@ 2014-05-22  9:50       ` Konstantin Khlebnikov
  2014-05-22 10:36         ` Kirill A. Shutemov
  2014-06-02  5:24       ` [RFC][PATCH 0/3] mm: introduce fincore() Naoya Horiguchi
  1 sibling, 1 reply; 22+ messages in thread
From: Konstantin Khlebnikov @ 2014-05-22  9:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Naoya Horiguchi, Linux Kernel Mailing List, linux-mm,
	Wu Fengguang, Arnaldo Carvalho de Melo, Borislav Petkov,
	Cyrill Gorcunov

On Thu, May 22, 2014 at 6:33 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Wed, 21 May 2014 22:19:55 -0400 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
>
>> > A much nicer interface would be for us to (finally!) implement
>> > fincore(), perhaps with an enhanced per-present-page payload which
>> > presents the info which you need (although we don't actually know what
>> > that info is!).
>>
>> page/pfn of each page slot and its page cache tag as shown in patch 4/4.
>>
>> > This would require open() - it appears to be a requirement that the
>> > caller not open the file, but no reason was given for this.
>> >
>> > Requiring open() would address some of the obvious security concerns,
>> > but it will still be possible for processes to poke around and get some
>> > understanding of the behaviour of other processes.  Careful attention
>> > should be paid to this aspect of any such patchset.
>>
>> Sorry if I missed your point, but this interface defines fixed mapping
>> between file position in /proc/kpagecache and in-file page offset of
>> the target file. So we do not need to use seq_file mechanism, that's
>> why open() is not defined and default one is used.
>> The same thing is true for /proc/{kpagecount,kpageflags}, from which
>> I copied/pasted some basic code.
>
> I think you did miss my point ;) Please do a web search for fincore -
> it's a syscall similar to mincore(), only it queries pagecache:
> fincore(int fd, loff_t offset, ...).  In its simplest form it queries
> just for present/absent, but we could increase the query payload to
> incorporate additional per-page info.
>
> It would take a lot of thought and discussion to nail down the
> fincore() interface (we've already tried a couple of times).  But
> unfortunately, fincore() is probably going to be implemented one day
> and it will (or at least could) make /proc/kpagecache obsolete.
>

It seems fincore() also might obsolete /proc/kpageflags and /proc/pid/pagemap.
because it might be implemented for /dev/mem and /proc/pid/mem as well
as for normal files.

Something like this:
int fincore(int fd, u64 *kpf, u64 *pfn, size_t length, off_t offset)

It reports PFN and page-flags in KPF_* notation. PFN array is optional.
KFP_NOPAGE reports hole, otherwise this is present page, but probably
not-uptodate.
KFP_SOFTDIRTY already here.


Also we need new flag KFP_SWAPENTRY for vm/shmem to report swap-entry
instead of pfn.
Probably this is redundant, we cannot report pfn and swap-entry
togerher if page present and in swap-cache.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] pagecache scanning with /proc/kpagecache
  2014-05-22  9:50       ` Konstantin Khlebnikov
@ 2014-05-22 10:36         ` Kirill A. Shutemov
  0 siblings, 0 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2014-05-22 10:36 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Andrew Morton, Naoya Horiguchi, Linux Kernel Mailing List,
	linux-mm, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Cyrill Gorcunov

On Thu, May 22, 2014 at 01:50:22PM +0400, Konstantin Khlebnikov wrote:
> On Thu, May 22, 2014 at 6:33 AM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> > On Wed, 21 May 2014 22:19:55 -0400 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> >
> >> > A much nicer interface would be for us to (finally!) implement
> >> > fincore(), perhaps with an enhanced per-present-page payload which
> >> > presents the info which you need (although we don't actually know what
> >> > that info is!).
> >>
> >> page/pfn of each page slot and its page cache tag as shown in patch 4/4.
> >>
> >> > This would require open() - it appears to be a requirement that the
> >> > caller not open the file, but no reason was given for this.
> >> >
> >> > Requiring open() would address some of the obvious security concerns,
> >> > but it will still be possible for processes to poke around and get some
> >> > understanding of the behaviour of other processes.  Careful attention
> >> > should be paid to this aspect of any such patchset.
> >>
> >> Sorry if I missed your point, but this interface defines fixed mapping
> >> between file position in /proc/kpagecache and in-file page offset of
> >> the target file. So we do not need to use seq_file mechanism, that's
> >> why open() is not defined and default one is used.
> >> The same thing is true for /proc/{kpagecount,kpageflags}, from which
> >> I copied/pasted some basic code.
> >
> > I think you did miss my point ;) Please do a web search for fincore -
> > it's a syscall similar to mincore(), only it queries pagecache:
> > fincore(int fd, loff_t offset, ...).  In its simplest form it queries
> > just for present/absent, but we could increase the query payload to
> > incorporate additional per-page info.
> >
> > It would take a lot of thought and discussion to nail down the
> > fincore() interface (we've already tried a couple of times).  But
> > unfortunately, fincore() is probably going to be implemented one day
> > and it will (or at least could) make /proc/kpagecache obsolete.
> >
> 
> It seems fincore() also might obsolete /proc/kpageflags and /proc/pid/pagemap.
> because it might be implemented for /dev/mem and /proc/pid/mem as well
> as for normal files.
> 
> Something like this:
> int fincore(int fd, u64 *kpf, u64 *pfn, size_t length, off_t offset)

As always with new syscalls flags are missing ;)

u64 for kpf doesn't sound future proof enough. What about this:

int fincore(int fd, size_t length, off_t offset,
	unsigned long flags, void *records);

Format of records is defined by what user asks in flags. Like:

 - FINCORE_PFN: records are 64-bit each with pfn;
 - FINCORE_PAGE_FLAGS: records are 64-bit each with flags;
 - FINCORE_PFN | FINCORE_PAGE_FLAGS: records are 128-bit each with pfns
   followed by flags (or vice versa);

New flags can extend the format if we would want to expose more info.

Comments?

BTW, does everybody happy with mincore() interface? We report 1 there if
pte is present, but it doesn't really say much about the page for cases
like zero page...

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH 0/3] mm: introduce fincore()
  2014-05-22  2:33     ` Andrew Morton
  2014-05-22  9:50       ` Konstantin Khlebnikov
@ 2014-06-02  5:24       ` Naoya Horiguchi
  2014-06-02  5:24         ` [PATCH 1/3] replace PAGECACHE_TAG_* definition with enumeration Naoya Horiguchi
                           ` (2 more replies)
  1 sibling, 3 replies; 22+ messages in thread
From: Naoya Horiguchi @ 2014-06-02  5:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm

Due to the previous discussion[1], I learned that you people have discussed
this system call a few times (but not conclusion) and it can solve my problem
about pagecache scanning (see[2] for my motivation.) So I try it now.

The main patch of this patchset is patch 2, and this is based on Johannes's
previous version[3], so I CCed people who joined that discussion. While there
might be controversies about the format of data copied from kernel to userspace,
I take the Kirill's suggestion[4] which uses a flag to choose the data format,
which is extensible and flexible (you can cut off some info if you don't need it.)

And I added simple tests at patch 3, and patch 2 passes all the tests.

Any comments are welcomed.

[1] http://marc.info/?l=linux-kernel&m=140072606903894&w=2
[2] http://marc.info/?l=linux-mm&m=140072522603776&w=2
[3] http://thread.gmane.org/gmane.linux.kernel/1439212/focus=1441919
[4] http://marc.info/?l=linux-kernel&m=140075509611532&w=2

Thanks,
Naoya Horiguchi
---
Summary:

Naoya Horiguchi (3):
      replace PAGECACHE_TAG_* definition with enumeration
      mm: introduce fincore()
      selftest: add test code for fincore()

 arch/x86/syscalls/syscall_64.tbl                   |   1 +
 include/linux/fs.h                                 |   9 +-
 include/linux/syscalls.h                           |   2 +
 mm/Makefile                                        |   2 +-
 mm/fincore.c                                       | 362 +++++++++++++++++++++
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/fincore/Makefile           |  31 ++
 .../selftests/fincore/create_hugetlbfs_file.c      |  49 +++
 tools/testing/selftests/fincore/fincore.c          | 153 +++++++++
 tools/testing/selftests/fincore/run_fincoretests   | 355 ++++++++++++++++++++
 10 files changed, 961 insertions(+), 4 deletions(-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/3] replace PAGECACHE_TAG_* definition with enumeration
  2014-06-02  5:24       ` [RFC][PATCH 0/3] mm: introduce fincore() Naoya Horiguchi
@ 2014-06-02  5:24         ` Naoya Horiguchi
  2014-06-02 16:12           ` Dave Hansen
  2014-06-02  5:24         ` [PATCH 2/3] mm: introduce fincore() Naoya Horiguchi
  2014-06-02  5:24         ` [PATCH 3/3] selftest: add test code for fincore() Naoya Horiguchi
  2 siblings, 1 reply; 22+ messages in thread
From: Naoya Horiguchi @ 2014-06-02  5:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm

We need number of pagecache tags in later patches, this patch prepares it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/fs.h | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git v3.15-rc7.orig/include/linux/fs.h v3.15-rc7/include/linux/fs.h
index 878031227c57..5b489df9d964 100644
--- v3.15-rc7.orig/include/linux/fs.h
+++ v3.15-rc7/include/linux/fs.h
@@ -447,9 +447,12 @@ struct block_device {
  * Radix-tree tags, for tagging dirty and writeback pages within the pagecache
  * radix trees
  */
-#define PAGECACHE_TAG_DIRTY	0
-#define PAGECACHE_TAG_WRITEBACK	1
-#define PAGECACHE_TAG_TOWRITE	2
+enum {
+	PAGECACHE_TAG_DIRTY,
+	PAGECACHE_TAG_WRITEBACK,
+	PAGECACHE_TAG_TOWRITE,
+	__NR_PAGECACHE_TAGS,
+};
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/3] mm: introduce fincore()
  2014-06-02  5:24       ` [RFC][PATCH 0/3] mm: introduce fincore() Naoya Horiguchi
  2014-06-02  5:24         ` [PATCH 1/3] replace PAGECACHE_TAG_* definition with enumeration Naoya Horiguchi
@ 2014-06-02  5:24         ` Naoya Horiguchi
  2014-06-02  6:42           ` Christoph Hellwig
                             ` (3 more replies)
  2014-06-02  5:24         ` [PATCH 3/3] selftest: add test code for fincore() Naoya Horiguchi
  2 siblings, 4 replies; 22+ messages in thread
From: Naoya Horiguchi @ 2014-06-02  5:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm

This patch provides a new system call fincore(2), which provides mincore()-
like information, i.e. page residency of a given file. But unlike mincore(),
fincore() can have a mode flag and it enables us to extract more detailed
information about page cache like pfn and page flag. This kind of information
is very helpful for example when applications want to know the file cache
status to control IO on their own way.

Detail about the data format being passed to userspace are explained in
inline comment, but generally in long entry format, we can choose which
information is extraced flexibly, so you don't have to waste memory by
extracting unnecessary information. And with FINCORE_SKIP_HOLE flag,
we can skip hole pages (not on memory,) which makes us avoid a flood of
meaningless zero entries when calling on extremely large (but only few
pages of it are loaded on memory) file.

Basic testset is added in a next patch on tools/testing/selftests/fincore/.

[1] http://thread.gmane.org/gmane.linux.kernel/1439212/focus=1441919

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 arch/x86/syscalls/syscall_64.tbl |   1 +
 include/linux/syscalls.h         |   2 +
 mm/Makefile                      |   2 +-
 mm/fincore.c                     | 362 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 366 insertions(+), 1 deletion(-)
 create mode 100644 mm/fincore.c

diff --git v3.15-rc7.orig/arch/x86/syscalls/syscall_64.tbl v3.15-rc7/arch/x86/syscalls/syscall_64.tbl
index 04376ac3d9ef..0a6b6dd77708 100644
--- v3.15-rc7.orig/arch/x86/syscalls/syscall_64.tbl
+++ v3.15-rc7/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
 316	common	renameat2		sys_renameat2
+317	common	fincore			sys_fincore
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git v3.15-rc7.orig/include/linux/syscalls.h v3.15-rc7/include/linux/syscalls.h
index a4a0588c5397..d625ec9cb73d 100644
--- v3.15-rc7.orig/include/linux/syscalls.h
+++ v3.15-rc7/include/linux/syscalls.h
@@ -866,4 +866,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
 asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
 			 unsigned long idx1, unsigned long idx2);
 asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_fincore(int fd, loff_t start, long nr_pages,
+			int mode, unsigned char __user *vec);
 #endif
diff --git v3.15-rc7.orig/mm/Makefile v3.15-rc7/mm/Makefile
index b484452dac57..55e1d13ddb76 100644
--- v3.15-rc7.orig/mm/Makefile
+++ v3.15-rc7/mm/Makefile
@@ -18,7 +18,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o balloon_compaction.o vmacache.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   iov_iter.o $(mmu-y)
+			   iov_iter.o fincore.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git v3.15-rc7.orig/mm/fincore.c v3.15-rc7/mm/fincore.c
new file mode 100644
index 000000000000..3fc3ef465471
--- /dev/null
+++ v3.15-rc7/mm/fincore.c
@@ -0,0 +1,362 @@
+/*
+ * fincore(2) system call
+ *
+ * Copyright (C) 2014 NEC Corporation, Naoya Horiguchi
+ */
+
+#include <linux/syscalls.h>
+#include <linux/pagemap.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/hugetlb.h>
+
+/*
+ * You can control how the buffer in userspace is filled with this mode
+ * parameters:
+ *
+ * - FINCORE_BMAP:
+ *     The page status is returned in a vector of bytes.
+ *     The least significant bit of each byte is 1 if the referenced page
+ *     is in memory, otherwise it is zero.
+ *
+ * - FINCORE_PFN:
+ *     stores pfn, using 8 bytes.
+ *
+ * - FINCORE_PAGEFLAGS:
+ *     stores page flags, using 8 bytes. See definition of KPF_* for details.
+ *
+ * - FINCORE_PAGECACHE_TAGS:
+ *     stores pagecache tags, using 8 bytes. See definition of PAGECACHE_TAG_*
+ *     for details.
+ *
+ * - FINCORE_SKIP_HOLE: if this flag is set, fincore() doesn't store any
+ *     information about hole. Instead each records per page has the entry
+ *     of page offset (using 8 bytes.) This mode is useful if we handle
+ *     large file and only few pages are on memory for the file.
+ *
+ * FINCORE_BMAP shouldn't be used combined with any other flags, and returnd
+ * data in this mode is like this:
+ *
+ *   page offset  0   1   2   3   4
+ *              +---+---+---+---+---+
+ *              | 1 | 0 | 0 | 1 | 1 | ...
+ *              +---+---+---+---+---+
+ *               <->
+ *              1 byte
+ *
+ * For FINCORE_PFN, page data is formatted like this:
+ *
+ *   page offset    0       1       2       3       4
+ *              +-------+-------+-------+-------+-------+
+ *              |  pfn  |  pfn  |  pfn  |  pfn  |  pfn  | ...
+ *              +-------+-------+-------+-------+-------+
+ *               <----->
+ *               8 byte
+ *
+ * We can use multiple flags among FINCORE_(PFN|PAGEFLAGS|PAGECACHE_TAGS).
+ * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
+ * information is stored like this:
+ *
+ *    page offset 0    page offset 1   page offset 2
+ *   +-------+-------+-------+-------+-------+-------+
+ *   |  pfn  | flags |  pfn  | flags |  pfn  | flags | ...
+ *   +-------+-------+-------+-------+-------+-------+
+ *    <-------------> <-------------> <------------->
+ *       16 bytes        16 bytes        16 bytes
+ *
+ * When FINCORE_SKIP_HOLE is set, we ignore holes and add page offset entry
+ * (8 bytes) instead. For example, the data format of mode
+ * FINCORE_PFN|FINCORE_SKIP_HOLE is like follows:
+ *
+ *   +-------+-------+-------+-------+-------+-------+
+ *   | pgoff |  pfn  | pgoff |  pfn  | pgoff |  pfn  | ...
+ *   +-------+-------+-------+-------+-------+-------+
+ *    <-------------> <-------------> <------------->
+ *       16 bytes        16 bytes        16 bytes
+ */
+#define FINCORE_BMAP		0x01	/* bytemap mode */
+#define FINCORE_PFN		0x02
+#define FINCORE_PAGE_FLAGS	0x04
+#define FINCORE_PAGECACHE_TAGS	0x08
+#define FINCORE_SKIP_HOLE	0x10
+
+#define FINCORE_MODE_MASK	0x1f
+#define FINCORE_LONGENTRY_MASK	(FINCORE_PFN | FINCORE_PAGE_FLAGS | \
+				 FINCORE_PAGECACHE_TAGS | FINCORE_SKIP_HOLE)
+
+struct fincore_control {
+	int mode;
+	int width;		/* width of each entry (in bytes) */
+	unsigned char *buffer;
+	long buffer_size;
+	void *cursor;		/* current position on the buffer */
+	pgoff_t pgstart;	/* start point of page cache scan in each run
+				 * of the while loop */
+	long nr_pages;		/* number of pages to be copied to userspace
+				 * (decreasing while scan proceeds) */
+	long scanned_offset;	/* page offset of the lastest scanned page */
+	struct address_space *mapping;
+};
+
+/*
+ * TODO: doing radix_tree_tag_get() for each tag is not optimal, but no easy
+ * way without degrading finely tuned radix tree routines.
+ */
+static unsigned long get_pagecache_tags(struct radix_tree_root *root,
+					unsigned long index)
+{
+	int i;
+	unsigned long tags = 0;
+	for (i = 0; i < __NR_PAGECACHE_TAGS; i++)
+		if (radix_tree_tag_get(root, index, i))
+			tags |=  1 << i;
+	return tags;
+}
+
+#define store_entry(fc, type, data) ({		\
+	*(type *)fc->cursor = (type)data;	\
+	fc->cursor += sizeof(type);		\
+})
+
+/*
+ * Store page cache data to temporal buffer in the specified format depending
+ * on fincore mode.
+ */
+static void __do_fincore(struct fincore_control *fc, struct page *page,
+			 unsigned long index)
+{
+	VM_BUG_ON(!page);
+	VM_BUG_ON((unsigned long)fc->cursor - (unsigned long)fc->buffer
+		  >= fc->buffer_size);
+	if (fc->mode & FINCORE_BMAP)
+		store_entry(fc, unsigned char, PageUptodate(page));
+	else if (fc->mode & (FINCORE_LONGENTRY_MASK)) {
+		if (fc->mode & FINCORE_SKIP_HOLE)
+			store_entry(fc, unsigned long, index);
+		if (fc->mode & FINCORE_PFN)
+			store_entry(fc, unsigned long, page_to_pfn(page));
+		if (fc->mode & FINCORE_PAGE_FLAGS)
+			store_entry(fc, unsigned long, stable_page_flags(page));
+		if (fc->mode & FINCORE_PAGECACHE_TAGS)
+			store_entry(fc, unsigned long,
+				    get_pagecache_tags(&fc->mapping->page_tree,
+						       index));
+	}
+}
+
+/*
+ * Traverse page cache tree. It's assumed that temporal buffer are zeroed
+ * in advance. Due to this, we don't have to store zero entry explicitly
+ * one-by-one and we just set fc->cursor to the position of the next
+ * on-memory page.
+ *
+ * Return value is the number of pages whose data is stored in fc->buffer.
+ */
+static long do_fincore(struct fincore_control *fc, int nr_pages)
+{
+	pgoff_t pgend = fc->pgstart + nr_pages;
+	struct radix_tree_iter iter;
+	void **slot;
+	long nr = 0;
+
+	fc->cursor = fc->buffer;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &fc->mapping->page_tree, &iter,
+				 fc->pgstart) {
+		long jump;
+		struct page *page;
+
+		fc->scanned_offset = iter.index;
+		/* Handle holes */
+		jump = iter.index - fc->pgstart - nr;
+		if (jump) {
+			if (!(fc->mode & FINCORE_SKIP_HOLE)) {
+				if (iter.index < pgend) {
+					fc->cursor += jump * fc->width;
+					nr = iter.index - fc->pgstart;
+				} else {
+					/*
+					 * Fill remaining buffer as hole. Next
+					 * call should start at offset pgend.
+					 */
+					nr = nr_pages;
+					fc->scanned_offset = pgend - 1;
+					break;
+				}
+			}
+		}
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			/*
+			 * No need to increment nr and fc->cursor, because next
+			 * iteration should detect hole and update them there.
+			 */
+			continue;
+		else if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page)) {
+				/*
+				 * Transient condition which can only trigger
+				 * when entry at index 0 moves out of or back
+				 * to root: none yet gotten, safe to restart.
+				 */
+				WARN_ON(iter.index);
+				goto restart;
+			}
+			__do_fincore(fc, page, iter.index);
+		} else {
+			if (!page_cache_get_speculative(page))
+				goto repeat;
+
+			/* Has the page moved? */
+			if (unlikely(page != *slot)) {
+				page_cache_release(page);
+				goto repeat;
+			}
+
+			__do_fincore(fc, page, iter.index);
+			page_cache_release(page);
+		}
+
+		if (++nr == nr_pages)
+			break;
+	}
+	rcu_read_unlock();
+
+	return nr;
+}
+
+static inline bool fincore_validate_mode(int mode)
+{
+	if (mode & ~FINCORE_MODE_MASK)
+		return false;
+	if (!(!!(mode & FINCORE_BMAP) ^ !!(mode & FINCORE_LONGENTRY_MASK)))
+		return false;
+	if ((mode & FINCORE_LONGENTRY_MASK) == FINCORE_SKIP_HOLE)
+		return false;
+	return true;
+}
+
+#define FINCORE_LOOP_STEP	256L
+
+/*
+ * The fincore(2) system call
+ *
+ *  @fd:        file descriptor of the target file
+ *  @start:     starting address offset of the target file (in byte).
+ *              This should be aligned to page cache size.
+ *  @nr_pages:  the number of pages from which the page data is passed to
+ *              userspace. If you don't set FINCORE_SKIP_HOLE, it's identical
+ *		to the pages to be scanned.
+ *  @mode       fincore mode flags to determine the entry's format
+ *  @vec        pointer of the userspace buffer. The size must be equal to or
+ *              larger than (@nr_pages * width), where width is the size of
+ *              each entry.
+ *
+ * fincore() returns the memory residency status and additional info (like
+ * pfn and page flags) of the given file's pages.
+ *
+ * Depending on the fincore mode, caller can receive the different formatted
+ * information. See the comment on definition of FINCORE_* above.
+ *
+ * Because the status of a page can change after fincore() checks it
+ * but before it returns to the application, the returned vector may
+ * contain stale information.
+ *
+ * return values:
+ *  -EBADF:     @fd isn't a valid open file descriptor
+ *  -EFAULT:    @vec points to an illegal address
+ *  -EINVAL:    @start is not aligned to page cache size, or @mode is invalid
+ *  non-negative value: fincore() is successfully done and the value means
+ *              the number of valid entries stored on userspace buffer
+ */
+SYSCALL_DEFINE5(fincore, int, fd, loff_t, start, long, nr_pages,
+		int, mode, unsigned char __user *, vec)
+{
+	long ret;
+	long step;
+	long nr = 0;
+	int pc_shift = PAGE_CACHE_SHIFT;
+	struct fd f;
+
+	struct fincore_control fc = {
+		.mode	= mode,
+		.width	= sizeof(unsigned char),
+	};
+
+	if (!fincore_validate_mode(mode))
+		return -EINVAL;
+
+	f = fdget(fd);
+
+	if (is_file_hugepages(f.file))
+		pc_shift = huge_page_shift(hstate_file(f.file));
+
+	if (!IS_ALIGNED(start, 1 << pc_shift)) {
+		ret = -EINVAL;
+		goto fput;
+	}
+
+	/*
+	 * TODO: support /dev/mem, /proc/pid/mem for system/process wide
+	 * page survey, which would obsolete /proc/kpageflags, and
+	 * /proc/pid/pagemap.
+	 */
+	if (!S_ISREG(file_inode(f.file)->i_mode)) {
+		ret = -EBADF;
+		goto fput;
+	}
+
+	fc.pgstart = start >> pc_shift;
+	fc.nr_pages = nr_pages;
+	fc.mapping = f.file->f_mapping;
+	if (mode & FINCORE_LONGENTRY_MASK)
+		fc.width = ((mode & FINCORE_PFN ? 1 : 0) +
+			    (mode & FINCORE_PAGE_FLAGS ? 1 : 0) +
+			    (mode & FINCORE_PAGECACHE_TAGS ? 1 : 0) +
+			    (mode & FINCORE_SKIP_HOLE ? 1 : 0)
+			) * sizeof(unsigned long);
+	step = min(fc.nr_pages, FINCORE_LOOP_STEP);
+
+	fc.buffer_size = step * fc.width;
+	fc.buffer = kzalloc(fc.buffer_size, GFP_TEMPORARY);
+	if (!fc.buffer) {
+		ret = -ENOMEM;
+		goto fput;
+	}
+
+	while (true) {
+		ret = do_fincore(&fc, min(step, fc.nr_pages));
+		/* Reached the end of the file */
+		if (ret == 0) {
+			ret = nr;
+			break;
+		}
+		if (ret < 0)
+			break;
+		if (copy_to_user(vec + nr * fc.width,
+				 fc.buffer, ret * fc.width)) {
+			ret = -EFAULT;
+			break;
+		}
+		fc.nr_pages -= ret;
+		fc.pgstart = fc.scanned_offset + 1;
+		nr += ret;
+		/* Completed scanning the requested numbers of pages */
+		if (fc.nr_pages == 0) {
+			ret = nr;
+			break;
+		}
+		/* Clear buffer for next do_fincore() */
+		memset(fc.buffer, 0, step * fc.width);
+	}
+
+	kfree(fc.buffer);
+fput:
+	fdput(f);
+	return ret;
+}
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/3] selftest: add test code for fincore()
  2014-06-02  5:24       ` [RFC][PATCH 0/3] mm: introduce fincore() Naoya Horiguchi
  2014-06-02  5:24         ` [PATCH 1/3] replace PAGECACHE_TAG_* definition with enumeration Naoya Horiguchi
  2014-06-02  5:24         ` [PATCH 2/3] mm: introduce fincore() Naoya Horiguchi
@ 2014-06-02  5:24         ` Naoya Horiguchi
  2 siblings, 0 replies; 22+ messages in thread
From: Naoya Horiguchi @ 2014-06-02  5:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm

This patch adds simple test programs for fincore(), which contains the
following testcase:
- test_unaligned_start_address
- test_unaligned_start_address_hugetlb
- test_invalid_mode
- test_smallfile_bytemap
- test_smallfile_pfn
- test_smallfile_multientry
- test_largefile_pfn
- test_largefile_pfn_offset
- test_largefile_pfn_overrun
- test_tmpfs_pfn
- test_hugetlb_pfn
- test_largefile_pfn_skiphole
- test_smallfile_pfn_skiphole
-
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/fincore/Makefile           |  31 ++
 .../selftests/fincore/create_hugetlbfs_file.c      |  49 +++
 tools/testing/selftests/fincore/fincore.c          | 153 +++++++++
 tools/testing/selftests/fincore/run_fincoretests   | 355 +++++++++++++++++++++
 5 files changed, 589 insertions(+)
 create mode 100644 tools/testing/selftests/fincore/Makefile
 create mode 100644 tools/testing/selftests/fincore/create_hugetlbfs_file.c
 create mode 100644 tools/testing/selftests/fincore/fincore.c
 create mode 100644 tools/testing/selftests/fincore/run_fincoretests

diff --git v3.15-rc7.orig/tools/testing/selftests/Makefile v3.15-rc7/tools/testing/selftests/Makefile
index 32487ed18354..820813f571fb 100644
--- v3.15-rc7.orig/tools/testing/selftests/Makefile
+++ v3.15-rc7/tools/testing/selftests/Makefile
@@ -10,6 +10,7 @@ TARGETS += timers
 TARGETS += vm
 TARGETS += powerpc
 TARGETS += user
+TARGETS += fincore
 
 all:
 	for TARGET in $(TARGETS); do \
diff --git v3.15-rc7.orig/tools/testing/selftests/fincore/Makefile v3.15-rc7/tools/testing/selftests/fincore/Makefile
new file mode 100644
index 000000000000..ab4361c70da5
--- /dev/null
+++ v3.15-rc7/tools/testing/selftests/fincore/Makefile
@@ -0,0 +1,31 @@
+# Makefile for vm selftests
+
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+ifeq ($(ARCH),i386)
+        ARCH := X86
+        CFLAGS := -DCONFIG_X86_32 -D__i386__
+endif
+ifeq ($(ARCH),x86_64)
+        ARCH := X86
+        CFLAGS := -DCONFIG_X86_64 -D__x86_64__
+endif
+
+CC = $(CROSS_COMPILE)gcc
+CFLAGS = -Wall
+CFLAGS += -I../../../../arch/x86/include/generated/
+CFLAGS += -I../../../../include/
+CFLAGS += -I../../../../usr/include/
+CFLAGS += -I../../../../arch/x86/include/
+
+BINARIES = fincore create_hugetlbfs_file
+
+all: $(BINARIES)
+%: %.c
+	$(CC) $(CFLAGS) -o $@ $^
+
+run_tests: all
+	@/bin/sh ./run_fincoretests || (echo "fincoretests: [FAIL]"; exit 1)
+
+clean:
+	$(RM) $(BINARIES)
diff --git v3.15-rc7.orig/tools/testing/selftests/fincore/create_hugetlbfs_file.c v3.15-rc7/tools/testing/selftests/fincore/create_hugetlbfs_file.c
new file mode 100644
index 000000000000..a46ccf0af5f2
--- /dev/null
+++ v3.15-rc7/tools/testing/selftests/fincore/create_hugetlbfs_file.c
@@ -0,0 +1,49 @@
+#define _GNU_SOURCE 1
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdlib.h>
+
+#define err(x) (perror(x), exit(1))
+
+unsigned long default_hugepage_size(void)
+{
+	unsigned long hps = 0;
+	char *line = NULL;
+	size_t linelen = 0;
+	FILE *f = fopen("/proc/meminfo", "r");
+	if (!f)
+		err("open /proc/meminfo");
+	while (getline(&line, &linelen, f) > 0) {
+		if (sscanf(line, "Hugepagesize:	%lu kB", &hps) == 1) {
+			hps <<= 10;
+			break;
+		}
+	}
+	free(line);
+	return hps;
+}
+
+int main(int argc, char **argv)
+{
+	int ret;
+	int fd;
+	char *p;
+	unsigned long hpsize = default_hugepage_size();
+	fd = open(argv[1], O_RDWR|O_CREAT);
+	if (fd == -1)
+		err("open");
+	p = mmap(NULL, 10 * hpsize, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+	if (p == (void *)-1)
+		err("mmap");
+	memset(p, 'a', 3 * hpsize);
+	memset(p + 7 * hpsize, 'a', 3 * hpsize - 1);
+	ret = close(fd);
+	if (ret == -1)
+		err("close");
+	return 0;
+}
diff --git v3.15-rc7.orig/tools/testing/selftests/fincore/fincore.c v3.15-rc7/tools/testing/selftests/fincore/fincore.c
new file mode 100644
index 000000000000..b089fe5f6bdf
--- /dev/null
+++ v3.15-rc7/tools/testing/selftests/fincore/fincore.c
@@ -0,0 +1,153 @@
+/*
+ * fincore(2) test program
+ */
+
+#define _GNU_SOURCE 1
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <getopt.h>
+#include <assert.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+
+#define err(x) (perror(x), exit(1))
+
+#define FINCORE_BMAP		0x01
+#define FINCORE_PFN		0x02
+#define FINCORE_PAGE_FLAGS	0x04
+#define FINCORE_PAGECACHE_TAGS	0x08
+#define FINCORE_SKIP_HOLE	0x10
+
+#define FINCORE_MODE_MASK	0x1f
+#define FINCORE_LONGENTRY_MASK	(FINCORE_PFN | FINCORE_PAGE_FLAGS | \
+				 FINCORE_PAGECACHE_TAGS | FINCORE_SKIP_HOLE)
+
+static int sys_fincore(int fd, loff_t start, size_t len, int mode,
+		       unsigned char *vec)
+{
+	return syscall(__NR_fincore, fd, start, len, mode, vec);
+}
+
+static int call_fincore(int fd, loff_t start, size_t len, int mode,
+			unsigned char *vec)
+{
+	return sys_fincore(fd, start, len, mode, vec);
+}
+
+void usage(char *str)
+{
+	printf(
+		"Usage: %s [-s start] [-l len] [-m mode] [-p pagesize] file\n"
+		"  -s: start offset (in bytes)\n"
+		"  -l: length to scan (in bytes)\n"
+		"  -m: fincore mode\n"
+		"  -p: set page size (for hugepage)\n"
+		"  -h: show this message\n"
+		, str);
+	exit(EXIT_SUCCESS);
+}
+
+static void show_fincore_buffer(long start, long nr_pages, int records_per_page,
+				int mode, unsigned char *buf)
+{
+	int i, j;
+	unsigned char *curuc = (unsigned char *)buf;
+	unsigned long *curul = (unsigned long *)buf;
+
+	for (i = 0; i < nr_pages; i++) {
+		j = 0;
+		if (mode & FINCORE_BMAP)
+			printf("0x%lx\t%d", start + i, curuc[i + j]);
+		else if (mode & (FINCORE_LONGENTRY_MASK)) {
+			if (mode & FINCORE_SKIP_HOLE)
+				printf("0x%lx", curul[i * records_per_page + (j++)]);
+			else
+				printf("0x%lx", start + i);
+			if (mode & FINCORE_PFN)
+				printf("\t0x%lx", curul[i * records_per_page + (j++)]);
+			if (mode & FINCORE_PAGE_FLAGS)
+				printf("\t0x%lx", curul[i * records_per_page + (j++)]);
+			if (mode & FINCORE_PAGECACHE_TAGS)
+				printf("\t0x%lx", curul[i * records_per_page + (j++)]);
+		}
+		printf("\n");
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	char c;
+	int fd;
+	int ret;
+	int mode = FINCORE_PFN;
+	int width = sizeof(unsigned char);
+	int records_per_page = 1;
+	long pagesize = sysconf(_SC_PAGESIZE);
+	long nr_pages;
+	unsigned long start = 0;
+	unsigned long len = 0;
+	unsigned char *buf;
+	struct stat stat;
+
+	while ((c = getopt(argc, argv, "s:l:m:p:h")) != -1) {
+		switch (c) {
+		case 's':
+			start = strtoul(optarg, NULL, 0);
+			break;
+		case 'l':
+			len = strtoul(optarg, NULL, 0);
+			break;
+		case 'm':
+			mode = strtoul(optarg, NULL, 0);
+			break;
+		case 'p':
+			pagesize = strtoul(optarg, NULL, 0);
+			break;
+		case 'h':
+		default:
+			usage(argv[0]);
+		}
+	}
+
+	fd = open(argv[optind], O_RDWR);
+	if (fd == -1)
+		err("open failed.");
+
+	/* scan to the end of file by default */
+	if (!len) {
+		ret = fstat(fd, &stat);
+		if (ret == -1)
+			err("fstat failed.");
+		len = stat.st_size - start;
+	}
+
+	if (mode & FINCORE_LONGENTRY_MASK) {
+		records_per_page = ((mode & FINCORE_PFN ? 1 : 0) +
+				    (mode & FINCORE_PAGE_FLAGS ? 1 : 0) +
+				    (mode & FINCORE_PAGECACHE_TAGS ? 1 : 0) +
+				    (mode & FINCORE_SKIP_HOLE ? 1 : 0)
+			);
+		width = records_per_page * sizeof(unsigned long);
+	}
+
+	nr_pages = ((len + pagesize - 1) & (~(pagesize - 1))) / pagesize;
+	buf = malloc(nr_pages * width);
+	if (!buf)
+		err("malloc");
+	ret = call_fincore(fd, start, nr_pages, mode, buf);
+	if (ret < 0)
+		err("fincore");
+	/*
+	 * print buffer to stdout, and parse it later for validation check.
+	 * fincore() returns the number of entries written to the buffer.
+	 */
+	show_fincore_buffer(start / pagesize, ret, records_per_page, mode, buf);
+	ret = close(fd);
+	if (ret < 0)
+		err("close");
+	return 0;
+}
diff --git v3.15-rc7.orig/tools/testing/selftests/fincore/run_fincoretests v3.15-rc7/tools/testing/selftests/fincore/run_fincoretests
new file mode 100644
index 000000000000..3775ac339860
--- /dev/null
+++ v3.15-rc7/tools/testing/selftests/fincore/run_fincoretests
@@ -0,0 +1,355 @@
+#!/bin/bash
+
+WDIR=./fincore_work
+mkdir $WDIR 2> /dev/null
+TMPF=`mktemp --tmpdir=$WDIR -d`
+
+#
+# common routines
+#
+abort() {
+    echo "Test abort"
+    exit 1
+}
+
+create_small_file() {
+    dd if=/dev/urandom of=$WDIR/smallfile bs=4096 count=4 > /dev/null 2>&1
+    dd if=/dev/urandom of=$WDIR/smallfile bs=4096 count=4 seek=8> /dev/null 2>&1
+    date >> $WDIR/smallfile
+    sync
+}
+
+create_large_file() {
+    dd if=/dev/urandom of=$WDIR/largefile bs=4096 count=384 > /dev/null 2>&1
+    dd if=/dev/urandom of=$WDIR/largefile bs=4096 count=384 seek=640> /dev/null 2>&1
+    sync
+}
+
+create_tmpfs_file() {
+    dd if=/dev/urandom of=/tmp/tmpfile bs=4096 count=4 > /dev/null 2>&1
+    dd if=/dev/urandom of=/tmp/tmpfile bs=4096 count=4 seek=8> /dev/null 2>&1
+    date >> /tmp/tmpfile
+    sync
+}
+
+create_hugetlb_file() {
+    if mount | grep $WDIR/hugepages > /dev/null ; then
+        echo "$WDIR/hugepages already mounted"
+    else
+        mkdir -p $WDIR/hugepages 2> /dev/null
+        mount -t hugetlbfs none $WDIR/hugepages
+        if [ $? -ne 0 ] ; then
+            echo "Failed to mount hugetlbfs" >&2
+            return 1
+        fi
+    fi
+    local hptotal=$(grep HugePages_Total: /proc/meminfo | tr -s ' ' | cut -f2 -d' ')
+    if [ "$hptotal" -lt 10 ] ; then
+        echo "Hugepage pool size need to be >= 10" >&2
+        return 1
+    fi
+    ./create_hugetlbfs_file $WDIR/hugepages/file
+    if [ $? -ne 0 ] ; then
+        echo "Failed to create hugetlb file" >&2
+        return 1
+    fi
+    return 0;
+}
+
+#
+# Testcases
+#
+test_smallfile_bytemap() {
+    local exist
+    local nonexist
+    create_small_file
+
+    ./fincore -m 0x1 $WDIR/smallfile > $TMPF/$FUNCNAME
+    exist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep 1 | wc -l)
+    nonexist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep 0 | wc -l)
+    if [ "$exist" -ne 9 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of on-memory pages should be 9, but got $exist"
+        return 1
+    fi
+    if [ "$nonexist" -ne 4 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of hole pages should be 4, but got $nonexist"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+    return 0
+}
+
+test_smallfile_pfn() {
+    local exist
+    local nonexist
+    create_small_file
+
+    ./fincore -m 0x2 $WDIR/smallfile > $TMPF/$FUNCNAME
+    exist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    if [ "$exist" -ne 9 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of on-memory pages should be 9, but got $exist"
+        return 1
+    fi
+    if [ "$nonexist" -ne 4 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of hole pages should be 4, but got $nonexist"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+    return 0
+}
+
+test_smallfile_multientry() {
+    local exist
+    local nonexist
+    create_small_file
+
+    ./fincore -m 0xe $WDIR/smallfile > $TMPF/$FUNCNAME
+    exist=$(cat $TMPF/$FUNCNAME | cut -f 2,3,4 | grep -vP "0x0\t0x0\t0x0" | wc -l)
+    nonexist=$(cat $TMPF/$FUNCNAME | cut -f 2,3,4 | grep -P "0x0\t0x0\t0x0" | wc -l)
+    if [ "$exist" -ne 9 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of on-memory pages should be 9, but got $exist"
+        return 1
+    fi
+    if [ "$nonexist" -ne 4 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of hole pages should be 4, but got $nonexist"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+    return 0
+}
+
+test_smallfile_pfn_skiphole() {
+    local exist
+    local nonexist
+    create_small_file
+
+    ./fincore -m 0x12 $WDIR/smallfile > $TMPF/$FUNCNAME
+    exist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    if [ "$exist" -ne 9 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of on-memory pages should be 9, but got $exist"
+        return 1
+    fi
+    if [ "$nonexist" -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of hole pages should be 0, but got $nonexist"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+    return 0
+}
+
+# in-kernel function sys_fincore() repeat copy_to_user() per 256 entries,
+# so testing for large file is meaningful testcase.
+test_largefile_pfn() {
+    local exist
+    local nonexist
+    create_large_file
+
+    ./fincore -m 0x2 $WDIR/largefile > $TMPF/$FUNCNAME
+    exist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    if [ "$exist" -ne 768 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of on-memory pages should be 768, but got $exist"
+        return 1
+    fi
+    if [ "$nonexist" -ne 256 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of hole pages should be 256, but got $nonexist"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+    return 0
+}
+
+test_largefile_pfn_offset() {
+    local exist
+    local nonexist
+    create_large_file
+
+    ./fincore -m 0x2 -s 0x80000 $WDIR/largefile > $TMPF/$FUNCNAME
+    exist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    if [ "$exist" -ne 640 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of on-memory pages should be 640, but got $exist"
+        return 1
+    fi
+    if [ "$nonexist" -ne 256 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of hole pages should be 256, but got $nonexist"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+    return 0
+}
+
+test_largefile_pfn_overrun() {
+    local exist
+    local nonexist
+    create_large_file
+
+    ./fincore -m 0x2 -s 0x80000 -l 0x400000 $WDIR/largefile > $TMPF/$FUNCNAME
+    exist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    if [ "$exist" -ne 640 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of on-memory pages should be 640, but got $exist"
+        return 1
+    fi
+    if [ "$nonexist" -ne 256 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of hole pages should be 256, but got $nonexist"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+    return 0
+}
+
+test_largefile_pfn_skiphole() {
+    local exist
+    local nonexist
+    create_large_file
+
+    ./fincore -m 0x12 -s 0x100000 -l 0x102000 $WDIR/largefile > $TMPF/$FUNCNAME
+    exist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    if [ "$exist" -ne 258 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of on-memory pages should be 258, but got $exist"
+        return 1
+    fi
+    if [ "$nonexist" -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of hole pages should be 0, but got $nonexist"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+    return 0
+}
+
+test_tmpfs_pfn() {
+    local exist
+    local nonexist
+    create_tmpfs_file
+
+    ./fincore -m 0x2 /tmp/tmpfile > $TMPF/$FUNCNAME
+    exist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    if [ "$exist" -ne 9 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of on-memory pages should be 9, but got $exist"
+        return 1
+    fi
+    if [ "$nonexist" -ne 4 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of hole pages should be 4, but got $nonexist"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+    return 0
+}
+
+test_hugetlb_pfn() {
+    local exist
+    local nonexist
+    local exitcode=0
+    create_hugetlb_file
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: fail to create a file on hugetlbfs"
+        return 1
+    fi
+    local hugepagesize=$[$(cat /proc/meminfo  | grep Hugepagesize: | tr -s ' ' | cut -f2 -d' ') * 1024]
+    ./fincore -p $hugepagesize -m 0x2 $WDIR/hugepages/file > $TMPF/$FUNCNAME
+    exist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    rm -rf $WDIR/hugepages/file
+    umount $WDIR/hugepages
+    if [ "$exist" -ne 6 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of on-memory pages should be 6, but got $exist"
+        return 1
+     fi
+    if [ "$nonexist" -ne 4 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of hole pages should be 4, but got $nonexist"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+test_unaligned_start_address() {
+    create_small_file
+    ./fincore -m 0x2 -s 0x30 $WDIR/smallfile > $TMPF/$FUNCNAME
+    if [ $? -eq 0 ] ; then
+        echo "[FAIL] $FUNCNAME: fincore should fail for unaligned start address"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+test_unaligned_start_address_hugetlb() {
+    local exist
+    local nonexist
+    local exitcode=0
+    create_hugetlb_file
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: fail to create a file on hugetlbfs"
+        return 1
+    fi
+    local hugepagesize=$[$(cat /proc/meminfo  | grep Hugepagesize: | tr -s ' ' | cut -f2 -d' ') * 1024]
+    ./fincore -p $hugepagesize -m 0x2 -s 0x1000 $WDIR/hugepages/file > $TMPF/$FUNCNAME
+    if [ $? -eq 0 ] ; then
+        echo "[FAIL] $FUNCNAME: fincore should fail for page-unaligned start address"
+        return 1
+    fi
+    ./fincore -p $hugepagesize -m 0x2 -s $hugepagesize $WDIR/hugepages/file > $TMPF/$FUNCNAME
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: fincore should pass for hugepage-aligned start address"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+test_invalid_mode() {
+    create_small_file
+    ./fincore -m 0x0 $WDIR/smallfile > $TMPF/$FUNCNAME
+    if [ $? -eq 0 ] ; then
+        echo "[FAIL] $FUNCNAME: mode == NULL is invalid mode"
+        return 1
+    fi
+    ./fincore -m 0x3 $WDIR/smallfile > $TMPF/$FUNCNAME
+    if [ $? -eq 0 ] ; then
+        echo "[FAIL] $FUNCNAME: mode == (FINCORE_BMAP|FINCORE_PFN) is invalid mode"
+        return 1
+    fi
+    ./fincore -m 0x11 $WDIR/smallfile > $TMPF/$FUNCNAME
+    if [ $? -eq 0 ] ; then
+        echo "[FAIL] $FUNCNAME: mode == (FINCORE_BMAP|FINCORE_SKIP_HOLE) is invalid mode"
+        return 1
+    fi
+    ./fincore -m 0x12 $WDIR/smallfile > $TMPF/$FUNCNAME
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: mode == (FINCORE_PFN|FINCORE_SKIP_HOLE) is valid mode"
+        return 1
+    fi
+    ./fincore -m 0x10 $WDIR/smallfile > $TMPF/$FUNCNAME
+    if [ $? -eq 0 ] ; then
+        echo "[FAIL] $FUNCNAME: mode == (FINCORE_SKIP_HOLE) is invalid mode"
+        return 1
+    fi
+    ./fincore -m 0x1002 $WDIR/smallfile > $TMPF/$FUNCNAME
+    if [ $? -eq 0 ] ; then
+        echo "[FAIL] $FUNCNAME: mode == (Unknown|FINCORE_PFN) is invalid mode"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+test_unaligned_start_address           || abort
+test_unaligned_start_address_hugetlb   || abort
+test_invalid_mode                      || abort
+test_smallfile_bytemap                 || abort
+test_smallfile_pfn                     || abort
+test_smallfile_multientry              || abort
+test_largefile_pfn                     || abort
+test_largefile_pfn_offset              || abort
+test_largefile_pfn_overrun             || abort
+test_tmpfs_pfn                         || abort
+test_hugetlb_pfn                       || abort
+test_largefile_pfn_skiphole            || abort
+test_smallfile_pfn_skiphole            || abort
+
+# cleanup
+rm -rf $WDIR
+
+exit 0
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/3] mm: introduce fincore()
  2014-06-02  5:24         ` [PATCH 2/3] mm: introduce fincore() Naoya Horiguchi
@ 2014-06-02  6:42           ` Christoph Hellwig
  2014-06-02  7:06           ` Michael Kerrisk
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2014-06-02  6:42 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, linux-man

Please also provide a man page for the system call.

I'm also very unhappy about the crazy different interpretation of the
return value depending on flags, which probably becomes more obvious if
you try to document it.

That being said I think fincore is useful, but why not stick to the
same simple interface as mincore?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/3] mm: introduce fincore()
  2014-06-02  5:24         ` [PATCH 2/3] mm: introduce fincore() Naoya Horiguchi
  2014-06-02  6:42           ` Christoph Hellwig
@ 2014-06-02  7:06           ` Michael Kerrisk
  2014-06-02 12:23           ` Kirill A. Shutemov
  2014-06-02 16:11           ` Dave Hansen
  3 siblings, 0 replies; 22+ messages in thread
From: Michael Kerrisk @ 2014-06-02  7:06 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	Linux Kernel, linux-mm, Christoph Hellwig,
	Michael Kerrisk-manpages, Linux API

Hello Naoya,

As Christoph noted, it would be best to provide some good user
documentation for this proposed system call, to aid design review.

Also, as per Documentation/SubmitChecklist, patches that change the
kernel-userspace API/ABI should CC
linux-api@vger.kernel.org (see
https://www.kernel.org/doc/man-pages/linux-api-ml.html).

Thanks,

Michael


On Mon, Jun 2, 2014 at 7:24 AM, Naoya Horiguchi
<n-horiguchi@ah.jp.nec.com> wrote:
> This patch provides a new system call fincore(2), which provides mincore()-
> like information, i.e. page residency of a given file. But unlike mincore(),
> fincore() can have a mode flag and it enables us to extract more detailed
> information about page cache like pfn and page flag. This kind of information
> is very helpful for example when applications want to know the file cache
> status to control IO on their own way.
>
> Detail about the data format being passed to userspace are explained in
> inline comment, but generally in long entry format, we can choose which
> information is extraced flexibly, so you don't have to waste memory by
> extracting unnecessary information. And with FINCORE_SKIP_HOLE flag,
> we can skip hole pages (not on memory,) which makes us avoid a flood of
> meaningless zero entries when calling on extremely large (but only few
> pages of it are loaded on memory) file.
>
> Basic testset is added in a next patch on tools/testing/selftests/fincore/.
>
> [1] http://thread.gmane.org/gmane.linux.kernel/1439212/focus=1441919
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  arch/x86/syscalls/syscall_64.tbl |   1 +
>  include/linux/syscalls.h         |   2 +
>  mm/Makefile                      |   2 +-
>  mm/fincore.c                     | 362 +++++++++++++++++++++++++++++++++++++++
>  4 files changed, 366 insertions(+), 1 deletion(-)
>  create mode 100644 mm/fincore.c
>
> diff --git v3.15-rc7.orig/arch/x86/syscalls/syscall_64.tbl v3.15-rc7/arch/x86/syscalls/syscall_64.tbl
> index 04376ac3d9ef..0a6b6dd77708 100644
> --- v3.15-rc7.orig/arch/x86/syscalls/syscall_64.tbl
> +++ v3.15-rc7/arch/x86/syscalls/syscall_64.tbl
> @@ -323,6 +323,7 @@
>  314    common  sched_setattr           sys_sched_setattr
>  315    common  sched_getattr           sys_sched_getattr
>  316    common  renameat2               sys_renameat2
> +317    common  fincore                 sys_fincore
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git v3.15-rc7.orig/include/linux/syscalls.h v3.15-rc7/include/linux/syscalls.h
> index a4a0588c5397..d625ec9cb73d 100644
> --- v3.15-rc7.orig/include/linux/syscalls.h
> +++ v3.15-rc7/include/linux/syscalls.h
> @@ -866,4 +866,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
>  asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
>                          unsigned long idx1, unsigned long idx2);
>  asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
> +asmlinkage long sys_fincore(int fd, loff_t start, long nr_pages,
> +                       int mode, unsigned char __user *vec);
>  #endif
> diff --git v3.15-rc7.orig/mm/Makefile v3.15-rc7/mm/Makefile
> index b484452dac57..55e1d13ddb76 100644
> --- v3.15-rc7.orig/mm/Makefile
> +++ v3.15-rc7/mm/Makefile
> @@ -18,7 +18,7 @@ obj-y                 := filemap.o mempool.o oom_kill.o fadvise.o \
>                            mm_init.o mmu_context.o percpu.o slab_common.o \
>                            compaction.o balloon_compaction.o vmacache.o \
>                            interval_tree.o list_lru.o workingset.o \
> -                          iov_iter.o $(mmu-y)
> +                          iov_iter.o fincore.o $(mmu-y)
>
>  obj-y += init-mm.o
>
> diff --git v3.15-rc7.orig/mm/fincore.c v3.15-rc7/mm/fincore.c
> new file mode 100644
> index 000000000000..3fc3ef465471
> --- /dev/null
> +++ v3.15-rc7/mm/fincore.c
> @@ -0,0 +1,362 @@
> +/*
> + * fincore(2) system call
> + *
> + * Copyright (C) 2014 NEC Corporation, Naoya Horiguchi
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/pagemap.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/hugetlb.h>
> +
> +/*
> + * You can control how the buffer in userspace is filled with this mode
> + * parameters:
> + *
> + * - FINCORE_BMAP:
> + *     The page status is returned in a vector of bytes.
> + *     The least significant bit of each byte is 1 if the referenced page
> + *     is in memory, otherwise it is zero.
> + *
> + * - FINCORE_PFN:
> + *     stores pfn, using 8 bytes.
> + *
> + * - FINCORE_PAGEFLAGS:
> + *     stores page flags, using 8 bytes. See definition of KPF_* for details.
> + *
> + * - FINCORE_PAGECACHE_TAGS:
> + *     stores pagecache tags, using 8 bytes. See definition of PAGECACHE_TAG_*
> + *     for details.
> + *
> + * - FINCORE_SKIP_HOLE: if this flag is set, fincore() doesn't store any
> + *     information about hole. Instead each records per page has the entry
> + *     of page offset (using 8 bytes.) This mode is useful if we handle
> + *     large file and only few pages are on memory for the file.
> + *
> + * FINCORE_BMAP shouldn't be used combined with any other flags, and returnd
> + * data in this mode is like this:
> + *
> + *   page offset  0   1   2   3   4
> + *              +---+---+---+---+---+
> + *              | 1 | 0 | 0 | 1 | 1 | ...
> + *              +---+---+---+---+---+
> + *               <->
> + *              1 byte
> + *
> + * For FINCORE_PFN, page data is formatted like this:
> + *
> + *   page offset    0       1       2       3       4
> + *              +-------+-------+-------+-------+-------+
> + *              |  pfn  |  pfn  |  pfn  |  pfn  |  pfn  | ...
> + *              +-------+-------+-------+-------+-------+
> + *               <----->
> + *               8 byte
> + *
> + * We can use multiple flags among FINCORE_(PFN|PAGEFLAGS|PAGECACHE_TAGS).
> + * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
> + * information is stored like this:
> + *
> + *    page offset 0    page offset 1   page offset 2
> + *   +-------+-------+-------+-------+-------+-------+
> + *   |  pfn  | flags |  pfn  | flags |  pfn  | flags | ...
> + *   +-------+-------+-------+-------+-------+-------+
> + *    <-------------> <-------------> <------------->
> + *       16 bytes        16 bytes        16 bytes
> + *
> + * When FINCORE_SKIP_HOLE is set, we ignore holes and add page offset entry
> + * (8 bytes) instead. For example, the data format of mode
> + * FINCORE_PFN|FINCORE_SKIP_HOLE is like follows:
> + *
> + *   +-------+-------+-------+-------+-------+-------+
> + *   | pgoff |  pfn  | pgoff |  pfn  | pgoff |  pfn  | ...
> + *   +-------+-------+-------+-------+-------+-------+
> + *    <-------------> <-------------> <------------->
> + *       16 bytes        16 bytes        16 bytes
> + */
> +#define FINCORE_BMAP           0x01    /* bytemap mode */
> +#define FINCORE_PFN            0x02
> +#define FINCORE_PAGE_FLAGS     0x04
> +#define FINCORE_PAGECACHE_TAGS 0x08
> +#define FINCORE_SKIP_HOLE      0x10
> +
> +#define FINCORE_MODE_MASK      0x1f
> +#define FINCORE_LONGENTRY_MASK (FINCORE_PFN | FINCORE_PAGE_FLAGS | \
> +                                FINCORE_PAGECACHE_TAGS | FINCORE_SKIP_HOLE)
> +
> +struct fincore_control {
> +       int mode;
> +       int width;              /* width of each entry (in bytes) */
> +       unsigned char *buffer;
> +       long buffer_size;
> +       void *cursor;           /* current position on the buffer */
> +       pgoff_t pgstart;        /* start point of page cache scan in each run
> +                                * of the while loop */
> +       long nr_pages;          /* number of pages to be copied to userspace
> +                                * (decreasing while scan proceeds) */
> +       long scanned_offset;    /* page offset of the lastest scanned page */
> +       struct address_space *mapping;
> +};
> +
> +/*
> + * TODO: doing radix_tree_tag_get() for each tag is not optimal, but no easy
> + * way without degrading finely tuned radix tree routines.
> + */
> +static unsigned long get_pagecache_tags(struct radix_tree_root *root,
> +                                       unsigned long index)
> +{
> +       int i;
> +       unsigned long tags = 0;
> +       for (i = 0; i < __NR_PAGECACHE_TAGS; i++)
> +               if (radix_tree_tag_get(root, index, i))
> +                       tags |=  1 << i;
> +       return tags;
> +}
> +
> +#define store_entry(fc, type, data) ({         \
> +       *(type *)fc->cursor = (type)data;       \
> +       fc->cursor += sizeof(type);             \
> +})
> +
> +/*
> + * Store page cache data to temporal buffer in the specified format depending
> + * on fincore mode.
> + */
> +static void __do_fincore(struct fincore_control *fc, struct page *page,
> +                        unsigned long index)
> +{
> +       VM_BUG_ON(!page);
> +       VM_BUG_ON((unsigned long)fc->cursor - (unsigned long)fc->buffer
> +                 >= fc->buffer_size);
> +       if (fc->mode & FINCORE_BMAP)
> +               store_entry(fc, unsigned char, PageUptodate(page));
> +       else if (fc->mode & (FINCORE_LONGENTRY_MASK)) {
> +               if (fc->mode & FINCORE_SKIP_HOLE)
> +                       store_entry(fc, unsigned long, index);
> +               if (fc->mode & FINCORE_PFN)
> +                       store_entry(fc, unsigned long, page_to_pfn(page));
> +               if (fc->mode & FINCORE_PAGE_FLAGS)
> +                       store_entry(fc, unsigned long, stable_page_flags(page));
> +               if (fc->mode & FINCORE_PAGECACHE_TAGS)
> +                       store_entry(fc, unsigned long,
> +                                   get_pagecache_tags(&fc->mapping->page_tree,
> +                                                      index));
> +       }
> +}
> +
> +/*
> + * Traverse page cache tree. It's assumed that temporal buffer are zeroed
> + * in advance. Due to this, we don't have to store zero entry explicitly
> + * one-by-one and we just set fc->cursor to the position of the next
> + * on-memory page.
> + *
> + * Return value is the number of pages whose data is stored in fc->buffer.
> + */
> +static long do_fincore(struct fincore_control *fc, int nr_pages)
> +{
> +       pgoff_t pgend = fc->pgstart + nr_pages;
> +       struct radix_tree_iter iter;
> +       void **slot;
> +       long nr = 0;
> +
> +       fc->cursor = fc->buffer;
> +
> +       rcu_read_lock();
> +restart:
> +       radix_tree_for_each_slot(slot, &fc->mapping->page_tree, &iter,
> +                                fc->pgstart) {
> +               long jump;
> +               struct page *page;
> +
> +               fc->scanned_offset = iter.index;
> +               /* Handle holes */
> +               jump = iter.index - fc->pgstart - nr;
> +               if (jump) {
> +                       if (!(fc->mode & FINCORE_SKIP_HOLE)) {
> +                               if (iter.index < pgend) {
> +                                       fc->cursor += jump * fc->width;
> +                                       nr = iter.index - fc->pgstart;
> +                               } else {
> +                                       /*
> +                                        * Fill remaining buffer as hole. Next
> +                                        * call should start at offset pgend.
> +                                        */
> +                                       nr = nr_pages;
> +                                       fc->scanned_offset = pgend - 1;
> +                                       break;
> +                               }
> +                       }
> +               }
> +repeat:
> +               page = radix_tree_deref_slot(slot);
> +               if (unlikely(!page))
> +                       /*
> +                        * No need to increment nr and fc->cursor, because next
> +                        * iteration should detect hole and update them there.
> +                        */
> +                       continue;
> +               else if (radix_tree_exception(page)) {
> +                       if (radix_tree_deref_retry(page)) {
> +                               /*
> +                                * Transient condition which can only trigger
> +                                * when entry at index 0 moves out of or back
> +                                * to root: none yet gotten, safe to restart.
> +                                */
> +                               WARN_ON(iter.index);
> +                               goto restart;
> +                       }
> +                       __do_fincore(fc, page, iter.index);
> +               } else {
> +                       if (!page_cache_get_speculative(page))
> +                               goto repeat;
> +
> +                       /* Has the page moved? */
> +                       if (unlikely(page != *slot)) {
> +                               page_cache_release(page);
> +                               goto repeat;
> +                       }
> +
> +                       __do_fincore(fc, page, iter.index);
> +                       page_cache_release(page);
> +               }
> +
> +               if (++nr == nr_pages)
> +                       break;
> +       }
> +       rcu_read_unlock();
> +
> +       return nr;
> +}
> +
> +static inline bool fincore_validate_mode(int mode)
> +{
> +       if (mode & ~FINCORE_MODE_MASK)
> +               return false;
> +       if (!(!!(mode & FINCORE_BMAP) ^ !!(mode & FINCORE_LONGENTRY_MASK)))
> +               return false;
> +       if ((mode & FINCORE_LONGENTRY_MASK) == FINCORE_SKIP_HOLE)
> +               return false;
> +       return true;
> +}
> +
> +#define FINCORE_LOOP_STEP      256L
> +
> +/*
> + * The fincore(2) system call
> + *
> + *  @fd:        file descriptor of the target file
> + *  @start:     starting address offset of the target file (in byte).
> + *              This should be aligned to page cache size.
> + *  @nr_pages:  the number of pages from which the page data is passed to
> + *              userspace. If you don't set FINCORE_SKIP_HOLE, it's identical
> + *             to the pages to be scanned.
> + *  @mode       fincore mode flags to determine the entry's format
> + *  @vec        pointer of the userspace buffer. The size must be equal to or
> + *              larger than (@nr_pages * width), where width is the size of
> + *              each entry.
> + *
> + * fincore() returns the memory residency status and additional info (like
> + * pfn and page flags) of the given file's pages.
> + *
> + * Depending on the fincore mode, caller can receive the different formatted
> + * information. See the comment on definition of FINCORE_* above.
> + *
> + * Because the status of a page can change after fincore() checks it
> + * but before it returns to the application, the returned vector may
> + * contain stale information.
> + *
> + * return values:
> + *  -EBADF:     @fd isn't a valid open file descriptor
> + *  -EFAULT:    @vec points to an illegal address
> + *  -EINVAL:    @start is not aligned to page cache size, or @mode is invalid
> + *  non-negative value: fincore() is successfully done and the value means
> + *              the number of valid entries stored on userspace buffer
> + */
> +SYSCALL_DEFINE5(fincore, int, fd, loff_t, start, long, nr_pages,
> +               int, mode, unsigned char __user *, vec)
> +{
> +       long ret;
> +       long step;
> +       long nr = 0;
> +       int pc_shift = PAGE_CACHE_SHIFT;
> +       struct fd f;
> +
> +       struct fincore_control fc = {
> +               .mode   = mode,
> +               .width  = sizeof(unsigned char),
> +       };
> +
> +       if (!fincore_validate_mode(mode))
> +               return -EINVAL;
> +
> +       f = fdget(fd);
> +
> +       if (is_file_hugepages(f.file))
> +               pc_shift = huge_page_shift(hstate_file(f.file));
> +
> +       if (!IS_ALIGNED(start, 1 << pc_shift)) {
> +               ret = -EINVAL;
> +               goto fput;
> +       }
> +
> +       /*
> +        * TODO: support /dev/mem, /proc/pid/mem for system/process wide
> +        * page survey, which would obsolete /proc/kpageflags, and
> +        * /proc/pid/pagemap.
> +        */
> +       if (!S_ISREG(file_inode(f.file)->i_mode)) {
> +               ret = -EBADF;
> +               goto fput;
> +       }
> +
> +       fc.pgstart = start >> pc_shift;
> +       fc.nr_pages = nr_pages;
> +       fc.mapping = f.file->f_mapping;
> +       if (mode & FINCORE_LONGENTRY_MASK)
> +               fc.width = ((mode & FINCORE_PFN ? 1 : 0) +
> +                           (mode & FINCORE_PAGE_FLAGS ? 1 : 0) +
> +                           (mode & FINCORE_PAGECACHE_TAGS ? 1 : 0) +
> +                           (mode & FINCORE_SKIP_HOLE ? 1 : 0)
> +                       ) * sizeof(unsigned long);
> +       step = min(fc.nr_pages, FINCORE_LOOP_STEP);
> +
> +       fc.buffer_size = step * fc.width;
> +       fc.buffer = kzalloc(fc.buffer_size, GFP_TEMPORARY);
> +       if (!fc.buffer) {
> +               ret = -ENOMEM;
> +               goto fput;
> +       }
> +
> +       while (true) {
> +               ret = do_fincore(&fc, min(step, fc.nr_pages));
> +               /* Reached the end of the file */
> +               if (ret == 0) {
> +                       ret = nr;
> +                       break;
> +               }
> +               if (ret < 0)
> +                       break;
> +               if (copy_to_user(vec + nr * fc.width,
> +                                fc.buffer, ret * fc.width)) {
> +                       ret = -EFAULT;
> +                       break;
> +               }
> +               fc.nr_pages -= ret;
> +               fc.pgstart = fc.scanned_offset + 1;
> +               nr += ret;
> +               /* Completed scanning the requested numbers of pages */
> +               if (fc.nr_pages == 0) {
> +                       ret = nr;
> +                       break;
> +               }
> +               /* Clear buffer for next do_fincore() */
> +               memset(fc.buffer, 0, step * fc.width);
> +       }
> +
> +       kfree(fc.buffer);
> +fput:
> +       fdput(f);
> +       return ret;
> +}
> --
> 1.9.3
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/3] mm: introduce fincore()
  2014-06-02  5:24         ` [PATCH 2/3] mm: introduce fincore() Naoya Horiguchi
  2014-06-02  6:42           ` Christoph Hellwig
  2014-06-02  7:06           ` Michael Kerrisk
@ 2014-06-02 12:23           ` Kirill A. Shutemov
  2014-06-02 16:11           ` Dave Hansen
  3 siblings, 0 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2014-06-02 12:23 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm

On Mon, Jun 02, 2014 at 01:24:58AM -0400, Naoya Horiguchi wrote:
> This patch provides a new system call fincore(2), which provides mincore()-
> like information, i.e. page residency of a given file. But unlike mincore(),
> fincore() can have a mode flag and it enables us to extract more detailed
> information about page cache like pfn and page flag. This kind of information
> is very helpful for example when applications want to know the file cache
> status to control IO on their own way.
> 
> Detail about the data format being passed to userspace are explained in
> inline comment, but generally in long entry format, we can choose which
> information is extraced flexibly, so you don't have to waste memory by
> extracting unnecessary information. And with FINCORE_SKIP_HOLE flag,
> we can skip hole pages (not on memory,) which makes us avoid a flood of
> meaningless zero entries when calling on extremely large (but only few
> pages of it are loaded on memory) file.
> 
> Basic testset is added in a next patch on tools/testing/selftests/fincore/.
> 
> [1] http://thread.gmane.org/gmane.linux.kernel/1439212/focus=1441919
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

...

> diff --git v3.15-rc7.orig/mm/fincore.c v3.15-rc7/mm/fincore.c
> new file mode 100644
> index 000000000000..3fc3ef465471
> --- /dev/null
> +++ v3.15-rc7/mm/fincore.c
> @@ -0,0 +1,362 @@
> +/*
> + * fincore(2) system call
> + *
> + * Copyright (C) 2014 NEC Corporation, Naoya Horiguchi
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/pagemap.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/hugetlb.h>
> +
> +/*
> + * You can control how the buffer in userspace is filled with this mode
> + * parameters:
> + *
> + * - FINCORE_BMAP:
> + *     The page status is returned in a vector of bytes.
> + *     The least significant bit of each byte is 1 if the referenced page
> + *     is in memory, otherwise it is zero.

I'm okay with bytemap. Just wounder why not bitmap?

> + *
> + * - FINCORE_PFN:
> + *     stores pfn, using 8 bytes.
> + *
> + * - FINCORE_PAGEFLAGS:
> + *     stores page flags, using 8 bytes. See definition of KPF_* for details.
> + *
> + * - FINCORE_PAGECACHE_TAGS:
> + *     stores pagecache tags, using 8 bytes. See definition of PAGECACHE_TAG_*
> + *     for details.

Is it safe to expose this info to unprivilaged process (consider all three
flags above)?

> + * - FINCORE_SKIP_HOLE: if this flag is set, fincore() doesn't store any
> + *     information about hole. Instead each records per page has the entry
> + *     of page offset (using 8 bytes.) This mode is useful if we handle
> + *     large file and only few pages are on memory for the file.

Hm.. It's probably overkill, but instead of filling userspace buffer we
could return file descriptor and define lseek(SEEK_HOLE). Just thinking.

> + *
> + * FINCORE_BMAP shouldn't be used combined with any other flags, and returnd
> + * data in this mode is like this:
> + *
> + *   page offset  0   1   2   3   4
> + *              +---+---+---+---+---+
> + *              | 1 | 0 | 0 | 1 | 1 | ...
> + *              +---+---+---+---+---+
> + *               <->
> + *              1 byte
> + *
> + * For FINCORE_PFN, page data is formatted like this:
> + *
> + *   page offset    0       1       2       3       4
> + *              +-------+-------+-------+-------+-------+
> + *              |  pfn  |  pfn  |  pfn  |  pfn  |  pfn  | ...
> + *              +-------+-------+-------+-------+-------+
> + *               <----->
> + *               8 byte
> + *
> + * We can use multiple flags among FINCORE_(PFN|PAGEFLAGS|PAGECACHE_TAGS).
> + * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
> + * information is stored like this:
> + *
> + *    page offset 0    page offset 1   page offset 2
> + *   +-------+-------+-------+-------+-------+-------+
> + *   |  pfn  | flags |  pfn  | flags |  pfn  | flags | ...
> + *   +-------+-------+-------+-------+-------+-------+
> + *    <-------------> <-------------> <------------->
> + *       16 bytes        16 bytes        16 bytes
> + *
> + * When FINCORE_SKIP_HOLE is set, we ignore holes and add page offset entry
> + * (8 bytes) instead. For example, the data format of mode
> + * FINCORE_PFN|FINCORE_SKIP_HOLE is like follows:
> + *
> + *   +-------+-------+-------+-------+-------+-------+
> + *   | pgoff |  pfn  | pgoff |  pfn  | pgoff |  pfn  | ...
> + *   +-------+-------+-------+-------+-------+-------+
> + *    <-------------> <-------------> <------------->
> + *       16 bytes        16 bytes        16 bytes
> + */
> +#define FINCORE_BMAP		0x01	/* bytemap mode */
> +#define FINCORE_PFN		0x02
> +#define FINCORE_PAGE_FLAGS	0x04
> +#define FINCORE_PAGECACHE_TAGS	0x08
> +#define FINCORE_SKIP_HOLE	0x10

FINCORE_SKIP_HOLE is greater then FINCORE_PFN but pgoff precedes pfn in
records. It's confusing. We need clear definition of record format.

What about rename FINCORE_SKIP_HOLE -> FINCORE_PGOFF, move it before
FINCORE_PFN. So FINCORE_PGOFF is less than FINCORE_PFN, which is less than
FINCORE_PAGE_FLAGS, which is less than FINCORE_PAGECACHE_TAGS. It matches
order in records:

FINCORE_PGOFF|FINCORE_PFN|FINCORE_PAGEFLAGS|FINCORE_PAGECACHE_TAGS

 +-------+-------+-------+-------+-------+-------+-------+-------+
 | pgoff |  pfn  | flags |  tags | pgoff |  pfn  | flags |  tags | ...
 +-------+-------+-------+-------+-------+-------+-------+-------+
  <-----------------------------> <------------------------------>
             32 bytes                        32 bytes

> +
> +#define FINCORE_MODE_MASK	0x1f
> +#define FINCORE_LONGENTRY_MASK	(FINCORE_PFN | FINCORE_PAGE_FLAGS | \
> +				 FINCORE_PAGECACHE_TAGS | FINCORE_SKIP_HOLE)
> +
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/3] mm: introduce fincore()
  2014-06-02  5:24         ` [PATCH 2/3] mm: introduce fincore() Naoya Horiguchi
                             ` (2 preceding siblings ...)
  2014-06-02 12:23           ` Kirill A. Shutemov
@ 2014-06-02 16:11           ` Dave Hansen
  3 siblings, 0 replies; 22+ messages in thread
From: Dave Hansen @ 2014-06-02 16:11 UTC (permalink / raw)
  To: Naoya Horiguchi, Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Kirill A. Shutemov

On 06/01/2014 10:24 PM, Naoya Horiguchi wrote:
> Detail about the data format being passed to userspace are explained in
> inline comment, but generally in long entry format, we can choose which
> information is extraced flexibly, so you don't have to waste memory by
> extracting unnecessary information. And with FINCORE_SKIP_HOLE flag,
> we can skip hole pages (not on memory,) which makes us avoid a flood of
> meaningless zero entries when calling on extremely large (but only few
> pages of it are loaded on memory) file.

Something similar could be useful for hugetlbfs too.  For a 1GB page,
it's pretty silly to do 2^18 entries which essentially repeat the same
data in an interface like this.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] replace PAGECACHE_TAG_* definition with enumeration
  2014-06-02  5:24         ` [PATCH 1/3] replace PAGECACHE_TAG_* definition with enumeration Naoya Horiguchi
@ 2014-06-02 16:12           ` Dave Hansen
       [not found]             ` <1401727052-f7v7kykv@n-horiguchi@ah.jp.nec.com>
  2014-06-02 21:16             ` Andrew Morton
  0 siblings, 2 replies; 22+ messages in thread
From: Dave Hansen @ 2014-06-02 16:12 UTC (permalink / raw)
  To: Naoya Horiguchi, Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm

On 06/01/2014 10:24 PM, Naoya Horiguchi wrote:
> -#define PAGECACHE_TAG_DIRTY	0
> -#define PAGECACHE_TAG_WRITEBACK	1
> -#define PAGECACHE_TAG_TOWRITE	2
> +enum {
> +	PAGECACHE_TAG_DIRTY,
> +	PAGECACHE_TAG_WRITEBACK,
> +	PAGECACHE_TAG_TOWRITE,
> +	__NR_PAGECACHE_TAGS,
> +};

Doesn't this end up exposing kernel-internal values out to a userspace
interface?  Wouldn't that lock these values in to the ABI?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] replace PAGECACHE_TAG_* definition with enumeration
       [not found]             ` <1401727052-f7v7kykv@n-horiguchi@ah.jp.nec.com>
@ 2014-06-02 16:45               ` Dave Hansen
       [not found]                 ` <538cb12a.8518c20a.1a51.ffff9761SMTPIN_ADDED_BROKEN@mx.google.com>
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Hansen @ 2014-06-02 16:45 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: akpm, koct9i, fengguang.wu, acme, bp, kirill, hannes, rusty,
	davem, andres, linux-kernel, linux-mm

On 06/02/2014 09:37 AM, Naoya Horiguchi wrote:
> On Mon, Jun 02, 2014 at 09:12:25AM -0700, Dave Hansen wrote:
>> > On 06/01/2014 10:24 PM, Naoya Horiguchi wrote:
>>> > > -#define PAGECACHE_TAG_DIRTY	0
>>> > > -#define PAGECACHE_TAG_WRITEBACK	1
>>> > > -#define PAGECACHE_TAG_TOWRITE	2
>>> > > +enum {
>>> > > +	PAGECACHE_TAG_DIRTY,
>>> > > +	PAGECACHE_TAG_WRITEBACK,
>>> > > +	PAGECACHE_TAG_TOWRITE,
>>> > > +	__NR_PAGECACHE_TAGS,
>>> > > +};
>> > 
>> > Doesn't this end up exposing kernel-internal values out to a userspace
>> > interface?  Wouldn't that lock these values in to the ABI?
> Yes, that would. I hope these PAGECACHE_TAG_* stuff is very basic
> things and will never change drastically in the future (only added),
> so it's unlikely to bother people about ABI breakage things.

OK, so if I'm writing a userspace program, which header do I include
pull these values in to my program?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] replace PAGECACHE_TAG_* definition with enumeration
       [not found]                 ` <538cb12a.8518c20a.1a51.ffff9761SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2014-06-02 18:19                   ` Dave Hansen
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Hansen @ 2014-06-02 18:19 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm

On 06/02/2014 10:14 AM, Naoya Horiguchi wrote:
> Yes, that's necessary to consider (but I haven't done, sorry),
> so I'm thinking of moving this definition to the new file
> include/uapi/linux/pagecache.h and let it be imported from the
> userspace programs. Is it fine?

Yep, although I'd probably also explicitly separate the definitions of
the user-exposed ones from the kernel-internal ones.  We want to make
this hard to screw up.

I can see why we might want to expose dirty and writeback out to
userspace, especially since we already expose the aggregate, system-wide
view in /proc/meminfo.  But, what about PAGECACHE_TAG_TOWRITE?  I really
can't think of a good reason why userspace would ever care about it or
consider it different from PAGECACHE_TAG_DIRTY.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] replace PAGECACHE_TAG_* definition with enumeration
  2014-06-02 16:12           ` Dave Hansen
       [not found]             ` <1401727052-f7v7kykv@n-horiguchi@ah.jp.nec.com>
@ 2014-06-02 21:16             ` Andrew Morton
  1 sibling, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2014-06-02 21:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Naoya Horiguchi, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm

On Mon, 02 Jun 2014 09:12:25 -0700 Dave Hansen <dave.hansen@intel.com> wrote:

> On 06/01/2014 10:24 PM, Naoya Horiguchi wrote:
> > -#define PAGECACHE_TAG_DIRTY	0
> > -#define PAGECACHE_TAG_WRITEBACK	1
> > -#define PAGECACHE_TAG_TOWRITE	2
> > +enum {
> > +	PAGECACHE_TAG_DIRTY,
> > +	PAGECACHE_TAG_WRITEBACK,
> > +	PAGECACHE_TAG_TOWRITE,
> > +	__NR_PAGECACHE_TAGS,
> > +};
> 
> Doesn't this end up exposing kernel-internal values out to a userspace
> interface?  Wouldn't that lock these values in to the ABI?

Yes, we should be careful here.  We should not do anything which
constrains future kernel code or which causes any form of
compatibility/migration issues.

I wonder if we can do something smart with the interface.  For example
when userspace calls sys_fincore() it must explicitly ask for
PAGECACHE_TAG_DIRTY and if some future kernel doesn't implement
PAGECACHE_TAG_DIRTY, it can return -EINVAL.

Or maybe it can succeed, but tells userspace "you didn't get
PAGECACHE_TAG_DIRTY".

<thinking out loud>

So userspace sends a mask of bits which select what fields it wants. 
The kernel returns a mask of bits which tell userspace what it actually
received.

Or something like that - you get the idea ;)


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2014-06-02 21:17 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-21  2:26 [PATCH 0/4] pagecache scanning with /proc/kpagecache Naoya Horiguchi
2014-05-21  2:26 ` [PATCH 1/4] radix-tree: add end_index to support ranged iteration Naoya Horiguchi
2014-05-21  8:21   ` Konstantin Khlebnikov
2014-05-21  2:26 ` [PATCH 2/4] fs/proc/page.c: introduce /proc/kpagecache interface Naoya Horiguchi
2014-05-21  2:26 ` [PATCH 3/4] tools/vm/page-types.c: rework on file cache scanning mode Naoya Horiguchi
2014-05-21  2:26 ` [PATCH 4/4] Documentation: update Documentation/vm/pagemap.txt Naoya Horiguchi
2014-05-21 22:42 ` [PATCH 0/4] pagecache scanning with /proc/kpagecache Andrew Morton
     [not found]   ` <537d5ee4.4914e00a.5672.ffff85d5SMTPIN_ADDED_BROKEN@mx.google.com>
2014-05-22  2:33     ` Andrew Morton
2014-05-22  9:50       ` Konstantin Khlebnikov
2014-05-22 10:36         ` Kirill A. Shutemov
2014-06-02  5:24       ` [RFC][PATCH 0/3] mm: introduce fincore() Naoya Horiguchi
2014-06-02  5:24         ` [PATCH 1/3] replace PAGECACHE_TAG_* definition with enumeration Naoya Horiguchi
2014-06-02 16:12           ` Dave Hansen
     [not found]             ` <1401727052-f7v7kykv@n-horiguchi@ah.jp.nec.com>
2014-06-02 16:45               ` Dave Hansen
     [not found]                 ` <538cb12a.8518c20a.1a51.ffff9761SMTPIN_ADDED_BROKEN@mx.google.com>
2014-06-02 18:19                   ` Dave Hansen
2014-06-02 21:16             ` Andrew Morton
2014-06-02  5:24         ` [PATCH 2/3] mm: introduce fincore() Naoya Horiguchi
2014-06-02  6:42           ` Christoph Hellwig
2014-06-02  7:06           ` Michael Kerrisk
2014-06-02 12:23           ` Kirill A. Shutemov
2014-06-02 16:11           ` Dave Hansen
2014-06-02  5:24         ` [PATCH 3/3] selftest: add test code for fincore() Naoya Horiguchi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).