All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] mm: introduce fincore() v2
@ 2014-07-03 21:52 ` Naoya Horiguchi
  0 siblings, 0 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2014-07-03 21:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Dave Hansen, Christoph Hellwig, Michael Kerrisk,
	Linux API, Naoya Horiguchi

This is the 2nd version of fincore patchset.

In the previous discussion[1], I got many feedbacks about the following
points:
- robust ABI handling is needed (especially about PAGECACHE_TAG_*)
- man page is necessary
- the parameter/return value of sys_fincore() needs improvement
- the order of bits FINCORE_*  and the order of 8 bytes entry in buffer
  should be identical
so I covered these in this version.

Any comments/reviews are welcomed.

[1] http://lwn.net/Articles/601020/

Thanks,
Naoya Horiguchi
---
Tree: git@github.com:Naoya-Horiguchi/linux.git
Branch: v3.16-rc3/fincore.ver2
---
Summary:

Naoya Horiguchi (4):
      define PAGECACHE_TAG_* as enumeration under include/uapi
      mm: introduce fincore()
      selftests/fincore: add test code for fincore()
      man2/fincore.2: document general description about fincore(2)

 arch/x86/syscalls/syscall_64.tbl                   |   1 +
 include/linux/fs.h                                 |   9 +-
 include/linux/syscalls.h                           |   4 +
 include/uapi/linux/pagecache.h                     | 111 ++++++
 man2/fincore.2                                     | 383 ++++++++++++++++++++
 mm/Makefile                                        |   2 +-
 mm/fincore.c                                       | 322 +++++++++++++++++
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/fincore/Makefile           |  31 ++
 .../selftests/fincore/create_hugetlbfs_file.c      |  49 +++
 tools/testing/selftests/fincore/fincore.c          | 166 +++++++++
 tools/testing/selftests/fincore/run_fincoretests   | 401 +++++++++++++++++++++
 12 files changed, 1471 insertions(+), 9 deletions(-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 0/4] mm: introduce fincore() v2
@ 2014-07-03 21:52 ` Naoya Horiguchi
  0 siblings, 0 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2014-07-03 21:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Dave Hansen, Christoph Hellwig, Michael Kerrisk,
	Linux API, Naoya Horiguchi

This is the 2nd version of fincore patchset.

In the previous discussion[1], I got many feedbacks about the following
points:
- robust ABI handling is needed (especially about PAGECACHE_TAG_*)
- man page is necessary
- the parameter/return value of sys_fincore() needs improvement
- the order of bits FINCORE_*  and the order of 8 bytes entry in buffer
  should be identical
so I covered these in this version.

Any comments/reviews are welcomed.

[1] http://lwn.net/Articles/601020/

Thanks,
Naoya Horiguchi
---
Tree: git@github.com:Naoya-Horiguchi/linux.git
Branch: v3.16-rc3/fincore.ver2
---
Summary:

Naoya Horiguchi (4):
      define PAGECACHE_TAG_* as enumeration under include/uapi
      mm: introduce fincore()
      selftests/fincore: add test code for fincore()
      man2/fincore.2: document general description about fincore(2)

 arch/x86/syscalls/syscall_64.tbl                   |   1 +
 include/linux/fs.h                                 |   9 +-
 include/linux/syscalls.h                           |   4 +
 include/uapi/linux/pagecache.h                     | 111 ++++++
 man2/fincore.2                                     | 383 ++++++++++++++++++++
 mm/Makefile                                        |   2 +-
 mm/fincore.c                                       | 322 +++++++++++++++++
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/fincore/Makefile           |  31 ++
 .../selftests/fincore/create_hugetlbfs_file.c      |  49 +++
 tools/testing/selftests/fincore/fincore.c          | 166 +++++++++
 tools/testing/selftests/fincore/run_fincoretests   | 401 +++++++++++++++++++++
 12 files changed, 1471 insertions(+), 9 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/4] define PAGECACHE_TAG_* as enumeration under include/uapi
  2014-07-03 21:52 ` Naoya Horiguchi
@ 2014-07-03 21:52   ` Naoya Horiguchi
  -1 siblings, 0 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2014-07-03 21:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Dave Hansen, Christoph Hellwig, Michael Kerrisk,
	Linux API, Naoya Horiguchi

We need the pagecache tags to be exported to userspace later in this
series for fincore(2), so this patch moves the definition to the new
include file for preparation. We also use the number of pagecache tags,
so this patch also adds it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/fs.h             |  9 +--------
 include/uapi/linux/pagecache.h | 17 +++++++++++++++++
 2 files changed, 18 insertions(+), 8 deletions(-)
 create mode 100644 include/uapi/linux/pagecache.h

diff --git v3.16-rc3.orig/include/linux/fs.h v3.16-rc3/include/linux/fs.h
index e11d60cc867b..ae4a953bd5f3 100644
--- v3.16-rc3.orig/include/linux/fs.h
+++ v3.16-rc3/include/linux/fs.h
@@ -32,6 +32,7 @@
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
+#include <uapi/linux/pagecache.h>
 
 struct export_operations;
 struct hd_geometry;
@@ -446,14 +447,6 @@ struct block_device {
 	struct mutex		bd_fsfreeze_mutex;
 };
 
-/*
- * Radix-tree tags, for tagging dirty and writeback pages within the pagecache
- * radix trees
- */
-#define PAGECACHE_TAG_DIRTY	0
-#define PAGECACHE_TAG_WRITEBACK	1
-#define PAGECACHE_TAG_TOWRITE	2
-
 int mapping_tagged(struct address_space *mapping, int tag);
 
 /*
diff --git v3.16-rc3.orig/include/uapi/linux/pagecache.h v3.16-rc3/include/uapi/linux/pagecache.h
new file mode 100644
index 000000000000..15e879f7395f
--- /dev/null
+++ v3.16-rc3/include/uapi/linux/pagecache.h
@@ -0,0 +1,17 @@
+#ifndef _UAPI_LINUX_PAGECACHE_H
+#define _UAPI_LINUX_PAGECACHE_H
+
+/*
+ * Radix-tree tags, for tagging dirty and writeback pages within the pagecache
+ * radix trees.
+ */
+enum {
+	PAGECACHE_TAG_DIRTY,
+	PAGECACHE_TAG_WRITEBACK,
+	PAGECACHE_TAG_TOWRITE,
+	__NR_PAGECACHE_TAGS,
+};
+
+#define PAGECACHE_TAG_MASK	((1UL << __NR_PAGECACHE_TAGS) - 1)
+
+#endif /* _UAPI_LINUX_PAGECACHE_H */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 1/4] define PAGECACHE_TAG_* as enumeration under include/uapi
@ 2014-07-03 21:52   ` Naoya Horiguchi
  0 siblings, 0 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2014-07-03 21:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Dave Hansen, Christoph Hellwig, Michael Kerrisk,
	Linux API, Naoya Horiguchi

We need the pagecache tags to be exported to userspace later in this
series for fincore(2), so this patch moves the definition to the new
include file for preparation. We also use the number of pagecache tags,
so this patch also adds it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/fs.h             |  9 +--------
 include/uapi/linux/pagecache.h | 17 +++++++++++++++++
 2 files changed, 18 insertions(+), 8 deletions(-)
 create mode 100644 include/uapi/linux/pagecache.h

diff --git v3.16-rc3.orig/include/linux/fs.h v3.16-rc3/include/linux/fs.h
index e11d60cc867b..ae4a953bd5f3 100644
--- v3.16-rc3.orig/include/linux/fs.h
+++ v3.16-rc3/include/linux/fs.h
@@ -32,6 +32,7 @@
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
+#include <uapi/linux/pagecache.h>
 
 struct export_operations;
 struct hd_geometry;
@@ -446,14 +447,6 @@ struct block_device {
 	struct mutex		bd_fsfreeze_mutex;
 };
 
-/*
- * Radix-tree tags, for tagging dirty and writeback pages within the pagecache
- * radix trees
- */
-#define PAGECACHE_TAG_DIRTY	0
-#define PAGECACHE_TAG_WRITEBACK	1
-#define PAGECACHE_TAG_TOWRITE	2
-
 int mapping_tagged(struct address_space *mapping, int tag);
 
 /*
diff --git v3.16-rc3.orig/include/uapi/linux/pagecache.h v3.16-rc3/include/uapi/linux/pagecache.h
new file mode 100644
index 000000000000..15e879f7395f
--- /dev/null
+++ v3.16-rc3/include/uapi/linux/pagecache.h
@@ -0,0 +1,17 @@
+#ifndef _UAPI_LINUX_PAGECACHE_H
+#define _UAPI_LINUX_PAGECACHE_H
+
+/*
+ * Radix-tree tags, for tagging dirty and writeback pages within the pagecache
+ * radix trees.
+ */
+enum {
+	PAGECACHE_TAG_DIRTY,
+	PAGECACHE_TAG_WRITEBACK,
+	PAGECACHE_TAG_TOWRITE,
+	__NR_PAGECACHE_TAGS,
+};
+
+#define PAGECACHE_TAG_MASK	((1UL << __NR_PAGECACHE_TAGS) - 1)
+
+#endif /* _UAPI_LINUX_PAGECACHE_H */
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 2/4] mm: introduce fincore()
  2014-07-03 21:52 ` Naoya Horiguchi
@ 2014-07-03 21:52   ` Naoya Horiguchi
  -1 siblings, 0 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2014-07-03 21:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Dave Hansen, Christoph Hellwig, Michael Kerrisk,
	Linux API, Naoya Horiguchi

This patch provides a new system call fincore(2), which provides mincore()-
like information, i.e. page residency of a given file. But unlike mincore(),
fincore() has a mode flag which allows us to extract detailed information
about page cache like pfn and page flag. This kind of information is very
helpful, for example when applications want to know the file cache status
to control the IO on their own way.

The details about the data format being passed to userspace are explained
in inline comment, but generally in long entry format, we can choose which
information is extraced flexibly, so you don't have to waste memory by
extracting unnecessary information. And with FINCORE_PGOFF flag, we can skip
hole pages (not on memory,) which makes us avoid a flood of meaningless
zero entries when calling on extremely large (but only few pages of it
are loaded on memory) file.

Basic testset is added in the next patch on tools/testing/selftests/fincore/.

ChangeLog v2:
- move definition of FINCORE_* to include/uapi/linux/pagecache.h
- add another parameter fincore_extra to sys_fincore()
- rename FINCORE_SKIP_HOLE to FINCORE_PGOFF and change bit order.
- add valid argument check (start should be inside file address range,
  nr_pages should be positive)
- add end-of-file check (scan to the end of file even if the last page
  is a hole)
- add access_ok(VERIFY_WIRTE) (copied from mincore())
- update inline comments

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 arch/x86/syscalls/syscall_64.tbl |   1 +
 include/linux/syscalls.h         |   4 +
 include/uapi/linux/pagecache.h   |  94 ++++++++++++
 mm/Makefile                      |   2 +-
 mm/fincore.c                     | 322 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 422 insertions(+), 1 deletion(-)
 create mode 100644 mm/fincore.c

diff --git v3.16-rc3.orig/arch/x86/syscalls/syscall_64.tbl v3.16-rc3/arch/x86/syscalls/syscall_64.tbl
index ec255a1646d2..9d291b7081ca 100644
--- v3.16-rc3.orig/arch/x86/syscalls/syscall_64.tbl
+++ v3.16-rc3/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
 316	common	renameat2		sys_renameat2
+317	common	fincore			sys_fincore
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git v3.16-rc3.orig/include/linux/syscalls.h v3.16-rc3/include/linux/syscalls.h
index b0881a0ed322..60795ee8f9ee 100644
--- v3.16-rc3.orig/include/linux/syscalls.h
+++ v3.16-rc3/include/linux/syscalls.h
@@ -65,6 +65,7 @@ struct old_linux_dirent;
 struct perf_event_attr;
 struct file_handle;
 struct sigaltstack;
+struct fincore_extra;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -866,4 +867,7 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
 asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
 			 unsigned long idx1, unsigned long idx2);
 asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_fincore(int fd, loff_t start, long nr_pages,
+			int mode, unsigned char __user *vec,
+			struct fincore_extra __user *extra);
 #endif
diff --git v3.16-rc3.orig/include/uapi/linux/pagecache.h v3.16-rc3/include/uapi/linux/pagecache.h
index 15e879f7395f..cd53c69d6f56 100644
--- v3.16-rc3.orig/include/uapi/linux/pagecache.h
+++ v3.16-rc3/include/uapi/linux/pagecache.h
@@ -14,4 +14,98 @@ enum {
 
 #define PAGECACHE_TAG_MASK	((1UL << __NR_PAGECACHE_TAGS) - 1)
 
+/*
+ * You can control how the buffer in userspace is filled with this mode
+ * parameters:
+ *
+ * - FINCORE_BMAP:
+ *     the page status is returned in a vector of bytes.
+ *     The least significant bit of each byte is 1 if the referenced page
+ *     is in memory, otherwise it is zero.
+ *
+ * - FINCORE_PGOFF:
+ *     if this flag is set, fincore() doesn't store any information about
+ *     holes. Instead each records per page has the entry of page offset,
+ *     using 8 bytes. This mode is useful if we handle a large file and
+ *     only few pages are on memory.
+ *
+ * - FINCORE_PFN:
+ *     stores pfn, using 8 bytes.
+ *
+ * - FINCORE_PAGEFLAGS:
+ *     stores page flags, using 8 bytes. See definition of KPF_* for
+ *     details of each bit.
+ *
+ * - FINCORE_PAGECACHE_TAGS:
+ *     stores pagecache tags, using 8 bytes. See definition of PAGECACHE_TAG_*
+ *     for details of each bit.
+ *
+ * FINCORE_BMAP shouldn't be used combined with any other flags, and returnd
+ * data in this mode is like this:
+ *
+ *   page offset  0   1   2   3   4
+ *              +---+---+---+---+---+
+ *              | 1 | 0 | 0 | 1 | 1 | ...
+ *              +---+---+---+---+---+
+ *               <->
+ *              1 byte
+ *
+ * For FINCORE_PFN, page data is formatted like this:
+ *
+ *   page offset    0       1       2       3       4
+ *              +-------+-------+-------+-------+-------+
+ *              |  pfn  |  pfn  |  pfn  |  pfn  |  pfn  | ...
+ *              +-------+-------+-------+-------+-------+
+ *               <----->
+ *               8 byte
+ *
+ * We can use multiple flags among the flags in FINCORE_LONGENTRY_MASK.
+ * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
+ * information is stored like this:
+ *
+ *    page offset 0    page offset 1   page offset 2   page offset 3
+ *                                        (hole)
+ *   +-------+-------+-------+-------+-------+-------+-------+-------+
+ *   |  pfn  | flags |  pfn  | flags |   0   |   0   |  pfn  | flags | ...
+ *   +-------+-------+-------+-------+-------+-------+-------+-------+
+ *    <-------------> <-------------> <-------------> <------------->
+ *       16 bytes        16 bytes        16 bytes        16 bytes
+ *
+ * When FINCORE_PGOFF is set, we store page offset entry and ignore holes
+ * For example, the data format of mode FINCORE_PGOFF|FINCORE_PFN|
+ * FINCORE_PAGEFLAGS|FINCORE_PAGECACHE_TAGS is like follows:
+ *
+ *   +-------+-------+-------+-------+-------+-------+-------+-------+
+ *   | pgoff |  pfn  | flags |  tags | pgoff |  pfn  | flags |  tags | ...
+ *   +-------+-------+-------+-------+-------+-------+-------+-------+
+ *    <-----------------------------> <----------------------------->
+ *               32 bytes                        32 bytes
+ */
+#define FINCORE_BMAP		0x01	/* bytemap mode */
+#define FINCORE_PGOFF		0x02
+#define FINCORE_PFN		0x04
+#define FINCORE_PAGE_FLAGS	0x08
+#define FINCORE_PAGECACHE_TAGS	0x10
+
+#define FINCORE_MODE_MASK	0x1f
+#define FINCORE_LONGENTRY_MASK	(FINCORE_PGOFF | FINCORE_PFN | \
+				 FINCORE_PAGE_FLAGS | FINCORE_PAGECACHE_TAGS)
+
+struct fincore_extra {
+	/*
+	 * (output) the number of entries with valid data, this is useful
+	 * if you set FINCORE_PGOFF and want to know the end of filled data.
+	 */
+	unsigned long nr_entries;
+
+	/*
+	 * (input) A mask of pagecache tags which selects what fields the
+	 * user wants.
+	 * (output) A mask of pagecache tags returned from the kernel
+	 * which tells userspace which data it actually filled.
+	 * This variable is used only when FINCORE_PAGECACHE_TAGS is set.
+	 */
+	unsigned long tags;
+};
+
 #endif /* _UAPI_LINUX_PAGECACHE_H */
diff --git v3.16-rc3.orig/mm/Makefile v3.16-rc3/mm/Makefile
index 4064f3ec145e..cc9420221afd 100644
--- v3.16-rc3.orig/mm/Makefile
+++ v3.16-rc3/mm/Makefile
@@ -18,7 +18,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o balloon_compaction.o vmacache.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   iov_iter.o $(mmu-y)
+			   iov_iter.o fincore.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git v3.16-rc3.orig/mm/fincore.c v3.16-rc3/mm/fincore.c
new file mode 100644
index 000000000000..df7226658c0d
--- /dev/null
+++ v3.16-rc3/mm/fincore.c
@@ -0,0 +1,322 @@
+/*
+ * fincore(2) system call
+ *
+ * Copyright (C) 2014 NEC Corporation, Naoya Horiguchi
+ */
+
+#include <linux/syscalls.h>
+#include <linux/pagemap.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/hugetlb.h>
+#include <uapi/linux/pagecache.h>
+
+
+struct fincore_control {
+	int mode;
+	int width;		/* width of each entry (in bytes) */
+	unsigned char *buffer;
+	long buffer_size;
+	void *cursor;		/* current position on the buffer */
+	pgoff_t pgstart;	/* start point of page cache scan in each run
+				 * of the while loop */
+	long nr_pages;		/* number of pages to be copied to userspace
+				 * (decreasing while scan proceeds) */
+	long scanned_offset;	/* page offset of the lastest scanned page */
+	unsigned long tags;	/* pagecache tag mask */
+	struct address_space *mapping;
+};
+
+/*
+ * TODO: doing radix_tree_tag_get() for each tag is not optimal, but no easy
+ * way without degrading finely tuned radix tree routines.
+ */
+static unsigned long get_pagecache_tags(struct fincore_control *fc,
+					unsigned long index)
+{
+	int i;
+	unsigned long tags = 0;
+	struct radix_tree_root *root = &fc->mapping->page_tree;
+
+	for (i = 0; i < __NR_PAGECACHE_TAGS; i++) {
+		if (fc->tags & (1UL << i))
+			if (radix_tree_tag_get(root, index, i))
+				tags |=  1 << i;
+	}
+	return tags;
+}
+
+#define store_entry(fc, type, data) ({		\
+	*(type *)fc->cursor = (type)data;	\
+	fc->cursor += sizeof(type);		\
+})
+
+/*
+ * Store page cache data to temporal buffer in the specified format depending
+ * on fincore mode.
+ */
+static void __do_fincore(struct fincore_control *fc, struct page *page,
+			 unsigned long index)
+{
+	VM_BUG_ON(!page);
+	VM_BUG_ON((unsigned long)fc->cursor - (unsigned long)fc->buffer
+		  >= fc->buffer_size);
+	if (fc->mode & FINCORE_BMAP)
+		store_entry(fc, unsigned char, PageUptodate(page));
+	else if (fc->mode & (FINCORE_LONGENTRY_MASK)) {
+		if (fc->mode & FINCORE_PGOFF)
+			store_entry(fc, unsigned long, index);
+		if (fc->mode & FINCORE_PFN)
+			store_entry(fc, unsigned long, page_to_pfn(page));
+		if (fc->mode & FINCORE_PAGE_FLAGS)
+			store_entry(fc, unsigned long, stable_page_flags(page));
+		if (fc->mode & FINCORE_PAGECACHE_TAGS)
+			store_entry(fc, unsigned long,
+				    get_pagecache_tags(fc, index));
+	}
+}
+
+/*
+ * Traverse page cache tree. It's assumed that temporal buffer are zeroed
+ * in advance. Due to this, we don't have to store zero entry explicitly
+ * one-by-one and we just set fc->cursor to the position of the next
+ * on-memory page.
+ *
+ * Return value is the number of pages whose data is stored in fc->buffer.
+ */
+static long do_fincore(struct fincore_control *fc, int nr_pages)
+{
+	pgoff_t pgend = fc->pgstart + nr_pages;
+	struct radix_tree_iter iter;
+	void **slot;
+	long nr = 0;
+
+	fc->cursor = fc->buffer;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &fc->mapping->page_tree, &iter,
+				 fc->pgstart) {
+		long jump;
+		struct page *page;
+
+		fc->scanned_offset = iter.index;
+		/* Handle holes */
+		jump = iter.index - fc->pgstart - nr;
+		if (jump) {
+			if (!(fc->mode & FINCORE_PGOFF)) {
+				if (iter.index < pgend) {
+					fc->cursor += jump * fc->width;
+					nr = iter.index - fc->pgstart;
+				} else {
+					/*
+					 * Fill remaining buffer as hole. Next
+					 * call should start at offset pgend.
+					 */
+					nr = nr_pages;
+					fc->scanned_offset = pgend - 1;
+					break;
+				}
+			}
+		}
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			/*
+			 * No need to increment nr and fc->cursor, because next
+			 * iteration should detect hole and update them there.
+			 */
+			continue;
+		else if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page)) {
+				/*
+				 * Transient condition which can only trigger
+				 * when entry at index 0 moves out of or back
+				 * to root: none yet gotten, safe to restart.
+				 */
+				WARN_ON(iter.index);
+				goto restart;
+			}
+			__do_fincore(fc, page, iter.index);
+		} else {
+			if (!page_cache_get_speculative(page))
+				goto repeat;
+
+			/* Has the page moved? */
+			if (unlikely(page != *slot)) {
+				page_cache_release(page);
+				goto repeat;
+			}
+
+			__do_fincore(fc, page, iter.index);
+			page_cache_release(page);
+		}
+
+		if (++nr == nr_pages)
+			break;
+	}
+
+	if (!(fc->mode & FINCORE_PGOFF)) {
+		nr = nr_pages;
+		fc->scanned_offset = pgend - 1;
+	}
+
+	rcu_read_unlock();
+
+	return nr;
+}
+
+static inline bool fincore_validate_mode(int mode)
+{
+	if (mode & ~FINCORE_MODE_MASK)
+		return false;
+	if (!(!!(mode & FINCORE_BMAP) ^ !!(mode & FINCORE_LONGENTRY_MASK)))
+		return false;
+	return true;
+}
+
+#define FINCORE_LOOP_STEP	256L
+
+/*
+ * The fincore(2) system call
+ *
+ *  @fd:        file descriptor of the target file
+ *  @start:     starting address offset of the target file (in byte).
+ *              This should be aligned to page cache size.
+ *  @nr_pages:  the number of pages whose data is passed to userspace.
+ *  @mode       fincore mode flags to determine the entry's format
+ *  @vec        pointer of the userspace buffer. The size must be equal to or
+ *              larger than (@nr_pages * width), where width is the size of
+ *              each entry.
+ *  @extra      used to input/output additional information from/to userspace
+ *
+ * fincore() returns the memory residency status and additional info (like
+ * pfn and page flags) of the given file's pages.
+ *
+ * Depending on the fincore mode, caller can receive the different formatted
+ * information. See the comment on definition of FINCORE_*.
+ *
+ * Because the status of a page can change after fincore() checks it once,
+ * the returned vector may contain stale information.
+ *
+ * return values:
+ *  -EBADF:   @fd isn't a valid open file descriptor
+ *  -EFAULT:  @vec points to an illegal address
+ *  -EINVAL:  @start is unaligned to page cache size or is out of file range.
+ *            Or @nr_pages is non-positive. Or @mode is invalid.
+ *            Or fincore_extra is not given in FINCORE_PAGECACHE_TAG mode.
+ *  0:        fincore() is successfully done
+ */
+SYSCALL_DEFINE6(fincore, int, fd, loff_t, start, long, nr_pages,
+		int, mode, unsigned char __user *, vec,
+		struct fincore_extra __user *, extra)
+{
+	long ret = 0;
+	long step;
+	long nr = 0;
+	long pages_to_eof;
+	int pc_shift = PAGE_CACHE_SHIFT;
+	struct fd f;
+
+	struct fincore_control fc = {
+		.mode	= mode,
+		.width	= sizeof(unsigned char),
+	};
+
+	if (start < 0 || nr_pages <= 0)
+		return -EINVAL;
+
+	if (!fincore_validate_mode(mode))
+		return -EINVAL;
+
+	f = fdget(fd);
+
+	if (is_file_hugepages(f.file))
+		pc_shift = huge_page_shift(hstate_file(f.file));
+
+	if (!IS_ALIGNED(start, 1 << pc_shift)) {
+		ret = -EINVAL;
+		goto fput;
+	}
+
+	/*
+	 * TODO: support /dev/mem, /proc/pid/mem for system/process wide
+	 * page survey, which would obsolete /proc/kpageflags, and
+	 * /proc/pid/pagemap.
+	 */
+	if (!S_ISREG(file_inode(f.file)->i_mode)) {
+		ret = -EBADF;
+		goto fput;
+	}
+
+	fc.pgstart = start >> pc_shift;
+	pages_to_eof = DIV_ROUND_UP(i_size_read(file_inode(f.file)),
+				    1UL << pc_shift) - fc.pgstart;
+	/* start is too large */
+	if (pages_to_eof <= 0) {
+		ret = -EINVAL;
+		goto fput;
+	}
+	/* Never go beyond the end of file */
+	fc.nr_pages = min(pages_to_eof, nr_pages);
+	fc.mapping = f.file->f_mapping;
+	if (mode & FINCORE_LONGENTRY_MASK)
+		fc.width = ((mode & FINCORE_PGOFF ? 1 : 0) +
+			    (mode & FINCORE_PFN ? 1 : 0) +
+			    (mode & FINCORE_PAGE_FLAGS ? 1 : 0) +
+			    (mode & FINCORE_PAGECACHE_TAGS ? 1 : 0)
+			) * sizeof(unsigned long);
+
+	if (mode & FINCORE_PAGECACHE_TAGS) {
+		if (!extra) {
+			ret = -EINVAL;
+			goto fput;
+		} else {
+			fc.tags = extra->tags & PAGECACHE_TAG_MASK;
+			__put_user(fc.tags, &extra->tags);
+		}
+	}
+
+	if (!access_ok(VERIFY_WRITE, vec, nr_pages * fc.width)) {
+		ret = -EFAULT;
+		goto fput;
+	}
+
+	step = min(fc.nr_pages, FINCORE_LOOP_STEP);
+
+	fc.buffer_size = step * fc.width;
+	fc.buffer = kmalloc(fc.buffer_size, GFP_TEMPORARY);
+	if (!fc.buffer) {
+		ret = -ENOMEM;
+		goto fput;
+	}
+
+	while (fc.nr_pages > 0) {
+		memset(fc.buffer, 0, fc.buffer_size);
+		ret = do_fincore(&fc, min(step, fc.nr_pages));
+		/* Reached the end of the file */
+		if (ret == 0)
+			break;
+		if (ret < 0)
+			break;
+		if (copy_to_user(vec + nr * fc.width,
+				 fc.buffer, ret * fc.width)) {
+			ret = -EFAULT;
+			break;
+		}
+		fc.nr_pages -= ret;
+		fc.pgstart = fc.scanned_offset + 1;
+		nr += ret;
+	}
+
+	kfree(fc.buffer);
+
+	if (extra)
+		__put_user(nr, &extra->nr_entries);
+
+fput:
+	fdput(f);
+	return ret;
+}
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 2/4] mm: introduce fincore()
@ 2014-07-03 21:52   ` Naoya Horiguchi
  0 siblings, 0 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2014-07-03 21:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Dave Hansen, Christoph Hellwig, Michael Kerrisk,
	Linux API, Naoya Horiguchi

This patch provides a new system call fincore(2), which provides mincore()-
like information, i.e. page residency of a given file. But unlike mincore(),
fincore() has a mode flag which allows us to extract detailed information
about page cache like pfn and page flag. This kind of information is very
helpful, for example when applications want to know the file cache status
to control the IO on their own way.

The details about the data format being passed to userspace are explained
in inline comment, but generally in long entry format, we can choose which
information is extraced flexibly, so you don't have to waste memory by
extracting unnecessary information. And with FINCORE_PGOFF flag, we can skip
hole pages (not on memory,) which makes us avoid a flood of meaningless
zero entries when calling on extremely large (but only few pages of it
are loaded on memory) file.

Basic testset is added in the next patch on tools/testing/selftests/fincore/.

ChangeLog v2:
- move definition of FINCORE_* to include/uapi/linux/pagecache.h
- add another parameter fincore_extra to sys_fincore()
- rename FINCORE_SKIP_HOLE to FINCORE_PGOFF and change bit order.
- add valid argument check (start should be inside file address range,
  nr_pages should be positive)
- add end-of-file check (scan to the end of file even if the last page
  is a hole)
- add access_ok(VERIFY_WIRTE) (copied from mincore())
- update inline comments

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 arch/x86/syscalls/syscall_64.tbl |   1 +
 include/linux/syscalls.h         |   4 +
 include/uapi/linux/pagecache.h   |  94 ++++++++++++
 mm/Makefile                      |   2 +-
 mm/fincore.c                     | 322 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 422 insertions(+), 1 deletion(-)
 create mode 100644 mm/fincore.c

diff --git v3.16-rc3.orig/arch/x86/syscalls/syscall_64.tbl v3.16-rc3/arch/x86/syscalls/syscall_64.tbl
index ec255a1646d2..9d291b7081ca 100644
--- v3.16-rc3.orig/arch/x86/syscalls/syscall_64.tbl
+++ v3.16-rc3/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
 316	common	renameat2		sys_renameat2
+317	common	fincore			sys_fincore
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git v3.16-rc3.orig/include/linux/syscalls.h v3.16-rc3/include/linux/syscalls.h
index b0881a0ed322..60795ee8f9ee 100644
--- v3.16-rc3.orig/include/linux/syscalls.h
+++ v3.16-rc3/include/linux/syscalls.h
@@ -65,6 +65,7 @@ struct old_linux_dirent;
 struct perf_event_attr;
 struct file_handle;
 struct sigaltstack;
+struct fincore_extra;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -866,4 +867,7 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
 asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
 			 unsigned long idx1, unsigned long idx2);
 asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_fincore(int fd, loff_t start, long nr_pages,
+			int mode, unsigned char __user *vec,
+			struct fincore_extra __user *extra);
 #endif
diff --git v3.16-rc3.orig/include/uapi/linux/pagecache.h v3.16-rc3/include/uapi/linux/pagecache.h
index 15e879f7395f..cd53c69d6f56 100644
--- v3.16-rc3.orig/include/uapi/linux/pagecache.h
+++ v3.16-rc3/include/uapi/linux/pagecache.h
@@ -14,4 +14,98 @@ enum {
 
 #define PAGECACHE_TAG_MASK	((1UL << __NR_PAGECACHE_TAGS) - 1)
 
+/*
+ * You can control how the buffer in userspace is filled with this mode
+ * parameters:
+ *
+ * - FINCORE_BMAP:
+ *     the page status is returned in a vector of bytes.
+ *     The least significant bit of each byte is 1 if the referenced page
+ *     is in memory, otherwise it is zero.
+ *
+ * - FINCORE_PGOFF:
+ *     if this flag is set, fincore() doesn't store any information about
+ *     holes. Instead each records per page has the entry of page offset,
+ *     using 8 bytes. This mode is useful if we handle a large file and
+ *     only few pages are on memory.
+ *
+ * - FINCORE_PFN:
+ *     stores pfn, using 8 bytes.
+ *
+ * - FINCORE_PAGEFLAGS:
+ *     stores page flags, using 8 bytes. See definition of KPF_* for
+ *     details of each bit.
+ *
+ * - FINCORE_PAGECACHE_TAGS:
+ *     stores pagecache tags, using 8 bytes. See definition of PAGECACHE_TAG_*
+ *     for details of each bit.
+ *
+ * FINCORE_BMAP shouldn't be used combined with any other flags, and returnd
+ * data in this mode is like this:
+ *
+ *   page offset  0   1   2   3   4
+ *              +---+---+---+---+---+
+ *              | 1 | 0 | 0 | 1 | 1 | ...
+ *              +---+---+---+---+---+
+ *               <->
+ *              1 byte
+ *
+ * For FINCORE_PFN, page data is formatted like this:
+ *
+ *   page offset    0       1       2       3       4
+ *              +-------+-------+-------+-------+-------+
+ *              |  pfn  |  pfn  |  pfn  |  pfn  |  pfn  | ...
+ *              +-------+-------+-------+-------+-------+
+ *               <----->
+ *               8 byte
+ *
+ * We can use multiple flags among the flags in FINCORE_LONGENTRY_MASK.
+ * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
+ * information is stored like this:
+ *
+ *    page offset 0    page offset 1   page offset 2   page offset 3
+ *                                        (hole)
+ *   +-------+-------+-------+-------+-------+-------+-------+-------+
+ *   |  pfn  | flags |  pfn  | flags |   0   |   0   |  pfn  | flags | ...
+ *   +-------+-------+-------+-------+-------+-------+-------+-------+
+ *    <-------------> <-------------> <-------------> <------------->
+ *       16 bytes        16 bytes        16 bytes        16 bytes
+ *
+ * When FINCORE_PGOFF is set, we store page offset entry and ignore holes
+ * For example, the data format of mode FINCORE_PGOFF|FINCORE_PFN|
+ * FINCORE_PAGEFLAGS|FINCORE_PAGECACHE_TAGS is like follows:
+ *
+ *   +-------+-------+-------+-------+-------+-------+-------+-------+
+ *   | pgoff |  pfn  | flags |  tags | pgoff |  pfn  | flags |  tags | ...
+ *   +-------+-------+-------+-------+-------+-------+-------+-------+
+ *    <-----------------------------> <----------------------------->
+ *               32 bytes                        32 bytes
+ */
+#define FINCORE_BMAP		0x01	/* bytemap mode */
+#define FINCORE_PGOFF		0x02
+#define FINCORE_PFN		0x04
+#define FINCORE_PAGE_FLAGS	0x08
+#define FINCORE_PAGECACHE_TAGS	0x10
+
+#define FINCORE_MODE_MASK	0x1f
+#define FINCORE_LONGENTRY_MASK	(FINCORE_PGOFF | FINCORE_PFN | \
+				 FINCORE_PAGE_FLAGS | FINCORE_PAGECACHE_TAGS)
+
+struct fincore_extra {
+	/*
+	 * (output) the number of entries with valid data, this is useful
+	 * if you set FINCORE_PGOFF and want to know the end of filled data.
+	 */
+	unsigned long nr_entries;
+
+	/*
+	 * (input) A mask of pagecache tags which selects what fields the
+	 * user wants.
+	 * (output) A mask of pagecache tags returned from the kernel
+	 * which tells userspace which data it actually filled.
+	 * This variable is used only when FINCORE_PAGECACHE_TAGS is set.
+	 */
+	unsigned long tags;
+};
+
 #endif /* _UAPI_LINUX_PAGECACHE_H */
diff --git v3.16-rc3.orig/mm/Makefile v3.16-rc3/mm/Makefile
index 4064f3ec145e..cc9420221afd 100644
--- v3.16-rc3.orig/mm/Makefile
+++ v3.16-rc3/mm/Makefile
@@ -18,7 +18,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o balloon_compaction.o vmacache.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   iov_iter.o $(mmu-y)
+			   iov_iter.o fincore.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git v3.16-rc3.orig/mm/fincore.c v3.16-rc3/mm/fincore.c
new file mode 100644
index 000000000000..df7226658c0d
--- /dev/null
+++ v3.16-rc3/mm/fincore.c
@@ -0,0 +1,322 @@
+/*
+ * fincore(2) system call
+ *
+ * Copyright (C) 2014 NEC Corporation, Naoya Horiguchi
+ */
+
+#include <linux/syscalls.h>
+#include <linux/pagemap.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/hugetlb.h>
+#include <uapi/linux/pagecache.h>
+
+
+struct fincore_control {
+	int mode;
+	int width;		/* width of each entry (in bytes) */
+	unsigned char *buffer;
+	long buffer_size;
+	void *cursor;		/* current position on the buffer */
+	pgoff_t pgstart;	/* start point of page cache scan in each run
+				 * of the while loop */
+	long nr_pages;		/* number of pages to be copied to userspace
+				 * (decreasing while scan proceeds) */
+	long scanned_offset;	/* page offset of the lastest scanned page */
+	unsigned long tags;	/* pagecache tag mask */
+	struct address_space *mapping;
+};
+
+/*
+ * TODO: doing radix_tree_tag_get() for each tag is not optimal, but no easy
+ * way without degrading finely tuned radix tree routines.
+ */
+static unsigned long get_pagecache_tags(struct fincore_control *fc,
+					unsigned long index)
+{
+	int i;
+	unsigned long tags = 0;
+	struct radix_tree_root *root = &fc->mapping->page_tree;
+
+	for (i = 0; i < __NR_PAGECACHE_TAGS; i++) {
+		if (fc->tags & (1UL << i))
+			if (radix_tree_tag_get(root, index, i))
+				tags |=  1 << i;
+	}
+	return tags;
+}
+
+#define store_entry(fc, type, data) ({		\
+	*(type *)fc->cursor = (type)data;	\
+	fc->cursor += sizeof(type);		\
+})
+
+/*
+ * Store page cache data to temporal buffer in the specified format depending
+ * on fincore mode.
+ */
+static void __do_fincore(struct fincore_control *fc, struct page *page,
+			 unsigned long index)
+{
+	VM_BUG_ON(!page);
+	VM_BUG_ON((unsigned long)fc->cursor - (unsigned long)fc->buffer
+		  >= fc->buffer_size);
+	if (fc->mode & FINCORE_BMAP)
+		store_entry(fc, unsigned char, PageUptodate(page));
+	else if (fc->mode & (FINCORE_LONGENTRY_MASK)) {
+		if (fc->mode & FINCORE_PGOFF)
+			store_entry(fc, unsigned long, index);
+		if (fc->mode & FINCORE_PFN)
+			store_entry(fc, unsigned long, page_to_pfn(page));
+		if (fc->mode & FINCORE_PAGE_FLAGS)
+			store_entry(fc, unsigned long, stable_page_flags(page));
+		if (fc->mode & FINCORE_PAGECACHE_TAGS)
+			store_entry(fc, unsigned long,
+				    get_pagecache_tags(fc, index));
+	}
+}
+
+/*
+ * Traverse page cache tree. It's assumed that temporal buffer are zeroed
+ * in advance. Due to this, we don't have to store zero entry explicitly
+ * one-by-one and we just set fc->cursor to the position of the next
+ * on-memory page.
+ *
+ * Return value is the number of pages whose data is stored in fc->buffer.
+ */
+static long do_fincore(struct fincore_control *fc, int nr_pages)
+{
+	pgoff_t pgend = fc->pgstart + nr_pages;
+	struct radix_tree_iter iter;
+	void **slot;
+	long nr = 0;
+
+	fc->cursor = fc->buffer;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &fc->mapping->page_tree, &iter,
+				 fc->pgstart) {
+		long jump;
+		struct page *page;
+
+		fc->scanned_offset = iter.index;
+		/* Handle holes */
+		jump = iter.index - fc->pgstart - nr;
+		if (jump) {
+			if (!(fc->mode & FINCORE_PGOFF)) {
+				if (iter.index < pgend) {
+					fc->cursor += jump * fc->width;
+					nr = iter.index - fc->pgstart;
+				} else {
+					/*
+					 * Fill remaining buffer as hole. Next
+					 * call should start at offset pgend.
+					 */
+					nr = nr_pages;
+					fc->scanned_offset = pgend - 1;
+					break;
+				}
+			}
+		}
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			/*
+			 * No need to increment nr and fc->cursor, because next
+			 * iteration should detect hole and update them there.
+			 */
+			continue;
+		else if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page)) {
+				/*
+				 * Transient condition which can only trigger
+				 * when entry at index 0 moves out of or back
+				 * to root: none yet gotten, safe to restart.
+				 */
+				WARN_ON(iter.index);
+				goto restart;
+			}
+			__do_fincore(fc, page, iter.index);
+		} else {
+			if (!page_cache_get_speculative(page))
+				goto repeat;
+
+			/* Has the page moved? */
+			if (unlikely(page != *slot)) {
+				page_cache_release(page);
+				goto repeat;
+			}
+
+			__do_fincore(fc, page, iter.index);
+			page_cache_release(page);
+		}
+
+		if (++nr == nr_pages)
+			break;
+	}
+
+	if (!(fc->mode & FINCORE_PGOFF)) {
+		nr = nr_pages;
+		fc->scanned_offset = pgend - 1;
+	}
+
+	rcu_read_unlock();
+
+	return nr;
+}
+
+static inline bool fincore_validate_mode(int mode)
+{
+	if (mode & ~FINCORE_MODE_MASK)
+		return false;
+	if (!(!!(mode & FINCORE_BMAP) ^ !!(mode & FINCORE_LONGENTRY_MASK)))
+		return false;
+	return true;
+}
+
+#define FINCORE_LOOP_STEP	256L
+
+/*
+ * The fincore(2) system call
+ *
+ *  @fd:        file descriptor of the target file
+ *  @start:     starting address offset of the target file (in byte).
+ *              This should be aligned to page cache size.
+ *  @nr_pages:  the number of pages whose data is passed to userspace.
+ *  @mode       fincore mode flags to determine the entry's format
+ *  @vec        pointer of the userspace buffer. The size must be equal to or
+ *              larger than (@nr_pages * width), where width is the size of
+ *              each entry.
+ *  @extra      used to input/output additional information from/to userspace
+ *
+ * fincore() returns the memory residency status and additional info (like
+ * pfn and page flags) of the given file's pages.
+ *
+ * Depending on the fincore mode, caller can receive the different formatted
+ * information. See the comment on definition of FINCORE_*.
+ *
+ * Because the status of a page can change after fincore() checks it once,
+ * the returned vector may contain stale information.
+ *
+ * return values:
+ *  -EBADF:   @fd isn't a valid open file descriptor
+ *  -EFAULT:  @vec points to an illegal address
+ *  -EINVAL:  @start is unaligned to page cache size or is out of file range.
+ *            Or @nr_pages is non-positive. Or @mode is invalid.
+ *            Or fincore_extra is not given in FINCORE_PAGECACHE_TAG mode.
+ *  0:        fincore() is successfully done
+ */
+SYSCALL_DEFINE6(fincore, int, fd, loff_t, start, long, nr_pages,
+		int, mode, unsigned char __user *, vec,
+		struct fincore_extra __user *, extra)
+{
+	long ret = 0;
+	long step;
+	long nr = 0;
+	long pages_to_eof;
+	int pc_shift = PAGE_CACHE_SHIFT;
+	struct fd f;
+
+	struct fincore_control fc = {
+		.mode	= mode,
+		.width	= sizeof(unsigned char),
+	};
+
+	if (start < 0 || nr_pages <= 0)
+		return -EINVAL;
+
+	if (!fincore_validate_mode(mode))
+		return -EINVAL;
+
+	f = fdget(fd);
+
+	if (is_file_hugepages(f.file))
+		pc_shift = huge_page_shift(hstate_file(f.file));
+
+	if (!IS_ALIGNED(start, 1 << pc_shift)) {
+		ret = -EINVAL;
+		goto fput;
+	}
+
+	/*
+	 * TODO: support /dev/mem, /proc/pid/mem for system/process wide
+	 * page survey, which would obsolete /proc/kpageflags, and
+	 * /proc/pid/pagemap.
+	 */
+	if (!S_ISREG(file_inode(f.file)->i_mode)) {
+		ret = -EBADF;
+		goto fput;
+	}
+
+	fc.pgstart = start >> pc_shift;
+	pages_to_eof = DIV_ROUND_UP(i_size_read(file_inode(f.file)),
+				    1UL << pc_shift) - fc.pgstart;
+	/* start is too large */
+	if (pages_to_eof <= 0) {
+		ret = -EINVAL;
+		goto fput;
+	}
+	/* Never go beyond the end of file */
+	fc.nr_pages = min(pages_to_eof, nr_pages);
+	fc.mapping = f.file->f_mapping;
+	if (mode & FINCORE_LONGENTRY_MASK)
+		fc.width = ((mode & FINCORE_PGOFF ? 1 : 0) +
+			    (mode & FINCORE_PFN ? 1 : 0) +
+			    (mode & FINCORE_PAGE_FLAGS ? 1 : 0) +
+			    (mode & FINCORE_PAGECACHE_TAGS ? 1 : 0)
+			) * sizeof(unsigned long);
+
+	if (mode & FINCORE_PAGECACHE_TAGS) {
+		if (!extra) {
+			ret = -EINVAL;
+			goto fput;
+		} else {
+			fc.tags = extra->tags & PAGECACHE_TAG_MASK;
+			__put_user(fc.tags, &extra->tags);
+		}
+	}
+
+	if (!access_ok(VERIFY_WRITE, vec, nr_pages * fc.width)) {
+		ret = -EFAULT;
+		goto fput;
+	}
+
+	step = min(fc.nr_pages, FINCORE_LOOP_STEP);
+
+	fc.buffer_size = step * fc.width;
+	fc.buffer = kmalloc(fc.buffer_size, GFP_TEMPORARY);
+	if (!fc.buffer) {
+		ret = -ENOMEM;
+		goto fput;
+	}
+
+	while (fc.nr_pages > 0) {
+		memset(fc.buffer, 0, fc.buffer_size);
+		ret = do_fincore(&fc, min(step, fc.nr_pages));
+		/* Reached the end of the file */
+		if (ret == 0)
+			break;
+		if (ret < 0)
+			break;
+		if (copy_to_user(vec + nr * fc.width,
+				 fc.buffer, ret * fc.width)) {
+			ret = -EFAULT;
+			break;
+		}
+		fc.nr_pages -= ret;
+		fc.pgstart = fc.scanned_offset + 1;
+		nr += ret;
+	}
+
+	kfree(fc.buffer);
+
+	if (extra)
+		__put_user(nr, &extra->nr_entries);
+
+fput:
+	fdput(f);
+	return ret;
+}
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 3/4] selftests/fincore: add test code for fincore()
  2014-07-03 21:52 ` Naoya Horiguchi
@ 2014-07-03 21:52   ` Naoya Horiguchi
  -1 siblings, 0 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2014-07-03 21:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Dave Hansen, Christoph Hellwig, Michael Kerrisk,
	Linux API, Naoya Horiguchi

This patch adds simple test programs for fincore(), which contains the
following testcase:
  - test_smallfile_bytemap
  - test_smallfile_pfn
  - test_smallfile_multientry
  - test_smallfile_pfn_skiphole
  - test_smallfile_pagecache_tag
  - test_largefile_pfn
  - test_largefile_pfn_offset
  - test_largefile_pfn_overrun
  - test_largefile_pfn_skiphole
  - test_tmpfs_pfn
  - test_hugetlb_pfn
  - test_invalid_start_address
  - test_invalid_len
  - test_invalid_mode
  - test_unaligned_start_address_hugetlb

ChangeLog v2:
- include uapi/linux/pagecache.h
- add testcase test_invalid_start_address and test_invalid_len
- other small changes to adjust for the kernel's changes

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/fincore/Makefile           |  31 ++
 .../selftests/fincore/create_hugetlbfs_file.c      |  49 +++
 tools/testing/selftests/fincore/fincore.c          | 166 +++++++++
 tools/testing/selftests/fincore/run_fincoretests   | 401 +++++++++++++++++++++
 5 files changed, 648 insertions(+)
 create mode 100644 tools/testing/selftests/fincore/Makefile
 create mode 100644 tools/testing/selftests/fincore/create_hugetlbfs_file.c
 create mode 100644 tools/testing/selftests/fincore/fincore.c
 create mode 100644 tools/testing/selftests/fincore/run_fincoretests

diff --git v3.16-rc3.orig/tools/testing/selftests/Makefile v3.16-rc3/tools/testing/selftests/Makefile
index e66e710cc595..91e817b87a9e 100644
--- v3.16-rc3.orig/tools/testing/selftests/Makefile
+++ v3.16-rc3/tools/testing/selftests/Makefile
@@ -11,6 +11,7 @@ TARGETS += vm
 TARGETS += powerpc
 TARGETS += user
 TARGETS += sysctl
+TARGETS += fincore
 
 all:
 	for TARGET in $(TARGETS); do \
diff --git v3.16-rc3.orig/tools/testing/selftests/fincore/Makefile v3.16-rc3/tools/testing/selftests/fincore/Makefile
new file mode 100644
index 000000000000..ab4361c70da5
--- /dev/null
+++ v3.16-rc3/tools/testing/selftests/fincore/Makefile
@@ -0,0 +1,31 @@
+# Makefile for vm selftests
+
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+ifeq ($(ARCH),i386)
+        ARCH := X86
+        CFLAGS := -DCONFIG_X86_32 -D__i386__
+endif
+ifeq ($(ARCH),x86_64)
+        ARCH := X86
+        CFLAGS := -DCONFIG_X86_64 -D__x86_64__
+endif
+
+CC = $(CROSS_COMPILE)gcc
+CFLAGS = -Wall
+CFLAGS += -I../../../../arch/x86/include/generated/
+CFLAGS += -I../../../../include/
+CFLAGS += -I../../../../usr/include/
+CFLAGS += -I../../../../arch/x86/include/
+
+BINARIES = fincore create_hugetlbfs_file
+
+all: $(BINARIES)
+%: %.c
+	$(CC) $(CFLAGS) -o $@ $^
+
+run_tests: all
+	@/bin/sh ./run_fincoretests || (echo "fincoretests: [FAIL]"; exit 1)
+
+clean:
+	$(RM) $(BINARIES)
diff --git v3.16-rc3.orig/tools/testing/selftests/fincore/create_hugetlbfs_file.c v3.16-rc3/tools/testing/selftests/fincore/create_hugetlbfs_file.c
new file mode 100644
index 000000000000..a46ccf0af5f2
--- /dev/null
+++ v3.16-rc3/tools/testing/selftests/fincore/create_hugetlbfs_file.c
@@ -0,0 +1,49 @@
+#define _GNU_SOURCE 1
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdlib.h>
+
+#define err(x) (perror(x), exit(1))
+
+unsigned long default_hugepage_size(void)
+{
+	unsigned long hps = 0;
+	char *line = NULL;
+	size_t linelen = 0;
+	FILE *f = fopen("/proc/meminfo", "r");
+	if (!f)
+		err("open /proc/meminfo");
+	while (getline(&line, &linelen, f) > 0) {
+		if (sscanf(line, "Hugepagesize:	%lu kB", &hps) == 1) {
+			hps <<= 10;
+			break;
+		}
+	}
+	free(line);
+	return hps;
+}
+
+int main(int argc, char **argv)
+{
+	int ret;
+	int fd;
+	char *p;
+	unsigned long hpsize = default_hugepage_size();
+	fd = open(argv[1], O_RDWR|O_CREAT);
+	if (fd == -1)
+		err("open");
+	p = mmap(NULL, 10 * hpsize, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+	if (p == (void *)-1)
+		err("mmap");
+	memset(p, 'a', 3 * hpsize);
+	memset(p + 7 * hpsize, 'a', 3 * hpsize - 1);
+	ret = close(fd);
+	if (ret == -1)
+		err("close");
+	return 0;
+}
diff --git v3.16-rc3.orig/tools/testing/selftests/fincore/fincore.c v3.16-rc3/tools/testing/selftests/fincore/fincore.c
new file mode 100644
index 000000000000..5722622a3b75
--- /dev/null
+++ v3.16-rc3/tools/testing/selftests/fincore/fincore.c
@@ -0,0 +1,166 @@
+/*
+ * fincore(2) test program
+ */
+
+#define _GNU_SOURCE 1
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <getopt.h>
+#include <assert.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <uapi/linux/pagecache.h>
+
+#define err(x) (perror(x), exit(1))
+
+void usage(char *str)
+{
+	printf(
+		"Usage: %s [-s start] [-l len] [-m mode] [-p pagesize] file\n"
+		"  -s: start offset (in bytes)\n"
+		"  -l: length to scan (in bytes)\n"
+		"  -m: fincore mode\n"
+		"  -p: set page size (for hugepage)\n"
+		"  -h: show this message\n"
+		, str);
+	exit(EXIT_SUCCESS);
+}
+
+static void show_fincore_buffer(long start, long nr_pages, int records_per_page,
+				int mode, unsigned char *buf)
+{
+	int i, j;
+	unsigned char *curuc = (unsigned char *)buf;
+	unsigned long *curul = (unsigned long *)buf;
+
+	for (i = 0; i < nr_pages; i++) {
+		j = 0;
+		if (mode & FINCORE_BMAP)
+			printf("buffer: 0x%lx\t%d", start + i, curuc[i + j]);
+		else if (mode & (FINCORE_LONGENTRY_MASK)) {
+			if (mode & FINCORE_PGOFF)
+				printf("buffer: 0x%lx",
+				       curul[i * records_per_page + (j++)]);
+			else
+				printf("buffer: 0x%lx", start + i);
+			if (mode & FINCORE_PFN)
+				printf("\t0x%lx",
+				       curul[i * records_per_page + (j++)]);
+			if (mode & FINCORE_PAGE_FLAGS)
+				printf("\t0x%lx",
+				       curul[i * records_per_page + (j++)]);
+			if (mode & FINCORE_PAGECACHE_TAGS)
+				printf("\t0x%lx",
+				       curul[i * records_per_page + (j++)]);
+		}
+		printf("\n");
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	char c;
+	int fd;
+	int ret;
+	int mode = FINCORE_PFN;
+	int width = sizeof(unsigned char);
+	int records_per_page = 1;
+	long pagesize = sysconf(_SC_PAGESIZE);
+	long nr_pages;
+	unsigned long start = 0;
+	int len_not_given = 1;
+	long len = 0;
+	long buffer_size = 0;
+	unsigned char *buf;
+	struct stat stat;
+	int extra = 0;
+	struct fincore_extra fe = {
+		.tags = PAGECACHE_TAG_DIRTY|PAGECACHE_TAG_WRITEBACK|
+			PAGECACHE_TAG_TOWRITE,
+	};
+
+	while ((c = getopt(argc, argv, "s:l:m:p:et:b:h")) != -1) {
+		switch (c) {
+		case 's':
+			start = strtoul(optarg, NULL, 0);
+			break;
+		case 'l':
+			len_not_given = 0;
+			len = strtol(optarg, NULL, 0);
+			break;
+		case 'm':
+			mode = strtoul(optarg, NULL, 0);
+			break;
+		case 'p':
+			pagesize = strtoul(optarg, NULL, 0);
+			break;
+		case 'e':
+			extra = 1;
+			break;
+		case 't':
+			fe.tags = strtoul(optarg, NULL, 0);
+			break;
+		case 'b':
+			buffer_size = strtoul(optarg, NULL, 0);
+			break;
+		case 'h':
+		default:
+			usage(argv[0]);
+		}
+	}
+
+	fd = open(argv[optind], O_RDWR);
+	if (fd == -1)
+		err("open failed.");
+
+	/* scan to the end of file by default */
+	if (len_not_given) {
+		ret = fstat(fd, &stat);
+		if (ret == -1)
+			err("fstat failed.");
+		len = stat.st_size - start;
+	}
+
+	if (mode & FINCORE_LONGENTRY_MASK) {
+		records_per_page = ((mode & FINCORE_PGOFF ? 1 : 0) +
+				    (mode & FINCORE_PFN ? 1 : 0) +
+				    (mode & FINCORE_PAGE_FLAGS ? 1 : 0) +
+				    (mode & FINCORE_PAGECACHE_TAGS ? 1 : 0)
+			);
+		width = records_per_page * sizeof(unsigned long);
+	}
+
+	nr_pages = ((len + pagesize - 1) & (~(pagesize - 1))) / pagesize;
+	printf("start:0x%lx, len:%ld, mode:%d, pagesize:0x%lx, "
+	       "tags:0x%lx,\n buffer_size:0x%lx, nr_pages:0x%lx, width:%d\n",
+	       start, len, mode, pagesize, fe.tags,
+	       buffer_size, nr_pages, width);
+	buf = malloc(buffer_size > 0 ? buffer_size : nr_pages * width);
+	if (!buf)
+		err("malloc");
+
+	ret = syscall(__NR_fincore, fd, start, nr_pages, mode, buf,
+		      extra ? &fe : NULL);
+	if (ret < 0)
+		err("fincore");
+	/*
+	 * print buffer to stdout, and parse it later for validation check.
+	 * fincore() returns the number of entries written to the buffer.
+	 */
+	show_fincore_buffer(start / pagesize, nr_pages, records_per_page,
+			    mode, buf);
+
+	if (extra) {
+		printf("fincore_extra->nr_entries: %ld\n", fe.nr_entries);
+		printf("fincore_extra->tags: 0x%lx\n", fe.tags);
+	}
+
+	ret = close(fd);
+	if (ret < 0)
+		err("close");
+	return 0;
+}
diff --git v3.16-rc3.orig/tools/testing/selftests/fincore/run_fincoretests v3.16-rc3/tools/testing/selftests/fincore/run_fincoretests
new file mode 100644
index 000000000000..99c89f915b30
--- /dev/null
+++ v3.16-rc3/tools/testing/selftests/fincore/run_fincoretests
@@ -0,0 +1,401 @@
+#!/bin/bash
+
+WDIR=./fincore_work
+mkdir $WDIR 2> /dev/null
+TMPF=`mktemp --tmpdir=$WDIR -d`
+export LANG=C
+
+sysctl -q vm.nr_hugepages=50
+
+#
+# common routines
+#
+abort() {
+    echo "Test abort"
+    exit 1
+}
+
+create_small_file() {
+    dd if=/dev/urandom of=$WDIR/smallfile bs=4096 count=4 > /dev/null 2>&1
+    dd if=/dev/urandom of=$WDIR/smallfile bs=4096 count=4 seek=8> /dev/null 2>&1
+    date >> $WDIR/smallfile
+    sync
+}
+
+create_large_file() {
+    dd if=/dev/urandom of=$WDIR/largefile bs=4096 count=384 > /dev/null 2>&1
+    dd if=/dev/urandom of=$WDIR/largefile bs=4096 count=384 seek=640> /dev/null 2>&1
+    sync
+}
+
+create_tmpfs_file() {
+    dd if=/dev/urandom of=/tmp/tmpfile bs=4096 count=4 > /dev/null 2>&1
+    dd if=/dev/urandom of=/tmp/tmpfile bs=4096 count=4 seek=8> /dev/null 2>&1
+    date >> /tmp/tmpfile
+    sync
+}
+
+create_hugetlb_file() {
+    if mount | grep $WDIR/hugepages > /dev/null ; then
+        echo "$WDIR/hugepages already mounted"
+    else
+        mkdir -p $WDIR/hugepages 2> /dev/null
+        mount -t hugetlbfs none $WDIR/hugepages 2> /dev/null
+        if [ $? -ne 0 ] ; then
+            echo "Failed to mount hugetlbfs" >&2
+            return 1
+        fi
+    fi
+    local hptotal=$(grep HugePages_Total: /proc/meminfo | tr -s ' ' | cut -f2 -d' ')
+    if [ "$hptotal" -lt 10 ] ; then
+        echo "Hugepage pool size need to be >= 10" >&2
+        return 1
+    fi
+    ./create_hugetlbfs_file $WDIR/hugepages/file
+    if [ $? -ne 0 ] ; then
+        echo "Failed to create hugetlb file" >&2
+        return 1
+    fi
+    return 0;
+}
+
+get_buffer() {
+    cat "$1" | grep '^buffer:' | cut -f 2- -d ' '
+}
+
+get_fincore_extra_nr_entries() {
+    cat "$1" | grep '^fincore_extra->nr_entries' | cut -f 2 -d ' '
+}
+
+get_fincore_extra_tags() {
+    cat "$1" | grep '^fincore_extra->tags' | cut -f 2 -d ' '
+}
+
+nr_of_exist_should_be() {
+    if [ "$1" -ne "$2" ] ; then
+        echo "[FAIL] $3: Number of on-memory pages should be $1, but got $2"
+        return 1
+    fi
+    return 0
+}
+
+nr_of_nonexist_should_be() {
+    if [ "$1" -ne "$2" ] ; then
+        echo "[FAIL] $3: Number of hole entries should be $1, but got $2"
+        return 1
+    fi
+    return 0
+}
+
+nr_of_valid_entries_should_be() {
+    if [ "$1" -ne "$2" ] ; then
+        echo "[FAIL] $3: Number of valid entries should be $1, but got $2"
+        return 1
+    fi
+    return 0
+}
+
+check_einval() {
+    grep "fincore: Invalid argument" "$1" > /dev/null
+}
+
+#
+# Testcases
+#
+test_smallfile_bytemap() {
+    local exist
+    local nonexist
+    create_small_file
+
+    ./fincore -m 0x1 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 1 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0 | wc -l)
+    nr_of_exist_should_be 9 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_smallfile_pfn() {
+    local exist
+    local nonexist
+    create_small_file
+
+    ./fincore -m 0x4 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_of_exist_should_be 9 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_smallfile_multientry() {
+    local exist
+    local nonexist
+    create_small_file
+
+    ./fincore -m 0x1c -e $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2,3,4 | grep -vP "0x0\t0x0\t0x0" | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2,3,4 | grep -P "0x0\t0x0\t0x0" | wc -l)
+    nr_of_exist_should_be 9 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_smallfile_pfn_skiphole() {
+    local exist
+    local nonexist
+    local nr_entries
+    create_small_file
+
+    ./fincore -m 0x6 -e $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_entries=$(get_fincore_extra_nr_entries $TMPF/$FUNCNAME)
+    nr_of_exist_should_be 9 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    nr_of_valid_entries_should_be 9 "$nr_entries" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_smallfile_pagecache_tag() {
+    local nr_dirty
+    local fincore_extra_tags
+    create_small_file
+
+    # dirty one page
+    date >> $WDIR/smallfile
+
+    ./fincore -m 0x10 -e -t 0xff $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    nr_dirty=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x1 | wc -l)
+    fincore_extra_tags=$(get_fincore_extra_tags $TMPF/$FUNCNAME)
+    if [ "$nr_dirty" -ne 1 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of dirty bit should be 1, but got $nr_dirty"
+        return 1
+    fi
+    if [ "$fincore_extra_tags" != 0x7 ] ; then
+        echo "[FAIL] $FUNCNAME: unsupported PAGECACHE_TAG_* should be ignored."
+        return 1
+    fi
+
+    # ignore only PAGECACHE_TAG_DIRTY
+    ./fincore -m 0x10 -e -t 0x6 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    nr_dirty=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x1 | wc -l)
+    fincore_extra_tags=$(get_fincore_extra_tags $TMPF/$FUNCNAME)
+    if [ "$nr_dirty" -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of dirty bit should be 0, but got $nr_dirty"
+        return 1
+    fi
+    if [ "$fincore_extra_tags" != 0x6 ] ; then
+        echo "[FAIL] $FUNCNAME: unsupported PAGECACHE_TAG_* should be ignored."
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+# in-kernel function sys_fincore() repeat copy_to_user() per 256 entries,
+# so testing for large file is meaningful testcase.
+test_largefile_pfn() {
+    local exist
+    local nonexist
+    create_large_file
+
+    ./fincore -m 0x4 -e $WDIR/largefile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_of_exist_should_be 768 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 256 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_largefile_pfn_offset() {
+    local exist
+    local nonexist
+    create_large_file
+
+    ./fincore -m 0x4 -s 0x80000 $WDIR/largefile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_of_exist_should_be 640 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 256 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_largefile_pfn_overrun() {
+    local exist
+    local nonexist
+    local nr_entries
+    create_large_file
+
+    ./fincore -m 0x4 -s 0x80000 -l 0x400000 -e $WDIR/largefile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_entries=$(get_fincore_extra_nr_entries $TMPF/$FUNCNAME)
+    nr_of_exist_should_be 640 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 384 "$nonexist" "$FUNCNAME" || return 1
+    nr_of_valid_entries_should_be 896 "$nr_entries" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_largefile_pfn_skiphole() {
+    local exist
+    local nonexist
+    create_large_file
+
+    ./fincore -m 0x6 -s 0x100000 -l 0x102000 -e $WDIR/largefile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_entries=$(get_fincore_extra_nr_entries $TMPF/$FUNCNAME)
+    nr_of_exist_should_be 258 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 0 "$nonexist" "$FUNCNAME" || return 1
+    nr_of_valid_entries_should_be 258 "$nr_entries" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_tmpfs_pfn() {
+    local exist
+    local nonexist
+    create_tmpfs_file
+
+    ./fincore -m 0x4 /tmp/tmpfile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_of_exist_should_be 9 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_hugetlb_pfn() {
+    local exist
+    local nonexist
+    local exitcode=0
+    create_hugetlb_file
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: fail to create a file on hugetlbfs"
+        return 1
+    fi
+    local hugepagesize=$[$(cat /proc/meminfo  | grep Hugepagesize: | tr -s ' ' | cut -f2 -d' ') * 1024]
+    ./fincore -p $hugepagesize -m 0x4 -e $WDIR/hugepages/file > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_entries=$(get_fincore_extra_nr_entries $TMPF/$FUNCNAME)
+    nr_of_exist_should_be 6 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    nr_of_valid_entries_should_be 10 "$nr_entries" "$FUNCNAME" || return 1
+    rm -rf $WDIR/hugepages/file
+    echo "[PASS] $FUNCNAME"
+}
+
+test_invalid_start_address() {
+    create_small_file
+    ./fincore -m 0x4 -s -0x4000 -l 1 -e $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: negative start is invalid"
+        return 1
+    fi
+    ./fincore -m 0x4 -s 0x100000 -l 1 -e $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: too large start is invalid"
+        return 1
+    fi
+    ./fincore -m 0x4 -s 0x30 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: fincore should fail for unaligned start address"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+test_invalid_len() {
+    create_small_file
+    ./fincore -m 0x4 -l 0 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: zero len is invalid"
+        return 1
+    fi
+    ./fincore -m 0x4 -l -10 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: negative len is invalid"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+test_invalid_mode() {
+    create_small_file
+    ./fincore -m 0x0 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: mode == NULL is invalid mode"
+        return 1
+    fi
+    ./fincore -m 0x5 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: mode == (FINCORE_BMAP|FINCORE_PFN) is invalid mode"
+        return 1
+    fi
+    ./fincore -m 0x3 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: mode == (FINCORE_BMAP|FINCORE_PGOFF) is invalid mode"
+        return 1
+    fi
+    ./fincore -m 0x6 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: mode == (FINCORE_PGOFF|FINCORE_PFN) is valid mode"
+        return 1
+    fi
+    ./fincore -m 0x2 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: mode == (FINCORE_PGOFF) is valid mode"
+        return 1
+    fi
+    ./fincore -m 0x1004 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: mode == (Unknown|FINCORE_PFN) is invalid mode"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+test_unaligned_start_address_hugetlb() {
+    local exist
+    local nonexist
+    local exitcode=0
+    create_hugetlb_file
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: fail to create a file on hugetlbfs"
+        return 1
+    fi
+    local hugepagesize=$[$(cat /proc/meminfo  | grep Hugepagesize: | tr -s ' ' | cut -f2 -d' ') * 1024]
+    ./fincore -p $hugepagesize -m 0x4 -s 0x1000 $WDIR/hugepages/file > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: fincore should fail for page-unaligned start address"
+        return 1
+    fi
+    ./fincore -p $hugepagesize -m 0x4 -s $hugepagesize $WDIR/hugepages/file > $TMPF/$FUNCNAME 2>&1
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: fincore should pass for hugepage-aligned start address"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+test_smallfile_bytemap                 || abort
+test_smallfile_pfn                     || abort
+test_smallfile_multientry              || abort
+test_smallfile_pfn_skiphole            || abort
+test_smallfile_pagecache_tag           || abort
+test_largefile_pfn                     || abort
+test_largefile_pfn_offset              || abort
+test_largefile_pfn_overrun             || abort
+test_largefile_pfn_skiphole            || abort
+test_tmpfs_pfn                         || abort
+test_hugetlb_pfn                       || abort
+test_invalid_start_address             || abort
+test_invalid_len                       || abort
+test_invalid_mode                      || abort
+test_unaligned_start_address_hugetlb   || abort
+
+# cleanup
+rm -rf $WDIR/hugepages/file
+umount $WDIR/hugepages > /dev/null 2>&1
+
+exit 0
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 3/4] selftests/fincore: add test code for fincore()
@ 2014-07-03 21:52   ` Naoya Horiguchi
  0 siblings, 0 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2014-07-03 21:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Dave Hansen, Christoph Hellwig, Michael Kerrisk,
	Linux API, Naoya Horiguchi

This patch adds simple test programs for fincore(), which contains the
following testcase:
  - test_smallfile_bytemap
  - test_smallfile_pfn
  - test_smallfile_multientry
  - test_smallfile_pfn_skiphole
  - test_smallfile_pagecache_tag
  - test_largefile_pfn
  - test_largefile_pfn_offset
  - test_largefile_pfn_overrun
  - test_largefile_pfn_skiphole
  - test_tmpfs_pfn
  - test_hugetlb_pfn
  - test_invalid_start_address
  - test_invalid_len
  - test_invalid_mode
  - test_unaligned_start_address_hugetlb

ChangeLog v2:
- include uapi/linux/pagecache.h
- add testcase test_invalid_start_address and test_invalid_len
- other small changes to adjust for the kernel's changes

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/fincore/Makefile           |  31 ++
 .../selftests/fincore/create_hugetlbfs_file.c      |  49 +++
 tools/testing/selftests/fincore/fincore.c          | 166 +++++++++
 tools/testing/selftests/fincore/run_fincoretests   | 401 +++++++++++++++++++++
 5 files changed, 648 insertions(+)
 create mode 100644 tools/testing/selftests/fincore/Makefile
 create mode 100644 tools/testing/selftests/fincore/create_hugetlbfs_file.c
 create mode 100644 tools/testing/selftests/fincore/fincore.c
 create mode 100644 tools/testing/selftests/fincore/run_fincoretests

diff --git v3.16-rc3.orig/tools/testing/selftests/Makefile v3.16-rc3/tools/testing/selftests/Makefile
index e66e710cc595..91e817b87a9e 100644
--- v3.16-rc3.orig/tools/testing/selftests/Makefile
+++ v3.16-rc3/tools/testing/selftests/Makefile
@@ -11,6 +11,7 @@ TARGETS += vm
 TARGETS += powerpc
 TARGETS += user
 TARGETS += sysctl
+TARGETS += fincore
 
 all:
 	for TARGET in $(TARGETS); do \
diff --git v3.16-rc3.orig/tools/testing/selftests/fincore/Makefile v3.16-rc3/tools/testing/selftests/fincore/Makefile
new file mode 100644
index 000000000000..ab4361c70da5
--- /dev/null
+++ v3.16-rc3/tools/testing/selftests/fincore/Makefile
@@ -0,0 +1,31 @@
+# Makefile for vm selftests
+
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+ifeq ($(ARCH),i386)
+        ARCH := X86
+        CFLAGS := -DCONFIG_X86_32 -D__i386__
+endif
+ifeq ($(ARCH),x86_64)
+        ARCH := X86
+        CFLAGS := -DCONFIG_X86_64 -D__x86_64__
+endif
+
+CC = $(CROSS_COMPILE)gcc
+CFLAGS = -Wall
+CFLAGS += -I../../../../arch/x86/include/generated/
+CFLAGS += -I../../../../include/
+CFLAGS += -I../../../../usr/include/
+CFLAGS += -I../../../../arch/x86/include/
+
+BINARIES = fincore create_hugetlbfs_file
+
+all: $(BINARIES)
+%: %.c
+	$(CC) $(CFLAGS) -o $@ $^
+
+run_tests: all
+	@/bin/sh ./run_fincoretests || (echo "fincoretests: [FAIL]"; exit 1)
+
+clean:
+	$(RM) $(BINARIES)
diff --git v3.16-rc3.orig/tools/testing/selftests/fincore/create_hugetlbfs_file.c v3.16-rc3/tools/testing/selftests/fincore/create_hugetlbfs_file.c
new file mode 100644
index 000000000000..a46ccf0af5f2
--- /dev/null
+++ v3.16-rc3/tools/testing/selftests/fincore/create_hugetlbfs_file.c
@@ -0,0 +1,49 @@
+#define _GNU_SOURCE 1
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdlib.h>
+
+#define err(x) (perror(x), exit(1))
+
+unsigned long default_hugepage_size(void)
+{
+	unsigned long hps = 0;
+	char *line = NULL;
+	size_t linelen = 0;
+	FILE *f = fopen("/proc/meminfo", "r");
+	if (!f)
+		err("open /proc/meminfo");
+	while (getline(&line, &linelen, f) > 0) {
+		if (sscanf(line, "Hugepagesize:	%lu kB", &hps) == 1) {
+			hps <<= 10;
+			break;
+		}
+	}
+	free(line);
+	return hps;
+}
+
+int main(int argc, char **argv)
+{
+	int ret;
+	int fd;
+	char *p;
+	unsigned long hpsize = default_hugepage_size();
+	fd = open(argv[1], O_RDWR|O_CREAT);
+	if (fd == -1)
+		err("open");
+	p = mmap(NULL, 10 * hpsize, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+	if (p == (void *)-1)
+		err("mmap");
+	memset(p, 'a', 3 * hpsize);
+	memset(p + 7 * hpsize, 'a', 3 * hpsize - 1);
+	ret = close(fd);
+	if (ret == -1)
+		err("close");
+	return 0;
+}
diff --git v3.16-rc3.orig/tools/testing/selftests/fincore/fincore.c v3.16-rc3/tools/testing/selftests/fincore/fincore.c
new file mode 100644
index 000000000000..5722622a3b75
--- /dev/null
+++ v3.16-rc3/tools/testing/selftests/fincore/fincore.c
@@ -0,0 +1,166 @@
+/*
+ * fincore(2) test program
+ */
+
+#define _GNU_SOURCE 1
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <getopt.h>
+#include <assert.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <uapi/linux/pagecache.h>
+
+#define err(x) (perror(x), exit(1))
+
+void usage(char *str)
+{
+	printf(
+		"Usage: %s [-s start] [-l len] [-m mode] [-p pagesize] file\n"
+		"  -s: start offset (in bytes)\n"
+		"  -l: length to scan (in bytes)\n"
+		"  -m: fincore mode\n"
+		"  -p: set page size (for hugepage)\n"
+		"  -h: show this message\n"
+		, str);
+	exit(EXIT_SUCCESS);
+}
+
+static void show_fincore_buffer(long start, long nr_pages, int records_per_page,
+				int mode, unsigned char *buf)
+{
+	int i, j;
+	unsigned char *curuc = (unsigned char *)buf;
+	unsigned long *curul = (unsigned long *)buf;
+
+	for (i = 0; i < nr_pages; i++) {
+		j = 0;
+		if (mode & FINCORE_BMAP)
+			printf("buffer: 0x%lx\t%d", start + i, curuc[i + j]);
+		else if (mode & (FINCORE_LONGENTRY_MASK)) {
+			if (mode & FINCORE_PGOFF)
+				printf("buffer: 0x%lx",
+				       curul[i * records_per_page + (j++)]);
+			else
+				printf("buffer: 0x%lx", start + i);
+			if (mode & FINCORE_PFN)
+				printf("\t0x%lx",
+				       curul[i * records_per_page + (j++)]);
+			if (mode & FINCORE_PAGE_FLAGS)
+				printf("\t0x%lx",
+				       curul[i * records_per_page + (j++)]);
+			if (mode & FINCORE_PAGECACHE_TAGS)
+				printf("\t0x%lx",
+				       curul[i * records_per_page + (j++)]);
+		}
+		printf("\n");
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	char c;
+	int fd;
+	int ret;
+	int mode = FINCORE_PFN;
+	int width = sizeof(unsigned char);
+	int records_per_page = 1;
+	long pagesize = sysconf(_SC_PAGESIZE);
+	long nr_pages;
+	unsigned long start = 0;
+	int len_not_given = 1;
+	long len = 0;
+	long buffer_size = 0;
+	unsigned char *buf;
+	struct stat stat;
+	int extra = 0;
+	struct fincore_extra fe = {
+		.tags = PAGECACHE_TAG_DIRTY|PAGECACHE_TAG_WRITEBACK|
+			PAGECACHE_TAG_TOWRITE,
+	};
+
+	while ((c = getopt(argc, argv, "s:l:m:p:et:b:h")) != -1) {
+		switch (c) {
+		case 's':
+			start = strtoul(optarg, NULL, 0);
+			break;
+		case 'l':
+			len_not_given = 0;
+			len = strtol(optarg, NULL, 0);
+			break;
+		case 'm':
+			mode = strtoul(optarg, NULL, 0);
+			break;
+		case 'p':
+			pagesize = strtoul(optarg, NULL, 0);
+			break;
+		case 'e':
+			extra = 1;
+			break;
+		case 't':
+			fe.tags = strtoul(optarg, NULL, 0);
+			break;
+		case 'b':
+			buffer_size = strtoul(optarg, NULL, 0);
+			break;
+		case 'h':
+		default:
+			usage(argv[0]);
+		}
+	}
+
+	fd = open(argv[optind], O_RDWR);
+	if (fd == -1)
+		err("open failed.");
+
+	/* scan to the end of file by default */
+	if (len_not_given) {
+		ret = fstat(fd, &stat);
+		if (ret == -1)
+			err("fstat failed.");
+		len = stat.st_size - start;
+	}
+
+	if (mode & FINCORE_LONGENTRY_MASK) {
+		records_per_page = ((mode & FINCORE_PGOFF ? 1 : 0) +
+				    (mode & FINCORE_PFN ? 1 : 0) +
+				    (mode & FINCORE_PAGE_FLAGS ? 1 : 0) +
+				    (mode & FINCORE_PAGECACHE_TAGS ? 1 : 0)
+			);
+		width = records_per_page * sizeof(unsigned long);
+	}
+
+	nr_pages = ((len + pagesize - 1) & (~(pagesize - 1))) / pagesize;
+	printf("start:0x%lx, len:%ld, mode:%d, pagesize:0x%lx, "
+	       "tags:0x%lx,\n buffer_size:0x%lx, nr_pages:0x%lx, width:%d\n",
+	       start, len, mode, pagesize, fe.tags,
+	       buffer_size, nr_pages, width);
+	buf = malloc(buffer_size > 0 ? buffer_size : nr_pages * width);
+	if (!buf)
+		err("malloc");
+
+	ret = syscall(__NR_fincore, fd, start, nr_pages, mode, buf,
+		      extra ? &fe : NULL);
+	if (ret < 0)
+		err("fincore");
+	/*
+	 * print buffer to stdout, and parse it later for validation check.
+	 * fincore() returns the number of entries written to the buffer.
+	 */
+	show_fincore_buffer(start / pagesize, nr_pages, records_per_page,
+			    mode, buf);
+
+	if (extra) {
+		printf("fincore_extra->nr_entries: %ld\n", fe.nr_entries);
+		printf("fincore_extra->tags: 0x%lx\n", fe.tags);
+	}
+
+	ret = close(fd);
+	if (ret < 0)
+		err("close");
+	return 0;
+}
diff --git v3.16-rc3.orig/tools/testing/selftests/fincore/run_fincoretests v3.16-rc3/tools/testing/selftests/fincore/run_fincoretests
new file mode 100644
index 000000000000..99c89f915b30
--- /dev/null
+++ v3.16-rc3/tools/testing/selftests/fincore/run_fincoretests
@@ -0,0 +1,401 @@
+#!/bin/bash
+
+WDIR=./fincore_work
+mkdir $WDIR 2> /dev/null
+TMPF=`mktemp --tmpdir=$WDIR -d`
+export LANG=C
+
+sysctl -q vm.nr_hugepages=50
+
+#
+# common routines
+#
+abort() {
+    echo "Test abort"
+    exit 1
+}
+
+create_small_file() {
+    dd if=/dev/urandom of=$WDIR/smallfile bs=4096 count=4 > /dev/null 2>&1
+    dd if=/dev/urandom of=$WDIR/smallfile bs=4096 count=4 seek=8> /dev/null 2>&1
+    date >> $WDIR/smallfile
+    sync
+}
+
+create_large_file() {
+    dd if=/dev/urandom of=$WDIR/largefile bs=4096 count=384 > /dev/null 2>&1
+    dd if=/dev/urandom of=$WDIR/largefile bs=4096 count=384 seek=640> /dev/null 2>&1
+    sync
+}
+
+create_tmpfs_file() {
+    dd if=/dev/urandom of=/tmp/tmpfile bs=4096 count=4 > /dev/null 2>&1
+    dd if=/dev/urandom of=/tmp/tmpfile bs=4096 count=4 seek=8> /dev/null 2>&1
+    date >> /tmp/tmpfile
+    sync
+}
+
+create_hugetlb_file() {
+    if mount | grep $WDIR/hugepages > /dev/null ; then
+        echo "$WDIR/hugepages already mounted"
+    else
+        mkdir -p $WDIR/hugepages 2> /dev/null
+        mount -t hugetlbfs none $WDIR/hugepages 2> /dev/null
+        if [ $? -ne 0 ] ; then
+            echo "Failed to mount hugetlbfs" >&2
+            return 1
+        fi
+    fi
+    local hptotal=$(grep HugePages_Total: /proc/meminfo | tr -s ' ' | cut -f2 -d' ')
+    if [ "$hptotal" -lt 10 ] ; then
+        echo "Hugepage pool size need to be >= 10" >&2
+        return 1
+    fi
+    ./create_hugetlbfs_file $WDIR/hugepages/file
+    if [ $? -ne 0 ] ; then
+        echo "Failed to create hugetlb file" >&2
+        return 1
+    fi
+    return 0;
+}
+
+get_buffer() {
+    cat "$1" | grep '^buffer:' | cut -f 2- -d ' '
+}
+
+get_fincore_extra_nr_entries() {
+    cat "$1" | grep '^fincore_extra->nr_entries' | cut -f 2 -d ' '
+}
+
+get_fincore_extra_tags() {
+    cat "$1" | grep '^fincore_extra->tags' | cut -f 2 -d ' '
+}
+
+nr_of_exist_should_be() {
+    if [ "$1" -ne "$2" ] ; then
+        echo "[FAIL] $3: Number of on-memory pages should be $1, but got $2"
+        return 1
+    fi
+    return 0
+}
+
+nr_of_nonexist_should_be() {
+    if [ "$1" -ne "$2" ] ; then
+        echo "[FAIL] $3: Number of hole entries should be $1, but got $2"
+        return 1
+    fi
+    return 0
+}
+
+nr_of_valid_entries_should_be() {
+    if [ "$1" -ne "$2" ] ; then
+        echo "[FAIL] $3: Number of valid entries should be $1, but got $2"
+        return 1
+    fi
+    return 0
+}
+
+check_einval() {
+    grep "fincore: Invalid argument" "$1" > /dev/null
+}
+
+#
+# Testcases
+#
+test_smallfile_bytemap() {
+    local exist
+    local nonexist
+    create_small_file
+
+    ./fincore -m 0x1 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 1 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0 | wc -l)
+    nr_of_exist_should_be 9 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_smallfile_pfn() {
+    local exist
+    local nonexist
+    create_small_file
+
+    ./fincore -m 0x4 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_of_exist_should_be 9 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_smallfile_multientry() {
+    local exist
+    local nonexist
+    create_small_file
+
+    ./fincore -m 0x1c -e $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2,3,4 | grep -vP "0x0\t0x0\t0x0" | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2,3,4 | grep -P "0x0\t0x0\t0x0" | wc -l)
+    nr_of_exist_should_be 9 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_smallfile_pfn_skiphole() {
+    local exist
+    local nonexist
+    local nr_entries
+    create_small_file
+
+    ./fincore -m 0x6 -e $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_entries=$(get_fincore_extra_nr_entries $TMPF/$FUNCNAME)
+    nr_of_exist_should_be 9 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    nr_of_valid_entries_should_be 9 "$nr_entries" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_smallfile_pagecache_tag() {
+    local nr_dirty
+    local fincore_extra_tags
+    create_small_file
+
+    # dirty one page
+    date >> $WDIR/smallfile
+
+    ./fincore -m 0x10 -e -t 0xff $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    nr_dirty=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x1 | wc -l)
+    fincore_extra_tags=$(get_fincore_extra_tags $TMPF/$FUNCNAME)
+    if [ "$nr_dirty" -ne 1 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of dirty bit should be 1, but got $nr_dirty"
+        return 1
+    fi
+    if [ "$fincore_extra_tags" != 0x7 ] ; then
+        echo "[FAIL] $FUNCNAME: unsupported PAGECACHE_TAG_* should be ignored."
+        return 1
+    fi
+
+    # ignore only PAGECACHE_TAG_DIRTY
+    ./fincore -m 0x10 -e -t 0x6 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    nr_dirty=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x1 | wc -l)
+    fincore_extra_tags=$(get_fincore_extra_tags $TMPF/$FUNCNAME)
+    if [ "$nr_dirty" -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: Number of dirty bit should be 0, but got $nr_dirty"
+        return 1
+    fi
+    if [ "$fincore_extra_tags" != 0x6 ] ; then
+        echo "[FAIL] $FUNCNAME: unsupported PAGECACHE_TAG_* should be ignored."
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+# in-kernel function sys_fincore() repeat copy_to_user() per 256 entries,
+# so testing for large file is meaningful testcase.
+test_largefile_pfn() {
+    local exist
+    local nonexist
+    create_large_file
+
+    ./fincore -m 0x4 -e $WDIR/largefile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_of_exist_should_be 768 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 256 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_largefile_pfn_offset() {
+    local exist
+    local nonexist
+    create_large_file
+
+    ./fincore -m 0x4 -s 0x80000 $WDIR/largefile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_of_exist_should_be 640 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 256 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_largefile_pfn_overrun() {
+    local exist
+    local nonexist
+    local nr_entries
+    create_large_file
+
+    ./fincore -m 0x4 -s 0x80000 -l 0x400000 -e $WDIR/largefile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_entries=$(get_fincore_extra_nr_entries $TMPF/$FUNCNAME)
+    nr_of_exist_should_be 640 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 384 "$nonexist" "$FUNCNAME" || return 1
+    nr_of_valid_entries_should_be 896 "$nr_entries" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_largefile_pfn_skiphole() {
+    local exist
+    local nonexist
+    create_large_file
+
+    ./fincore -m 0x6 -s 0x100000 -l 0x102000 -e $WDIR/largefile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_entries=$(get_fincore_extra_nr_entries $TMPF/$FUNCNAME)
+    nr_of_exist_should_be 258 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 0 "$nonexist" "$FUNCNAME" || return 1
+    nr_of_valid_entries_should_be 258 "$nr_entries" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_tmpfs_pfn() {
+    local exist
+    local nonexist
+    create_tmpfs_file
+
+    ./fincore -m 0x4 /tmp/tmpfile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_of_exist_should_be 9 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_hugetlb_pfn() {
+    local exist
+    local nonexist
+    local exitcode=0
+    create_hugetlb_file
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: fail to create a file on hugetlbfs"
+        return 1
+    fi
+    local hugepagesize=$[$(cat /proc/meminfo  | grep Hugepagesize: | tr -s ' ' | cut -f2 -d' ') * 1024]
+    ./fincore -p $hugepagesize -m 0x4 -e $WDIR/hugepages/file > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_entries=$(get_fincore_extra_nr_entries $TMPF/$FUNCNAME)
+    nr_of_exist_should_be 6 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    nr_of_valid_entries_should_be 10 "$nr_entries" "$FUNCNAME" || return 1
+    rm -rf $WDIR/hugepages/file
+    echo "[PASS] $FUNCNAME"
+}
+
+test_invalid_start_address() {
+    create_small_file
+    ./fincore -m 0x4 -s -0x4000 -l 1 -e $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: negative start is invalid"
+        return 1
+    fi
+    ./fincore -m 0x4 -s 0x100000 -l 1 -e $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: too large start is invalid"
+        return 1
+    fi
+    ./fincore -m 0x4 -s 0x30 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: fincore should fail for unaligned start address"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+test_invalid_len() {
+    create_small_file
+    ./fincore -m 0x4 -l 0 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: zero len is invalid"
+        return 1
+    fi
+    ./fincore -m 0x4 -l -10 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: negative len is invalid"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+test_invalid_mode() {
+    create_small_file
+    ./fincore -m 0x0 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: mode == NULL is invalid mode"
+        return 1
+    fi
+    ./fincore -m 0x5 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: mode == (FINCORE_BMAP|FINCORE_PFN) is invalid mode"
+        return 1
+    fi
+    ./fincore -m 0x3 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: mode == (FINCORE_BMAP|FINCORE_PGOFF) is invalid mode"
+        return 1
+    fi
+    ./fincore -m 0x6 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: mode == (FINCORE_PGOFF|FINCORE_PFN) is valid mode"
+        return 1
+    fi
+    ./fincore -m 0x2 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: mode == (FINCORE_PGOFF) is valid mode"
+        return 1
+    fi
+    ./fincore -m 0x1004 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: mode == (Unknown|FINCORE_PFN) is invalid mode"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+test_unaligned_start_address_hugetlb() {
+    local exist
+    local nonexist
+    local exitcode=0
+    create_hugetlb_file
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: fail to create a file on hugetlbfs"
+        return 1
+    fi
+    local hugepagesize=$[$(cat /proc/meminfo  | grep Hugepagesize: | tr -s ' ' | cut -f2 -d' ') * 1024]
+    ./fincore -p $hugepagesize -m 0x4 -s 0x1000 $WDIR/hugepages/file > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: fincore should fail for page-unaligned start address"
+        return 1
+    fi
+    ./fincore -p $hugepagesize -m 0x4 -s $hugepagesize $WDIR/hugepages/file > $TMPF/$FUNCNAME 2>&1
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: fincore should pass for hugepage-aligned start address"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+test_smallfile_bytemap                 || abort
+test_smallfile_pfn                     || abort
+test_smallfile_multientry              || abort
+test_smallfile_pfn_skiphole            || abort
+test_smallfile_pagecache_tag           || abort
+test_largefile_pfn                     || abort
+test_largefile_pfn_offset              || abort
+test_largefile_pfn_overrun             || abort
+test_largefile_pfn_skiphole            || abort
+test_tmpfs_pfn                         || abort
+test_hugetlb_pfn                       || abort
+test_invalid_start_address             || abort
+test_invalid_len                       || abort
+test_invalid_mode                      || abort
+test_unaligned_start_address_hugetlb   || abort
+
+# cleanup
+rm -rf $WDIR/hugepages/file
+umount $WDIR/hugepages > /dev/null 2>&1
+
+exit 0
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 4/4] man2/fincore.2: document general description about fincore(2)
  2014-07-03 21:52 ` Naoya Horiguchi
@ 2014-07-03 21:52   ` Naoya Horiguchi
  -1 siblings, 0 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2014-07-03 21:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Dave Hansen, Christoph Hellwig, Michael Kerrisk,
	Linux API, Naoya Horiguchi

This patch adds the man page for the new system call fincore(2).

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 man2/fincore.2 | 383 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 383 insertions(+)
 create mode 100644 man2/fincore.2

diff --git v3.16-rc3.orig/man2/fincore.2 v3.16-rc3/man2/fincore.2
new file mode 100644
index 000000000000..dcc596db4fa0
--- /dev/null
+++ v3.16-rc3/man2/fincore.2
@@ -0,0 +1,383 @@
+.\" Copyright (C) 2014 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH FINCORE 2 2014-07-03 "Linux" "Linux Programmer's Manual"
+.SH NAME
+fincore \- get page cache information
+.SH SYNOPSIS
+.nf
+.B #include <linux/pagecache.h>
+.B #include <linux/kernel-page-flags.h>
+.sp
+.BI "int fincore(int " fd ", loff_t " start ", long " nr_pages ", int " mode ,
+.BI "            unsigned char *" vec ", struct fincore_extra *" extra );
+.fi
+.SH DESCRIPTION
+.BR fincore ()
+extracts information of in-core data of (i.e., page caches for)
+the file referred to by the file descriptor
+.IR fd .
+The kernel scans over the page cache tree,
+starting at the in-file offset
+.I start
+(in bytes) until
+.I nr_pages
+entries in the userspace buffer pointed to by
+.IR vec
+are filled with the page cache's data,
+or until the scan reached the end of the file.
+The format of each entry stored in
+.I vec
+depends on the
+.IR mode .
+The extra argument
+.I extra
+is used to pass the additional data between the kernel and the userspace.
+This is optional, so you may set
+.I extra
+to NULL if unnecessary.
+The structure
+.I fincore_extra
+is defined like:
+.in +4n
+.nf
+
+struct fincore_extra {
+        unsigned long nr_entries;
+        unsigned long tags;
+};
+
+.fi
+.in
+The field
+.I nr_entries
+is an output parameter, set to the number of valid entries stored in
+.IR vec
+by the kernel on return.
+The field
+.I tags
+is used as an input and output parameter, indicating the set of
+page cache tags of the caller's interest.
+For more detail,
+see the description about the FINCORE_PAGECACHE_TAG mode below.
+
+The
+.I start
+argument must be aligned to the page cache size boundary.
+In most cases, it's the page size boundary,
+but if called for a hugetlbfs file,
+the page cache size is the size of the hugepage associated with the file,
+so
+.I start
+must be aligned to the hugepage size boundary.
+
+The
+.I mode
+argument determines the data format of each entry in the user buffer
+.IR vec :
+.TP
+.B FINCORE_BMAP (0)
+In this mode,
+1 byte vector is stored in
+.I vec
+on return.
+The least significant bit of each byte is set if the corresponding page
+is currently resident in memory, and is cleared otherwise.
+(The other bits in each byte are undefined and reserved for future use.)
+.LP
+Any of the following flags are to be set to add an 8 byte field in each entry.
+You can set any of these flags at the same time, although you can't set
+FINCORE_BMAP combined with these 8 byte field flags.
+.TP
+.B FINCORE_PGOFF (1)
+This flag indicates that each entry contains a page offset field.
+With this information, you don't have to get data for hole range,
+so they are not stored in
+.I vec
+any longer.
+Note that if you call with this flag, you can't predict how many valid
+entries are stored in the buffer on return. So the
+.I nr_entries
+field in
+.I struct fincore_extra
+is useful if you want it.
+.TP
+.B FINCORE_PFN (2)
+This flag indicates that each entry contains a page frame number
+(i.e., physical address in page size unit) field.
+.TP
+.B FINCORE_PAGE_FLAGS (3)
+This flag indicates that each entry contains a page flags field.
+See KERNEL PAGE FLAGS section for more detail about each bit.
+.TP
+.B FINCORE_PAGECACHE_TAGS (4)
+This flag indicates that each entry contains a page cache tag field.
+See PAGE CACHE TAGS section for more detail about each bit.
+Note that if you set this flag, you must set the argument
+.I extra
+and set
+.I tags
+to the set of page cache tags you are interested in.
+And on return,
+.I tags
+are set by the kernel to the set of tags which is actually scanned.
+.LP
+The size of the buffer
+.I vec
+must be at least
+.I nr_pages
+bytes if FINCORE_BMAP is set,
+and
+.I (8*n*nr_pages)
+bytes if some of the 8 byte field flags are set,
+where
+.I n
+means the number of 8 byte field flags being set.
+When multiple 8 byte field flags are set, the order of data in each
+entry is the same as one in the bit definition order (shown above
+as the numbers in parentheses.)
+For example, when you set FINCORE_PGOFF (bit 1) and FINCORE_PAGE_FLAGS (bit 3,)
+the first 8 bytes in an entry is the page offset,
+and the second 8 bytes is the page flags.
+
+Note that the information returned by the kernel is just a snapshot:
+pages which are not locked in memory can be freed at any moment, and
+the contents of
+.I vec
+may already be stale by the time the caller refers to the data.
+.SH KERNEL PAGE FLAGS
+.TP
+.B KPF_LOCKED (0)
+The lock on the page is held, suggesting that the kernel may be
+doing some page-related sensitive operation.
+.TP
+.B KPF_ERROR (1)
+The page was affected by IO error or memory error, so the data on the page
+might be lost.
+.TP
+.B KPF_REFERENCED (2)
+This page flag is used to control the page reclaim, combined with KPF_ACTIVE.
+.TP
+.B KPF_UPTODATE (3)
+The page has valid contents.
+.TP
+.B KPF_DIRTY (4)
+The data of the page is not synchronized with one on the backing storage.
+.TP
+.B KPF_LRU (5)
+The page is linked to one of the LRU (Least Recently Update) lists.
+.TP
+.B KPF_ACTIVE (6)
+The page is linked to one of the active LRU lists.
+.TP
+.B KPF_SLAB (7)
+The page is used to construct slabs, which is managed by the kernel
+to allocate various types of kernel objects.
+.TP
+.B KPF_WRITEBACK (8)
+The page is under the writeback operation.
+.TP
+.B KPF_RECLAIM (9)
+The page is under the page reclaim operation.
+.TP
+.B KPF_BUDDY (10)
+The page is under the buddy allocator as a free page. Note that this flag
+is only set to the first page of the "buddy" (i.e., the chunk of free pages.)
+.TP
+.B KPF_MMAP (11)
+The page is mapped to the virtual address space of some processes.
+.TP
+.B KPF_ANON (12)
+The page is anonymous page.
+.TP
+.B KPF_SWAPCACHE (13)
+The page has its own copy of the data on the swap device.
+.TP
+.B KPF_SWAPBACKED (14)
+The page can be swapped out. This flag is set on anonymous pages,
+tmpfs pages, or shmem page.
+.TP
+.B KPF_COMPOUND_HEAD (15)
+The page belongs to a high-order page, and is its first page.
+.TP
+.B KPF_COMPOUND_TAIL (16)
+The page belongs to a high-order page, and is not its first page.
+.TP
+.B KPF_HUGE (17)
+The page is used to construct a hugepage.
+.TP
+.B KPF_UNEVICTABLE (18)
+The page is prevented from being freed.
+This is caused by
+.BR mlock (2)
+or shared memory with
+.BR SHM_LOCK .
+.TP
+.B KPF_HWPOISON (19)
+The page is affected by a hardware error on the memory.
+.TP
+.B KPF_NOPAGE (20)
+This is a pseudo page flag which indicates that the given address
+has no struct page backed.
+.TP
+.B KPF_KSM (21)
+The page is a shared page governed by KSM (Kernel Shared Merging.)
+.TP
+.B KPF_THP (22)
+The page is used to construct a transparent hugepage.
+.LP
+.SH PAGE CACHE TAGS
+.TP
+.B PAGECACHE_TAG_DIRTY
+The page is dirty.
+.TP
+.B PAGECACHE_TAG_WRITEBACK
+The page is under the writeback operation.
+.TP
+.B PAGECACHE_TAG_TOWRITE
+The writeback operation on the page will start soon.
+.LP
+.SH RETURN VALUE
+On success,
+.BR fincore ()
+returns 0.
+On error, \-1 is returned, and
+.I errno
+is set appropriately.
+.SH ERRORS
+.TP
+.B EBADF
+.I fd
+is not a valid file descriptor.
+.TP
+.B EFAULT
+.I vec
+points to an invalid address.
+.TP
+.B EINVAL
+.I start
+is unaligned to page cache size or is out-of-range
+(negative or larger than the file size.)
+Or
+.I nr_pages
+is not a positive value.
+Or
+.I mode
+contained a undefined flag, or contained no flag,
+or contained both of FINCORE_BMAP and one of the "8 byte field" flags.
+Or
+.I fincore_extra
+is not given if
+.I FINCORE_PAGECACHE_TAG
+flag is set.
+.SH VERSIONS
+TBD
+.SH CONFORMING TO
+TBD
+
+.SH EXAMPLE
+.PP
+The following program is an example that shows the page cache information
+of the file specified in its first command-line argument to the standard
+output.
+
+.nf
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <uapi/linux/pagecache.h>
+
+#define err(msg) do { perror(msg); exit(1); } while (0)
+
+int main(int argc, char *argv[])
+{
+    int i, j;
+    int fd;
+    int ret;
+    long ps = sysconf(_SC_PAGESIZE);
+    long nr_pages;
+    unsigned char *buf;
+    struct stat stat;
+    struct fincore_extra fe = {};
+
+    fd = open(argv[1], O_RDWR);
+    if (fd == \-1)
+        err("open");
+
+    ret = fstat(fd, &stat);
+    if (ret == \-1)
+        err("fstat");
+    nr_pages = ((stat.st_size + ps \- 1) & (~(ps \- 1))) / ps;
+
+    buf = malloc(nr_pages * 24);
+    if (!buf)
+        err("malloc");
+
+    /* byte map */
+    ret = fincore(fd, 0, nr_pages, FINCORE_BMAP, buf, NULL);
+    if (ret < 0)
+        err("fincore");
+    printf("Page residency:");
+    for (i = 0; i < nr_pages; i++)
+        printf("%d", buf[i]);
+    printf("\\n\\n");
+
+    /* 8 byte entry */
+    ret = fincore(fd, 0, nr_pages,
+                  FINCORE_PFN|FINCORE_PAGE_FLAGS, buf, &fe);
+    if (ret < 0)
+        err("fincore");
+    printf("pfn\\tflags %lx\\n", fe.nr_entries);
+    for (i = 0; i < fe.nr_entries; i++) {
+        for (j = 0; j < 2; j++)
+            printf("0x%lx\\t", *(unsigned long *)(buf + (i*2+j)*8));
+        printf("\\n");
+    }
+    printf("\\n");
+
+    /* 8 byte entry with page offset (no hole scanned) */
+    ret = fincore(fd, 0, nr_pages,
+              FINCORE_PGOFF|FINCORE_PFN|FINCORE_PAGE_FLAGS, buf, &fe);
+    if (ret < 0)
+        err("fincore");
+    printf("pgoff\\tpfn\\tflags %lx\\n", fe.nr_entries);
+    for (i = 0; i < fe.nr_entries; i++) {
+        for (j = 0; j < 3; j++)
+            printf("0x%lx\\t", *(unsigned long *)(buf + (i*3+j)*8));
+        printf("\\n");
+    }
+
+    free(buf);
+
+    ret = close(fd);
+    if (ret < 0)
+        err("close");
+    return 0;
+}
+.fi
+.SH SEE ALSO
+.BR mincore (2),
+.BR fsync (2)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 4/4] man2/fincore.2: document general description about fincore(2)
@ 2014-07-03 21:52   ` Naoya Horiguchi
  0 siblings, 0 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2014-07-03 21:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Dave Hansen, Christoph Hellwig, Michael Kerrisk,
	Linux API, Naoya Horiguchi

This patch adds the man page for the new system call fincore(2).

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 man2/fincore.2 | 383 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 383 insertions(+)
 create mode 100644 man2/fincore.2

diff --git v3.16-rc3.orig/man2/fincore.2 v3.16-rc3/man2/fincore.2
new file mode 100644
index 000000000000..dcc596db4fa0
--- /dev/null
+++ v3.16-rc3/man2/fincore.2
@@ -0,0 +1,383 @@
+.\" Copyright (C) 2014 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH FINCORE 2 2014-07-03 "Linux" "Linux Programmer's Manual"
+.SH NAME
+fincore \- get page cache information
+.SH SYNOPSIS
+.nf
+.B #include <linux/pagecache.h>
+.B #include <linux/kernel-page-flags.h>
+.sp
+.BI "int fincore(int " fd ", loff_t " start ", long " nr_pages ", int " mode ,
+.BI "            unsigned char *" vec ", struct fincore_extra *" extra );
+.fi
+.SH DESCRIPTION
+.BR fincore ()
+extracts information of in-core data of (i.e., page caches for)
+the file referred to by the file descriptor
+.IR fd .
+The kernel scans over the page cache tree,
+starting at the in-file offset
+.I start
+(in bytes) until
+.I nr_pages
+entries in the userspace buffer pointed to by
+.IR vec
+are filled with the page cache's data,
+or until the scan reached the end of the file.
+The format of each entry stored in
+.I vec
+depends on the
+.IR mode .
+The extra argument
+.I extra
+is used to pass the additional data between the kernel and the userspace.
+This is optional, so you may set
+.I extra
+to NULL if unnecessary.
+The structure
+.I fincore_extra
+is defined like:
+.in +4n
+.nf
+
+struct fincore_extra {
+        unsigned long nr_entries;
+        unsigned long tags;
+};
+
+.fi
+.in
+The field
+.I nr_entries
+is an output parameter, set to the number of valid entries stored in
+.IR vec
+by the kernel on return.
+The field
+.I tags
+is used as an input and output parameter, indicating the set of
+page cache tags of the caller's interest.
+For more detail,
+see the description about the FINCORE_PAGECACHE_TAG mode below.
+
+The
+.I start
+argument must be aligned to the page cache size boundary.
+In most cases, it's the page size boundary,
+but if called for a hugetlbfs file,
+the page cache size is the size of the hugepage associated with the file,
+so
+.I start
+must be aligned to the hugepage size boundary.
+
+The
+.I mode
+argument determines the data format of each entry in the user buffer
+.IR vec :
+.TP
+.B FINCORE_BMAP (0)
+In this mode,
+1 byte vector is stored in
+.I vec
+on return.
+The least significant bit of each byte is set if the corresponding page
+is currently resident in memory, and is cleared otherwise.
+(The other bits in each byte are undefined and reserved for future use.)
+.LP
+Any of the following flags are to be set to add an 8 byte field in each entry.
+You can set any of these flags at the same time, although you can't set
+FINCORE_BMAP combined with these 8 byte field flags.
+.TP
+.B FINCORE_PGOFF (1)
+This flag indicates that each entry contains a page offset field.
+With this information, you don't have to get data for hole range,
+so they are not stored in
+.I vec
+any longer.
+Note that if you call with this flag, you can't predict how many valid
+entries are stored in the buffer on return. So the
+.I nr_entries
+field in
+.I struct fincore_extra
+is useful if you want it.
+.TP
+.B FINCORE_PFN (2)
+This flag indicates that each entry contains a page frame number
+(i.e., physical address in page size unit) field.
+.TP
+.B FINCORE_PAGE_FLAGS (3)
+This flag indicates that each entry contains a page flags field.
+See KERNEL PAGE FLAGS section for more detail about each bit.
+.TP
+.B FINCORE_PAGECACHE_TAGS (4)
+This flag indicates that each entry contains a page cache tag field.
+See PAGE CACHE TAGS section for more detail about each bit.
+Note that if you set this flag, you must set the argument
+.I extra
+and set
+.I tags
+to the set of page cache tags you are interested in.
+And on return,
+.I tags
+are set by the kernel to the set of tags which is actually scanned.
+.LP
+The size of the buffer
+.I vec
+must be at least
+.I nr_pages
+bytes if FINCORE_BMAP is set,
+and
+.I (8*n*nr_pages)
+bytes if some of the 8 byte field flags are set,
+where
+.I n
+means the number of 8 byte field flags being set.
+When multiple 8 byte field flags are set, the order of data in each
+entry is the same as one in the bit definition order (shown above
+as the numbers in parentheses.)
+For example, when you set FINCORE_PGOFF (bit 1) and FINCORE_PAGE_FLAGS (bit 3,)
+the first 8 bytes in an entry is the page offset,
+and the second 8 bytes is the page flags.
+
+Note that the information returned by the kernel is just a snapshot:
+pages which are not locked in memory can be freed at any moment, and
+the contents of
+.I vec
+may already be stale by the time the caller refers to the data.
+.SH KERNEL PAGE FLAGS
+.TP
+.B KPF_LOCKED (0)
+The lock on the page is held, suggesting that the kernel may be
+doing some page-related sensitive operation.
+.TP
+.B KPF_ERROR (1)
+The page was affected by IO error or memory error, so the data on the page
+might be lost.
+.TP
+.B KPF_REFERENCED (2)
+This page flag is used to control the page reclaim, combined with KPF_ACTIVE.
+.TP
+.B KPF_UPTODATE (3)
+The page has valid contents.
+.TP
+.B KPF_DIRTY (4)
+The data of the page is not synchronized with one on the backing storage.
+.TP
+.B KPF_LRU (5)
+The page is linked to one of the LRU (Least Recently Update) lists.
+.TP
+.B KPF_ACTIVE (6)
+The page is linked to one of the active LRU lists.
+.TP
+.B KPF_SLAB (7)
+The page is used to construct slabs, which is managed by the kernel
+to allocate various types of kernel objects.
+.TP
+.B KPF_WRITEBACK (8)
+The page is under the writeback operation.
+.TP
+.B KPF_RECLAIM (9)
+The page is under the page reclaim operation.
+.TP
+.B KPF_BUDDY (10)
+The page is under the buddy allocator as a free page. Note that this flag
+is only set to the first page of the "buddy" (i.e., the chunk of free pages.)
+.TP
+.B KPF_MMAP (11)
+The page is mapped to the virtual address space of some processes.
+.TP
+.B KPF_ANON (12)
+The page is anonymous page.
+.TP
+.B KPF_SWAPCACHE (13)
+The page has its own copy of the data on the swap device.
+.TP
+.B KPF_SWAPBACKED (14)
+The page can be swapped out. This flag is set on anonymous pages,
+tmpfs pages, or shmem page.
+.TP
+.B KPF_COMPOUND_HEAD (15)
+The page belongs to a high-order page, and is its first page.
+.TP
+.B KPF_COMPOUND_TAIL (16)
+The page belongs to a high-order page, and is not its first page.
+.TP
+.B KPF_HUGE (17)
+The page is used to construct a hugepage.
+.TP
+.B KPF_UNEVICTABLE (18)
+The page is prevented from being freed.
+This is caused by
+.BR mlock (2)
+or shared memory with
+.BR SHM_LOCK .
+.TP
+.B KPF_HWPOISON (19)
+The page is affected by a hardware error on the memory.
+.TP
+.B KPF_NOPAGE (20)
+This is a pseudo page flag which indicates that the given address
+has no struct page backed.
+.TP
+.B KPF_KSM (21)
+The page is a shared page governed by KSM (Kernel Shared Merging.)
+.TP
+.B KPF_THP (22)
+The page is used to construct a transparent hugepage.
+.LP
+.SH PAGE CACHE TAGS
+.TP
+.B PAGECACHE_TAG_DIRTY
+The page is dirty.
+.TP
+.B PAGECACHE_TAG_WRITEBACK
+The page is under the writeback operation.
+.TP
+.B PAGECACHE_TAG_TOWRITE
+The writeback operation on the page will start soon.
+.LP
+.SH RETURN VALUE
+On success,
+.BR fincore ()
+returns 0.
+On error, \-1 is returned, and
+.I errno
+is set appropriately.
+.SH ERRORS
+.TP
+.B EBADF
+.I fd
+is not a valid file descriptor.
+.TP
+.B EFAULT
+.I vec
+points to an invalid address.
+.TP
+.B EINVAL
+.I start
+is unaligned to page cache size or is out-of-range
+(negative or larger than the file size.)
+Or
+.I nr_pages
+is not a positive value.
+Or
+.I mode
+contained a undefined flag, or contained no flag,
+or contained both of FINCORE_BMAP and one of the "8 byte field" flags.
+Or
+.I fincore_extra
+is not given if
+.I FINCORE_PAGECACHE_TAG
+flag is set.
+.SH VERSIONS
+TBD
+.SH CONFORMING TO
+TBD
+
+.SH EXAMPLE
+.PP
+The following program is an example that shows the page cache information
+of the file specified in its first command-line argument to the standard
+output.
+
+.nf
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <uapi/linux/pagecache.h>
+
+#define err(msg) do { perror(msg); exit(1); } while (0)
+
+int main(int argc, char *argv[])
+{
+    int i, j;
+    int fd;
+    int ret;
+    long ps = sysconf(_SC_PAGESIZE);
+    long nr_pages;
+    unsigned char *buf;
+    struct stat stat;
+    struct fincore_extra fe = {};
+
+    fd = open(argv[1], O_RDWR);
+    if (fd == \-1)
+        err("open");
+
+    ret = fstat(fd, &stat);
+    if (ret == \-1)
+        err("fstat");
+    nr_pages = ((stat.st_size + ps \- 1) & (~(ps \- 1))) / ps;
+
+    buf = malloc(nr_pages * 24);
+    if (!buf)
+        err("malloc");
+
+    /* byte map */
+    ret = fincore(fd, 0, nr_pages, FINCORE_BMAP, buf, NULL);
+    if (ret < 0)
+        err("fincore");
+    printf("Page residency:");
+    for (i = 0; i < nr_pages; i++)
+        printf("%d", buf[i]);
+    printf("\\n\\n");
+
+    /* 8 byte entry */
+    ret = fincore(fd, 0, nr_pages,
+                  FINCORE_PFN|FINCORE_PAGE_FLAGS, buf, &fe);
+    if (ret < 0)
+        err("fincore");
+    printf("pfn\\tflags %lx\\n", fe.nr_entries);
+    for (i = 0; i < fe.nr_entries; i++) {
+        for (j = 0; j < 2; j++)
+            printf("0x%lx\\t", *(unsigned long *)(buf + (i*2+j)*8));
+        printf("\\n");
+    }
+    printf("\\n");
+
+    /* 8 byte entry with page offset (no hole scanned) */
+    ret = fincore(fd, 0, nr_pages,
+              FINCORE_PGOFF|FINCORE_PFN|FINCORE_PAGE_FLAGS, buf, &fe);
+    if (ret < 0)
+        err("fincore");
+    printf("pgoff\\tpfn\\tflags %lx\\n", fe.nr_entries);
+    for (i = 0; i < fe.nr_entries; i++) {
+        for (j = 0; j < 3; j++)
+            printf("0x%lx\\t", *(unsigned long *)(buf + (i*3+j)*8));
+        printf("\\n");
+    }
+
+    free(buf);
+
+    ret = close(fd);
+    if (ret < 0)
+        err("close");
+    return 0;
+}
+.fi
+.SH SEE ALSO
+.BR mincore (2),
+.BR fsync (2)
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/4] define PAGECACHE_TAG_* as enumeration under include/uapi
  2014-07-03 21:52   ` Naoya Horiguchi
@ 2014-07-04  1:16     ` Dave Chinner
  -1 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2014-07-04  1:16 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Dave Hansen, Christoph Hellwig,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On Thu, Jul 03, 2014 at 05:52:12PM -0400, Naoya Horiguchi wrote:
> We need the pagecache tags to be exported to userspace later in this
> series for fincore(2), so this patch moves the definition to the new
> include file for preparation. We also use the number of pagecache tags,
> so this patch also adds it.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

NACK.

The radix tree tags are deeply internal implementation details.
They are an artifact of the current mark-and-sweep writeback
algorithm, and as such should never, ever be exposed to userspace,
let alone fixed in an ABI we need to support forever more.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/4] define PAGECACHE_TAG_* as enumeration under include/uapi
@ 2014-07-04  1:16     ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2014-07-04  1:16 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Dave Hansen, Christoph Hellwig,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On Thu, Jul 03, 2014 at 05:52:12PM -0400, Naoya Horiguchi wrote:
> We need the pagecache tags to be exported to userspace later in this
> series for fincore(2), so this patch moves the definition to the new
> include file for preparation. We also use the number of pagecache tags,
> so this patch also adds it.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

NACK.

The radix tree tags are deeply internal implementation details.
They are an artifact of the current mark-and-sweep writeback
algorithm, and as such should never, ever be exposed to userspace,
let alone fixed in an ABI we need to support forever more.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/4] define PAGECACHE_TAG_* as enumeration under include/uapi
  2014-07-04  1:16     ` Dave Chinner
@ 2014-07-04  1:41       ` Naoya Horiguchi
  -1 siblings, 0 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2014-07-04  1:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Dave Hansen, Christoph Hellwig,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On Fri, Jul 04, 2014 at 11:16:39AM +1000, Dave Chinner wrote:
> On Thu, Jul 03, 2014 at 05:52:12PM -0400, Naoya Horiguchi wrote:
> > We need the pagecache tags to be exported to userspace later in this
> > series for fincore(2), so this patch moves the definition to the new
> > include file for preparation. We also use the number of pagecache tags,
> > so this patch also adds it.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> NACK.
> 
> The radix tree tags are deeply internal implementation details.
> They are an artifact of the current mark-and-sweep writeback
> algorithm, and as such should never, ever be exposed to userspace,
> let alone fixed in an ABI we need to support forever more.

Hm, OK, so I'll do whole this series without pagecache tag things.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/4] define PAGECACHE_TAG_* as enumeration under include/uapi
@ 2014-07-04  1:41       ` Naoya Horiguchi
  0 siblings, 0 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2014-07-04  1:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Dave Hansen, Christoph Hellwig,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On Fri, Jul 04, 2014 at 11:16:39AM +1000, Dave Chinner wrote:
> On Thu, Jul 03, 2014 at 05:52:12PM -0400, Naoya Horiguchi wrote:
> > We need the pagecache tags to be exported to userspace later in this
> > series for fincore(2), so this patch moves the definition to the new
> > include file for preparation. We also use the number of pagecache tags,
> > so this patch also adds it.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> NACK.
> 
> The radix tree tags are deeply internal implementation details.
> They are an artifact of the current mark-and-sweep writeback
> algorithm, and as such should never, ever be exposed to userspace,
> let alone fixed in an ABI we need to support forever more.

Hm, OK, so I'll do whole this series without pagecache tag things.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/4] mm: introduce fincore()
  2014-07-03 21:52   ` Naoya Horiguchi
@ 2014-07-04 10:12     ` Christoph Hellwig
  -1 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2014-07-04 10:12 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Dave Hansen, Christoph Hellwig,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On Thu, Jul 03, 2014 at 05:52:13PM -0400, Naoya Horiguchi wrote:
> This patch provides a new system call fincore(2), which provides mincore()-
> like information, i.e. page residency of a given file. But unlike mincore(),
> fincore() has a mode flag which allows us to extract detailed information
> about page cache like pfn and page flag. This kind of information is very
> helpful, for example when applications want to know the file cache status
> to control the IO on their own way.

It's still a nasty multiplexer for multiple different reporting formats
in a single system call.  How about your really just do a fincore that
mirrors mincore instead of piggybacking exports of various internal
flags (tags and page flags onto it.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/4] mm: introduce fincore()
@ 2014-07-04 10:12     ` Christoph Hellwig
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2014-07-04 10:12 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Dave Hansen, Christoph Hellwig,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On Thu, Jul 03, 2014 at 05:52:13PM -0400, Naoya Horiguchi wrote:
> This patch provides a new system call fincore(2), which provides mincore()-
> like information, i.e. page residency of a given file. But unlike mincore(),
> fincore() has a mode flag which allows us to extract detailed information
> about page cache like pfn and page flag. This kind of information is very
> helpful, for example when applications want to know the file cache status
> to control the IO on their own way.

It's still a nasty multiplexer for multiple different reporting formats
in a single system call.  How about your really just do a fincore that
mirrors mincore instead of piggybacking exports of various internal
flags (tags and page flags onto it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/4] mm: introduce fincore()
  2014-07-04 10:12     ` Christoph Hellwig
  (?)
@ 2014-07-04 15:15     ` Cédric Villemain
  2014-07-04 16:31         ` Naoya Horiguchi
  -1 siblings, 1 reply; 20+ messages in thread
From: Cédric Villemain @ 2014-07-04 15:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Naoya Horiguchi, Andrew Morton, Konstantin Khlebnikov,
	Wu Fengguang, Arnaldo Carvalho de Melo, Borislav Petkov,
	Kirill A. Shutemov, Johannes Weiner, Rusty Russell, David Miller,
	Andres Freund, linux-kernel, linux-mm, Dave Hansen,
	Michael Kerrisk, Linux API, Naoya Horiguchi

[-- Attachment #1: Type: text/plain, Size: 1228 bytes --]

Le vendredi 4 juillet 2014 03:12:30 Christoph Hellwig a écrit :
> On Thu, Jul 03, 2014 at 05:52:13PM -0400, Naoya Horiguchi wrote:
> > This patch provides a new system call fincore(2), which provides
> > mincore()- like information, i.e. page residency of a given file.
> > But unlike mincore(), fincore() has a mode flag which allows us to
> > extract detailed information about page cache like pfn and page
> > flag. This kind of information is very helpful, for example when
> > applications want to know the file cache status to control the IO
> > on their own way.
> 
> It's still a nasty multiplexer for multiple different reporting
> formats in a single system call.  How about your really just do a
> fincore that mirrors mincore instead of piggybacking exports of
> various internal flags (tags and page flags onto it.

The fincore à la mincore got some arguments against it too. It seems this 
implementations try (I've not tested nor have a close look yet) to 
answer both concerns : have details and also possible to have 
aggregation function not too expansive.

-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/4] mm: introduce fincore()
  2014-07-04 15:15     ` Cédric Villemain
  2014-07-04 16:31         ` Naoya Horiguchi
@ 2014-07-04 16:31         ` Naoya Horiguchi
  0 siblings, 0 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2014-07-04 16:31 UTC (permalink / raw)
  To: Cédric Villemain
  Cc: Christoph Hellwig, Andrew Morton, Konstantin Khlebnikov,
	Wu Fengguang, Arnaldo Carvalho de Melo, Borislav Petkov,
	Kirill A. Shutemov, Johannes Weiner, Rusty Russell, David Miller,
	Andres Freund, linux-kernel, linux-mm, Dave Hansen,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On Fri, Jul 04, 2014 at 05:15:59PM +0200, Cédric Villemain wrote:
> Le vendredi 4 juillet 2014 03:12:30 Christoph Hellwig a écrit :
> > On Thu, Jul 03, 2014 at 05:52:13PM -0400, Naoya Horiguchi wrote:
> > > This patch provides a new system call fincore(2), which provides
> > > mincore()- like information, i.e. page residency of a given file.
> > > But unlike mincore(), fincore() has a mode flag which allows us to
> > > extract detailed information about page cache like pfn and page
> > > flag. This kind of information is very helpful, for example when
> > > applications want to know the file cache status to control the IO
> > > on their own way.
> > 
> > It's still a nasty multiplexer for multiple different reporting
> > formats in a single system call.  How about your really just do a
> > fincore that mirrors mincore instead of piggybacking exports of
> > various internal flags (tags and page flags onto it.

We can do it in mincore-compatible way with FINCORE_BMAP mode.
If you choose it, you don't care about any details about other modes.
I don't make no default mode, but if we have a good reason, I'm OK
to set FINCORE_BMAP as default mode.

> The fincore à la mincore got some arguments against it too. It seems this 
> implementations try (I've not tested nor have a close look yet) to 
> answer both concerns : have details and also possible to have 
> aggregation function not too expansive.

Correct, that's the motivation of this non-trivial interface.
This could finally obsoletes messy /proc/kpage{flags,count} and/or
/proc/pid/pagemap kind of things, and we will not have to collect
information over all these interfaces (so that's less expensive.)

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/4] mm: introduce fincore()
@ 2014-07-04 16:31         ` Naoya Horiguchi
  0 siblings, 0 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2014-07-04 16:31 UTC (permalink / raw)
  To: Cédric Villemain
  Cc: Christoph Hellwig, Andrew Morton, Konstantin Khlebnikov,
	Wu Fengguang, Arnaldo Carvalho de Melo, Borislav Petkov,
	Kirill A. Shutemov, Johannes Weiner, Rusty Russell, David Miller,
	Andres Freund, linux-kernel, linux-mm, Dave Hansen,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On Fri, Jul 04, 2014 at 05:15:59PM +0200, Cédric Villemain wrote:
> Le vendredi 4 juillet 2014 03:12:30 Christoph Hellwig a écrit :
> > On Thu, Jul 03, 2014 at 05:52:13PM -0400, Naoya Horiguchi wrote:
> > > This patch provides a new system call fincore(2), which provides
> > > mincore()- like information, i.e. page residency of a given file.
> > > But unlike mincore(), fincore() has a mode flag which allows us to
> > > extract detailed information about page cache like pfn and page
> > > flag. This kind of information is very helpful, for example when
> > > applications want to know the file cache status to control the IO
> > > on their own way.
> > 
> > It's still a nasty multiplexer for multiple different reporting
> > formats in a single system call.  How about your really just do a
> > fincore that mirrors mincore instead of piggybacking exports of
> > various internal flags (tags and page flags onto it.

We can do it in mincore-compatible way with FINCORE_BMAP mode.
If you choose it, you don't care about any details about other modes.
I don't make no default mode, but if we have a good reason, I'm OK
to set FINCORE_BMAP as default mode.

> The fincore à la mincore got some arguments against it too. It seems this 
> implementations try (I've not tested nor have a close look yet) to 
> answer both concerns : have details and also possible to have 
> aggregation function not too expansive.

Correct, that's the motivation of this non-trivial interface.
This could finally obsoletes messy /proc/kpage{flags,count} and/or
/proc/pid/pagemap kind of things, and we will not have to collect
information over all these interfaces (so that's less expensive.)

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/4] mm: introduce fincore()
@ 2014-07-04 16:31         ` Naoya Horiguchi
  0 siblings, 0 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2014-07-04 16:31 UTC (permalink / raw)
  To: Cédric Villemain
  Cc: Christoph Hellwig, Andrew Morton, Konstantin Khlebnikov,
	Wu Fengguang, Arnaldo Carvalho de Melo, Borislav Petkov,
	Kirill A. Shutemov, Johannes Weiner, Rusty Russell, David Miller,
	Andres Freund, linux-kernel, linux-mm, Dave Hansen,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On Fri, Jul 04, 2014 at 05:15:59PM +0200, Cedric Villemain wrote:
> Le vendredi 4 juillet 2014 03:12:30 Christoph Hellwig a ecrit :
> > On Thu, Jul 03, 2014 at 05:52:13PM -0400, Naoya Horiguchi wrote:
> > > This patch provides a new system call fincore(2), which provides
> > > mincore()- like information, i.e. page residency of a given file.
> > > But unlike mincore(), fincore() has a mode flag which allows us to
> > > extract detailed information about page cache like pfn and page
> > > flag. This kind of information is very helpful, for example when
> > > applications want to know the file cache status to control the IO
> > > on their own way.
> > 
> > It's still a nasty multiplexer for multiple different reporting
> > formats in a single system call.  How about your really just do a
> > fincore that mirrors mincore instead of piggybacking exports of
> > various internal flags (tags and page flags onto it.

We can do it in mincore-compatible way with FINCORE_BMAP mode.
If you choose it, you don't care about any details about other modes.
I don't make no default mode, but if we have a good reason, I'm OK
to set FINCORE_BMAP as default mode.

> The fincore a la mincore got some arguments against it too. It seems this 
> implementations try (I've not tested nor have a close look yet) to 
> answer both concerns : have details and also possible to have 
> aggregation function not too expansive.

Correct, that's the motivation of this non-trivial interface.
This could finally obsoletes messy /proc/kpage{flags,count} and/or
/proc/pid/pagemap kind of things, and we will not have to collect
information over all these interfaces (so that's less expensive.)

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2014-07-04 17:43 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-03 21:52 [PATCH 0/4] mm: introduce fincore() v2 Naoya Horiguchi
2014-07-03 21:52 ` Naoya Horiguchi
2014-07-03 21:52 ` [PATCH v2 1/4] define PAGECACHE_TAG_* as enumeration under include/uapi Naoya Horiguchi
2014-07-03 21:52   ` Naoya Horiguchi
2014-07-04  1:16   ` Dave Chinner
2014-07-04  1:16     ` Dave Chinner
2014-07-04  1:41     ` Naoya Horiguchi
2014-07-04  1:41       ` Naoya Horiguchi
2014-07-03 21:52 ` [PATCH v2 2/4] mm: introduce fincore() Naoya Horiguchi
2014-07-03 21:52   ` Naoya Horiguchi
2014-07-04 10:12   ` Christoph Hellwig
2014-07-04 10:12     ` Christoph Hellwig
2014-07-04 15:15     ` Cédric Villemain
2014-07-04 16:31       ` Naoya Horiguchi
2014-07-04 16:31         ` Naoya Horiguchi
2014-07-04 16:31         ` Naoya Horiguchi
2014-07-03 21:52 ` [PATCH v2 3/4] selftests/fincore: add test code for fincore() Naoya Horiguchi
2014-07-03 21:52   ` Naoya Horiguchi
2014-07-03 21:52 ` [PATCH v2 4/4] man2/fincore.2: document general description about fincore(2) Naoya Horiguchi
2014-07-03 21:52   ` Naoya Horiguchi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.