All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/3] mm: introduce fincore() v3
@ 2014-07-07 18:00 ` Naoya Horiguchi
  0 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-07 18:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Dave Hansen, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

This is the 3rd version of fincore patchset.
I stop exporting PAGECACHE_TAG_* information to userspace via fincore().
Rather than that, no major change since v2.

Any comments/reviews are welcomed.

v2: http://lwn.net/Articles/604380/
v1: http://lwn.net/Articles/601020/

Thanks,
Naoya Horiguchi
---
Tree: git@github.com:Naoya-Horiguchi/linux.git
Branch: v3.16-rc3/fincore.ver3
---
Summary:

Naoya Horiguchi (3):
      mm: introduce fincore()
      selftests/fincore: add test code for fincore()
      man2/fincore.2: document general description about fincore(2)

 arch/x86/syscalls/syscall_64.tbl                   |   1 +
 include/linux/syscalls.h                           |   4 +
 include/uapi/linux/fincore.h                       |  84 +++++
 man2/fincore.2                                     | 348 ++++++++++++++++++++
 mm/Makefile                                        |   2 +-
 mm/fincore.c                                       | 286 ++++++++++++++++
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/fincore/Makefile           |  31 ++
 .../selftests/fincore/create_hugetlbfs_file.c      |  49 +++
 tools/testing/selftests/fincore/fincore.c          | 153 +++++++++
 tools/testing/selftests/fincore/run_fincoretests   | 361 +++++++++++++++++++++
 11 files changed, 1319 insertions(+), 1 deletion(-)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v3 0/3] mm: introduce fincore() v3
@ 2014-07-07 18:00 ` Naoya Horiguchi
  0 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-07 18:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Dave Hansen, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

This is the 3rd version of fincore patchset.
I stop exporting PAGECACHE_TAG_* information to userspace via fincore().
Rather than that, no major change since v2.

Any comments/reviews are welcomed.

v2: http://lwn.net/Articles/604380/
v1: http://lwn.net/Articles/601020/

Thanks,
Naoya Horiguchi
---
Tree: git@github.com:Naoya-Horiguchi/linux.git
Branch: v3.16-rc3/fincore.ver3
---
Summary:

Naoya Horiguchi (3):
      mm: introduce fincore()
      selftests/fincore: add test code for fincore()
      man2/fincore.2: document general description about fincore(2)

 arch/x86/syscalls/syscall_64.tbl                   |   1 +
 include/linux/syscalls.h                           |   4 +
 include/uapi/linux/fincore.h                       |  84 +++++
 man2/fincore.2                                     | 348 ++++++++++++++++++++
 mm/Makefile                                        |   2 +-
 mm/fincore.c                                       | 286 ++++++++++++++++
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/fincore/Makefile           |  31 ++
 .../selftests/fincore/create_hugetlbfs_file.c      |  49 +++
 tools/testing/selftests/fincore/fincore.c          | 153 +++++++++
 tools/testing/selftests/fincore/run_fincoretests   | 361 +++++++++++++++++++++
 11 files changed, 1319 insertions(+), 1 deletion(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v3 1/3] mm: introduce fincore()
  2014-07-07 18:00 ` Naoya Horiguchi
@ 2014-07-07 18:00   ` Naoya Horiguchi
  -1 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-07 18:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Dave Hansen, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

This patch provides a new system call fincore(), which extracts mincore()-
like information from the kernel, i.e. page residency of a given file.
But unlike mincore(), fincore() has a mode flag which allows us to extract
more detailed information like pfn and page flag. This kind of information
is helpful, for example when applications want to know the file cache status
to control the IO on their own way.
There was some complaint around current mincore() API, where we have only
one byte for each page and mincore() is not extensible without breaking
existing applications, so the mode flag is to avoid the same problem.

The details about the data format being passed to userspace are explained
in inline comment, but generally in long entry format, we can choose which
information is extraced flexibly, so you don't have to waste memory by
extracting unnecessary information. And with FINCORE_PGOFF flag, we can skip
hole pages (not on memory,) which makes us avoid a flood of meaningless
zero entries when calling on extremely large (but only few pages of it
are loaded on memory) file.

Basic testset is added in the next patch on tools/testing/selftests/fincore/.

ChangeLog v3:
- remove pagecache tag things
- rename include/uapi/linux/pagecache.h to include/uapi/linux/fincore.h

ChangeLog v2:
- move definition of FINCORE_* to include/uapi/linux/pagecache.h
- add another parameter fincore_extra to sys_fincore()
- rename FINCORE_SKIP_HOLE to FINCORE_PGOFF and change bit order.
- add valid argument check (start should be inside file address range,
  nr_pages should be positive)
- add end-of-file check (scan to the end of file even if the last page
  is a hole)
- add access_ok(VERIFY_WIRTE) (copied from mincore())
- update inline comments

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 arch/x86/syscalls/syscall_64.tbl |   1 +
 include/linux/syscalls.h         |   4 +
 include/uapi/linux/fincore.h     |  84 ++++++++++++
 mm/Makefile                      |   2 +-
 mm/fincore.c                     | 286 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 376 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/fincore.h
 create mode 100644 mm/fincore.c

diff --git v3.16-rc3.orig/arch/x86/syscalls/syscall_64.tbl v3.16-rc3/arch/x86/syscalls/syscall_64.tbl
index ec255a1646d2..9d291b7081ca 100644
--- v3.16-rc3.orig/arch/x86/syscalls/syscall_64.tbl
+++ v3.16-rc3/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
 316	common	renameat2		sys_renameat2
+317	common	fincore			sys_fincore
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git v3.16-rc3.orig/include/linux/syscalls.h v3.16-rc3/include/linux/syscalls.h
index b0881a0ed322..60795ee8f9ee 100644
--- v3.16-rc3.orig/include/linux/syscalls.h
+++ v3.16-rc3/include/linux/syscalls.h
@@ -65,6 +65,7 @@ struct old_linux_dirent;
 struct perf_event_attr;
 struct file_handle;
 struct sigaltstack;
+struct fincore_extra;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -866,4 +867,7 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
 asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
 			 unsigned long idx1, unsigned long idx2);
 asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_fincore(int fd, loff_t start, long nr_pages,
+			int mode, unsigned char __user *vec,
+			struct fincore_extra __user *extra);
 #endif
diff --git v3.16-rc3.orig/include/uapi/linux/fincore.h v3.16-rc3/include/uapi/linux/fincore.h
new file mode 100644
index 000000000000..30797e68a8d4
--- /dev/null
+++ v3.16-rc3/include/uapi/linux/fincore.h
@@ -0,0 +1,84 @@
+#ifndef _UAPI_LINUX_FINCORE_H
+#define _UAPI_LINUX_FINCORE_H
+
+/*
+ * You can control how the buffer in userspace is filled with this mode
+ * parameters:
+ *
+ * - FINCORE_BMAP:
+ *     the page status is returned in a vector of bytes.
+ *     The least significant bit of each byte is 1 if the referenced page
+ *     is in memory, otherwise it is zero.
+ *
+ * - FINCORE_PGOFF:
+ *     if this flag is set, fincore() doesn't store any information about
+ *     holes. Instead each records per page has the entry of page offset,
+ *     using 8 bytes. This mode is useful if we handle a large file and
+ *     only few pages are on memory.
+ *
+ * - FINCORE_PFN:
+ *     stores pfn, using 8 bytes.
+ *
+ * - FINCORE_PAGEFLAGS:
+ *     stores page flags, using 8 bytes. See definition of KPF_* for
+ *     details of each bit.
+ *
+ * FINCORE_BMAP shouldn't be used combined with any other flags, and returnd
+ * data in this mode is like this:
+ *
+ *   page offset  0   1   2   3   4
+ *              +---+---+---+---+---+
+ *              | 1 | 0 | 0 | 1 | 1 | ...
+ *              +---+---+---+---+---+
+ *               <->
+ *              1 byte
+ *
+ * For FINCORE_PFN, page data is formatted like this:
+ *
+ *   page offset    0       1       2       3       4
+ *              +-------+-------+-------+-------+-------+
+ *              |  pfn  |  pfn  |  pfn  |  pfn  |  pfn  | ...
+ *              +-------+-------+-------+-------+-------+
+ *               <----->
+ *               8 byte
+ *
+ * We can use multiple flags among the flags in FINCORE_LONGENTRY_MASK.
+ * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
+ * information is stored like this:
+ *
+ *    page offset 0    page offset 1   page offset 2   page offset 3
+ *                                        (hole)
+ *   +-------+-------+-------+-------+-------+-------+-------+-------+
+ *   |  pfn  | flags |  pfn  | flags |   0   |   0   |  pfn  | flags | ...
+ *   +-------+-------+-------+-------+-------+-------+-------+-------+
+ *    <-------------> <-------------> <-------------> <------------->
+ *       16 bytes        16 bytes        16 bytes        16 bytes
+ *
+ * When FINCORE_PGOFF is set, we store page offset entry and ignore holes
+ * For example, the data format of mode FINCORE_PGOFF|FINCORE_PFN|
+ * FINCORE_PAGEFLAGS is like follows:
+ *
+ *   +-------+-------+-------+-------+-------+-------+
+ *   | pgoff |  pfn  | flags | pgoff |  pfn  | flags | ...
+ *   +-------+-------+-------+-------+-------+-------+
+ *    <---------------------> <--------------------->
+ *           24 bytes                24 bytes
+ */
+#define FINCORE_BMAP		0x01	/* bytemap mode */
+#define FINCORE_PGOFF		0x02
+#define FINCORE_PFN		0x04
+#define FINCORE_PAGE_FLAGS	0x08
+
+#define FINCORE_MODE_MASK	0x1f
+#define FINCORE_LONGENTRY_MASK	(FINCORE_PGOFF | FINCORE_PFN | \
+				 FINCORE_PAGE_FLAGS)
+
+struct fincore_extra {
+	/*
+	 * (output) the number of entries with valid data, this is useful
+	 * if you set FINCORE_PGOFF and want to know the end of filled data.
+	 */
+	unsigned long nr_entries;
+};
+
+#endif /* _UAPI_LINUX_FINCORE_H */
diff --git v3.16-rc3.orig/mm/Makefile v3.16-rc3/mm/Makefile
index 4064f3ec145e..cc9420221afd 100644
--- v3.16-rc3.orig/mm/Makefile
+++ v3.16-rc3/mm/Makefile
@@ -18,7 +18,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o balloon_compaction.o vmacache.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   iov_iter.o $(mmu-y)
+			   iov_iter.o fincore.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git v3.16-rc3.orig/mm/fincore.c v3.16-rc3/mm/fincore.c
new file mode 100644
index 000000000000..2f20ac3569df
--- /dev/null
+++ v3.16-rc3/mm/fincore.c
@@ -0,0 +1,286 @@
+/*
+ * fincore(2) system call
+ *
+ * Copyright (C) 2014 NEC Corporation, Naoya Horiguchi
+ */
+
+#include <linux/syscalls.h>
+#include <linux/pagemap.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/hugetlb.h>
+#include <uapi/linux/fincore.h>
+
+struct fincore_control {
+	int mode;
+	int width;		/* width of each entry (in bytes) */
+	unsigned char *buffer;
+	long buffer_size;
+	void *cursor;		/* current position on the buffer */
+	pgoff_t pgstart;	/* start point of page cache scan in each run
+				 * of the while loop */
+	long nr_pages;		/* number of pages to be copied to userspace
+				 * (decreasing while scan proceeds) */
+	long scanned_offset;	/* page offset of the lastest scanned page */
+	struct address_space *mapping;
+};
+
+#define store_entry(fc, type, data) ({		\
+	*(type *)fc->cursor = (type)data;	\
+	fc->cursor += sizeof(type);		\
+})
+
+/*
+ * Store page cache data to temporal buffer in the specified format depending
+ * on fincore mode.
+ */
+static void __do_fincore(struct fincore_control *fc, struct page *page,
+			 unsigned long index)
+{
+	VM_BUG_ON(!page);
+	VM_BUG_ON((unsigned long)fc->cursor - (unsigned long)fc->buffer
+		  >= fc->buffer_size);
+	if (fc->mode & FINCORE_BMAP)
+		store_entry(fc, unsigned char, PageUptodate(page));
+	else if (fc->mode & (FINCORE_LONGENTRY_MASK)) {
+		if (fc->mode & FINCORE_PGOFF)
+			store_entry(fc, unsigned long, index);
+		if (fc->mode & FINCORE_PFN)
+			store_entry(fc, unsigned long, page_to_pfn(page));
+		if (fc->mode & FINCORE_PAGE_FLAGS)
+			store_entry(fc, unsigned long, stable_page_flags(page));
+	}
+}
+
+/*
+ * Traverse page cache tree. It's assumed that temporal buffer are zeroed
+ * in advance. Due to this, we don't have to store zero entry explicitly
+ * one-by-one and we just set fc->cursor to the position of the next
+ * on-memory page.
+ *
+ * Return value is the number of pages whose data is stored in fc->buffer.
+ */
+static long do_fincore(struct fincore_control *fc, int nr_pages)
+{
+	pgoff_t pgend = fc->pgstart + nr_pages;
+	struct radix_tree_iter iter;
+	void **slot;
+	long nr = 0;
+
+	fc->cursor = fc->buffer;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &fc->mapping->page_tree, &iter,
+				 fc->pgstart) {
+		long jump;
+		struct page *page;
+
+		fc->scanned_offset = iter.index;
+		/* Handle holes */
+		jump = iter.index - fc->pgstart - nr;
+		if (jump) {
+			if (!(fc->mode & FINCORE_PGOFF)) {
+				if (iter.index < pgend) {
+					fc->cursor += jump * fc->width;
+					nr = iter.index - fc->pgstart;
+				} else {
+					/*
+					 * Fill remaining buffer as hole. Next
+					 * call should start at offset pgend.
+					 */
+					nr = nr_pages;
+					fc->scanned_offset = pgend - 1;
+					break;
+				}
+			}
+		}
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			/*
+			 * No need to increment nr and fc->cursor, because next
+			 * iteration should detect hole and update them there.
+			 */
+			continue;
+		else if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page)) {
+				/*
+				 * Transient condition which can only trigger
+				 * when entry at index 0 moves out of or back
+				 * to root: none yet gotten, safe to restart.
+				 */
+				WARN_ON(iter.index);
+				goto restart;
+			}
+			__do_fincore(fc, page, iter.index);
+		} else {
+			if (!page_cache_get_speculative(page))
+				goto repeat;
+
+			/* Has the page moved? */
+			if (unlikely(page != *slot)) {
+				page_cache_release(page);
+				goto repeat;
+			}
+
+			__do_fincore(fc, page, iter.index);
+			page_cache_release(page);
+		}
+
+		if (++nr == nr_pages)
+			break;
+	}
+
+	if (!(fc->mode & FINCORE_PGOFF)) {
+		nr = nr_pages;
+		fc->scanned_offset = pgend - 1;
+	}
+
+	rcu_read_unlock();
+
+	return nr;
+}
+
+static inline bool fincore_validate_mode(int mode)
+{
+	if (mode & ~FINCORE_MODE_MASK)
+		return false;
+	if (!(!!(mode & FINCORE_BMAP) ^ !!(mode & FINCORE_LONGENTRY_MASK)))
+		return false;
+	return true;
+}
+
+#define FINCORE_LOOP_STEP	256L
+
+/*
+ * The fincore(2) system call
+ *
+ *  @fd:        file descriptor of the target file
+ *  @start:     starting address offset of the target file (in byte).
+ *              This should be aligned to page cache size.
+ *  @nr_pages:  the number of pages whose data is passed to userspace.
+ *  @mode       fincore mode flags to determine the entry's format
+ *  @vec        pointer of the userspace buffer. The size must be equal to or
+ *              larger than (@nr_pages * width), where width is the size of
+ *              each entry.
+ *  @extra      used to input/output additional information from/to userspace
+ *
+ * fincore() returns the memory residency status and additional info (like
+ * pfn and page flags) of the given file's pages.
+ *
+ * Depending on the fincore mode, caller can receive the different formatted
+ * information. See the comment on definition of FINCORE_*.
+ *
+ * Because the status of a page can change after fincore() checks it once,
+ * the returned vector may contain stale information.
+ *
+ * return values:
+ *  -EBADF:   @fd isn't a valid open file descriptor
+ *  -EFAULT:  @vec points to an illegal address
+ *  -EINVAL:  @start is unaligned to page cache size or is out of file range.
+ *            Or @nr_pages is non-positive. Or @mode is invalid.
+ *  0:        fincore() is successfully done
+ */
+SYSCALL_DEFINE6(fincore, int, fd, loff_t, start, long, nr_pages,
+		int, mode, unsigned char __user *, vec,
+		struct fincore_extra __user *, extra)
+{
+	long ret = 0;
+	long step;
+	long nr = 0;
+	long pages_to_eof;
+	int pc_shift = PAGE_CACHE_SHIFT;
+	struct fd f;
+
+	struct fincore_control fc = {
+		.mode	= mode,
+		.width	= sizeof(unsigned char),
+	};
+
+	if (start < 0 || nr_pages <= 0)
+		return -EINVAL;
+
+	if (!fincore_validate_mode(mode))
+		return -EINVAL;
+
+	f = fdget(fd);
+
+	if (is_file_hugepages(f.file))
+		pc_shift = huge_page_shift(hstate_file(f.file));
+
+	if (!IS_ALIGNED(start, 1 << pc_shift)) {
+		ret = -EINVAL;
+		goto fput;
+	}
+
+	/*
+	 * TODO: support /dev/mem, /proc/pid/mem for system/process wide
+	 * page survey, which would obsolete /proc/kpageflags, and
+	 * /proc/pid/pagemap.
+	 */
+	if (!S_ISREG(file_inode(f.file)->i_mode)) {
+		ret = -EBADF;
+		goto fput;
+	}
+
+	fc.pgstart = start >> pc_shift;
+	pages_to_eof = DIV_ROUND_UP(i_size_read(file_inode(f.file)),
+				    1UL << pc_shift) - fc.pgstart;
+	/* start is too large */
+	if (pages_to_eof <= 0) {
+		ret = -EINVAL;
+		goto fput;
+	}
+	/* Never go beyond the end of file */
+	fc.nr_pages = min(pages_to_eof, nr_pages);
+	fc.mapping = f.file->f_mapping;
+	if (mode & FINCORE_LONGENTRY_MASK)
+		fc.width = ((mode & FINCORE_PGOFF ? 1 : 0) +
+			    (mode & FINCORE_PFN ? 1 : 0) +
+			    (mode & FINCORE_PAGE_FLAGS ? 1 : 0)
+			) * sizeof(unsigned long);
+
+	if (!access_ok(VERIFY_WRITE, vec, nr_pages * fc.width)) {
+		ret = -EFAULT;
+		goto fput;
+	}
+
+	step = min(fc.nr_pages, FINCORE_LOOP_STEP);
+
+	fc.buffer_size = step * fc.width;
+	fc.buffer = kmalloc(fc.buffer_size, GFP_TEMPORARY);
+	if (!fc.buffer) {
+		ret = -ENOMEM;
+		goto fput;
+	}
+
+	while (fc.nr_pages > 0) {
+		memset(fc.buffer, 0, fc.buffer_size);
+		ret = do_fincore(&fc, min(step, fc.nr_pages));
+		/* Reached the end of the file */
+		if (ret == 0)
+			break;
+		if (ret < 0)
+			break;
+		if (copy_to_user(vec + nr * fc.width,
+				 fc.buffer, ret * fc.width)) {
+			ret = -EFAULT;
+			break;
+		}
+		fc.nr_pages -= ret;
+		fc.pgstart = fc.scanned_offset + 1;
+		nr += ret;
+	}
+
+	kfree(fc.buffer);
+
+	if (extra)
+		__put_user(nr, &extra->nr_entries);
+
+fput:
+	fdput(f);
+	return ret;
+}
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 1/3] mm: introduce fincore()
@ 2014-07-07 18:00   ` Naoya Horiguchi
  0 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-07 18:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Dave Hansen, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

This patch provides a new system call fincore(), which extracts mincore()-
like information from the kernel, i.e. page residency of a given file.
But unlike mincore(), fincore() has a mode flag which allows us to extract
more detailed information like pfn and page flag. This kind of information
is helpful, for example when applications want to know the file cache status
to control the IO on their own way.
There was some complaint around current mincore() API, where we have only
one byte for each page and mincore() is not extensible without breaking
existing applications, so the mode flag is to avoid the same problem.

The details about the data format being passed to userspace are explained
in inline comment, but generally in long entry format, we can choose which
information is extraced flexibly, so you don't have to waste memory by
extracting unnecessary information. And with FINCORE_PGOFF flag, we can skip
hole pages (not on memory,) which makes us avoid a flood of meaningless
zero entries when calling on extremely large (but only few pages of it
are loaded on memory) file.

Basic testset is added in the next patch on tools/testing/selftests/fincore/.

ChangeLog v3:
- remove pagecache tag things
- rename include/uapi/linux/pagecache.h to include/uapi/linux/fincore.h

ChangeLog v2:
- move definition of FINCORE_* to include/uapi/linux/pagecache.h
- add another parameter fincore_extra to sys_fincore()
- rename FINCORE_SKIP_HOLE to FINCORE_PGOFF and change bit order.
- add valid argument check (start should be inside file address range,
  nr_pages should be positive)
- add end-of-file check (scan to the end of file even if the last page
  is a hole)
- add access_ok(VERIFY_WIRTE) (copied from mincore())
- update inline comments

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 arch/x86/syscalls/syscall_64.tbl |   1 +
 include/linux/syscalls.h         |   4 +
 include/uapi/linux/fincore.h     |  84 ++++++++++++
 mm/Makefile                      |   2 +-
 mm/fincore.c                     | 286 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 376 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/fincore.h
 create mode 100644 mm/fincore.c

diff --git v3.16-rc3.orig/arch/x86/syscalls/syscall_64.tbl v3.16-rc3/arch/x86/syscalls/syscall_64.tbl
index ec255a1646d2..9d291b7081ca 100644
--- v3.16-rc3.orig/arch/x86/syscalls/syscall_64.tbl
+++ v3.16-rc3/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
 316	common	renameat2		sys_renameat2
+317	common	fincore			sys_fincore
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git v3.16-rc3.orig/include/linux/syscalls.h v3.16-rc3/include/linux/syscalls.h
index b0881a0ed322..60795ee8f9ee 100644
--- v3.16-rc3.orig/include/linux/syscalls.h
+++ v3.16-rc3/include/linux/syscalls.h
@@ -65,6 +65,7 @@ struct old_linux_dirent;
 struct perf_event_attr;
 struct file_handle;
 struct sigaltstack;
+struct fincore_extra;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -866,4 +867,7 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
 asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
 			 unsigned long idx1, unsigned long idx2);
 asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_fincore(int fd, loff_t start, long nr_pages,
+			int mode, unsigned char __user *vec,
+			struct fincore_extra __user *extra);
 #endif
diff --git v3.16-rc3.orig/include/uapi/linux/fincore.h v3.16-rc3/include/uapi/linux/fincore.h
new file mode 100644
index 000000000000..30797e68a8d4
--- /dev/null
+++ v3.16-rc3/include/uapi/linux/fincore.h
@@ -0,0 +1,84 @@
+#ifndef _UAPI_LINUX_FINCORE_H
+#define _UAPI_LINUX_FINCORE_H
+
+/*
+ * You can control how the buffer in userspace is filled with this mode
+ * parameters:
+ *
+ * - FINCORE_BMAP:
+ *     the page status is returned in a vector of bytes.
+ *     The least significant bit of each byte is 1 if the referenced page
+ *     is in memory, otherwise it is zero.
+ *
+ * - FINCORE_PGOFF:
+ *     if this flag is set, fincore() doesn't store any information about
+ *     holes. Instead each records per page has the entry of page offset,
+ *     using 8 bytes. This mode is useful if we handle a large file and
+ *     only few pages are on memory.
+ *
+ * - FINCORE_PFN:
+ *     stores pfn, using 8 bytes.
+ *
+ * - FINCORE_PAGEFLAGS:
+ *     stores page flags, using 8 bytes. See definition of KPF_* for
+ *     details of each bit.
+ *
+ * FINCORE_BMAP shouldn't be used combined with any other flags, and returnd
+ * data in this mode is like this:
+ *
+ *   page offset  0   1   2   3   4
+ *              +---+---+---+---+---+
+ *              | 1 | 0 | 0 | 1 | 1 | ...
+ *              +---+---+---+---+---+
+ *               <->
+ *              1 byte
+ *
+ * For FINCORE_PFN, page data is formatted like this:
+ *
+ *   page offset    0       1       2       3       4
+ *              +-------+-------+-------+-------+-------+
+ *              |  pfn  |  pfn  |  pfn  |  pfn  |  pfn  | ...
+ *              +-------+-------+-------+-------+-------+
+ *               <----->
+ *               8 byte
+ *
+ * We can use multiple flags among the flags in FINCORE_LONGENTRY_MASK.
+ * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
+ * information is stored like this:
+ *
+ *    page offset 0    page offset 1   page offset 2   page offset 3
+ *                                        (hole)
+ *   +-------+-------+-------+-------+-------+-------+-------+-------+
+ *   |  pfn  | flags |  pfn  | flags |   0   |   0   |  pfn  | flags | ...
+ *   +-------+-------+-------+-------+-------+-------+-------+-------+
+ *    <-------------> <-------------> <-------------> <------------->
+ *       16 bytes        16 bytes        16 bytes        16 bytes
+ *
+ * When FINCORE_PGOFF is set, we store page offset entry and ignore holes
+ * For example, the data format of mode FINCORE_PGOFF|FINCORE_PFN|
+ * FINCORE_PAGEFLAGS is like follows:
+ *
+ *   +-------+-------+-------+-------+-------+-------+
+ *   | pgoff |  pfn  | flags | pgoff |  pfn  | flags | ...
+ *   +-------+-------+-------+-------+-------+-------+
+ *    <---------------------> <--------------------->
+ *           24 bytes                24 bytes
+ */
+#define FINCORE_BMAP		0x01	/* bytemap mode */
+#define FINCORE_PGOFF		0x02
+#define FINCORE_PFN		0x04
+#define FINCORE_PAGE_FLAGS	0x08
+
+#define FINCORE_MODE_MASK	0x1f
+#define FINCORE_LONGENTRY_MASK	(FINCORE_PGOFF | FINCORE_PFN | \
+				 FINCORE_PAGE_FLAGS)
+
+struct fincore_extra {
+	/*
+	 * (output) the number of entries with valid data, this is useful
+	 * if you set FINCORE_PGOFF and want to know the end of filled data.
+	 */
+	unsigned long nr_entries;
+};
+
+#endif /* _UAPI_LINUX_FINCORE_H */
diff --git v3.16-rc3.orig/mm/Makefile v3.16-rc3/mm/Makefile
index 4064f3ec145e..cc9420221afd 100644
--- v3.16-rc3.orig/mm/Makefile
+++ v3.16-rc3/mm/Makefile
@@ -18,7 +18,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o balloon_compaction.o vmacache.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   iov_iter.o $(mmu-y)
+			   iov_iter.o fincore.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git v3.16-rc3.orig/mm/fincore.c v3.16-rc3/mm/fincore.c
new file mode 100644
index 000000000000..2f20ac3569df
--- /dev/null
+++ v3.16-rc3/mm/fincore.c
@@ -0,0 +1,286 @@
+/*
+ * fincore(2) system call
+ *
+ * Copyright (C) 2014 NEC Corporation, Naoya Horiguchi
+ */
+
+#include <linux/syscalls.h>
+#include <linux/pagemap.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/hugetlb.h>
+#include <uapi/linux/fincore.h>
+
+struct fincore_control {
+	int mode;
+	int width;		/* width of each entry (in bytes) */
+	unsigned char *buffer;
+	long buffer_size;
+	void *cursor;		/* current position on the buffer */
+	pgoff_t pgstart;	/* start point of page cache scan in each run
+				 * of the while loop */
+	long nr_pages;		/* number of pages to be copied to userspace
+				 * (decreasing while scan proceeds) */
+	long scanned_offset;	/* page offset of the lastest scanned page */
+	struct address_space *mapping;
+};
+
+#define store_entry(fc, type, data) ({		\
+	*(type *)fc->cursor = (type)data;	\
+	fc->cursor += sizeof(type);		\
+})
+
+/*
+ * Store page cache data to temporal buffer in the specified format depending
+ * on fincore mode.
+ */
+static void __do_fincore(struct fincore_control *fc, struct page *page,
+			 unsigned long index)
+{
+	VM_BUG_ON(!page);
+	VM_BUG_ON((unsigned long)fc->cursor - (unsigned long)fc->buffer
+		  >= fc->buffer_size);
+	if (fc->mode & FINCORE_BMAP)
+		store_entry(fc, unsigned char, PageUptodate(page));
+	else if (fc->mode & (FINCORE_LONGENTRY_MASK)) {
+		if (fc->mode & FINCORE_PGOFF)
+			store_entry(fc, unsigned long, index);
+		if (fc->mode & FINCORE_PFN)
+			store_entry(fc, unsigned long, page_to_pfn(page));
+		if (fc->mode & FINCORE_PAGE_FLAGS)
+			store_entry(fc, unsigned long, stable_page_flags(page));
+	}
+}
+
+/*
+ * Traverse page cache tree. It's assumed that temporal buffer are zeroed
+ * in advance. Due to this, we don't have to store zero entry explicitly
+ * one-by-one and we just set fc->cursor to the position of the next
+ * on-memory page.
+ *
+ * Return value is the number of pages whose data is stored in fc->buffer.
+ */
+static long do_fincore(struct fincore_control *fc, int nr_pages)
+{
+	pgoff_t pgend = fc->pgstart + nr_pages;
+	struct radix_tree_iter iter;
+	void **slot;
+	long nr = 0;
+
+	fc->cursor = fc->buffer;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &fc->mapping->page_tree, &iter,
+				 fc->pgstart) {
+		long jump;
+		struct page *page;
+
+		fc->scanned_offset = iter.index;
+		/* Handle holes */
+		jump = iter.index - fc->pgstart - nr;
+		if (jump) {
+			if (!(fc->mode & FINCORE_PGOFF)) {
+				if (iter.index < pgend) {
+					fc->cursor += jump * fc->width;
+					nr = iter.index - fc->pgstart;
+				} else {
+					/*
+					 * Fill remaining buffer as hole. Next
+					 * call should start at offset pgend.
+					 */
+					nr = nr_pages;
+					fc->scanned_offset = pgend - 1;
+					break;
+				}
+			}
+		}
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			/*
+			 * No need to increment nr and fc->cursor, because next
+			 * iteration should detect hole and update them there.
+			 */
+			continue;
+		else if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page)) {
+				/*
+				 * Transient condition which can only trigger
+				 * when entry at index 0 moves out of or back
+				 * to root: none yet gotten, safe to restart.
+				 */
+				WARN_ON(iter.index);
+				goto restart;
+			}
+			__do_fincore(fc, page, iter.index);
+		} else {
+			if (!page_cache_get_speculative(page))
+				goto repeat;
+
+			/* Has the page moved? */
+			if (unlikely(page != *slot)) {
+				page_cache_release(page);
+				goto repeat;
+			}
+
+			__do_fincore(fc, page, iter.index);
+			page_cache_release(page);
+		}
+
+		if (++nr == nr_pages)
+			break;
+	}
+
+	if (!(fc->mode & FINCORE_PGOFF)) {
+		nr = nr_pages;
+		fc->scanned_offset = pgend - 1;
+	}
+
+	rcu_read_unlock();
+
+	return nr;
+}
+
+static inline bool fincore_validate_mode(int mode)
+{
+	if (mode & ~FINCORE_MODE_MASK)
+		return false;
+	if (!(!!(mode & FINCORE_BMAP) ^ !!(mode & FINCORE_LONGENTRY_MASK)))
+		return false;
+	return true;
+}
+
+#define FINCORE_LOOP_STEP	256L
+
+/*
+ * The fincore(2) system call
+ *
+ *  @fd:        file descriptor of the target file
+ *  @start:     starting address offset of the target file (in byte).
+ *              This should be aligned to page cache size.
+ *  @nr_pages:  the number of pages whose data is passed to userspace.
+ *  @mode       fincore mode flags to determine the entry's format
+ *  @vec        pointer of the userspace buffer. The size must be equal to or
+ *              larger than (@nr_pages * width), where width is the size of
+ *              each entry.
+ *  @extra      used to input/output additional information from/to userspace
+ *
+ * fincore() returns the memory residency status and additional info (like
+ * pfn and page flags) of the given file's pages.
+ *
+ * Depending on the fincore mode, caller can receive the different formatted
+ * information. See the comment on definition of FINCORE_*.
+ *
+ * Because the status of a page can change after fincore() checks it once,
+ * the returned vector may contain stale information.
+ *
+ * return values:
+ *  -EBADF:   @fd isn't a valid open file descriptor
+ *  -EFAULT:  @vec points to an illegal address
+ *  -EINVAL:  @start is unaligned to page cache size or is out of file range.
+ *            Or @nr_pages is non-positive. Or @mode is invalid.
+ *  0:        fincore() is successfully done
+ */
+SYSCALL_DEFINE6(fincore, int, fd, loff_t, start, long, nr_pages,
+		int, mode, unsigned char __user *, vec,
+		struct fincore_extra __user *, extra)
+{
+	long ret = 0;
+	long step;
+	long nr = 0;
+	long pages_to_eof;
+	int pc_shift = PAGE_CACHE_SHIFT;
+	struct fd f;
+
+	struct fincore_control fc = {
+		.mode	= mode,
+		.width	= sizeof(unsigned char),
+	};
+
+	if (start < 0 || nr_pages <= 0)
+		return -EINVAL;
+
+	if (!fincore_validate_mode(mode))
+		return -EINVAL;
+
+	f = fdget(fd);
+
+	if (is_file_hugepages(f.file))
+		pc_shift = huge_page_shift(hstate_file(f.file));
+
+	if (!IS_ALIGNED(start, 1 << pc_shift)) {
+		ret = -EINVAL;
+		goto fput;
+	}
+
+	/*
+	 * TODO: support /dev/mem, /proc/pid/mem for system/process wide
+	 * page survey, which would obsolete /proc/kpageflags, and
+	 * /proc/pid/pagemap.
+	 */
+	if (!S_ISREG(file_inode(f.file)->i_mode)) {
+		ret = -EBADF;
+		goto fput;
+	}
+
+	fc.pgstart = start >> pc_shift;
+	pages_to_eof = DIV_ROUND_UP(i_size_read(file_inode(f.file)),
+				    1UL << pc_shift) - fc.pgstart;
+	/* start is too large */
+	if (pages_to_eof <= 0) {
+		ret = -EINVAL;
+		goto fput;
+	}
+	/* Never go beyond the end of file */
+	fc.nr_pages = min(pages_to_eof, nr_pages);
+	fc.mapping = f.file->f_mapping;
+	if (mode & FINCORE_LONGENTRY_MASK)
+		fc.width = ((mode & FINCORE_PGOFF ? 1 : 0) +
+			    (mode & FINCORE_PFN ? 1 : 0) +
+			    (mode & FINCORE_PAGE_FLAGS ? 1 : 0)
+			) * sizeof(unsigned long);
+
+	if (!access_ok(VERIFY_WRITE, vec, nr_pages * fc.width)) {
+		ret = -EFAULT;
+		goto fput;
+	}
+
+	step = min(fc.nr_pages, FINCORE_LOOP_STEP);
+
+	fc.buffer_size = step * fc.width;
+	fc.buffer = kmalloc(fc.buffer_size, GFP_TEMPORARY);
+	if (!fc.buffer) {
+		ret = -ENOMEM;
+		goto fput;
+	}
+
+	while (fc.nr_pages > 0) {
+		memset(fc.buffer, 0, fc.buffer_size);
+		ret = do_fincore(&fc, min(step, fc.nr_pages));
+		/* Reached the end of the file */
+		if (ret == 0)
+			break;
+		if (ret < 0)
+			break;
+		if (copy_to_user(vec + nr * fc.width,
+				 fc.buffer, ret * fc.width)) {
+			ret = -EFAULT;
+			break;
+		}
+		fc.nr_pages -= ret;
+		fc.pgstart = fc.scanned_offset + 1;
+		nr += ret;
+	}
+
+	kfree(fc.buffer);
+
+	if (extra)
+		__put_user(nr, &extra->nr_entries);
+
+fput:
+	fdput(f);
+	return ret;
+}
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 2/3] selftests/fincore: add test code for fincore()
  2014-07-07 18:00 ` Naoya Horiguchi
@ 2014-07-07 18:00   ` Naoya Horiguchi
  -1 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-07 18:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Dave Hansen, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

This patch adds simple test programs for fincore(), which contains the
following testcase:
  - test_smallfile_bytemap
  - test_smallfile_pfn
  - test_smallfile_multientry
  - test_smallfile_pfn_skiphole
  - test_largefile_pfn
  - test_largefile_pfn_offset
  - test_largefile_pfn_overrun
  - test_largefile_pfn_skiphole
  - test_tmpfs_pfn
  - test_hugetlb_pfn
  - test_invalid_start_address
  - test_invalid_len
  - test_invalid_mode
  - test_unaligned_start_address_hugetlb

ChangeLog v3:
- remove pagecache tag stuff

ChangeLog v2:
- include uapi/linux/pagecache.h
- add testcase test_invalid_start_address and test_invalid_len
- other small changes to adjust for the kernel's changes

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/fincore/Makefile           |  31 ++
 .../selftests/fincore/create_hugetlbfs_file.c      |  49 +++
 tools/testing/selftests/fincore/fincore.c          | 153 +++++++++
 tools/testing/selftests/fincore/run_fincoretests   | 361 +++++++++++++++++++++
 5 files changed, 595 insertions(+)
 create mode 100644 tools/testing/selftests/fincore/Makefile
 create mode 100644 tools/testing/selftests/fincore/create_hugetlbfs_file.c
 create mode 100644 tools/testing/selftests/fincore/fincore.c
 create mode 100644 tools/testing/selftests/fincore/run_fincoretests

diff --git v3.16-rc3.orig/tools/testing/selftests/Makefile v3.16-rc3/tools/testing/selftests/Makefile
index e66e710cc595..91e817b87a9e 100644
--- v3.16-rc3.orig/tools/testing/selftests/Makefile
+++ v3.16-rc3/tools/testing/selftests/Makefile
@@ -11,6 +11,7 @@ TARGETS += vm
 TARGETS += powerpc
 TARGETS += user
 TARGETS += sysctl
+TARGETS += fincore
 
 all:
 	for TARGET in $(TARGETS); do \
diff --git v3.16-rc3.orig/tools/testing/selftests/fincore/Makefile v3.16-rc3/tools/testing/selftests/fincore/Makefile
new file mode 100644
index 000000000000..ab4361c70da5
--- /dev/null
+++ v3.16-rc3/tools/testing/selftests/fincore/Makefile
@@ -0,0 +1,31 @@
+# Makefile for vm selftests
+
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+ifeq ($(ARCH),i386)
+        ARCH := X86
+        CFLAGS := -DCONFIG_X86_32 -D__i386__
+endif
+ifeq ($(ARCH),x86_64)
+        ARCH := X86
+        CFLAGS := -DCONFIG_X86_64 -D__x86_64__
+endif
+
+CC = $(CROSS_COMPILE)gcc
+CFLAGS = -Wall
+CFLAGS += -I../../../../arch/x86/include/generated/
+CFLAGS += -I../../../../include/
+CFLAGS += -I../../../../usr/include/
+CFLAGS += -I../../../../arch/x86/include/
+
+BINARIES = fincore create_hugetlbfs_file
+
+all: $(BINARIES)
+%: %.c
+	$(CC) $(CFLAGS) -o $@ $^
+
+run_tests: all
+	@/bin/sh ./run_fincoretests || (echo "fincoretests: [FAIL]"; exit 1)
+
+clean:
+	$(RM) $(BINARIES)
diff --git v3.16-rc3.orig/tools/testing/selftests/fincore/create_hugetlbfs_file.c v3.16-rc3/tools/testing/selftests/fincore/create_hugetlbfs_file.c
new file mode 100644
index 000000000000..a46ccf0af5f2
--- /dev/null
+++ v3.16-rc3/tools/testing/selftests/fincore/create_hugetlbfs_file.c
@@ -0,0 +1,49 @@
+#define _GNU_SOURCE 1
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdlib.h>
+
+#define err(x) (perror(x), exit(1))
+
+unsigned long default_hugepage_size(void)
+{
+	unsigned long hps = 0;
+	char *line = NULL;
+	size_t linelen = 0;
+	FILE *f = fopen("/proc/meminfo", "r");
+	if (!f)
+		err("open /proc/meminfo");
+	while (getline(&line, &linelen, f) > 0) {
+		if (sscanf(line, "Hugepagesize:	%lu kB", &hps) == 1) {
+			hps <<= 10;
+			break;
+		}
+	}
+	free(line);
+	return hps;
+}
+
+int main(int argc, char **argv)
+{
+	int ret;
+	int fd;
+	char *p;
+	unsigned long hpsize = default_hugepage_size();
+	fd = open(argv[1], O_RDWR|O_CREAT);
+	if (fd == -1)
+		err("open");
+	p = mmap(NULL, 10 * hpsize, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+	if (p == (void *)-1)
+		err("mmap");
+	memset(p, 'a', 3 * hpsize);
+	memset(p + 7 * hpsize, 'a', 3 * hpsize - 1);
+	ret = close(fd);
+	if (ret == -1)
+		err("close");
+	return 0;
+}
diff --git v3.16-rc3.orig/tools/testing/selftests/fincore/fincore.c v3.16-rc3/tools/testing/selftests/fincore/fincore.c
new file mode 100644
index 000000000000..786a985ba5bd
--- /dev/null
+++ v3.16-rc3/tools/testing/selftests/fincore/fincore.c
@@ -0,0 +1,153 @@
+/*
+ * fincore(2) test program
+ */
+
+#define _GNU_SOURCE 1
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <getopt.h>
+#include <assert.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <uapi/linux/fincore.h>
+
+#define err(x) (perror(x), exit(1))
+
+void usage(char *str)
+{
+	printf(
+		"Usage: %s [-s start] [-l len] [-m mode] [-p pagesize] file\n"
+		"  -s: start offset (in bytes)\n"
+		"  -l: length to scan (in bytes)\n"
+		"  -m: fincore mode\n"
+		"  -p: set page size (for hugepage)\n"
+		"  -h: show this message\n"
+		, str);
+	exit(EXIT_SUCCESS);
+}
+
+static void show_fincore_buffer(long start, long nr_pages, int records_per_page,
+				int mode, unsigned char *buf)
+{
+	int i, j;
+	unsigned char *curuc = (unsigned char *)buf;
+	unsigned long *curul = (unsigned long *)buf;
+
+	for (i = 0; i < nr_pages; i++) {
+		j = 0;
+		if (mode & FINCORE_BMAP)
+			printf("buffer: 0x%lx\t%d", start + i, curuc[i + j]);
+		else if (mode & (FINCORE_LONGENTRY_MASK)) {
+			if (mode & FINCORE_PGOFF)
+				printf("buffer: 0x%lx",
+				       curul[i * records_per_page + (j++)]);
+			else
+				printf("buffer: 0x%lx", start + i);
+			if (mode & FINCORE_PFN)
+				printf("\t0x%lx",
+				       curul[i * records_per_page + (j++)]);
+			if (mode & FINCORE_PAGE_FLAGS)
+				printf("\t0x%lx",
+				       curul[i * records_per_page + (j++)]);
+		}
+		printf("\n");
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	char c;
+	int fd;
+	int ret;
+	int mode = FINCORE_PFN;
+	int width = sizeof(unsigned char);
+	int records_per_page = 1;
+	long pagesize = sysconf(_SC_PAGESIZE);
+	long nr_pages;
+	unsigned long start = 0;
+	int len_not_given = 1;
+	long len = 0;
+	long buffer_size = 0;
+	unsigned char *buf;
+	struct stat stat;
+	int extra = 0;
+	struct fincore_extra fe = {};
+
+	while ((c = getopt(argc, argv, "s:l:m:p:et:b:h")) != -1) {
+		switch (c) {
+		case 's':
+			start = strtoul(optarg, NULL, 0);
+			break;
+		case 'l':
+			len_not_given = 0;
+			len = strtol(optarg, NULL, 0);
+			break;
+		case 'm':
+			mode = strtoul(optarg, NULL, 0);
+			break;
+		case 'p':
+			pagesize = strtoul(optarg, NULL, 0);
+			break;
+		case 'e':
+			extra = 1;
+			break;
+		case 'b':
+			buffer_size = strtoul(optarg, NULL, 0);
+			break;
+		case 'h':
+		default:
+			usage(argv[0]);
+		}
+	}
+
+	fd = open(argv[optind], O_RDWR);
+	if (fd == -1)
+		err("open failed.");
+
+	/* scan to the end of file by default */
+	if (len_not_given) {
+		ret = fstat(fd, &stat);
+		if (ret == -1)
+			err("fstat failed.");
+		len = stat.st_size - start;
+	}
+
+	if (mode & FINCORE_LONGENTRY_MASK) {
+		records_per_page = ((mode & FINCORE_PGOFF ? 1 : 0) +
+				    (mode & FINCORE_PFN ? 1 : 0) +
+				    (mode & FINCORE_PAGE_FLAGS ? 1 : 0)
+			);
+		width = records_per_page * sizeof(unsigned long);
+	}
+
+	nr_pages = ((len + pagesize - 1) & (~(pagesize - 1))) / pagesize;
+	printf("start:0x%lx, len:%ld, mode:%d, pagesize:0x%lx,\n"
+	       "buffer_size:0x%lx, nr_pages:0x%lx, width:%d\n",
+	       start, len, mode, pagesize, buffer_size, nr_pages, width);
+	buf = malloc(buffer_size > 0 ? buffer_size : nr_pages * width);
+	if (!buf)
+		err("malloc");
+
+	ret = syscall(__NR_fincore, fd, start, nr_pages, mode, buf,
+		      extra ? &fe : NULL);
+	if (ret < 0)
+		err("fincore");
+	/*
+	 * print buffer to stdout, and parse it later for validation check.
+	 * fincore() returns the number of entries written to the buffer.
+	 */
+	show_fincore_buffer(start / pagesize, nr_pages, records_per_page,
+			    mode, buf);
+
+	if (extra)
+		printf("fincore_extra->nr_entries: %ld\n", fe.nr_entries);
+
+	ret = close(fd);
+	if (ret < 0)
+		err("close");
+	return 0;
+}
diff --git v3.16-rc3.orig/tools/testing/selftests/fincore/run_fincoretests v3.16-rc3/tools/testing/selftests/fincore/run_fincoretests
new file mode 100644
index 000000000000..881a1ce1fbee
--- /dev/null
+++ v3.16-rc3/tools/testing/selftests/fincore/run_fincoretests
@@ -0,0 +1,361 @@
+#!/bin/bash
+
+WDIR=./fincore_work
+mkdir $WDIR 2> /dev/null
+TMPF=`mktemp --tmpdir=$WDIR -d`
+export LANG=C
+
+sysctl -q vm.nr_hugepages=50
+
+#
+# common routines
+#
+abort() {
+    echo "Test abort"
+    exit 1
+}
+
+create_small_file() {
+    dd if=/dev/urandom of=$WDIR/smallfile bs=4096 count=4 > /dev/null 2>&1
+    dd if=/dev/urandom of=$WDIR/smallfile bs=4096 count=4 seek=8> /dev/null 2>&1
+    date >> $WDIR/smallfile
+    sync
+}
+
+create_large_file() {
+    dd if=/dev/urandom of=$WDIR/largefile bs=4096 count=384 > /dev/null 2>&1
+    dd if=/dev/urandom of=$WDIR/largefile bs=4096 count=384 seek=640> /dev/null 2>&1
+    sync
+}
+
+create_tmpfs_file() {
+    dd if=/dev/urandom of=/tmp/tmpfile bs=4096 count=4 > /dev/null 2>&1
+    dd if=/dev/urandom of=/tmp/tmpfile bs=4096 count=4 seek=8> /dev/null 2>&1
+    date >> /tmp/tmpfile
+    sync
+}
+
+create_hugetlb_file() {
+    if mount | grep $WDIR/hugepages > /dev/null ; then
+        echo "$WDIR/hugepages already mounted"
+    else
+        mkdir -p $WDIR/hugepages 2> /dev/null
+        mount -t hugetlbfs none $WDIR/hugepages 2> /dev/null
+        if [ $? -ne 0 ] ; then
+            echo "Failed to mount hugetlbfs" >&2
+            return 1
+        fi
+    fi
+    local hptotal=$(grep HugePages_Total: /proc/meminfo | tr -s ' ' | cut -f2 -d' ')
+    if [ "$hptotal" -lt 10 ] ; then
+        echo "Hugepage pool size need to be >= 10" >&2
+        return 1
+    fi
+    ./create_hugetlbfs_file $WDIR/hugepages/file
+    if [ $? -ne 0 ] ; then
+        echo "Failed to create hugetlb file" >&2
+        return 1
+    fi
+    return 0;
+}
+
+get_buffer() {
+    cat "$1" | grep '^buffer:' | cut -f 2- -d ' '
+}
+
+get_fincore_extra_nr_entries() {
+    cat "$1" | grep '^fincore_extra->nr_entries' | cut -f 2 -d ' '
+}
+
+nr_of_exist_should_be() {
+    if [ "$1" -ne "$2" ] ; then
+        echo "[FAIL] $3: Number of on-memory pages should be $1, but got $2"
+        return 1
+    fi
+    return 0
+}
+
+nr_of_nonexist_should_be() {
+    if [ "$1" -ne "$2" ] ; then
+        echo "[FAIL] $3: Number of hole entries should be $1, but got $2"
+        return 1
+    fi
+    return 0
+}
+
+nr_of_valid_entries_should_be() {
+    if [ "$1" -ne "$2" ] ; then
+        echo "[FAIL] $3: Number of valid entries should be $1, but got $2"
+        return 1
+    fi
+    return 0
+}
+
+check_einval() {
+    grep "fincore: Invalid argument" "$1" > /dev/null
+}
+
+#
+# Testcases
+#
+test_smallfile_bytemap() {
+    local exist
+    local nonexist
+    create_small_file
+
+    ./fincore -m 0x1 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 1 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0 | wc -l)
+    nr_of_exist_should_be 9 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_smallfile_pfn() {
+    local exist
+    local nonexist
+    create_small_file
+
+    ./fincore -m 0x4 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_of_exist_should_be 9 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_smallfile_multientry() {
+    local exist
+    local nonexist
+    create_small_file
+
+    ./fincore -m 0xc -e $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2,3 | grep -vP "0x0\t0x0" | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2,3 | grep -P "0x0\t0x0" | wc -l)
+    nr_of_exist_should_be 9 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_smallfile_pfn_skiphole() {
+    local exist
+    local nonexist
+    local nr_entries
+    create_small_file
+
+    ./fincore -m 0x6 -e $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_entries=$(get_fincore_extra_nr_entries $TMPF/$FUNCNAME)
+    nr_of_exist_should_be 9 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    nr_of_valid_entries_should_be 9 "$nr_entries" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+# in-kernel function sys_fincore() repeat copy_to_user() per 256 entries,
+# so testing for large file is meaningful testcase.
+test_largefile_pfn() {
+    local exist
+    local nonexist
+    create_large_file
+
+    ./fincore -m 0x4 -e $WDIR/largefile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_of_exist_should_be 768 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 256 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_largefile_pfn_offset() {
+    local exist
+    local nonexist
+    create_large_file
+
+    ./fincore -m 0x4 -s 0x80000 $WDIR/largefile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_of_exist_should_be 640 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 256 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_largefile_pfn_overrun() {
+    local exist
+    local nonexist
+    local nr_entries
+    create_large_file
+
+    ./fincore -m 0x4 -s 0x80000 -l 0x400000 -e $WDIR/largefile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_entries=$(get_fincore_extra_nr_entries $TMPF/$FUNCNAME)
+    nr_of_exist_should_be 640 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 384 "$nonexist" "$FUNCNAME" || return 1
+    nr_of_valid_entries_should_be 896 "$nr_entries" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_largefile_pfn_skiphole() {
+    local exist
+    local nonexist
+    create_large_file
+
+    ./fincore -m 0x6 -s 0x100000 -l 0x102000 -e $WDIR/largefile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_entries=$(get_fincore_extra_nr_entries $TMPF/$FUNCNAME)
+    nr_of_exist_should_be 258 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 0 "$nonexist" "$FUNCNAME" || return 1
+    nr_of_valid_entries_should_be 258 "$nr_entries" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_tmpfs_pfn() {
+    local exist
+    local nonexist
+    create_tmpfs_file
+
+    ./fincore -m 0x4 /tmp/tmpfile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_of_exist_should_be 9 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_hugetlb_pfn() {
+    local exist
+    local nonexist
+    local exitcode=0
+    create_hugetlb_file
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: fail to create a file on hugetlbfs"
+        return 1
+    fi
+    local hugepagesize=$[$(cat /proc/meminfo  | grep Hugepagesize: | tr -s ' ' | cut -f2 -d' ') * 1024]
+    ./fincore -p $hugepagesize -m 0x4 -e $WDIR/hugepages/file > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_entries=$(get_fincore_extra_nr_entries $TMPF/$FUNCNAME)
+    nr_of_exist_should_be 6 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    nr_of_valid_entries_should_be 10 "$nr_entries" "$FUNCNAME" || return 1
+    rm -rf $WDIR/hugepages/file
+    echo "[PASS] $FUNCNAME"
+}
+
+test_invalid_start_address() {
+    create_small_file
+    ./fincore -m 0x4 -s -0x4000 -l 1 -e $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: negative start is invalid"
+        return 1
+    fi
+    ./fincore -m 0x4 -s 0x100000 -l 1 -e $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: too large start is invalid"
+        return 1
+    fi
+    ./fincore -m 0x4 -s 0x30 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: fincore should fail for unaligned start address"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+test_invalid_len() {
+    create_small_file
+    ./fincore -m 0x4 -l 0 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: zero len is invalid"
+        return 1
+    fi
+    ./fincore -m 0x4 -l -10 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: negative len is invalid"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+test_invalid_mode() {
+    create_small_file
+    ./fincore -m 0x0 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: mode == NULL is invalid mode"
+        return 1
+    fi
+    ./fincore -m 0x5 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: mode == (FINCORE_BMAP|FINCORE_PFN) is invalid mode"
+        return 1
+    fi
+    ./fincore -m 0x3 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: mode == (FINCORE_BMAP|FINCORE_PGOFF) is invalid mode"
+        return 1
+    fi
+    ./fincore -m 0x6 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: mode == (FINCORE_PGOFF|FINCORE_PFN) is valid mode"
+        return 1
+    fi
+    ./fincore -m 0x2 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: mode == (FINCORE_PGOFF) is valid mode"
+        return 1
+    fi
+    ./fincore -m 0x1004 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: mode == (Unknown|FINCORE_PFN) is invalid mode"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+test_unaligned_start_address_hugetlb() {
+    local exist
+    local nonexist
+    local exitcode=0
+    create_hugetlb_file
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: fail to create a file on hugetlbfs"
+        return 1
+    fi
+    local hugepagesize=$[$(cat /proc/meminfo  | grep Hugepagesize: | tr -s ' ' | cut -f2 -d' ') * 1024]
+    ./fincore -p $hugepagesize -m 0x4 -s 0x1000 $WDIR/hugepages/file > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: fincore should fail for page-unaligned start address"
+        return 1
+    fi
+    ./fincore -p $hugepagesize -m 0x4 -s $hugepagesize $WDIR/hugepages/file > $TMPF/$FUNCNAME 2>&1
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: fincore should pass for hugepage-aligned start address"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+test_smallfile_bytemap                 || abort
+test_smallfile_pfn                     || abort
+test_smallfile_multientry              || abort
+test_smallfile_pfn_skiphole            || abort
+test_largefile_pfn                     || abort
+test_largefile_pfn_offset              || abort
+test_largefile_pfn_overrun             || abort
+test_largefile_pfn_skiphole            || abort
+test_tmpfs_pfn                         || abort
+test_hugetlb_pfn                       || abort
+test_invalid_start_address             || abort
+test_invalid_len                       || abort
+test_invalid_mode                      || abort
+test_unaligned_start_address_hugetlb   || abort
+
+# cleanup
+rm -rf $WDIR/hugepages/file
+umount $WDIR/hugepages > /dev/null 2>&1
+
+exit 0
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 2/3] selftests/fincore: add test code for fincore()
@ 2014-07-07 18:00   ` Naoya Horiguchi
  0 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-07 18:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Dave Hansen, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

This patch adds simple test programs for fincore(), which contains the
following testcase:
  - test_smallfile_bytemap
  - test_smallfile_pfn
  - test_smallfile_multientry
  - test_smallfile_pfn_skiphole
  - test_largefile_pfn
  - test_largefile_pfn_offset
  - test_largefile_pfn_overrun
  - test_largefile_pfn_skiphole
  - test_tmpfs_pfn
  - test_hugetlb_pfn
  - test_invalid_start_address
  - test_invalid_len
  - test_invalid_mode
  - test_unaligned_start_address_hugetlb

ChangeLog v3:
- remove pagecache tag stuff

ChangeLog v2:
- include uapi/linux/pagecache.h
- add testcase test_invalid_start_address and test_invalid_len
- other small changes to adjust for the kernel's changes

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/fincore/Makefile           |  31 ++
 .../selftests/fincore/create_hugetlbfs_file.c      |  49 +++
 tools/testing/selftests/fincore/fincore.c          | 153 +++++++++
 tools/testing/selftests/fincore/run_fincoretests   | 361 +++++++++++++++++++++
 5 files changed, 595 insertions(+)
 create mode 100644 tools/testing/selftests/fincore/Makefile
 create mode 100644 tools/testing/selftests/fincore/create_hugetlbfs_file.c
 create mode 100644 tools/testing/selftests/fincore/fincore.c
 create mode 100644 tools/testing/selftests/fincore/run_fincoretests

diff --git v3.16-rc3.orig/tools/testing/selftests/Makefile v3.16-rc3/tools/testing/selftests/Makefile
index e66e710cc595..91e817b87a9e 100644
--- v3.16-rc3.orig/tools/testing/selftests/Makefile
+++ v3.16-rc3/tools/testing/selftests/Makefile
@@ -11,6 +11,7 @@ TARGETS += vm
 TARGETS += powerpc
 TARGETS += user
 TARGETS += sysctl
+TARGETS += fincore
 
 all:
 	for TARGET in $(TARGETS); do \
diff --git v3.16-rc3.orig/tools/testing/selftests/fincore/Makefile v3.16-rc3/tools/testing/selftests/fincore/Makefile
new file mode 100644
index 000000000000..ab4361c70da5
--- /dev/null
+++ v3.16-rc3/tools/testing/selftests/fincore/Makefile
@@ -0,0 +1,31 @@
+# Makefile for vm selftests
+
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+ifeq ($(ARCH),i386)
+        ARCH := X86
+        CFLAGS := -DCONFIG_X86_32 -D__i386__
+endif
+ifeq ($(ARCH),x86_64)
+        ARCH := X86
+        CFLAGS := -DCONFIG_X86_64 -D__x86_64__
+endif
+
+CC = $(CROSS_COMPILE)gcc
+CFLAGS = -Wall
+CFLAGS += -I../../../../arch/x86/include/generated/
+CFLAGS += -I../../../../include/
+CFLAGS += -I../../../../usr/include/
+CFLAGS += -I../../../../arch/x86/include/
+
+BINARIES = fincore create_hugetlbfs_file
+
+all: $(BINARIES)
+%: %.c
+	$(CC) $(CFLAGS) -o $@ $^
+
+run_tests: all
+	@/bin/sh ./run_fincoretests || (echo "fincoretests: [FAIL]"; exit 1)
+
+clean:
+	$(RM) $(BINARIES)
diff --git v3.16-rc3.orig/tools/testing/selftests/fincore/create_hugetlbfs_file.c v3.16-rc3/tools/testing/selftests/fincore/create_hugetlbfs_file.c
new file mode 100644
index 000000000000..a46ccf0af5f2
--- /dev/null
+++ v3.16-rc3/tools/testing/selftests/fincore/create_hugetlbfs_file.c
@@ -0,0 +1,49 @@
+#define _GNU_SOURCE 1
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdlib.h>
+
+#define err(x) (perror(x), exit(1))
+
+unsigned long default_hugepage_size(void)
+{
+	unsigned long hps = 0;
+	char *line = NULL;
+	size_t linelen = 0;
+	FILE *f = fopen("/proc/meminfo", "r");
+	if (!f)
+		err("open /proc/meminfo");
+	while (getline(&line, &linelen, f) > 0) {
+		if (sscanf(line, "Hugepagesize:	%lu kB", &hps) == 1) {
+			hps <<= 10;
+			break;
+		}
+	}
+	free(line);
+	return hps;
+}
+
+int main(int argc, char **argv)
+{
+	int ret;
+	int fd;
+	char *p;
+	unsigned long hpsize = default_hugepage_size();
+	fd = open(argv[1], O_RDWR|O_CREAT);
+	if (fd == -1)
+		err("open");
+	p = mmap(NULL, 10 * hpsize, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+	if (p == (void *)-1)
+		err("mmap");
+	memset(p, 'a', 3 * hpsize);
+	memset(p + 7 * hpsize, 'a', 3 * hpsize - 1);
+	ret = close(fd);
+	if (ret == -1)
+		err("close");
+	return 0;
+}
diff --git v3.16-rc3.orig/tools/testing/selftests/fincore/fincore.c v3.16-rc3/tools/testing/selftests/fincore/fincore.c
new file mode 100644
index 000000000000..786a985ba5bd
--- /dev/null
+++ v3.16-rc3/tools/testing/selftests/fincore/fincore.c
@@ -0,0 +1,153 @@
+/*
+ * fincore(2) test program
+ */
+
+#define _GNU_SOURCE 1
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <getopt.h>
+#include <assert.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <uapi/linux/fincore.h>
+
+#define err(x) (perror(x), exit(1))
+
+void usage(char *str)
+{
+	printf(
+		"Usage: %s [-s start] [-l len] [-m mode] [-p pagesize] file\n"
+		"  -s: start offset (in bytes)\n"
+		"  -l: length to scan (in bytes)\n"
+		"  -m: fincore mode\n"
+		"  -p: set page size (for hugepage)\n"
+		"  -h: show this message\n"
+		, str);
+	exit(EXIT_SUCCESS);
+}
+
+static void show_fincore_buffer(long start, long nr_pages, int records_per_page,
+				int mode, unsigned char *buf)
+{
+	int i, j;
+	unsigned char *curuc = (unsigned char *)buf;
+	unsigned long *curul = (unsigned long *)buf;
+
+	for (i = 0; i < nr_pages; i++) {
+		j = 0;
+		if (mode & FINCORE_BMAP)
+			printf("buffer: 0x%lx\t%d", start + i, curuc[i + j]);
+		else if (mode & (FINCORE_LONGENTRY_MASK)) {
+			if (mode & FINCORE_PGOFF)
+				printf("buffer: 0x%lx",
+				       curul[i * records_per_page + (j++)]);
+			else
+				printf("buffer: 0x%lx", start + i);
+			if (mode & FINCORE_PFN)
+				printf("\t0x%lx",
+				       curul[i * records_per_page + (j++)]);
+			if (mode & FINCORE_PAGE_FLAGS)
+				printf("\t0x%lx",
+				       curul[i * records_per_page + (j++)]);
+		}
+		printf("\n");
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	char c;
+	int fd;
+	int ret;
+	int mode = FINCORE_PFN;
+	int width = sizeof(unsigned char);
+	int records_per_page = 1;
+	long pagesize = sysconf(_SC_PAGESIZE);
+	long nr_pages;
+	unsigned long start = 0;
+	int len_not_given = 1;
+	long len = 0;
+	long buffer_size = 0;
+	unsigned char *buf;
+	struct stat stat;
+	int extra = 0;
+	struct fincore_extra fe = {};
+
+	while ((c = getopt(argc, argv, "s:l:m:p:et:b:h")) != -1) {
+		switch (c) {
+		case 's':
+			start = strtoul(optarg, NULL, 0);
+			break;
+		case 'l':
+			len_not_given = 0;
+			len = strtol(optarg, NULL, 0);
+			break;
+		case 'm':
+			mode = strtoul(optarg, NULL, 0);
+			break;
+		case 'p':
+			pagesize = strtoul(optarg, NULL, 0);
+			break;
+		case 'e':
+			extra = 1;
+			break;
+		case 'b':
+			buffer_size = strtoul(optarg, NULL, 0);
+			break;
+		case 'h':
+		default:
+			usage(argv[0]);
+		}
+	}
+
+	fd = open(argv[optind], O_RDWR);
+	if (fd == -1)
+		err("open failed.");
+
+	/* scan to the end of file by default */
+	if (len_not_given) {
+		ret = fstat(fd, &stat);
+		if (ret == -1)
+			err("fstat failed.");
+		len = stat.st_size - start;
+	}
+
+	if (mode & FINCORE_LONGENTRY_MASK) {
+		records_per_page = ((mode & FINCORE_PGOFF ? 1 : 0) +
+				    (mode & FINCORE_PFN ? 1 : 0) +
+				    (mode & FINCORE_PAGE_FLAGS ? 1 : 0)
+			);
+		width = records_per_page * sizeof(unsigned long);
+	}
+
+	nr_pages = ((len + pagesize - 1) & (~(pagesize - 1))) / pagesize;
+	printf("start:0x%lx, len:%ld, mode:%d, pagesize:0x%lx,\n"
+	       "buffer_size:0x%lx, nr_pages:0x%lx, width:%d\n",
+	       start, len, mode, pagesize, buffer_size, nr_pages, width);
+	buf = malloc(buffer_size > 0 ? buffer_size : nr_pages * width);
+	if (!buf)
+		err("malloc");
+
+	ret = syscall(__NR_fincore, fd, start, nr_pages, mode, buf,
+		      extra ? &fe : NULL);
+	if (ret < 0)
+		err("fincore");
+	/*
+	 * print buffer to stdout, and parse it later for validation check.
+	 * fincore() returns the number of entries written to the buffer.
+	 */
+	show_fincore_buffer(start / pagesize, nr_pages, records_per_page,
+			    mode, buf);
+
+	if (extra)
+		printf("fincore_extra->nr_entries: %ld\n", fe.nr_entries);
+
+	ret = close(fd);
+	if (ret < 0)
+		err("close");
+	return 0;
+}
diff --git v3.16-rc3.orig/tools/testing/selftests/fincore/run_fincoretests v3.16-rc3/tools/testing/selftests/fincore/run_fincoretests
new file mode 100644
index 000000000000..881a1ce1fbee
--- /dev/null
+++ v3.16-rc3/tools/testing/selftests/fincore/run_fincoretests
@@ -0,0 +1,361 @@
+#!/bin/bash
+
+WDIR=./fincore_work
+mkdir $WDIR 2> /dev/null
+TMPF=`mktemp --tmpdir=$WDIR -d`
+export LANG=C
+
+sysctl -q vm.nr_hugepages=50
+
+#
+# common routines
+#
+abort() {
+    echo "Test abort"
+    exit 1
+}
+
+create_small_file() {
+    dd if=/dev/urandom of=$WDIR/smallfile bs=4096 count=4 > /dev/null 2>&1
+    dd if=/dev/urandom of=$WDIR/smallfile bs=4096 count=4 seek=8> /dev/null 2>&1
+    date >> $WDIR/smallfile
+    sync
+}
+
+create_large_file() {
+    dd if=/dev/urandom of=$WDIR/largefile bs=4096 count=384 > /dev/null 2>&1
+    dd if=/dev/urandom of=$WDIR/largefile bs=4096 count=384 seek=640> /dev/null 2>&1
+    sync
+}
+
+create_tmpfs_file() {
+    dd if=/dev/urandom of=/tmp/tmpfile bs=4096 count=4 > /dev/null 2>&1
+    dd if=/dev/urandom of=/tmp/tmpfile bs=4096 count=4 seek=8> /dev/null 2>&1
+    date >> /tmp/tmpfile
+    sync
+}
+
+create_hugetlb_file() {
+    if mount | grep $WDIR/hugepages > /dev/null ; then
+        echo "$WDIR/hugepages already mounted"
+    else
+        mkdir -p $WDIR/hugepages 2> /dev/null
+        mount -t hugetlbfs none $WDIR/hugepages 2> /dev/null
+        if [ $? -ne 0 ] ; then
+            echo "Failed to mount hugetlbfs" >&2
+            return 1
+        fi
+    fi
+    local hptotal=$(grep HugePages_Total: /proc/meminfo | tr -s ' ' | cut -f2 -d' ')
+    if [ "$hptotal" -lt 10 ] ; then
+        echo "Hugepage pool size need to be >= 10" >&2
+        return 1
+    fi
+    ./create_hugetlbfs_file $WDIR/hugepages/file
+    if [ $? -ne 0 ] ; then
+        echo "Failed to create hugetlb file" >&2
+        return 1
+    fi
+    return 0;
+}
+
+get_buffer() {
+    cat "$1" | grep '^buffer:' | cut -f 2- -d ' '
+}
+
+get_fincore_extra_nr_entries() {
+    cat "$1" | grep '^fincore_extra->nr_entries' | cut -f 2 -d ' '
+}
+
+nr_of_exist_should_be() {
+    if [ "$1" -ne "$2" ] ; then
+        echo "[FAIL] $3: Number of on-memory pages should be $1, but got $2"
+        return 1
+    fi
+    return 0
+}
+
+nr_of_nonexist_should_be() {
+    if [ "$1" -ne "$2" ] ; then
+        echo "[FAIL] $3: Number of hole entries should be $1, but got $2"
+        return 1
+    fi
+    return 0
+}
+
+nr_of_valid_entries_should_be() {
+    if [ "$1" -ne "$2" ] ; then
+        echo "[FAIL] $3: Number of valid entries should be $1, but got $2"
+        return 1
+    fi
+    return 0
+}
+
+check_einval() {
+    grep "fincore: Invalid argument" "$1" > /dev/null
+}
+
+#
+# Testcases
+#
+test_smallfile_bytemap() {
+    local exist
+    local nonexist
+    create_small_file
+
+    ./fincore -m 0x1 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 1 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0 | wc -l)
+    nr_of_exist_should_be 9 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_smallfile_pfn() {
+    local exist
+    local nonexist
+    create_small_file
+
+    ./fincore -m 0x4 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_of_exist_should_be 9 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_smallfile_multientry() {
+    local exist
+    local nonexist
+    create_small_file
+
+    ./fincore -m 0xc -e $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2,3 | grep -vP "0x0\t0x0" | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2,3 | grep -P "0x0\t0x0" | wc -l)
+    nr_of_exist_should_be 9 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_smallfile_pfn_skiphole() {
+    local exist
+    local nonexist
+    local nr_entries
+    create_small_file
+
+    ./fincore -m 0x6 -e $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_entries=$(get_fincore_extra_nr_entries $TMPF/$FUNCNAME)
+    nr_of_exist_should_be 9 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    nr_of_valid_entries_should_be 9 "$nr_entries" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+# in-kernel function sys_fincore() repeat copy_to_user() per 256 entries,
+# so testing for large file is meaningful testcase.
+test_largefile_pfn() {
+    local exist
+    local nonexist
+    create_large_file
+
+    ./fincore -m 0x4 -e $WDIR/largefile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_of_exist_should_be 768 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 256 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_largefile_pfn_offset() {
+    local exist
+    local nonexist
+    create_large_file
+
+    ./fincore -m 0x4 -s 0x80000 $WDIR/largefile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_of_exist_should_be 640 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 256 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_largefile_pfn_overrun() {
+    local exist
+    local nonexist
+    local nr_entries
+    create_large_file
+
+    ./fincore -m 0x4 -s 0x80000 -l 0x400000 -e $WDIR/largefile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_entries=$(get_fincore_extra_nr_entries $TMPF/$FUNCNAME)
+    nr_of_exist_should_be 640 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 384 "$nonexist" "$FUNCNAME" || return 1
+    nr_of_valid_entries_should_be 896 "$nr_entries" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_largefile_pfn_skiphole() {
+    local exist
+    local nonexist
+    create_large_file
+
+    ./fincore -m 0x6 -s 0x100000 -l 0x102000 -e $WDIR/largefile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_entries=$(get_fincore_extra_nr_entries $TMPF/$FUNCNAME)
+    nr_of_exist_should_be 258 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 0 "$nonexist" "$FUNCNAME" || return 1
+    nr_of_valid_entries_should_be 258 "$nr_entries" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_tmpfs_pfn() {
+    local exist
+    local nonexist
+    create_tmpfs_file
+
+    ./fincore -m 0x4 /tmp/tmpfile > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_of_exist_should_be 9 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    echo "[PASS] $FUNCNAME"
+}
+
+test_hugetlb_pfn() {
+    local exist
+    local nonexist
+    local exitcode=0
+    create_hugetlb_file
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: fail to create a file on hugetlbfs"
+        return 1
+    fi
+    local hugepagesize=$[$(cat /proc/meminfo  | grep Hugepagesize: | tr -s ' ' | cut -f2 -d' ') * 1024]
+    ./fincore -p $hugepagesize -m 0x4 -e $WDIR/hugepages/file > $TMPF/$FUNCNAME 2>&1
+    exist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+    nonexist=$(get_buffer $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+    nr_entries=$(get_fincore_extra_nr_entries $TMPF/$FUNCNAME)
+    nr_of_exist_should_be 6 "$exist" "$FUNCNAME" || return 1
+    nr_of_nonexist_should_be 4 "$nonexist" "$FUNCNAME" || return 1
+    nr_of_valid_entries_should_be 10 "$nr_entries" "$FUNCNAME" || return 1
+    rm -rf $WDIR/hugepages/file
+    echo "[PASS] $FUNCNAME"
+}
+
+test_invalid_start_address() {
+    create_small_file
+    ./fincore -m 0x4 -s -0x4000 -l 1 -e $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: negative start is invalid"
+        return 1
+    fi
+    ./fincore -m 0x4 -s 0x100000 -l 1 -e $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: too large start is invalid"
+        return 1
+    fi
+    ./fincore -m 0x4 -s 0x30 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: fincore should fail for unaligned start address"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+test_invalid_len() {
+    create_small_file
+    ./fincore -m 0x4 -l 0 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: zero len is invalid"
+        return 1
+    fi
+    ./fincore -m 0x4 -l -10 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: negative len is invalid"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+test_invalid_mode() {
+    create_small_file
+    ./fincore -m 0x0 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: mode == NULL is invalid mode"
+        return 1
+    fi
+    ./fincore -m 0x5 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: mode == (FINCORE_BMAP|FINCORE_PFN) is invalid mode"
+        return 1
+    fi
+    ./fincore -m 0x3 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: mode == (FINCORE_BMAP|FINCORE_PGOFF) is invalid mode"
+        return 1
+    fi
+    ./fincore -m 0x6 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: mode == (FINCORE_PGOFF|FINCORE_PFN) is valid mode"
+        return 1
+    fi
+    ./fincore -m 0x2 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: mode == (FINCORE_PGOFF) is valid mode"
+        return 1
+    fi
+    ./fincore -m 0x1004 $WDIR/smallfile > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: mode == (Unknown|FINCORE_PFN) is invalid mode"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+test_unaligned_start_address_hugetlb() {
+    local exist
+    local nonexist
+    local exitcode=0
+    create_hugetlb_file
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: fail to create a file on hugetlbfs"
+        return 1
+    fi
+    local hugepagesize=$[$(cat /proc/meminfo  | grep Hugepagesize: | tr -s ' ' | cut -f2 -d' ') * 1024]
+    ./fincore -p $hugepagesize -m 0x4 -s 0x1000 $WDIR/hugepages/file > $TMPF/$FUNCNAME 2>&1
+    if [ $? -eq 0 ] || ! check_einval $TMPF/$FUNCNAME ; then
+        echo "[FAIL] $FUNCNAME: fincore should fail for page-unaligned start address"
+        return 1
+    fi
+    ./fincore -p $hugepagesize -m 0x4 -s $hugepagesize $WDIR/hugepages/file > $TMPF/$FUNCNAME 2>&1
+    if [ $? -ne 0 ] ; then
+        echo "[FAIL] $FUNCNAME: fincore should pass for hugepage-aligned start address"
+        return 1
+    fi
+    echo "[PASS] $FUNCNAME"
+}
+
+test_smallfile_bytemap                 || abort
+test_smallfile_pfn                     || abort
+test_smallfile_multientry              || abort
+test_smallfile_pfn_skiphole            || abort
+test_largefile_pfn                     || abort
+test_largefile_pfn_offset              || abort
+test_largefile_pfn_overrun             || abort
+test_largefile_pfn_skiphole            || abort
+test_tmpfs_pfn                         || abort
+test_hugetlb_pfn                       || abort
+test_invalid_start_address             || abort
+test_invalid_len                       || abort
+test_invalid_mode                      || abort
+test_unaligned_start_address_hugetlb   || abort
+
+# cleanup
+rm -rf $WDIR/hugepages/file
+umount $WDIR/hugepages > /dev/null 2>&1
+
+exit 0
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 3/3] man2/fincore.2: document general description about fincore(2)
  2014-07-07 18:00 ` Naoya Horiguchi
@ 2014-07-07 18:00   ` Naoya Horiguchi
  -1 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-07 18:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Dave Hansen, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

This patch adds the man page for the new system call fincore(2).

ChangeLog v3:
- remove page cache tag stuff

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 man2/fincore.2 | 348 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 348 insertions(+)
 create mode 100644 man2/fincore.2

diff --git v3.16-rc3.orig/man2/fincore.2 v3.16-rc3/man2/fincore.2
new file mode 100644
index 000000000000..b4893084cd36
--- /dev/null
+++ v3.16-rc3/man2/fincore.2
@@ -0,0 +1,348 @@
+.\" Copyright (C) 2014 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH FINCORE 2 2014-07-07 "Linux" "Linux Programmer's Manual"
+.SH NAME
+fincore \- get page cache information
+.SH SYNOPSIS
+.nf
+.B #include <sys/fincore.h>
+.B #include <sys/kernel-page-flags.h>
+.sp
+.BI "int fincore(int " fd ", loff_t " start ", long " nr_pages ", int " mode ,
+.BI "            unsigned char *" vec ", struct fincore_extra *" extra );
+.fi
+.SH DESCRIPTION
+.BR fincore ()
+extracts information of in-core data of (i.e., page caches for)
+the file referred to by the file descriptor
+.IR fd .
+The kernel scans over the page cache tree,
+starting at the in-file offset
+.I start
+(in bytes) until
+.I nr_pages
+entries in the userspace buffer pointed to by
+.IR vec
+are filled with the page cache's data,
+or until the scan reached the end of the file.
+The format of each entry stored in
+.I vec
+depends on the
+.IR mode .
+The extra argument
+.I extra
+is used to pass the additional data between the kernel and the userspace.
+This is optional, so you may set
+.I extra
+to NULL if unnecessary.
+The structure
+.I fincore_extra
+is defined like:
+.in +4n
+.nf
+
+struct fincore_extra {
+        unsigned long nr_entries;
+};
+
+.fi
+.in
+The field
+.I nr_entries
+is an output parameter, set to the number of valid entries stored in
+.IR vec
+by the kernel on return.
+
+The
+.I start
+argument must be aligned to the page cache size boundary.
+In most cases, it's the page size boundary,
+but if called for a hugetlbfs file,
+the page cache size is the size of the hugepage associated with the file,
+so
+.I start
+must be aligned to the hugepage size boundary.
+
+The
+.I mode
+argument determines the data format of each entry in the user buffer
+.IR vec :
+.TP
+.B FINCORE_BMAP (0)
+In this mode,
+1 byte vector is stored in
+.I vec
+on return.
+The least significant bit of each byte is set if the corresponding page
+is currently resident in memory, and is cleared otherwise.
+(The other bits in each byte are undefined and reserved for future use.)
+.LP
+Any of the following flags are to be set to add an 8 byte field in each entry.
+You can set any of these flags at the same time, although you can't set
+FINCORE_BMAP combined with these 8 byte field flags.
+.TP
+.B FINCORE_PGOFF (1)
+This flag indicates that each entry contains a page offset field.
+With this information, you don't have to get data for hole range,
+so they are not stored in
+.I vec
+any longer.
+Note that if you call with this flag, you can't predict how many valid
+entries are stored in the buffer on return. So the
+.I nr_entries
+field in
+.I struct fincore_extra
+is useful if you want it.
+.TP
+.B FINCORE_PFN (2)
+This flag indicates that each entry contains a page frame number
+(i.e., physical address in page size unit) field.
+.TP
+.B FINCORE_PAGE_FLAGS (3)
+This flag indicates that each entry contains a page flags field.
+See KERNEL PAGE FLAGS section for more detail about each bit.
+.LP
+The size of the buffer
+.I vec
+must be at least
+.I nr_pages
+bytes if FINCORE_BMAP is set,
+and
+.I (8*n*nr_pages)
+bytes if some of the 8 byte field flags are set,
+where
+.I n
+means the number of 8 byte field flags being set.
+When multiple 8 byte field flags are set, the order of data in each
+entry is the same as one in the bit definition order (shown above
+as the numbers in parentheses.)
+For example, when you set FINCORE_PGOFF (bit 1) and FINCORE_PAGE_FLAGS (bit 3,)
+the first 8 bytes in an entry is the page offset,
+and the second 8 bytes is the page flags.
+
+Note that the information returned by the kernel is just a snapshot:
+pages which are not locked in memory can be freed at any moment, and
+the contents of
+.I vec
+may already be stale by the time the caller refers to the data.
+.SH KERNEL PAGE FLAGS
+.TP
+.B KPF_LOCKED (0)
+The lock on the page is held, suggesting that the kernel may be
+doing some page-related sensitive operation.
+.TP
+.B KPF_ERROR (1)
+The page was affected by IO error or memory error, so the data on the page
+might be lost.
+.TP
+.B KPF_REFERENCED (2)
+This page flag is used to control the page reclaim, combined with KPF_ACTIVE.
+.TP
+.B KPF_UPTODATE (3)
+The page has valid contents.
+.TP
+.B KPF_DIRTY (4)
+The data of the page is not synchronized with one on the backing storage.
+.TP
+.B KPF_LRU (5)
+The page is linked to one of the LRU (Least Recently Update) lists.
+.TP
+.B KPF_ACTIVE (6)
+The page is linked to one of the active LRU lists.
+.TP
+.B KPF_SLAB (7)
+The page is used to construct slabs, which is managed by the kernel
+to allocate various types of kernel objects.
+.TP
+.B KPF_WRITEBACK (8)
+The page is under the writeback operation.
+.TP
+.B KPF_RECLAIM (9)
+The page is under the page reclaim operation.
+.TP
+.B KPF_BUDDY (10)
+The page is under the buddy allocator as a free page. Note that this flag
+is only set to the first page of the "buddy" (i.e., the chunk of free pages.)
+.TP
+.B KPF_MMAP (11)
+The page is mapped to the virtual address space of some processes.
+.TP
+.B KPF_ANON (12)
+The page is anonymous page.
+.TP
+.B KPF_SWAPCACHE (13)
+The page has its own copy of the data on the swap device.
+.TP
+.B KPF_SWAPBACKED (14)
+The page can be swapped out. This flag is set on anonymous pages,
+tmpfs pages, or shmem page.
+.TP
+.B KPF_COMPOUND_HEAD (15)
+The page belongs to a high-order page, and is its first page.
+.TP
+.B KPF_COMPOUND_TAIL (16)
+The page belongs to a high-order page, and is not its first page.
+.TP
+.B KPF_HUGE (17)
+The page is used to construct a hugepage.
+.TP
+.B KPF_UNEVICTABLE (18)
+The page is prevented from being freed.
+This is caused by
+.BR mlock (2)
+or shared memory with
+.BR SHM_LOCK .
+.TP
+.B KPF_HWPOISON (19)
+The page is affected by a hardware error on the memory.
+.TP
+.B KPF_NOPAGE (20)
+This is a pseudo page flag which indicates that the given address
+has no struct page backed.
+.TP
+.B KPF_KSM (21)
+The page is a shared page governed by KSM (Kernel Shared Merging.)
+.TP
+.B KPF_THP (22)
+The page is used to construct a transparent hugepage.
+.LP
+.SH RETURN VALUE
+On success,
+.BR fincore ()
+returns 0.
+On error, \-1 is returned, and
+.I errno
+is set appropriately.
+.SH ERRORS
+.TP
+.B EBADF
+.I fd
+is not a valid file descriptor.
+.TP
+.B EFAULT
+.I vec
+points to an invalid address.
+.TP
+.B EINVAL
+.I start
+is unaligned to page cache size or is out-of-range
+(negative or larger than the file size.)
+Or
+.I nr_pages
+is not a positive value.
+Or
+.I mode
+contained a undefined flag, or contained no flag,
+or contained both of FINCORE_BMAP and one of the "8 byte field" flags.
+.SH VERSIONS
+TBD
+.SH CONFORMING TO
+TBD
+
+.SH EXAMPLE
+.PP
+The following program is an example that shows the page cache information
+of the file specified in its first command-line argument to the standard
+output.
+
+.nf
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <sys/fincore.h>
+
+#define err(msg) do { perror(msg); exit(1); } while (0)
+
+int main(int argc, char *argv[])
+{
+    int i, j;
+    int fd;
+    int ret;
+    long ps = sysconf(_SC_PAGESIZE);
+    long nr_pages;
+    unsigned char *buf;
+    struct stat stat;
+    struct fincore_extra fe = {};
+
+    fd = open(argv[1], O_RDWR);
+    if (fd == \-1)
+        err("open");
+
+    ret = fstat(fd, &stat);
+    if (ret == \-1)
+        err("fstat");
+    nr_pages = ((stat.st_size + ps \- 1) & (~(ps \- 1))) / ps;
+
+    buf = malloc(nr_pages * 24);
+    if (!buf)
+        err("malloc");
+
+    /* byte map */
+    ret = fincore(fd, 0, nr_pages, FINCORE_BMAP, buf, NULL);
+    if (ret < 0)
+        err("fincore");
+    printf("Page residency:");
+    for (i = 0; i < nr_pages; i++)
+        printf("%d", buf[i]);
+    printf("\\n\\n");
+
+    /* 8 byte entry */
+    ret = fincore(fd, 0, nr_pages,
+                  FINCORE_PFN|FINCORE_PAGE_FLAGS, buf, &fe);
+    if (ret < 0)
+        err("fincore");
+    printf("pfn\\tflags %lx\\n", fe.nr_entries);
+    for (i = 0; i < fe.nr_entries; i++) {
+        for (j = 0; j < 2; j++)
+            printf("0x%lx\\t", *(unsigned long *)(buf + (i*2+j)*8));
+        printf("\\n");
+    }
+    printf("\\n");
+
+    /* 8 byte entry with page offset (no hole scanned) */
+    ret = fincore(fd, 0, nr_pages,
+              FINCORE_PGOFF|FINCORE_PFN|FINCORE_PAGE_FLAGS, buf, &fe);
+    if (ret < 0)
+        err("fincore");
+    printf("pgoff\\tpfn\\tflags %lx\\n", fe.nr_entries);
+    for (i = 0; i < fe.nr_entries; i++) {
+        for (j = 0; j < 3; j++)
+            printf("0x%lx\\t", *(unsigned long *)(buf + (i*3+j)*8));
+        printf("\\n");
+    }
+
+    free(buf);
+
+    ret = close(fd);
+    if (ret < 0)
+        err("close");
+    return 0;
+}
+.fi
+.SH SEE ALSO
+.BR mincore (2),
+.BR fsync (2)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 3/3] man2/fincore.2: document general description about fincore(2)
@ 2014-07-07 18:00   ` Naoya Horiguchi
  0 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-07 18:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Dave Hansen, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

This patch adds the man page for the new system call fincore(2).

ChangeLog v3:
- remove page cache tag stuff

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 man2/fincore.2 | 348 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 348 insertions(+)
 create mode 100644 man2/fincore.2

diff --git v3.16-rc3.orig/man2/fincore.2 v3.16-rc3/man2/fincore.2
new file mode 100644
index 000000000000..b4893084cd36
--- /dev/null
+++ v3.16-rc3/man2/fincore.2
@@ -0,0 +1,348 @@
+.\" Copyright (C) 2014 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH FINCORE 2 2014-07-07 "Linux" "Linux Programmer's Manual"
+.SH NAME
+fincore \- get page cache information
+.SH SYNOPSIS
+.nf
+.B #include <sys/fincore.h>
+.B #include <sys/kernel-page-flags.h>
+.sp
+.BI "int fincore(int " fd ", loff_t " start ", long " nr_pages ", int " mode ,
+.BI "            unsigned char *" vec ", struct fincore_extra *" extra );
+.fi
+.SH DESCRIPTION
+.BR fincore ()
+extracts information of in-core data of (i.e., page caches for)
+the file referred to by the file descriptor
+.IR fd .
+The kernel scans over the page cache tree,
+starting at the in-file offset
+.I start
+(in bytes) until
+.I nr_pages
+entries in the userspace buffer pointed to by
+.IR vec
+are filled with the page cache's data,
+or until the scan reached the end of the file.
+The format of each entry stored in
+.I vec
+depends on the
+.IR mode .
+The extra argument
+.I extra
+is used to pass the additional data between the kernel and the userspace.
+This is optional, so you may set
+.I extra
+to NULL if unnecessary.
+The structure
+.I fincore_extra
+is defined like:
+.in +4n
+.nf
+
+struct fincore_extra {
+        unsigned long nr_entries;
+};
+
+.fi
+.in
+The field
+.I nr_entries
+is an output parameter, set to the number of valid entries stored in
+.IR vec
+by the kernel on return.
+
+The
+.I start
+argument must be aligned to the page cache size boundary.
+In most cases, it's the page size boundary,
+but if called for a hugetlbfs file,
+the page cache size is the size of the hugepage associated with the file,
+so
+.I start
+must be aligned to the hugepage size boundary.
+
+The
+.I mode
+argument determines the data format of each entry in the user buffer
+.IR vec :
+.TP
+.B FINCORE_BMAP (0)
+In this mode,
+1 byte vector is stored in
+.I vec
+on return.
+The least significant bit of each byte is set if the corresponding page
+is currently resident in memory, and is cleared otherwise.
+(The other bits in each byte are undefined and reserved for future use.)
+.LP
+Any of the following flags are to be set to add an 8 byte field in each entry.
+You can set any of these flags at the same time, although you can't set
+FINCORE_BMAP combined with these 8 byte field flags.
+.TP
+.B FINCORE_PGOFF (1)
+This flag indicates that each entry contains a page offset field.
+With this information, you don't have to get data for hole range,
+so they are not stored in
+.I vec
+any longer.
+Note that if you call with this flag, you can't predict how many valid
+entries are stored in the buffer on return. So the
+.I nr_entries
+field in
+.I struct fincore_extra
+is useful if you want it.
+.TP
+.B FINCORE_PFN (2)
+This flag indicates that each entry contains a page frame number
+(i.e., physical address in page size unit) field.
+.TP
+.B FINCORE_PAGE_FLAGS (3)
+This flag indicates that each entry contains a page flags field.
+See KERNEL PAGE FLAGS section for more detail about each bit.
+.LP
+The size of the buffer
+.I vec
+must be at least
+.I nr_pages
+bytes if FINCORE_BMAP is set,
+and
+.I (8*n*nr_pages)
+bytes if some of the 8 byte field flags are set,
+where
+.I n
+means the number of 8 byte field flags being set.
+When multiple 8 byte field flags are set, the order of data in each
+entry is the same as one in the bit definition order (shown above
+as the numbers in parentheses.)
+For example, when you set FINCORE_PGOFF (bit 1) and FINCORE_PAGE_FLAGS (bit 3,)
+the first 8 bytes in an entry is the page offset,
+and the second 8 bytes is the page flags.
+
+Note that the information returned by the kernel is just a snapshot:
+pages which are not locked in memory can be freed at any moment, and
+the contents of
+.I vec
+may already be stale by the time the caller refers to the data.
+.SH KERNEL PAGE FLAGS
+.TP
+.B KPF_LOCKED (0)
+The lock on the page is held, suggesting that the kernel may be
+doing some page-related sensitive operation.
+.TP
+.B KPF_ERROR (1)
+The page was affected by IO error or memory error, so the data on the page
+might be lost.
+.TP
+.B KPF_REFERENCED (2)
+This page flag is used to control the page reclaim, combined with KPF_ACTIVE.
+.TP
+.B KPF_UPTODATE (3)
+The page has valid contents.
+.TP
+.B KPF_DIRTY (4)
+The data of the page is not synchronized with one on the backing storage.
+.TP
+.B KPF_LRU (5)
+The page is linked to one of the LRU (Least Recently Update) lists.
+.TP
+.B KPF_ACTIVE (6)
+The page is linked to one of the active LRU lists.
+.TP
+.B KPF_SLAB (7)
+The page is used to construct slabs, which is managed by the kernel
+to allocate various types of kernel objects.
+.TP
+.B KPF_WRITEBACK (8)
+The page is under the writeback operation.
+.TP
+.B KPF_RECLAIM (9)
+The page is under the page reclaim operation.
+.TP
+.B KPF_BUDDY (10)
+The page is under the buddy allocator as a free page. Note that this flag
+is only set to the first page of the "buddy" (i.e., the chunk of free pages.)
+.TP
+.B KPF_MMAP (11)
+The page is mapped to the virtual address space of some processes.
+.TP
+.B KPF_ANON (12)
+The page is anonymous page.
+.TP
+.B KPF_SWAPCACHE (13)
+The page has its own copy of the data on the swap device.
+.TP
+.B KPF_SWAPBACKED (14)
+The page can be swapped out. This flag is set on anonymous pages,
+tmpfs pages, or shmem page.
+.TP
+.B KPF_COMPOUND_HEAD (15)
+The page belongs to a high-order page, and is its first page.
+.TP
+.B KPF_COMPOUND_TAIL (16)
+The page belongs to a high-order page, and is not its first page.
+.TP
+.B KPF_HUGE (17)
+The page is used to construct a hugepage.
+.TP
+.B KPF_UNEVICTABLE (18)
+The page is prevented from being freed.
+This is caused by
+.BR mlock (2)
+or shared memory with
+.BR SHM_LOCK .
+.TP
+.B KPF_HWPOISON (19)
+The page is affected by a hardware error on the memory.
+.TP
+.B KPF_NOPAGE (20)
+This is a pseudo page flag which indicates that the given address
+has no struct page backed.
+.TP
+.B KPF_KSM (21)
+The page is a shared page governed by KSM (Kernel Shared Merging.)
+.TP
+.B KPF_THP (22)
+The page is used to construct a transparent hugepage.
+.LP
+.SH RETURN VALUE
+On success,
+.BR fincore ()
+returns 0.
+On error, \-1 is returned, and
+.I errno
+is set appropriately.
+.SH ERRORS
+.TP
+.B EBADF
+.I fd
+is not a valid file descriptor.
+.TP
+.B EFAULT
+.I vec
+points to an invalid address.
+.TP
+.B EINVAL
+.I start
+is unaligned to page cache size or is out-of-range
+(negative or larger than the file size.)
+Or
+.I nr_pages
+is not a positive value.
+Or
+.I mode
+contained a undefined flag, or contained no flag,
+or contained both of FINCORE_BMAP and one of the "8 byte field" flags.
+.SH VERSIONS
+TBD
+.SH CONFORMING TO
+TBD
+
+.SH EXAMPLE
+.PP
+The following program is an example that shows the page cache information
+of the file specified in its first command-line argument to the standard
+output.
+
+.nf
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <sys/fincore.h>
+
+#define err(msg) do { perror(msg); exit(1); } while (0)
+
+int main(int argc, char *argv[])
+{
+    int i, j;
+    int fd;
+    int ret;
+    long ps = sysconf(_SC_PAGESIZE);
+    long nr_pages;
+    unsigned char *buf;
+    struct stat stat;
+    struct fincore_extra fe = {};
+
+    fd = open(argv[1], O_RDWR);
+    if (fd == \-1)
+        err("open");
+
+    ret = fstat(fd, &stat);
+    if (ret == \-1)
+        err("fstat");
+    nr_pages = ((stat.st_size + ps \- 1) & (~(ps \- 1))) / ps;
+
+    buf = malloc(nr_pages * 24);
+    if (!buf)
+        err("malloc");
+
+    /* byte map */
+    ret = fincore(fd, 0, nr_pages, FINCORE_BMAP, buf, NULL);
+    if (ret < 0)
+        err("fincore");
+    printf("Page residency:");
+    for (i = 0; i < nr_pages; i++)
+        printf("%d", buf[i]);
+    printf("\\n\\n");
+
+    /* 8 byte entry */
+    ret = fincore(fd, 0, nr_pages,
+                  FINCORE_PFN|FINCORE_PAGE_FLAGS, buf, &fe);
+    if (ret < 0)
+        err("fincore");
+    printf("pfn\\tflags %lx\\n", fe.nr_entries);
+    for (i = 0; i < fe.nr_entries; i++) {
+        for (j = 0; j < 2; j++)
+            printf("0x%lx\\t", *(unsigned long *)(buf + (i*2+j)*8));
+        printf("\\n");
+    }
+    printf("\\n");
+
+    /* 8 byte entry with page offset (no hole scanned) */
+    ret = fincore(fd, 0, nr_pages,
+              FINCORE_PGOFF|FINCORE_PFN|FINCORE_PAGE_FLAGS, buf, &fe);
+    if (ret < 0)
+        err("fincore");
+    printf("pgoff\\tpfn\\tflags %lx\\n", fe.nr_entries);
+    for (i = 0; i < fe.nr_entries; i++) {
+        for (j = 0; j < 3; j++)
+            printf("0x%lx\\t", *(unsigned long *)(buf + (i*3+j)*8));
+        printf("\\n");
+    }
+
+    free(buf);
+
+    ret = close(fd);
+    if (ret < 0)
+        err("close");
+    return 0;
+}
+.fi
+.SH SEE ALSO
+.BR mincore (2),
+.BR fsync (2)
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
  2014-07-07 18:00   ` Naoya Horiguchi
@ 2014-07-07 19:01     ` Dave Hansen
  -1 siblings, 0 replies; 45+ messages in thread
From: Dave Hansen @ 2014-07-07 19:01 UTC (permalink / raw)
  To: Naoya Horiguchi, Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Christoph Hellwig, Dave Chinner, Michael Kerrisk,
	Linux API, Naoya Horiguchi, Kees Cook

> +/*
> + * You can control how the buffer in userspace is filled with this mode
> + * parameters:

I agree that we don't have any good mechanisms for looking at the page
cache from userspace.  I've hacked some things up using mincore() and
they weren't pretty, so I welcome _something_ like this.

But, is this trying to do too many things at once?  Do we have solid use
cases spelled out for each of these modes?  Have we thought out how they
will be used in practice?

The biggest question for me, though, is whether we want to start
designing these per-page interfaces to consider different page sizes, or
whether we're going to just continue to pretend that the entire world is
4k pages.  Using FINCORE_BMAP on 1GB hugetlbfs files would be a bit
silly, for instance.

> + * - FINCORE_BMAP:
> + *     the page status is returned in a vector of bytes.
> + *     The least significant bit of each byte is 1 if the referenced page
> + *     is in memory, otherwise it is zero.

I know this is consistent with mincore(), but it did always bother me
that mincore() was so sparse.  Seems like it is wasting 7/8 of its bits.

> + * - FINCORE_PGOFF:
> + *     if this flag is set, fincore() doesn't store any information about
> + *     holes. Instead each records per page has the entry of page offset,
> + *     using 8 bytes. This mode is useful if we handle a large file and
> + *     only few pages are on memory.

This bothers me a bit.  How would someone know how sparse file was
before calling this?  If it's not sparse, and they use this, they'll end
up using 8x the memory they would have using FINCORE_BMAP.  If it *is*
sparse, and they use FINCORE_BMAP, they will either waste tons of memory
on buffers, or have to make a ton of calls.

I guess this could also be used to do *searches*, which would let you
search out holes.  Let's say you have a 2TB file.  You could call this
with a buffer size of 1 entry and do searches, say 0->1TB.  If you get
your one entry back, you know it's not completely sparse.

But, that wouldn't work with it as-designed.  The length of the buffer
and the range of the file being checked are coupled together, so you
can't say:

	vec = malloc(sizeof(long));
	fincore(fd, 0, 1TB, FINCORE_PGOFF, vec, extra);

without overflowing vec.

Is it really right to say this is going to be 8 bytes?  Would we want it
to share types with something else, like be an loff_t?

> + * - FINCORE_PFN:
> + *     stores pfn, using 8 bytes.

These are all an unprivileged operations from what I can tell.  I know
we're going to a lot of trouble to hide kernel addresses from being seen
in userspace.  This seems like it would be undesirable for the folks
that care about not leaking kernel addresses, especially for
unprivileged users.

This would essentially tell userspace where in the kernel's address
space some user-controlled data will be.

> + * We can use multiple flags among the flags in FINCORE_LONGENTRY_MASK.
> + * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
> + * information is stored like this:

Instead of specifying the ordering in the manpages alone, would it be
smarter to just say that the ordering of the items is dependent on the
ordering of the flags?  In other words if FINCORE_PFN <
FINCORE_PAGEFLAGS, then its field comes first?



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
@ 2014-07-07 19:01     ` Dave Hansen
  0 siblings, 0 replies; 45+ messages in thread
From: Dave Hansen @ 2014-07-07 19:01 UTC (permalink / raw)
  To: Naoya Horiguchi, Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Christoph Hellwig, Dave Chinner, Michael Kerrisk,
	Linux API, Naoya Horiguchi, Kees Cook

> +/*
> + * You can control how the buffer in userspace is filled with this mode
> + * parameters:

I agree that we don't have any good mechanisms for looking at the page
cache from userspace.  I've hacked some things up using mincore() and
they weren't pretty, so I welcome _something_ like this.

But, is this trying to do too many things at once?  Do we have solid use
cases spelled out for each of these modes?  Have we thought out how they
will be used in practice?

The biggest question for me, though, is whether we want to start
designing these per-page interfaces to consider different page sizes, or
whether we're going to just continue to pretend that the entire world is
4k pages.  Using FINCORE_BMAP on 1GB hugetlbfs files would be a bit
silly, for instance.

> + * - FINCORE_BMAP:
> + *     the page status is returned in a vector of bytes.
> + *     The least significant bit of each byte is 1 if the referenced page
> + *     is in memory, otherwise it is zero.

I know this is consistent with mincore(), but it did always bother me
that mincore() was so sparse.  Seems like it is wasting 7/8 of its bits.

> + * - FINCORE_PGOFF:
> + *     if this flag is set, fincore() doesn't store any information about
> + *     holes. Instead each records per page has the entry of page offset,
> + *     using 8 bytes. This mode is useful if we handle a large file and
> + *     only few pages are on memory.

This bothers me a bit.  How would someone know how sparse file was
before calling this?  If it's not sparse, and they use this, they'll end
up using 8x the memory they would have using FINCORE_BMAP.  If it *is*
sparse, and they use FINCORE_BMAP, they will either waste tons of memory
on buffers, or have to make a ton of calls.

I guess this could also be used to do *searches*, which would let you
search out holes.  Let's say you have a 2TB file.  You could call this
with a buffer size of 1 entry and do searches, say 0->1TB.  If you get
your one entry back, you know it's not completely sparse.

But, that wouldn't work with it as-designed.  The length of the buffer
and the range of the file being checked are coupled together, so you
can't say:

	vec = malloc(sizeof(long));
	fincore(fd, 0, 1TB, FINCORE_PGOFF, vec, extra);

without overflowing vec.

Is it really right to say this is going to be 8 bytes?  Would we want it
to share types with something else, like be an loff_t?

> + * - FINCORE_PFN:
> + *     stores pfn, using 8 bytes.

These are all an unprivileged operations from what I can tell.  I know
we're going to a lot of trouble to hide kernel addresses from being seen
in userspace.  This seems like it would be undesirable for the folks
that care about not leaking kernel addresses, especially for
unprivileged users.

This would essentially tell userspace where in the kernel's address
space some user-controlled data will be.

> + * We can use multiple flags among the flags in FINCORE_LONGENTRY_MASK.
> + * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
> + * information is stored like this:

Instead of specifying the ordering in the manpages alone, would it be
smarter to just say that the ordering of the items is dependent on the
ordering of the flags?  In other words if FINCORE_PFN <
FINCORE_PAGEFLAGS, then its field comes first?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 3/3] man2/fincore.2: document general description about fincore(2)
@ 2014-07-07 19:08     ` Dave Hansen
  0 siblings, 0 replies; 45+ messages in thread
From: Dave Hansen @ 2014-07-07 19:08 UTC (permalink / raw)
  To: Naoya Horiguchi, Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Christoph Hellwig, Dave Chinner, Michael Kerrisk,
	Linux API, Naoya Horiguchi

On 07/07/2014 11:00 AM, Naoya Horiguchi wrote:
> +.SH RETURN VALUE
> +On success,
> +.BR fincore ()
> +returns 0.
> +On error, \-1 is returned, and
> +.I errno
> +is set appropriately.

Is this accurate?  From reading the syscall itself, it looked like it
did this:

> + * Return value is the number of pages whose data is stored in fc->buffer.
> + */
> +static long do_fincore(struct fincore_control *fc, int nr_pages)

and:

> +SYSCALL_DEFINE6(fincore, int, fd, loff_t, start, long, nr_pages,
...
> +	while (fc.nr_pages > 0) {
> +		memset(fc.buffer, 0, fc.buffer_size);
> +		ret = do_fincore(&fc, min(step, fc.nr_pages));
> +		/* Reached the end of the file */
> +		if (ret == 0)
> +			break;
> +		if (ret < 0)
> +			break;
...
> +	}
...
> +	return ret;
> +}

Which seems that for a given loop of do_fincore(), you might end up
returning the result of that *single* iteration of do_fincore() instead
of the aggregate of the entire syscall.

So, it can return <0 on failure, 0 on success, or also an essentially
random >0 number on success too.

Why not just use the return value for something useful instead of
hacking in the extras->nr_entries stuff?  Oh, and what if that

> +	if (extra)
> +		__put_user(nr, &extra->nr_entries);

fails?  It seems like we might silently forget to tell userspace how
many entries we filled.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 3/3] man2/fincore.2: document general description about fincore(2)
@ 2014-07-07 19:08     ` Dave Hansen
  0 siblings, 0 replies; 45+ messages in thread
From: Dave Hansen @ 2014-07-07 19:08 UTC (permalink / raw)
  To: Naoya Horiguchi, Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On 07/07/2014 11:00 AM, Naoya Horiguchi wrote:
> +.SH RETURN VALUE
> +On success,
> +.BR fincore ()
> +returns 0.
> +On error, \-1 is returned, and
> +.I errno
> +is set appropriately.

Is this accurate?  From reading the syscall itself, it looked like it
did this:

> + * Return value is the number of pages whose data is stored in fc->buffer.
> + */
> +static long do_fincore(struct fincore_control *fc, int nr_pages)

and:

> +SYSCALL_DEFINE6(fincore, int, fd, loff_t, start, long, nr_pages,
...
> +	while (fc.nr_pages > 0) {
> +		memset(fc.buffer, 0, fc.buffer_size);
> +		ret = do_fincore(&fc, min(step, fc.nr_pages));
> +		/* Reached the end of the file */
> +		if (ret == 0)
> +			break;
> +		if (ret < 0)
> +			break;
...
> +	}
...
> +	return ret;
> +}

Which seems that for a given loop of do_fincore(), you might end up
returning the result of that *single* iteration of do_fincore() instead
of the aggregate of the entire syscall.

So, it can return <0 on failure, 0 on success, or also an essentially
random >0 number on success too.

Why not just use the return value for something useful instead of
hacking in the extras->nr_entries stuff?  Oh, and what if that

> +	if (extra)
> +		__put_user(nr, &extra->nr_entries);

fails?  It seems like we might silently forget to tell userspace how
many entries we filled.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 3/3] man2/fincore.2: document general description about fincore(2)
@ 2014-07-07 19:08     ` Dave Hansen
  0 siblings, 0 replies; 45+ messages in thread
From: Dave Hansen @ 2014-07-07 19:08 UTC (permalink / raw)
  To: Naoya Horiguchi, Andrew Morton
  Cc: Konstantin Khlebnikov, Wu Fengguang, Arnaldo Carvalho de Melo,
	Borislav Petkov, Kirill A. Shutemov, Johannes Weiner,
	Rusty Russell, David Miller, Andres Freund, linux-kernel,
	linux-mm, Christoph Hellwig, Dave Chinner, Michael Kerrisk,
	Linux API, Naoya Horiguchi

On 07/07/2014 11:00 AM, Naoya Horiguchi wrote:
> +.SH RETURN VALUE
> +On success,
> +.BR fincore ()
> +returns 0.
> +On error, \-1 is returned, and
> +.I errno
> +is set appropriately.

Is this accurate?  From reading the syscall itself, it looked like it
did this:

> + * Return value is the number of pages whose data is stored in fc->buffer.
> + */
> +static long do_fincore(struct fincore_control *fc, int nr_pages)

and:

> +SYSCALL_DEFINE6(fincore, int, fd, loff_t, start, long, nr_pages,
...
> +	while (fc.nr_pages > 0) {
> +		memset(fc.buffer, 0, fc.buffer_size);
> +		ret = do_fincore(&fc, min(step, fc.nr_pages));
> +		/* Reached the end of the file */
> +		if (ret == 0)
> +			break;
> +		if (ret < 0)
> +			break;
...
> +	}
...
> +	return ret;
> +}

Which seems that for a given loop of do_fincore(), you might end up
returning the result of that *single* iteration of do_fincore() instead
of the aggregate of the entire syscall.

So, it can return <0 on failure, 0 on success, or also an essentially
random >0 number on success too.

Why not just use the return value for something useful instead of
hacking in the extras->nr_entries stuff?  Oh, and what if that

> +	if (extra)
> +		__put_user(nr, &extra->nr_entries);

fails?  It seems like we might silently forget to tell userspace how
many entries we filled.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
  2014-07-07 19:01     ` Dave Hansen
@ 2014-07-07 20:21       ` Naoya Horiguchi
  -1 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-07 20:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi, Kees Cook

Hi Dave,

Thank you for the comments.

On Mon, Jul 07, 2014 at 12:01:41PM -0700, Dave Hansen wrote:
> > +/*
> > + * You can control how the buffer in userspace is filled with this mode
> > + * parameters:
> 
> I agree that we don't have any good mechanisms for looking at the page
> cache from userspace.  I've hacked some things up using mincore() and
> they weren't pretty, so I welcome _something_ like this.
> 
> But, is this trying to do too many things at once?  Do we have solid use
> cases spelled out for each of these modes?  Have we thought out how they
> will be used in practice?

tools/vm/page-types.c will be an in-kernel user after this base code is
accepted. The idea of doing fincore() thing comes up during the discussion
with Konstantin over file cache mode of this tool.
pfn and page flag are needed there, so I think it's one clear usecase.

> The biggest question for me, though, is whether we want to start
> designing these per-page interfaces to consider different page sizes, or
> whether we're going to just continue to pretend that the entire world is
> 4k pages.  Using FINCORE_BMAP on 1GB hugetlbfs files would be a bit
> silly, for instance.
> 
> > + * - FINCORE_BMAP:
> > + *     the page status is returned in a vector of bytes.
> > + *     The least significant bit of each byte is 1 if the referenced page
> > + *     is in memory, otherwise it is zero.
> 
> I know this is consistent with mincore(), but it did always bother me
> that mincore() was so sparse.  Seems like it is wasting 7/8 of its bits.

Yes, I got the same comment in previous round. So, OK, not a few people
seem to think that space efficiency is more important than the consistency,
so I'm OK to do it.

We have an idea of making fincore() cover the whole mincore()'s feature
by letting fincore() handle /proc/pid/mem. So mincore() will be obsolete,
and no one has to care about consistency beteen mincore and fincore.
That might be another reason justifying the idea above.

> > + * - FINCORE_PGOFF:
> > + *     if this flag is set, fincore() doesn't store any information about
> > + *     holes. Instead each records per page has the entry of page offset,
> > + *     using 8 bytes. This mode is useful if we handle a large file and
> > + *     only few pages are on memory.
> 
> This bothers me a bit.  How would someone know how sparse file was
> before calling this?  If it's not sparse, and they use this, they'll end
> up using 8x the memory they would have using FINCORE_BMAP.  If it *is*
> sparse, and they use FINCORE_BMAP, they will either waste tons of memory
> on buffers, or have to make a ton of calls.

Yes, that's the hard point.
Some new mode (FINCORE_SUM for example) to get how many pages of a file
is in memory might be helpful to choose which mode, although we need 2 calls.

> I guess this could also be used to do *searches*, which would let you
> search out holes.  Let's say you have a 2TB file.  You could call this
> with a buffer size of 1 entry and do searches, say 0->1TB.  If you get
> your one entry back, you know it's not completely sparse.
> 
> But, that wouldn't work with it as-designed.  The length of the buffer
> and the range of the file being checked are coupled together,

This is only correct for !FINCORE_PGOFF.

> so you
> can't say:
> 
> 	vec = malloc(sizeof(long));
> 	fincore(fd, 0, 1TB, FINCORE_PGOFF, vec, extra);
> 
> without overflowing vec.

The 3rd parameter is the number of pages whose data is passed to userspace,
so we expect userspace to set it according to the buffer size.

But yes, I still have a problem. In FINCORE_PGOFF mode we only scan until
the buffer becomes full, but userspace doesn't know at which point the
scan stopped. It can guess the end point from the pgoff of the last buffer,
but it might not be straightforward or well-designed.
And I should describe this behavior more.

> Is it really right to say this is going to be 8 bytes?  Would we want it
> to share types with something else, like be an loff_t?

Could you elaborate it more?

> > + * - FINCORE_PFN:
> > + *     stores pfn, using 8 bytes.
> 
> These are all an unprivileged operations from what I can tell.  I know
> we're going to a lot of trouble to hide kernel addresses from being seen
> in userspace.  This seems like it would be undesirable for the folks
> that care about not leaking kernel addresses, especially for
> unprivileged users.
> 
> This would essentially tell userspace where in the kernel's address
> space some user-controlled data will be.

OK, so this and FINCORE_PAGEFLAGS will be limited for privileged users.

> > + * We can use multiple flags among the flags in FINCORE_LONGENTRY_MASK.
> > + * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
> > + * information is stored like this:
> 
> Instead of specifying the ordering in the manpages alone, would it be
> smarter to just say that the ordering of the items is dependent on the
> ordering of the flags?  In other words if FINCORE_PFN <
> FINCORE_PAGEFLAGS, then its field comes first?

Ah, right. I should've referred to the ordering here also.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
@ 2014-07-07 20:21       ` Naoya Horiguchi
  0 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-07 20:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi, Kees Cook

Hi Dave,

Thank you for the comments.

On Mon, Jul 07, 2014 at 12:01:41PM -0700, Dave Hansen wrote:
> > +/*
> > + * You can control how the buffer in userspace is filled with this mode
> > + * parameters:
> 
> I agree that we don't have any good mechanisms for looking at the page
> cache from userspace.  I've hacked some things up using mincore() and
> they weren't pretty, so I welcome _something_ like this.
> 
> But, is this trying to do too many things at once?  Do we have solid use
> cases spelled out for each of these modes?  Have we thought out how they
> will be used in practice?

tools/vm/page-types.c will be an in-kernel user after this base code is
accepted. The idea of doing fincore() thing comes up during the discussion
with Konstantin over file cache mode of this tool.
pfn and page flag are needed there, so I think it's one clear usecase.

> The biggest question for me, though, is whether we want to start
> designing these per-page interfaces to consider different page sizes, or
> whether we're going to just continue to pretend that the entire world is
> 4k pages.  Using FINCORE_BMAP on 1GB hugetlbfs files would be a bit
> silly, for instance.
> 
> > + * - FINCORE_BMAP:
> > + *     the page status is returned in a vector of bytes.
> > + *     The least significant bit of each byte is 1 if the referenced page
> > + *     is in memory, otherwise it is zero.
> 
> I know this is consistent with mincore(), but it did always bother me
> that mincore() was so sparse.  Seems like it is wasting 7/8 of its bits.

Yes, I got the same comment in previous round. So, OK, not a few people
seem to think that space efficiency is more important than the consistency,
so I'm OK to do it.

We have an idea of making fincore() cover the whole mincore()'s feature
by letting fincore() handle /proc/pid/mem. So mincore() will be obsolete,
and no one has to care about consistency beteen mincore and fincore.
That might be another reason justifying the idea above.

> > + * - FINCORE_PGOFF:
> > + *     if this flag is set, fincore() doesn't store any information about
> > + *     holes. Instead each records per page has the entry of page offset,
> > + *     using 8 bytes. This mode is useful if we handle a large file and
> > + *     only few pages are on memory.
> 
> This bothers me a bit.  How would someone know how sparse file was
> before calling this?  If it's not sparse, and they use this, they'll end
> up using 8x the memory they would have using FINCORE_BMAP.  If it *is*
> sparse, and they use FINCORE_BMAP, they will either waste tons of memory
> on buffers, or have to make a ton of calls.

Yes, that's the hard point.
Some new mode (FINCORE_SUM for example) to get how many pages of a file
is in memory might be helpful to choose which mode, although we need 2 calls.

> I guess this could also be used to do *searches*, which would let you
> search out holes.  Let's say you have a 2TB file.  You could call this
> with a buffer size of 1 entry and do searches, say 0->1TB.  If you get
> your one entry back, you know it's not completely sparse.
> 
> But, that wouldn't work with it as-designed.  The length of the buffer
> and the range of the file being checked are coupled together,

This is only correct for !FINCORE_PGOFF.

> so you
> can't say:
> 
> 	vec = malloc(sizeof(long));
> 	fincore(fd, 0, 1TB, FINCORE_PGOFF, vec, extra);
> 
> without overflowing vec.

The 3rd parameter is the number of pages whose data is passed to userspace,
so we expect userspace to set it according to the buffer size.

But yes, I still have a problem. In FINCORE_PGOFF mode we only scan until
the buffer becomes full, but userspace doesn't know at which point the
scan stopped. It can guess the end point from the pgoff of the last buffer,
but it might not be straightforward or well-designed.
And I should describe this behavior more.

> Is it really right to say this is going to be 8 bytes?  Would we want it
> to share types with something else, like be an loff_t?

Could you elaborate it more?

> > + * - FINCORE_PFN:
> > + *     stores pfn, using 8 bytes.
> 
> These are all an unprivileged operations from what I can tell.  I know
> we're going to a lot of trouble to hide kernel addresses from being seen
> in userspace.  This seems like it would be undesirable for the folks
> that care about not leaking kernel addresses, especially for
> unprivileged users.
> 
> This would essentially tell userspace where in the kernel's address
> space some user-controlled data will be.

OK, so this and FINCORE_PAGEFLAGS will be limited for privileged users.

> > + * We can use multiple flags among the flags in FINCORE_LONGENTRY_MASK.
> > + * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
> > + * information is stored like this:
> 
> Instead of specifying the ordering in the manpages alone, would it be
> smarter to just say that the ordering of the items is dependent on the
> ordering of the flags?  In other words if FINCORE_PFN <
> FINCORE_PAGEFLAGS, then its field comes first?

Ah, right. I should've referred to the ordering here also.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
  2014-07-07 20:21       ` Naoya Horiguchi
@ 2014-07-07 20:43         ` Dave Hansen
  -1 siblings, 0 replies; 45+ messages in thread
From: Dave Hansen @ 2014-07-07 20:43 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi, Kees Cook

On 07/07/2014 01:21 PM, Naoya Horiguchi wrote:
> On Mon, Jul 07, 2014 at 12:01:41PM -0700, Dave Hansen wrote:
>> But, is this trying to do too many things at once?  Do we have solid use
>> cases spelled out for each of these modes?  Have we thought out how they
>> will be used in practice?
> 
> tools/vm/page-types.c will be an in-kernel user after this base code is
> accepted. The idea of doing fincore() thing comes up during the discussion
> with Konstantin over file cache mode of this tool.
> pfn and page flag are needed there, so I think it's one clear usecase.

I'm going to take that as a no. :)

The whole FINCORE_PGOFF vs. FINCORE_BMAP issue is something that will
come up in practice.  We just don't have the interfaces for an end user
to pick which one they want to use.
>> Is it really right to say this is going to be 8 bytes?  Would we want it
>> to share types with something else, like be an loff_t?
> 
> Could you elaborate it more?

We specify file offsets in other system calls, like the lseek family.  I
was just thinking that this type should match up with those calls since
they are expressing the same data type with the same ranges and limitations.

>>> + * - FINCORE_PFN:
>>> + *     stores pfn, using 8 bytes.
>>
>> These are all an unprivileged operations from what I can tell.  I know
>> we're going to a lot of trouble to hide kernel addresses from being seen
>> in userspace.  This seems like it would be undesirable for the folks
>> that care about not leaking kernel addresses, especially for
>> unprivileged users.
>>
>> This would essentially tell userspace where in the kernel's address
>> space some user-controlled data will be.
> 
> OK, so this and FINCORE_PAGEFLAGS will be limited for privileged users.

Then I'd just question their usefulness outside of a debugging
environment, especially when you can get at them in other (more
roundabout) ways in a debugging environment.

This is really looking to me like two system calls.  The bitmap-based
one, and another more extensible one.  I don't think there's any harm in
having two system calls, especially when they're trying to glue together
two disparate interfaces.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
@ 2014-07-07 20:43         ` Dave Hansen
  0 siblings, 0 replies; 45+ messages in thread
From: Dave Hansen @ 2014-07-07 20:43 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi, Kees Cook

On 07/07/2014 01:21 PM, Naoya Horiguchi wrote:
> On Mon, Jul 07, 2014 at 12:01:41PM -0700, Dave Hansen wrote:
>> But, is this trying to do too many things at once?  Do we have solid use
>> cases spelled out for each of these modes?  Have we thought out how they
>> will be used in practice?
> 
> tools/vm/page-types.c will be an in-kernel user after this base code is
> accepted. The idea of doing fincore() thing comes up during the discussion
> with Konstantin over file cache mode of this tool.
> pfn and page flag are needed there, so I think it's one clear usecase.

I'm going to take that as a no. :)

The whole FINCORE_PGOFF vs. FINCORE_BMAP issue is something that will
come up in practice.  We just don't have the interfaces for an end user
to pick which one they want to use.
>> Is it really right to say this is going to be 8 bytes?  Would we want it
>> to share types with something else, like be an loff_t?
> 
> Could you elaborate it more?

We specify file offsets in other system calls, like the lseek family.  I
was just thinking that this type should match up with those calls since
they are expressing the same data type with the same ranges and limitations.

>>> + * - FINCORE_PFN:
>>> + *     stores pfn, using 8 bytes.
>>
>> These are all an unprivileged operations from what I can tell.  I know
>> we're going to a lot of trouble to hide kernel addresses from being seen
>> in userspace.  This seems like it would be undesirable for the folks
>> that care about not leaking kernel addresses, especially for
>> unprivileged users.
>>
>> This would essentially tell userspace where in the kernel's address
>> space some user-controlled data will be.
> 
> OK, so this and FINCORE_PAGEFLAGS will be limited for privileged users.

Then I'd just question their usefulness outside of a debugging
environment, especially when you can get at them in other (more
roundabout) ways in a debugging environment.

This is really looking to me like two system calls.  The bitmap-based
one, and another more extensible one.  I don't think there's any harm in
having two system calls, especially when they're trying to glue together
two disparate interfaces.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 3/3] man2/fincore.2: document general description about fincore(2)
  2014-07-07 19:08     ` Dave Hansen
@ 2014-07-07 20:59       ` Naoya Horiguchi
  -1 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-07 20:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On Mon, Jul 07, 2014 at 12:08:12PM -0700, Dave Hansen wrote:
> On 07/07/2014 11:00 AM, Naoya Horiguchi wrote:
> > +.SH RETURN VALUE
> > +On success,
> > +.BR fincore ()
> > +returns 0.
> > +On error, \-1 is returned, and
> > +.I errno
> > +is set appropriately.
> 
> Is this accurate?  From reading the syscall itself, it looked like it
> did this:
> 
> > + * Return value is the number of pages whose data is stored in fc->buffer.
> > + */
> > +static long do_fincore(struct fincore_control *fc, int nr_pages)
> 
> and:
> 
> > +SYSCALL_DEFINE6(fincore, int, fd, loff_t, start, long, nr_pages,
> ...
> > +	while (fc.nr_pages > 0) {
> > +		memset(fc.buffer, 0, fc.buffer_size);
> > +		ret = do_fincore(&fc, min(step, fc.nr_pages));
> > +		/* Reached the end of the file */
> > +		if (ret == 0)
> > +			break;
> > +		if (ret < 0)
> > +			break;
> ...
> > +	}
> ...
> > +	return ret;
> > +}
> 
> Which seems that for a given loop of do_fincore(), you might end up
> returning the result of that *single* iteration of do_fincore() instead
> of the aggregate of the entire syscall.
> 
> So, it can return <0 on failure, 0 on success, or also an essentially
> random >0 number on success too.

We don't break this while loop if do_fincore() returned a positive value
unless copy_to_user() fails. And in that case ret is set to -EFAULT.
So I think sys_fincore() never returns a positive value.

BTW, we don't have to check "if (ret == 0)" and "if (ret < 0)" separately,
I'll fix it.

> Why not just use the return value for something useful instead of
> hacking in the extras->nr_entries stuff?

Hmm, I got the opposite complaint previously, where we shouldn't
interpret the return value differently depending on the flag.
And I'd like to keep the extra argument for future extensibility.
For example, if we want to collect pages only with a specific
set of page flags, this extra argument will be necessary.

>  Oh, and what if that
> 
> > +	if (extra)
> > +		__put_user(nr, &extra->nr_entries);
> 
> fails?  It seems like we might silently forget to tell userspace how
> many entries we filled.

Oh, I forget to check it.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 3/3] man2/fincore.2: document general description about fincore(2)
@ 2014-07-07 20:59       ` Naoya Horiguchi
  0 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-07 20:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On Mon, Jul 07, 2014 at 12:08:12PM -0700, Dave Hansen wrote:
> On 07/07/2014 11:00 AM, Naoya Horiguchi wrote:
> > +.SH RETURN VALUE
> > +On success,
> > +.BR fincore ()
> > +returns 0.
> > +On error, \-1 is returned, and
> > +.I errno
> > +is set appropriately.
> 
> Is this accurate?  From reading the syscall itself, it looked like it
> did this:
> 
> > + * Return value is the number of pages whose data is stored in fc->buffer.
> > + */
> > +static long do_fincore(struct fincore_control *fc, int nr_pages)
> 
> and:
> 
> > +SYSCALL_DEFINE6(fincore, int, fd, loff_t, start, long, nr_pages,
> ...
> > +	while (fc.nr_pages > 0) {
> > +		memset(fc.buffer, 0, fc.buffer_size);
> > +		ret = do_fincore(&fc, min(step, fc.nr_pages));
> > +		/* Reached the end of the file */
> > +		if (ret == 0)
> > +			break;
> > +		if (ret < 0)
> > +			break;
> ...
> > +	}
> ...
> > +	return ret;
> > +}
> 
> Which seems that for a given loop of do_fincore(), you might end up
> returning the result of that *single* iteration of do_fincore() instead
> of the aggregate of the entire syscall.
> 
> So, it can return <0 on failure, 0 on success, or also an essentially
> random >0 number on success too.

We don't break this while loop if do_fincore() returned a positive value
unless copy_to_user() fails. And in that case ret is set to -EFAULT.
So I think sys_fincore() never returns a positive value.

BTW, we don't have to check "if (ret == 0)" and "if (ret < 0)" separately,
I'll fix it.

> Why not just use the return value for something useful instead of
> hacking in the extras->nr_entries stuff?

Hmm, I got the opposite complaint previously, where we shouldn't
interpret the return value differently depending on the flag.
And I'd like to keep the extra argument for future extensibility.
For example, if we want to collect pages only with a specific
set of page flags, this extra argument will be necessary.

>  Oh, and what if that
> 
> > +	if (extra)
> > +		__put_user(nr, &extra->nr_entries);
> 
> fails?  It seems like we might silently forget to tell userspace how
> many entries we filled.

Oh, I forget to check it.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
  2014-07-07 20:43         ` Dave Hansen
@ 2014-07-07 21:48           ` Naoya Horiguchi
  -1 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-07 21:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi, Kees Cook

On Mon, Jul 07, 2014 at 01:43:31PM -0700, Dave Hansen wrote:
> On 07/07/2014 01:21 PM, Naoya Horiguchi wrote:
> > On Mon, Jul 07, 2014 at 12:01:41PM -0700, Dave Hansen wrote:
> >> But, is this trying to do too many things at once?  Do we have solid use
> >> cases spelled out for each of these modes?  Have we thought out how they
> >> will be used in practice?
> > 
> > tools/vm/page-types.c will be an in-kernel user after this base code is
> > accepted. The idea of doing fincore() thing comes up during the discussion
> > with Konstantin over file cache mode of this tool.
> > pfn and page flag are needed there, so I think it's one clear usecase.
> 
> I'm going to take that as a no. :)

As for other usecases, database developers should have some demand for
physical addresses (especially numa node?) or page flags (especially
page reclaim or writeback related ones).
But I'm not a database expert so can't say how, sorry.

> The whole FINCORE_PGOFF vs. FINCORE_BMAP issue is something that will
> come up in practice.  We just don't have the interfaces for an end user
> to pick which one they want to use.
> 
> >> Is it really right to say this is going to be 8 bytes?  Would we want it
> >> to share types with something else, like be an loff_t?
> > 
> > Could you elaborate it more?
> 
> We specify file offsets in other system calls, like the lseek family.  I
> was just thinking that this type should match up with those calls since
> they are expressing the same data type with the same ranges and limitations.

The 2nd parameter is loff_t, do we already do this?

> >>> + * - FINCORE_PFN:
> >>> + *     stores pfn, using 8 bytes.
> >>
> >> These are all an unprivileged operations from what I can tell.  I know
> >> we're going to a lot of trouble to hide kernel addresses from being seen
> >> in userspace.  This seems like it would be undesirable for the folks
> >> that care about not leaking kernel addresses, especially for
> >> unprivileged users.
> >>
> >> This would essentially tell userspace where in the kernel's address
> >> space some user-controlled data will be.
> > 
> > OK, so this and FINCORE_PAGEFLAGS will be limited for privileged users.

Sorry, this statement of mine might a bit short-sighted, and I'd like
to revoke it.
I think that some page flags and/or numa info should be useful outside
the debugging environment, and safe to expose to userspace. So limiting
to bitmap-one for unprivileged users is too strict.

> Then I'd just question their usefulness outside of a debugging
> environment, especially when you can get at them in other (more
> roundabout) ways in a debugging environment.
> 
> This is really looking to me like two system calls.  The bitmap-based
> one, and another more extensible one.  I don't think there's any harm in
> having two system calls, especially when they're trying to glue together
> two disparate interfaces.

I think that if separating syscall into two, one for privileged users
and one for unprivileged users migth be fine (rather than bitmap-based
one and extensible one.)

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
@ 2014-07-07 21:48           ` Naoya Horiguchi
  0 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-07 21:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi, Kees Cook

On Mon, Jul 07, 2014 at 01:43:31PM -0700, Dave Hansen wrote:
> On 07/07/2014 01:21 PM, Naoya Horiguchi wrote:
> > On Mon, Jul 07, 2014 at 12:01:41PM -0700, Dave Hansen wrote:
> >> But, is this trying to do too many things at once?  Do we have solid use
> >> cases spelled out for each of these modes?  Have we thought out how they
> >> will be used in practice?
> > 
> > tools/vm/page-types.c will be an in-kernel user after this base code is
> > accepted. The idea of doing fincore() thing comes up during the discussion
> > with Konstantin over file cache mode of this tool.
> > pfn and page flag are needed there, so I think it's one clear usecase.
> 
> I'm going to take that as a no. :)

As for other usecases, database developers should have some demand for
physical addresses (especially numa node?) or page flags (especially
page reclaim or writeback related ones).
But I'm not a database expert so can't say how, sorry.

> The whole FINCORE_PGOFF vs. FINCORE_BMAP issue is something that will
> come up in practice.  We just don't have the interfaces for an end user
> to pick which one they want to use.
> 
> >> Is it really right to say this is going to be 8 bytes?  Would we want it
> >> to share types with something else, like be an loff_t?
> > 
> > Could you elaborate it more?
> 
> We specify file offsets in other system calls, like the lseek family.  I
> was just thinking that this type should match up with those calls since
> they are expressing the same data type with the same ranges and limitations.

The 2nd parameter is loff_t, do we already do this?

> >>> + * - FINCORE_PFN:
> >>> + *     stores pfn, using 8 bytes.
> >>
> >> These are all an unprivileged operations from what I can tell.  I know
> >> we're going to a lot of trouble to hide kernel addresses from being seen
> >> in userspace.  This seems like it would be undesirable for the folks
> >> that care about not leaking kernel addresses, especially for
> >> unprivileged users.
> >>
> >> This would essentially tell userspace where in the kernel's address
> >> space some user-controlled data will be.
> > 
> > OK, so this and FINCORE_PAGEFLAGS will be limited for privileged users.

Sorry, this statement of mine might a bit short-sighted, and I'd like
to revoke it.
I think that some page flags and/or numa info should be useful outside
the debugging environment, and safe to expose to userspace. So limiting
to bitmap-one for unprivileged users is too strict.

> Then I'd just question their usefulness outside of a debugging
> environment, especially when you can get at them in other (more
> roundabout) ways in a debugging environment.
> 
> This is really looking to me like two system calls.  The bitmap-based
> one, and another more extensible one.  I don't think there's any harm in
> having two system calls, especially when they're trying to glue together
> two disparate interfaces.

I think that if separating syscall into two, one for privileged users
and one for unprivileged users migth be fine (rather than bitmap-based
one and extensible one.)

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 3/3] man2/fincore.2: document general description about fincore(2)
  2014-07-07 20:59       ` Naoya Horiguchi
@ 2014-07-07 22:34         ` Dave Hansen
  -1 siblings, 0 replies; 45+ messages in thread
From: Dave Hansen @ 2014-07-07 22:34 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On 07/07/2014 01:59 PM, Naoya Horiguchi wrote:
> On Mon, Jul 07, 2014 at 12:08:12PM -0700, Dave Hansen wrote:
>> On 07/07/2014 11:00 AM, Naoya Horiguchi wrote:
>>> +.SH RETURN VALUE
>>> +On success,
>>> +.BR fincore ()
>>> +returns 0.
>>> +On error, \-1 is returned, and
>>> +.I errno
>>> +is set appropriately.
>>
>> Is this accurate?  From reading the syscall itself, it looked like it
>> did this:
>>
>>> + * Return value is the number of pages whose data is stored in fc->buffer.
>>> + */
>>> +static long do_fincore(struct fincore_control *fc, int nr_pages)
>>
>> and:
>>
>>> +SYSCALL_DEFINE6(fincore, int, fd, loff_t, start, long, nr_pages,
>> ...
>>> +	while (fc.nr_pages > 0) {
>>> +		memset(fc.buffer, 0, fc.buffer_size);
>>> +		ret = do_fincore(&fc, min(step, fc.nr_pages));
>>> +		/* Reached the end of the file */
>>> +		if (ret == 0)
>>> +			break;
>>> +		if (ret < 0)
>>> +			break;
>> ...
>>> +	}
>> ...
>>> +	return ret;
>>> +}
>>
>> Which seems that for a given loop of do_fincore(), you might end up
>> returning the result of that *single* iteration of do_fincore() instead
>> of the aggregate of the entire syscall.
>>
>> So, it can return <0 on failure, 0 on success, or also an essentially
>> random >0 number on success too.
> 
> We don't break this while loop if do_fincore() returned a positive value
> unless copy_to_user() fails. And in that case ret is set to -EFAULT.
> So I think sys_fincore() never returns a positive value.

OK, that makes sense as I'm reading it again.

>> Why not just use the return value for something useful instead of
>> hacking in the extras->nr_entries stuff?
> 
> Hmm, I got the opposite complaint previously, where we shouldn't
> interpret the return value differently depending on the flag.
> And I'd like to keep the extra argument for future extensibility.
> For example, if we want to collect pages only with a specific
> set of page flags, this extra argument will be necessary.

Couldn't it simply be the number of elements that it wrote in to the
buffer, or even the number of bytes?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 3/3] man2/fincore.2: document general description about fincore(2)
@ 2014-07-07 22:34         ` Dave Hansen
  0 siblings, 0 replies; 45+ messages in thread
From: Dave Hansen @ 2014-07-07 22:34 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On 07/07/2014 01:59 PM, Naoya Horiguchi wrote:
> On Mon, Jul 07, 2014 at 12:08:12PM -0700, Dave Hansen wrote:
>> On 07/07/2014 11:00 AM, Naoya Horiguchi wrote:
>>> +.SH RETURN VALUE
>>> +On success,
>>> +.BR fincore ()
>>> +returns 0.
>>> +On error, \-1 is returned, and
>>> +.I errno
>>> +is set appropriately.
>>
>> Is this accurate?  From reading the syscall itself, it looked like it
>> did this:
>>
>>> + * Return value is the number of pages whose data is stored in fc->buffer.
>>> + */
>>> +static long do_fincore(struct fincore_control *fc, int nr_pages)
>>
>> and:
>>
>>> +SYSCALL_DEFINE6(fincore, int, fd, loff_t, start, long, nr_pages,
>> ...
>>> +	while (fc.nr_pages > 0) {
>>> +		memset(fc.buffer, 0, fc.buffer_size);
>>> +		ret = do_fincore(&fc, min(step, fc.nr_pages));
>>> +		/* Reached the end of the file */
>>> +		if (ret == 0)
>>> +			break;
>>> +		if (ret < 0)
>>> +			break;
>> ...
>>> +	}
>> ...
>>> +	return ret;
>>> +}
>>
>> Which seems that for a given loop of do_fincore(), you might end up
>> returning the result of that *single* iteration of do_fincore() instead
>> of the aggregate of the entire syscall.
>>
>> So, it can return <0 on failure, 0 on success, or also an essentially
>> random >0 number on success too.
> 
> We don't break this while loop if do_fincore() returned a positive value
> unless copy_to_user() fails. And in that case ret is set to -EFAULT.
> So I think sys_fincore() never returns a positive value.

OK, that makes sense as I'm reading it again.

>> Why not just use the return value for something useful instead of
>> hacking in the extras->nr_entries stuff?
> 
> Hmm, I got the opposite complaint previously, where we shouldn't
> interpret the return value differently depending on the flag.
> And I'd like to keep the extra argument for future extensibility.
> For example, if we want to collect pages only with a specific
> set of page flags, this extra argument will be necessary.

Couldn't it simply be the number of elements that it wrote in to the
buffer, or even the number of bytes?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
  2014-07-07 21:48           ` Naoya Horiguchi
@ 2014-07-07 22:44             ` Dave Hansen
  -1 siblings, 0 replies; 45+ messages in thread
From: Dave Hansen @ 2014-07-07 22:44 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi, Kees Cook

On 07/07/2014 02:48 PM, Naoya Horiguchi wrote:
> On Mon, Jul 07, 2014 at 01:43:31PM -0700, Dave Hansen wrote:
>> The whole FINCORE_PGOFF vs. FINCORE_BMAP issue is something that will
>> come up in practice.  We just don't have the interfaces for an end user
>> to pick which one they want to use.
>>
>>>> Is it really right to say this is going to be 8 bytes?  Would we want it
>>>> to share types with something else, like be an loff_t?
>>>
>>> Could you elaborate it more?
>>
>> We specify file offsets in other system calls, like the lseek family.  I
>> was just thinking that this type should match up with those calls since
>> they are expressing the same data type with the same ranges and limitations.
> 
> The 2nd parameter is loff_t, do we already do this?

I mean the fields in the buffer, like:

> +Any of the following flags are to be set to add an 8 byte field in each entry.
> +You can set any of these flags at the same time, although you can't set
> +FINCORE_BMAP combined with these 8 byte field flags.


>>>> This would essentially tell userspace where in the kernel's address
>>>> space some user-controlled data will be.
>>>
>>> OK, so this and FINCORE_PAGEFLAGS will be limited for privileged users.
> 
> Sorry, this statement of mine might a bit short-sighted, and I'd like
> to revoke it.
> I think that some page flags and/or numa info should be useful outside
> the debugging environment, and safe to expose to userspace. So limiting
> to bitmap-one for unprivileged users is too strict.

The PFN is not the same as NUMA information, and the PFN is insufficient
to describe the NUMA node on all systems that Linux supports.

Trying to get NUMA information back out is a good goal, but doing it
with PFNs is a bad idea since they have so many consequences.

I'm also bummed exporting NUMA information was a design goal of these
patches, but they weren't mentioned in any of the patch descriptions.

>> Then I'd just question their usefulness outside of a debugging
>> environment, especially when you can get at them in other (more
>> roundabout) ways in a debugging environment.
>>
>> This is really looking to me like two system calls.  The bitmap-based
>> one, and another more extensible one.  I don't think there's any harm in
>> having two system calls, especially when they're trying to glue together
>> two disparate interfaces.
> 
> I think that if separating syscall into two, one for privileged users
> and one for unprivileged users migth be fine (rather than bitmap-based
> one and extensible one.)

The problem as I see it is shoehorning two interfaces in to the same
syscall.  If there are privileged and unprivileged operations that use
the same _interfaces_ I think they should share a syscall.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
@ 2014-07-07 22:44             ` Dave Hansen
  0 siblings, 0 replies; 45+ messages in thread
From: Dave Hansen @ 2014-07-07 22:44 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi, Kees Cook

On 07/07/2014 02:48 PM, Naoya Horiguchi wrote:
> On Mon, Jul 07, 2014 at 01:43:31PM -0700, Dave Hansen wrote:
>> The whole FINCORE_PGOFF vs. FINCORE_BMAP issue is something that will
>> come up in practice.  We just don't have the interfaces for an end user
>> to pick which one they want to use.
>>
>>>> Is it really right to say this is going to be 8 bytes?  Would we want it
>>>> to share types with something else, like be an loff_t?
>>>
>>> Could you elaborate it more?
>>
>> We specify file offsets in other system calls, like the lseek family.  I
>> was just thinking that this type should match up with those calls since
>> they are expressing the same data type with the same ranges and limitations.
> 
> The 2nd parameter is loff_t, do we already do this?

I mean the fields in the buffer, like:

> +Any of the following flags are to be set to add an 8 byte field in each entry.
> +You can set any of these flags at the same time, although you can't set
> +FINCORE_BMAP combined with these 8 byte field flags.


>>>> This would essentially tell userspace where in the kernel's address
>>>> space some user-controlled data will be.
>>>
>>> OK, so this and FINCORE_PAGEFLAGS will be limited for privileged users.
> 
> Sorry, this statement of mine might a bit short-sighted, and I'd like
> to revoke it.
> I think that some page flags and/or numa info should be useful outside
> the debugging environment, and safe to expose to userspace. So limiting
> to bitmap-one for unprivileged users is too strict.

The PFN is not the same as NUMA information, and the PFN is insufficient
to describe the NUMA node on all systems that Linux supports.

Trying to get NUMA information back out is a good goal, but doing it
with PFNs is a bad idea since they have so many consequences.

I'm also bummed exporting NUMA information was a design goal of these
patches, but they weren't mentioned in any of the patch descriptions.

>> Then I'd just question their usefulness outside of a debugging
>> environment, especially when you can get at them in other (more
>> roundabout) ways in a debugging environment.
>>
>> This is really looking to me like two system calls.  The bitmap-based
>> one, and another more extensible one.  I don't think there's any harm in
>> having two system calls, especially when they're trying to glue together
>> two disparate interfaces.
> 
> I think that if separating syscall into two, one for privileged users
> and one for unprivileged users migth be fine (rather than bitmap-based
> one and extensible one.)

The problem as I see it is shoehorning two interfaces in to the same
syscall.  If there are privileged and unprivileged operations that use
the same _interfaces_ I think they should share a syscall.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 0/3] mm: introduce fincore() v3
  2014-07-07 18:00 ` Naoya Horiguchi
@ 2014-07-08 12:16   ` Christoph Hellwig
  -1 siblings, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2014-07-08 12:16 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Dave Hansen, Christoph Hellwig,
	Dave Chinner, Michael Kerrisk, Linux API, Naoya Horiguchi

Still a hard NAK for exposing page flags in a syscall ABI.  These are
way to volatile to go into an application interface.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 0/3] mm: introduce fincore() v3
@ 2014-07-08 12:16   ` Christoph Hellwig
  0 siblings, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2014-07-08 12:16 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Dave Hansen, Christoph Hellwig,
	Dave Chinner, Michael Kerrisk, Linux API, Naoya Horiguchi

Still a hard NAK for exposing page flags in a syscall ABI.  These are
way to volatile to go into an application interface.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 0/3] mm: introduce fincore() v3
  2014-07-08 12:16   ` Christoph Hellwig
@ 2014-07-08 13:27     ` Naoya Horiguchi
  -1 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-08 13:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Dave Hansen, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On Tue, Jul 08, 2014 at 05:16:18AM -0700, Christoph Hellwig wrote:
> Still a hard NAK for exposing page flags in a syscall ABI.  These are
> way to volatile to go into an application interface.

Is there any specific reason that exporting via syscall ABI is more
volatile than exporting via procfs as /proc/kpageflags alreadly does?

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 0/3] mm: introduce fincore() v3
@ 2014-07-08 13:27     ` Naoya Horiguchi
  0 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-08 13:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Dave Hansen, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On Tue, Jul 08, 2014 at 05:16:18AM -0700, Christoph Hellwig wrote:
> Still a hard NAK for exposing page flags in a syscall ABI.  These are
> way to volatile to go into an application interface.

Is there any specific reason that exporting via syscall ABI is more
volatile than exporting via procfs as /proc/kpageflags alreadly does?

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
  2014-07-07 22:44             ` Dave Hansen
@ 2014-07-08 15:35               ` Naoya Horiguchi
  -1 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-08 15:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi, Kees Cook

On Mon, Jul 07, 2014 at 03:44:22PM -0700, Dave Hansen wrote:
> On 07/07/2014 02:48 PM, Naoya Horiguchi wrote:
> > On Mon, Jul 07, 2014 at 01:43:31PM -0700, Dave Hansen wrote:
> >> The whole FINCORE_PGOFF vs. FINCORE_BMAP issue is something that will
> >> come up in practice.  We just don't have the interfaces for an end user
> >> to pick which one they want to use.
> >>
> >>>> Is it really right to say this is going to be 8 bytes?  Would we want it
> >>>> to share types with something else, like be an loff_t?
> >>>
> >>> Could you elaborate it more?
> >>
> >> We specify file offsets in other system calls, like the lseek family.  I
> >> was just thinking that this type should match up with those calls since
> >> they are expressing the same data type with the same ranges and limitations.
> > 
> > The 2nd parameter is loff_t, do we already do this?
> 
> I mean the fields in the buffer, like:
> 
> > +Any of the following flags are to be set to add an 8 byte field in each entry.
> > +You can set any of these flags at the same time, although you can't set
> > +FINCORE_BMAP combined with these 8 byte field flags.

Thanks. And OK, we can make it depending on arch or config
(although in currnet version only x86_64 is supported.)

> 
> >>>> This would essentially tell userspace where in the kernel's address
> >>>> space some user-controlled data will be.
> >>>
> >>> OK, so this and FINCORE_PAGEFLAGS will be limited for privileged users.
> > 
> > Sorry, this statement of mine might a bit short-sighted, and I'd like
> > to revoke it.
> > I think that some page flags and/or numa info should be useful outside
> > the debugging environment, and safe to expose to userspace. So limiting
> > to bitmap-one for unprivileged users is too strict.
> 
> The PFN is not the same as NUMA information, and the PFN is insufficient
> to describe the NUMA node on all systems that Linux supports.

Agree.

> Trying to get NUMA information back out is a good goal, but doing it
> with PFNs is a bad idea since they have so many consequences.

Yes, so a separate field for NUMA node is helpful. PFN is purely for
debugging.

> I'm also bummed exporting NUMA information was a design goal of these
> patches, but they weren't mentioned in any of the patch descriptions.

OK, I'll add it with some documentation in the next post.

> >> Then I'd just question their usefulness outside of a debugging
> >> environment, especially when you can get at them in other (more
> >> roundabout) ways in a debugging environment.
> >>
> >> This is really looking to me like two system calls.  The bitmap-based
> >> one, and another more extensible one.  I don't think there's any harm in
> >> having two system calls, especially when they're trying to glue together
> >> two disparate interfaces.
> > 
> > I think that if separating syscall into two, one for privileged users
> > and one for unprivileged users migth be fine (rather than bitmap-based
> > one and extensible one.)
> 
> The problem as I see it is shoehorning two interfaces in to the same
> syscall.  If there are privileged and unprivileged operations that use
> the same _interfaces_ I think they should share a syscall.

Hmm, if we think that bitmap one and extensible one are using different
interfaces, should we also consider that different modes in extensible
one are using different interfaces (whose entry per page is variable in
length)?
It seems to me just a problem about how differently we use the user buffer,
rather than about different interfaces.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
@ 2014-07-08 15:35               ` Naoya Horiguchi
  0 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-08 15:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi, Kees Cook

On Mon, Jul 07, 2014 at 03:44:22PM -0700, Dave Hansen wrote:
> On 07/07/2014 02:48 PM, Naoya Horiguchi wrote:
> > On Mon, Jul 07, 2014 at 01:43:31PM -0700, Dave Hansen wrote:
> >> The whole FINCORE_PGOFF vs. FINCORE_BMAP issue is something that will
> >> come up in practice.  We just don't have the interfaces for an end user
> >> to pick which one they want to use.
> >>
> >>>> Is it really right to say this is going to be 8 bytes?  Would we want it
> >>>> to share types with something else, like be an loff_t?
> >>>
> >>> Could you elaborate it more?
> >>
> >> We specify file offsets in other system calls, like the lseek family.  I
> >> was just thinking that this type should match up with those calls since
> >> they are expressing the same data type with the same ranges and limitations.
> > 
> > The 2nd parameter is loff_t, do we already do this?
> 
> I mean the fields in the buffer, like:
> 
> > +Any of the following flags are to be set to add an 8 byte field in each entry.
> > +You can set any of these flags at the same time, although you can't set
> > +FINCORE_BMAP combined with these 8 byte field flags.

Thanks. And OK, we can make it depending on arch or config
(although in currnet version only x86_64 is supported.)

> 
> >>>> This would essentially tell userspace where in the kernel's address
> >>>> space some user-controlled data will be.
> >>>
> >>> OK, so this and FINCORE_PAGEFLAGS will be limited for privileged users.
> > 
> > Sorry, this statement of mine might a bit short-sighted, and I'd like
> > to revoke it.
> > I think that some page flags and/or numa info should be useful outside
> > the debugging environment, and safe to expose to userspace. So limiting
> > to bitmap-one for unprivileged users is too strict.
> 
> The PFN is not the same as NUMA information, and the PFN is insufficient
> to describe the NUMA node on all systems that Linux supports.

Agree.

> Trying to get NUMA information back out is a good goal, but doing it
> with PFNs is a bad idea since they have so many consequences.

Yes, so a separate field for NUMA node is helpful. PFN is purely for
debugging.

> I'm also bummed exporting NUMA information was a design goal of these
> patches, but they weren't mentioned in any of the patch descriptions.

OK, I'll add it with some documentation in the next post.

> >> Then I'd just question their usefulness outside of a debugging
> >> environment, especially when you can get at them in other (more
> >> roundabout) ways in a debugging environment.
> >>
> >> This is really looking to me like two system calls.  The bitmap-based
> >> one, and another more extensible one.  I don't think there's any harm in
> >> having two system calls, especially when they're trying to glue together
> >> two disparate interfaces.
> > 
> > I think that if separating syscall into two, one for privileged users
> > and one for unprivileged users migth be fine (rather than bitmap-based
> > one and extensible one.)
> 
> The problem as I see it is shoehorning two interfaces in to the same
> syscall.  If there are privileged and unprivileged operations that use
> the same _interfaces_ I think they should share a syscall.

Hmm, if we think that bitmap one and extensible one are using different
interfaces, should we also consider that different modes in extensible
one are using different interfaces (whose entry per page is variable in
length)?
It seems to me just a problem about how differently we use the user buffer,
rather than about different interfaces.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 3/3] man2/fincore.2: document general description about fincore(2)
  2014-07-07 22:34         ` Dave Hansen
@ 2014-07-08 15:43           ` Naoya Horiguchi
  -1 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-08 15:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On Mon, Jul 07, 2014 at 03:34:06PM -0700, Dave Hansen wrote:
> On 07/07/2014 01:59 PM, Naoya Horiguchi wrote:
> > On Mon, Jul 07, 2014 at 12:08:12PM -0700, Dave Hansen wrote:
> >> On 07/07/2014 11:00 AM, Naoya Horiguchi wrote:
> >>> +.SH RETURN VALUE
> >>> +On success,
> >>> +.BR fincore ()
> >>> +returns 0.
> >>> +On error, \-1 is returned, and
> >>> +.I errno
> >>> +is set appropriately.
> >>
> >> Is this accurate?  From reading the syscall itself, it looked like it
> >> did this:
> >>
> >>> + * Return value is the number of pages whose data is stored in fc->buffer.
> >>> + */
> >>> +static long do_fincore(struct fincore_control *fc, int nr_pages)
> >>
> >> and:
> >>
> >>> +SYSCALL_DEFINE6(fincore, int, fd, loff_t, start, long, nr_pages,
> >> ...
> >>> +	while (fc.nr_pages > 0) {
> >>> +		memset(fc.buffer, 0, fc.buffer_size);
> >>> +		ret = do_fincore(&fc, min(step, fc.nr_pages));
> >>> +		/* Reached the end of the file */
> >>> +		if (ret == 0)
> >>> +			break;
> >>> +		if (ret < 0)
> >>> +			break;
> >> ...
> >>> +	}
> >> ...
> >>> +	return ret;
> >>> +}
> >>
> >> Which seems that for a given loop of do_fincore(), you might end up
> >> returning the result of that *single* iteration of do_fincore() instead
> >> of the aggregate of the entire syscall.
> >>
> >> So, it can return <0 on failure, 0 on success, or also an essentially
> >> random >0 number on success too.
> > 
> > We don't break this while loop if do_fincore() returned a positive value
> > unless copy_to_user() fails. And in that case ret is set to -EFAULT.
> > So I think sys_fincore() never returns a positive value.
> 
> OK, that makes sense as I'm reading it again.
> 
> >> Why not just use the return value for something useful instead of
> >> hacking in the extras->nr_entries stuff?
> > 
> > Hmm, I got the opposite complaint previously, where we shouldn't
> > interpret the return value differently depending on the flag.
> > And I'd like to keep the extra argument for future extensibility.
> > For example, if we want to collect pages only with a specific
> > set of page flags, this extra argument will be necessary.
> 
> Couldn't it simply be the number of elements that it wrote in to the
> buffer, or even the number of bytes?

Yes, returning the number of elements looks clearer to me.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 3/3] man2/fincore.2: document general description about fincore(2)
@ 2014-07-08 15:43           ` Naoya Horiguchi
  0 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-08 15:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On Mon, Jul 07, 2014 at 03:34:06PM -0700, Dave Hansen wrote:
> On 07/07/2014 01:59 PM, Naoya Horiguchi wrote:
> > On Mon, Jul 07, 2014 at 12:08:12PM -0700, Dave Hansen wrote:
> >> On 07/07/2014 11:00 AM, Naoya Horiguchi wrote:
> >>> +.SH RETURN VALUE
> >>> +On success,
> >>> +.BR fincore ()
> >>> +returns 0.
> >>> +On error, \-1 is returned, and
> >>> +.I errno
> >>> +is set appropriately.
> >>
> >> Is this accurate?  From reading the syscall itself, it looked like it
> >> did this:
> >>
> >>> + * Return value is the number of pages whose data is stored in fc->buffer.
> >>> + */
> >>> +static long do_fincore(struct fincore_control *fc, int nr_pages)
> >>
> >> and:
> >>
> >>> +SYSCALL_DEFINE6(fincore, int, fd, loff_t, start, long, nr_pages,
> >> ...
> >>> +	while (fc.nr_pages > 0) {
> >>> +		memset(fc.buffer, 0, fc.buffer_size);
> >>> +		ret = do_fincore(&fc, min(step, fc.nr_pages));
> >>> +		/* Reached the end of the file */
> >>> +		if (ret == 0)
> >>> +			break;
> >>> +		if (ret < 0)
> >>> +			break;
> >> ...
> >>> +	}
> >> ...
> >>> +	return ret;
> >>> +}
> >>
> >> Which seems that for a given loop of do_fincore(), you might end up
> >> returning the result of that *single* iteration of do_fincore() instead
> >> of the aggregate of the entire syscall.
> >>
> >> So, it can return <0 on failure, 0 on success, or also an essentially
> >> random >0 number on success too.
> > 
> > We don't break this while loop if do_fincore() returned a positive value
> > unless copy_to_user() fails. And in that case ret is set to -EFAULT.
> > So I think sys_fincore() never returns a positive value.
> 
> OK, that makes sense as I'm reading it again.
> 
> >> Why not just use the return value for something useful instead of
> >> hacking in the extras->nr_entries stuff?
> > 
> > Hmm, I got the opposite complaint previously, where we shouldn't
> > interpret the return value differently depending on the flag.
> > And I'd like to keep the extra argument for future extensibility.
> > For example, if we want to collect pages only with a specific
> > set of page flags, this extra argument will be necessary.
> 
> Couldn't it simply be the number of elements that it wrote in to the
> buffer, or even the number of bytes?

Yes, returning the number of elements looks clearer to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
  2014-07-07 19:01     ` Dave Hansen
@ 2014-07-08 19:03       ` Naoya Horiguchi
  -1 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-08 19:03 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi, Kees Cook

On Mon, Jul 07, 2014 at 12:01:41PM -0700, Dave Hansen wrote:
> > +/*
> > + * You can control how the buffer in userspace is filled with this mode
> > + * parameters:
> 
> I agree that we don't have any good mechanisms for looking at the page
> cache from userspace.  I've hacked some things up using mincore() and
> they weren't pretty, so I welcome _something_ like this.
> 
> But, is this trying to do too many things at once?  Do we have solid use
> cases spelled out for each of these modes?  Have we thought out how they
> will be used in practice?
> 
> The biggest question for me, though, is whether we want to start
> designing these per-page interfaces to consider different page sizes, or
> whether we're going to just continue to pretend that the entire world is
> 4k pages.  Using FINCORE_BMAP on 1GB hugetlbfs files would be a bit
> silly, for instance.

I didn't answer this question, sorry.

In my option, hugetlbfs pages should be handled as one hugepage (not as
many 4kB pages) to avoid lots of meaningless data transfer, as you pointed
out. And the current patch already works like that.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
@ 2014-07-08 19:03       ` Naoya Horiguchi
  0 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-08 19:03 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi, Kees Cook

On Mon, Jul 07, 2014 at 12:01:41PM -0700, Dave Hansen wrote:
> > +/*
> > + * You can control how the buffer in userspace is filled with this mode
> > + * parameters:
> 
> I agree that we don't have any good mechanisms for looking at the page
> cache from userspace.  I've hacked some things up using mincore() and
> they weren't pretty, so I welcome _something_ like this.
> 
> But, is this trying to do too many things at once?  Do we have solid use
> cases spelled out for each of these modes?  Have we thought out how they
> will be used in practice?
> 
> The biggest question for me, though, is whether we want to start
> designing these per-page interfaces to consider different page sizes, or
> whether we're going to just continue to pretend that the entire world is
> 4k pages.  Using FINCORE_BMAP on 1GB hugetlbfs files would be a bit
> silly, for instance.

I didn't answer this question, sorry.

In my option, hugetlbfs pages should be handled as one hugepage (not as
many 4kB pages) to avoid lots of meaningless data transfer, as you pointed
out. And the current patch already works like that.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
  2014-07-08 19:03       ` Naoya Horiguchi
@ 2014-07-08 19:42         ` Dave Hansen
  -1 siblings, 0 replies; 45+ messages in thread
From: Dave Hansen @ 2014-07-08 19:42 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi, Kees Cook

On 07/08/2014 12:03 PM, Naoya Horiguchi wrote:
>> > The biggest question for me, though, is whether we want to start
>> > designing these per-page interfaces to consider different page sizes, or
>> > whether we're going to just continue to pretend that the entire world is
>> > 4k pages.  Using FINCORE_BMAP on 1GB hugetlbfs files would be a bit
>> > silly, for instance.
> I didn't answer this question, sorry.
> 
> In my option, hugetlbfs pages should be handled as one hugepage (not as
> many 4kB pages) to avoid lots of meaningless data transfer, as you pointed
> out. And the current patch already works like that.

Just reading the code, I don't see any way that pc_shift gets passed
down in to the do_fincore() loop.  I don't see it getting reflected in
to 'nr' or 'nr_pages' in there, and I can't see how:

	jump = iter.index - fc->pgstart - nr;

can possibly be right since iter.index is being kept against the offset
in the userspace buffer (4k pages) and 'nr' and fc->pgstart are
essentially done in the huge page size.

If you had a 2-page 1GB-hpage_size() hugetlbfs file, you would only have
two pages in the radix tree, and only two iterations of
radix_tree_for_each_slot().  It would only set the first two bytes of a
256k BMAP buffer since only two pages were encountered in the radix tree.

Or am I reading your code wrong again?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
@ 2014-07-08 19:42         ` Dave Hansen
  0 siblings, 0 replies; 45+ messages in thread
From: Dave Hansen @ 2014-07-08 19:42 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi, Kees Cook

On 07/08/2014 12:03 PM, Naoya Horiguchi wrote:
>> > The biggest question for me, though, is whether we want to start
>> > designing these per-page interfaces to consider different page sizes, or
>> > whether we're going to just continue to pretend that the entire world is
>> > 4k pages.  Using FINCORE_BMAP on 1GB hugetlbfs files would be a bit
>> > silly, for instance.
> I didn't answer this question, sorry.
> 
> In my option, hugetlbfs pages should be handled as one hugepage (not as
> many 4kB pages) to avoid lots of meaningless data transfer, as you pointed
> out. And the current patch already works like that.

Just reading the code, I don't see any way that pc_shift gets passed
down in to the do_fincore() loop.  I don't see it getting reflected in
to 'nr' or 'nr_pages' in there, and I can't see how:

	jump = iter.index - fc->pgstart - nr;

can possibly be right since iter.index is being kept against the offset
in the userspace buffer (4k pages) and 'nr' and fc->pgstart are
essentially done in the huge page size.

If you had a 2-page 1GB-hpage_size() hugetlbfs file, you would only have
two pages in the radix tree, and only two iterations of
radix_tree_for_each_slot().  It would only set the first two bytes of a
256k BMAP buffer since only two pages were encountered in the radix tree.

Or am I reading your code wrong again?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
  2014-07-08 19:42         ` Dave Hansen
@ 2014-07-08 20:41           ` Naoya Horiguchi
  -1 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-08 20:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi, Kees Cook

On Tue, Jul 08, 2014 at 12:42:58PM -0700, Dave Hansen wrote:
> On 07/08/2014 12:03 PM, Naoya Horiguchi wrote:
> >> > The biggest question for me, though, is whether we want to start
> >> > designing these per-page interfaces to consider different page sizes, or
> >> > whether we're going to just continue to pretend that the entire world is
> >> > 4k pages.  Using FINCORE_BMAP on 1GB hugetlbfs files would be a bit
> >> > silly, for instance.
> > I didn't answer this question, sorry.
> > 
> > In my option, hugetlbfs pages should be handled as one hugepage (not as
> > many 4kB pages) to avoid lots of meaningless data transfer, as you pointed
> > out. And the current patch already works like that.
> 
> Just reading the code, I don't see any way that pc_shift gets passed
> down in to the do_fincore() loop.

No need to pass it down because operations over page cache tree use
page index internally to identify the in-file position and doesn't care
about page size. In 2MB hugetlbfs file, for example, index 1 means
byte offset 2MB (not offset 4kB.) So radix_tree_for_each_slot() runs
iter.index like 0 -> 1 -> 2 ... (instead of 0 -> 512 -> 1024 ...)

>  I don't see it getting reflected in
> to 'nr' or 'nr_pages' in there, and I can't see how:
> 
> 	jump = iter.index - fc->pgstart - nr;
> 
> can possibly be right since iter.index is being kept against the offset
> in the userspace buffer (4k pages) and 'nr' and fc->pgstart are
> essentially done in the huge page size.

... so all of iter.index, fc->pgstart, and nr is the same unit,
index (in hugepage size.) 
This is a pure index calculation, and do_fincore() is exactly the same
between 4kB pages and hugetlbfs pages.

> If you had a 2-page 1GB-hpage_size() hugetlbfs file, you would only have
> two pages in the radix tree, and only two iterations of
> radix_tree_for_each_slot().

Correct.

>  It would only set the first two bytes of a
> 256k BMAP buffer since only two pages were encountered in the radix tree.

Hmm, this example shows me a problem, thanks.

If the user knows the fd is for 1GB hugetlbfs file, it just prepares
the 2 bytes buffer, so no problem.
But if the user doesn't know whether the fd is from hugetlbfs file,
the user must prepare the large buffer, though only first few bytes
are used. And the more problematic is that the user could interpret
the data in buffer differently:
  1. only the first two 4kB-pages are loaded in the 2GB range,
  2. two 1GB-pages are loaded.
So for such callers, fincore() must notify the relevant page size
in some way on return.
Returning it via fincore_extra is my first thought but I'm not sure
if it's elegant enough.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
@ 2014-07-08 20:41           ` Naoya Horiguchi
  0 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-08 20:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi, Kees Cook

On Tue, Jul 08, 2014 at 12:42:58PM -0700, Dave Hansen wrote:
> On 07/08/2014 12:03 PM, Naoya Horiguchi wrote:
> >> > The biggest question for me, though, is whether we want to start
> >> > designing these per-page interfaces to consider different page sizes, or
> >> > whether we're going to just continue to pretend that the entire world is
> >> > 4k pages.  Using FINCORE_BMAP on 1GB hugetlbfs files would be a bit
> >> > silly, for instance.
> > I didn't answer this question, sorry.
> > 
> > In my option, hugetlbfs pages should be handled as one hugepage (not as
> > many 4kB pages) to avoid lots of meaningless data transfer, as you pointed
> > out. And the current patch already works like that.
> 
> Just reading the code, I don't see any way that pc_shift gets passed
> down in to the do_fincore() loop.

No need to pass it down because operations over page cache tree use
page index internally to identify the in-file position and doesn't care
about page size. In 2MB hugetlbfs file, for example, index 1 means
byte offset 2MB (not offset 4kB.) So radix_tree_for_each_slot() runs
iter.index like 0 -> 1 -> 2 ... (instead of 0 -> 512 -> 1024 ...)

>  I don't see it getting reflected in
> to 'nr' or 'nr_pages' in there, and I can't see how:
> 
> 	jump = iter.index - fc->pgstart - nr;
> 
> can possibly be right since iter.index is being kept against the offset
> in the userspace buffer (4k pages) and 'nr' and fc->pgstart are
> essentially done in the huge page size.

... so all of iter.index, fc->pgstart, and nr is the same unit,
index (in hugepage size.) 
This is a pure index calculation, and do_fincore() is exactly the same
between 4kB pages and hugetlbfs pages.

> If you had a 2-page 1GB-hpage_size() hugetlbfs file, you would only have
> two pages in the radix tree, and only two iterations of
> radix_tree_for_each_slot().

Correct.

>  It would only set the first two bytes of a
> 256k BMAP buffer since only two pages were encountered in the radix tree.

Hmm, this example shows me a problem, thanks.

If the user knows the fd is for 1GB hugetlbfs file, it just prepares
the 2 bytes buffer, so no problem.
But if the user doesn't know whether the fd is from hugetlbfs file,
the user must prepare the large buffer, though only first few bytes
are used. And the more problematic is that the user could interpret
the data in buffer differently:
  1. only the first two 4kB-pages are loaded in the 2GB range,
  2. two 1GB-pages are loaded.
So for such callers, fincore() must notify the relevant page size
in some way on return.
Returning it via fincore_extra is my first thought but I'm not sure
if it's elegant enough.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
  2014-07-08 20:41           ` Naoya Horiguchi
@ 2014-07-08 22:32             ` Dave Hansen
  -1 siblings, 0 replies; 45+ messages in thread
From: Dave Hansen @ 2014-07-08 22:32 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On 07/08/2014 01:41 PM, Naoya Horiguchi wrote:
>> >  It would only set the first two bytes of a
>> > 256k BMAP buffer since only two pages were encountered in the radix tree.
> Hmm, this example shows me a problem, thanks.
> 
> If the user knows the fd is for 1GB hugetlbfs file, it just prepares
> the 2 bytes buffer, so no problem.
> But if the user doesn't know whether the fd is from hugetlbfs file,
> the user must prepare the large buffer, though only first few bytes
> are used. And the more problematic is that the user could interpret
> the data in buffer differently:
>   1. only the first two 4kB-pages are loaded in the 2GB range,
>   2. two 1GB-pages are loaded.
> So for such callers, fincore() must notify the relevant page size
> in some way on return.
> Returning it via fincore_extra is my first thought but I'm not sure
> if it's elegant enough.

That does limit the interface to being used on a single page size per
call, which doesn't sound too bad since we don't mix page sizes in a
single file.  But, you mentioned using this interface along with
/proc/$pid/mem.  How would this deal with a process which had two sizes
of pages mapped?

Another option would be to have userspace pass in its desired
granularity.  Such an interface could be used to find holes in a file
fairly easily.  But, introduces a whole new set of issues, like what
BMAP means if only a part of the granule is in-core, and do you need a
new option to differentiate BMAP_AND vs. BMAP_OR operations.

I honestly think we need to take a step back and enumerate what you're
trying to do here before going any further.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
@ 2014-07-08 22:32             ` Dave Hansen
  0 siblings, 0 replies; 45+ messages in thread
From: Dave Hansen @ 2014-07-08 22:32 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On 07/08/2014 01:41 PM, Naoya Horiguchi wrote:
>> >  It would only set the first two bytes of a
>> > 256k BMAP buffer since only two pages were encountered in the radix tree.
> Hmm, this example shows me a problem, thanks.
> 
> If the user knows the fd is for 1GB hugetlbfs file, it just prepares
> the 2 bytes buffer, so no problem.
> But if the user doesn't know whether the fd is from hugetlbfs file,
> the user must prepare the large buffer, though only first few bytes
> are used. And the more problematic is that the user could interpret
> the data in buffer differently:
>   1. only the first two 4kB-pages are loaded in the 2GB range,
>   2. two 1GB-pages are loaded.
> So for such callers, fincore() must notify the relevant page size
> in some way on return.
> Returning it via fincore_extra is my first thought but I'm not sure
> if it's elegant enough.

That does limit the interface to being used on a single page size per
call, which doesn't sound too bad since we don't mix page sizes in a
single file.  But, you mentioned using this interface along with
/proc/$pid/mem.  How would this deal with a process which had two sizes
of pages mapped?

Another option would be to have userspace pass in its desired
granularity.  Such an interface could be used to find holes in a file
fairly easily.  But, introduces a whole new set of issues, like what
BMAP means if only a part of the granule is in-core, and do you need a
new option to differentiate BMAP_AND vs. BMAP_OR operations.

I honestly think we need to take a step back and enumerate what you're
trying to do here before going any further.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 0/3] mm: introduce fincore() v3
  2014-07-08 13:27     ` Naoya Horiguchi
@ 2014-07-09  8:51       ` Christoph Hellwig
  -1 siblings, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2014-07-09  8:51 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Christoph Hellwig, Andrew Morton, Konstantin Khlebnikov,
	Wu Fengguang, Arnaldo Carvalho de Melo, Borislav Petkov,
	Kirill A. Shutemov, Johannes Weiner, Rusty Russell, David Miller,
	Andres Freund, linux-kernel, linux-mm, Dave Hansen, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On Tue, Jul 08, 2014 at 09:27:55AM -0400, Naoya Horiguchi wrote:
> On Tue, Jul 08, 2014 at 05:16:18AM -0700, Christoph Hellwig wrote:
> > Still a hard NAK for exposing page flags in a syscall ABI.  These are
> > way to volatile to go into an application interface.
> 
> Is there any specific reason that exporting via syscall ABI is more
> volatile than exporting via procfs as /proc/kpageflags alreadly does?

An optional proc debug output is very different from an actual system
call.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 0/3] mm: introduce fincore() v3
@ 2014-07-09  8:51       ` Christoph Hellwig
  0 siblings, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2014-07-09  8:51 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Christoph Hellwig, Andrew Morton, Konstantin Khlebnikov,
	Wu Fengguang, Arnaldo Carvalho de Melo, Borislav Petkov,
	Kirill A. Shutemov, Johannes Weiner, Rusty Russell, David Miller,
	Andres Freund, linux-kernel, linux-mm, Dave Hansen, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On Tue, Jul 08, 2014 at 09:27:55AM -0400, Naoya Horiguchi wrote:
> On Tue, Jul 08, 2014 at 05:16:18AM -0700, Christoph Hellwig wrote:
> > Still a hard NAK for exposing page flags in a syscall ABI.  These are
> > way to volatile to go into an application interface.
> 
> Is there any specific reason that exporting via syscall ABI is more
> volatile than exporting via procfs as /proc/kpageflags alreadly does?

An optional proc debug output is very different from an actual system
call.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
  2014-07-08 22:32             ` Dave Hansen
@ 2014-07-11 16:53               ` Naoya Horiguchi
  -1 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-11 16:53 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On Tue, Jul 08, 2014 at 03:32:30PM -0700, Dave Hansen wrote:
> On 07/08/2014 01:41 PM, Naoya Horiguchi wrote:
> >> >  It would only set the first two bytes of a
> >> > 256k BMAP buffer since only two pages were encountered in the radix tree.
> > Hmm, this example shows me a problem, thanks.
> > 
> > If the user knows the fd is for 1GB hugetlbfs file, it just prepares
> > the 2 bytes buffer, so no problem.
> > But if the user doesn't know whether the fd is from hugetlbfs file,
> > the user must prepare the large buffer, though only first few bytes
> > are used. And the more problematic is that the user could interpret
> > the data in buffer differently:
> >   1. only the first two 4kB-pages are loaded in the 2GB range,
> >   2. two 1GB-pages are loaded.
> > So for such callers, fincore() must notify the relevant page size
> > in some way on return.
> > Returning it via fincore_extra is my first thought but I'm not sure
> > if it's elegant enough.
> 
> That does limit the interface to being used on a single page size per
> call, which doesn't sound too bad since we don't mix page sizes in a
> single file.  But, you mentioned using this interface along with
> /proc/$pid/mem.  How would this deal with a process which had two sizes
> of pages mapped?

Hmm, we should handle everything (including hugetlbfs) in 4kB page in
BMAP mode, because the position of the data in user buffer has the meaning.
And maybe it should be the case basically for in extensible modes, but
only in FINCORE_PGOFF mode (where no data of holes is passed to userspace,
and per-page entry contains offset information, so the in-buffer position
doesn't mean anything,) we can skip tail pages in the natural manner.

In this approach, fincore(FINCORE_PGOFF) returns not only pgoff, but also
page order (encoded in highest bits in pgoff field?). The callers must be
prepared to handle different page sizes.

So if users want to avoid lots of tail data for hugetlbfs pages,
using FINCORE_PGOFF mode is recommended for them.

That allows us to handle regular files, hugetlbfs files and /proc/$pid/mem
consistently.

> Another option would be to have userspace pass in its desired
> granularity.  Such an interface could be used to find holes in a file
> fairly easily.  But, introduces a whole new set of issues, like what
> BMAP means if only a part of the granule is in-core, and do you need a
> new option to differentiate BMAP_AND vs. BMAP_OR operations.

I don't see exactly what you mention here, but I agree that it makes
more complexity and might not be easy to keep code maintenability.

> I honestly think we need to take a step back and enumerate what you're
> trying to do here before going any further.

OK, I try it. (Please correct/add if you find something I should)

What: typical usecases is like below:
 1. mincore()'s variant for page cache. Exporting residency information
    (PageUpdate) is required.
 2. Helping IO control from userspace. Exporting relevant page flags
    (PageDirty and PageWriteback, or others if needed) and NUMA node
    information is required.
 3. Error page check. This is an essential part of "error reporting" patchset
    I'm developing now, where I make error information sticky on page cache
    tree to let userspace take proper actions (the final goal is to avoid
    consuming corrupted data.) Exporting PageError is required.
 4. (debugging) Page offset to PFN translation. This is for debugging,
    so should be available only for privileged users.

How: attempt to do these better:
 - extensiblility. As we experience now on mincore(), Page residency
   information is not enough (for example to predict necessity of fsync.)
   We want to avoid adding new syscalls per future requirements of new
   information.
 - efficiency. For handling large (sparse) files and/or hugepages, BMAP
   type interface is not optimal, so "table" type interface is useful
   to avoid meaningless data transfer. 


Considering the objection for exporting bare page flags from Christoph,
I'm thinking that we had better export some "translated" page flags.
For example, we now only export residency info, so let's define it
as FINCORE_RESIDENCY bit, and let's define FINCORE_NEEDSYNC, which
makes kernel to export a bit of PageDirty && !PageWriteback.
Maybe we can combine them if we like, and if we do, the user buffer will
be filled with 2 bit entries per page on return.
This is extensible and the memory footprint is minimum.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/3] mm: introduce fincore()
@ 2014-07-11 16:53               ` Naoya Horiguchi
  0 siblings, 0 replies; 45+ messages in thread
From: Naoya Horiguchi @ 2014-07-11 16:53 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Konstantin Khlebnikov, Wu Fengguang,
	Arnaldo Carvalho de Melo, Borislav Petkov, Kirill A. Shutemov,
	Johannes Weiner, Rusty Russell, David Miller, Andres Freund,
	linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	Michael Kerrisk, Linux API, Naoya Horiguchi

On Tue, Jul 08, 2014 at 03:32:30PM -0700, Dave Hansen wrote:
> On 07/08/2014 01:41 PM, Naoya Horiguchi wrote:
> >> >  It would only set the first two bytes of a
> >> > 256k BMAP buffer since only two pages were encountered in the radix tree.
> > Hmm, this example shows me a problem, thanks.
> > 
> > If the user knows the fd is for 1GB hugetlbfs file, it just prepares
> > the 2 bytes buffer, so no problem.
> > But if the user doesn't know whether the fd is from hugetlbfs file,
> > the user must prepare the large buffer, though only first few bytes
> > are used. And the more problematic is that the user could interpret
> > the data in buffer differently:
> >   1. only the first two 4kB-pages are loaded in the 2GB range,
> >   2. two 1GB-pages are loaded.
> > So for such callers, fincore() must notify the relevant page size
> > in some way on return.
> > Returning it via fincore_extra is my first thought but I'm not sure
> > if it's elegant enough.
> 
> That does limit the interface to being used on a single page size per
> call, which doesn't sound too bad since we don't mix page sizes in a
> single file.  But, you mentioned using this interface along with
> /proc/$pid/mem.  How would this deal with a process which had two sizes
> of pages mapped?

Hmm, we should handle everything (including hugetlbfs) in 4kB page in
BMAP mode, because the position of the data in user buffer has the meaning.
And maybe it should be the case basically for in extensible modes, but
only in FINCORE_PGOFF mode (where no data of holes is passed to userspace,
and per-page entry contains offset information, so the in-buffer position
doesn't mean anything,) we can skip tail pages in the natural manner.

In this approach, fincore(FINCORE_PGOFF) returns not only pgoff, but also
page order (encoded in highest bits in pgoff field?). The callers must be
prepared to handle different page sizes.

So if users want to avoid lots of tail data for hugetlbfs pages,
using FINCORE_PGOFF mode is recommended for them.

That allows us to handle regular files, hugetlbfs files and /proc/$pid/mem
consistently.

> Another option would be to have userspace pass in its desired
> granularity.  Such an interface could be used to find holes in a file
> fairly easily.  But, introduces a whole new set of issues, like what
> BMAP means if only a part of the granule is in-core, and do you need a
> new option to differentiate BMAP_AND vs. BMAP_OR operations.

I don't see exactly what you mention here, but I agree that it makes
more complexity and might not be easy to keep code maintenability.

> I honestly think we need to take a step back and enumerate what you're
> trying to do here before going any further.

OK, I try it. (Please correct/add if you find something I should)

What: typical usecases is like below:
 1. mincore()'s variant for page cache. Exporting residency information
    (PageUpdate) is required.
 2. Helping IO control from userspace. Exporting relevant page flags
    (PageDirty and PageWriteback, or others if needed) and NUMA node
    information is required.
 3. Error page check. This is an essential part of "error reporting" patchset
    I'm developing now, where I make error information sticky on page cache
    tree to let userspace take proper actions (the final goal is to avoid
    consuming corrupted data.) Exporting PageError is required.
 4. (debugging) Page offset to PFN translation. This is for debugging,
    so should be available only for privileged users.

How: attempt to do these better:
 - extensiblility. As we experience now on mincore(), Page residency
   information is not enough (for example to predict necessity of fsync.)
   We want to avoid adding new syscalls per future requirements of new
   information.
 - efficiency. For handling large (sparse) files and/or hugepages, BMAP
   type interface is not optimal, so "table" type interface is useful
   to avoid meaningless data transfer. 


Considering the objection for exporting bare page flags from Christoph,
I'm thinking that we had better export some "translated" page flags.
For example, we now only export residency info, so let's define it
as FINCORE_RESIDENCY bit, and let's define FINCORE_NEEDSYNC, which
makes kernel to export a bit of PageDirty && !PageWriteback.
Maybe we can combine them if we like, and if we do, the user buffer will
be filled with 2 bit entries per page on return.
This is extensible and the memory footprint is minimum.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2014-07-11 16:55 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-07 18:00 [PATCH v3 0/3] mm: introduce fincore() v3 Naoya Horiguchi
2014-07-07 18:00 ` Naoya Horiguchi
2014-07-07 18:00 ` [PATCH v3 1/3] mm: introduce fincore() Naoya Horiguchi
2014-07-07 18:00   ` Naoya Horiguchi
2014-07-07 19:01   ` Dave Hansen
2014-07-07 19:01     ` Dave Hansen
2014-07-07 20:21     ` Naoya Horiguchi
2014-07-07 20:21       ` Naoya Horiguchi
2014-07-07 20:43       ` Dave Hansen
2014-07-07 20:43         ` Dave Hansen
2014-07-07 21:48         ` Naoya Horiguchi
2014-07-07 21:48           ` Naoya Horiguchi
2014-07-07 22:44           ` Dave Hansen
2014-07-07 22:44             ` Dave Hansen
2014-07-08 15:35             ` Naoya Horiguchi
2014-07-08 15:35               ` Naoya Horiguchi
2014-07-08 19:03     ` Naoya Horiguchi
2014-07-08 19:03       ` Naoya Horiguchi
2014-07-08 19:42       ` Dave Hansen
2014-07-08 19:42         ` Dave Hansen
2014-07-08 20:41         ` Naoya Horiguchi
2014-07-08 20:41           ` Naoya Horiguchi
2014-07-08 22:32           ` Dave Hansen
2014-07-08 22:32             ` Dave Hansen
2014-07-11 16:53             ` Naoya Horiguchi
2014-07-11 16:53               ` Naoya Horiguchi
2014-07-07 18:00 ` [PATCH v3 2/3] selftests/fincore: add test code for fincore() Naoya Horiguchi
2014-07-07 18:00   ` Naoya Horiguchi
2014-07-07 18:00 ` [PATCH v3 3/3] man2/fincore.2: document general description about fincore(2) Naoya Horiguchi
2014-07-07 18:00   ` Naoya Horiguchi
2014-07-07 19:08   ` Dave Hansen
2014-07-07 19:08     ` Dave Hansen
2014-07-07 19:08     ` Dave Hansen
2014-07-07 20:59     ` Naoya Horiguchi
2014-07-07 20:59       ` Naoya Horiguchi
2014-07-07 22:34       ` Dave Hansen
2014-07-07 22:34         ` Dave Hansen
2014-07-08 15:43         ` Naoya Horiguchi
2014-07-08 15:43           ` Naoya Horiguchi
2014-07-08 12:16 ` [PATCH v3 0/3] mm: introduce fincore() v3 Christoph Hellwig
2014-07-08 12:16   ` Christoph Hellwig
2014-07-08 13:27   ` Naoya Horiguchi
2014-07-08 13:27     ` Naoya Horiguchi
2014-07-09  8:51     ` Christoph Hellwig
2014-07-09  8:51       ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.