linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 1/2] mm/page_idle: Add per-pid idle page tracking using virtual indexing
@ 2019-07-26 15:23 Joel Fernandes (Google)
  2019-07-26 15:23 ` [PATCH v3 2/2] doc: Update documentation for page_idle virtual address indexing Joel Fernandes (Google)
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Joel Fernandes (Google) @ 2019-07-26 15:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	Alexey Dobriyan, Andrew Morton, Brendan Gregg, Christian Hansen,
	dancol, fmayer, joaodias, joelaf, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Michal Hocko, Mike Rapoport, minchan, namhyung, Roman Gushchin,
	Stephen Rothwell, surenb, tkjos, Vladimir Davydov,
	Vlastimil Babka, wvw

The page_idle tracking feature currently requires looking up the pagemap
for a process followed by interacting with /sys/kernel/mm/page_idle.
Looking up PFN from pagemap in Android devices is not supported by
unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.

This patch adds support to directly interact with page_idle tracking at
the PID level by introducing a /proc/<pid>/page_idle file.  It follows
the exact same semantics as the global /sys/kernel/mm/page_idle, but now
looking up PFN through pagemap is not needed since the interface uses
virtual frame numbers, and at the same time also does not require
SYS_ADMIN.

In Android, we are using this for the heap profiler (heapprofd) which
profiles and pin points code paths which allocates and leaves memory
idle for long periods of time. This method solves the security issue
with userspace learning the PFN, and while at it is also shown to yield
better results than the pagemap lookup, the theory being that the window
where the address space can change is reduced by eliminating the
intermediate pagemap look up stage. In virtual address indexing, the
process's mmap_sem is held for the duration of the access.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

---
v2->v3:
Fixed a bug where I was doing a kfree that is not needed due to not
needing to do GFP_ATOMIC allocations.

v1->v2:
Mark swap ptes as idle (Minchan)
Avoid need for GFP_ATOMIC (Andrew)
Get rid of idle_page_list lock by moving list to stack

Internal review -> v1:
Fixes from Suren.
Corrections to change log, docs (Florian, Sandeep)

 fs/proc/base.c            |   3 +
 fs/proc/internal.h        |   1 +
 fs/proc/task_mmu.c        |  57 +++++++
 include/linux/page_idle.h |   4 +
 mm/page_idle.c            | 340 +++++++++++++++++++++++++++++++++-----
 5 files changed, 360 insertions(+), 45 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 77eb628ecc7f..a58dd74606e9 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3021,6 +3021,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 	REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
 	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
 	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+	REG("page_idle", S_IRUSR|S_IWUSR, proc_page_idle_operations),
+#endif
 #endif
 #ifdef CONFIG_SECURITY
 	DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index cd0c8d5ce9a1..bc9371880c63 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -293,6 +293,7 @@ extern const struct file_operations proc_pid_smaps_operations;
 extern const struct file_operations proc_pid_smaps_rollup_operations;
 extern const struct file_operations proc_clear_refs_operations;
 extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_page_idle_operations;
 
 extern unsigned long task_vsize(struct mm_struct *);
 extern unsigned long task_statm(struct mm_struct *,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4d2b860dbc3f..11ccc53da38e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1642,6 +1642,63 @@ const struct file_operations proc_pagemap_operations = {
 	.open		= pagemap_open,
 	.release	= pagemap_release,
 };
+
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+static ssize_t proc_page_idle_read(struct file *file, char __user *buf,
+				   size_t count, loff_t *ppos)
+{
+	int ret;
+	struct task_struct *tsk = get_proc_task(file_inode(file));
+
+	if (!tsk)
+		return -EINVAL;
+	ret = page_idle_proc_read(file, buf, count, ppos, tsk);
+	put_task_struct(tsk);
+	return ret;
+}
+
+static ssize_t proc_page_idle_write(struct file *file, const char __user *buf,
+				 size_t count, loff_t *ppos)
+{
+	int ret;
+	struct task_struct *tsk = get_proc_task(file_inode(file));
+
+	if (!tsk)
+		return -EINVAL;
+	ret = page_idle_proc_write(file, (char __user *)buf, count, ppos, tsk);
+	put_task_struct(tsk);
+	return ret;
+}
+
+static int proc_page_idle_open(struct inode *inode, struct file *file)
+{
+	struct mm_struct *mm;
+
+	mm = proc_mem_open(inode, PTRACE_MODE_READ);
+	if (IS_ERR(mm))
+		return PTR_ERR(mm);
+	file->private_data = mm;
+	return 0;
+}
+
+static int proc_page_idle_release(struct inode *inode, struct file *file)
+{
+	struct mm_struct *mm = file->private_data;
+
+	if (mm)
+		mmdrop(mm);
+	return 0;
+}
+
+const struct file_operations proc_page_idle_operations = {
+	.llseek		= mem_lseek, /* borrow this */
+	.read		= proc_page_idle_read,
+	.write		= proc_page_idle_write,
+	.open		= proc_page_idle_open,
+	.release	= proc_page_idle_release,
+};
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
 #endif /* CONFIG_PROC_PAGE_MONITOR */
 
 #ifdef CONFIG_NUMA
diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
index 1e894d34bdce..f1bc2640d85e 100644
--- a/include/linux/page_idle.h
+++ b/include/linux/page_idle.h
@@ -106,6 +106,10 @@ static inline void clear_page_idle(struct page *page)
 }
 #endif /* CONFIG_64BIT */
 
+ssize_t page_idle_proc_write(struct file *file,
+	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
+ssize_t page_idle_proc_read(struct file *file,
+	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
 #else /* !CONFIG_IDLE_PAGE_TRACKING */
 
 static inline bool page_is_young(struct page *page)
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 295512465065..86244f7f1faa 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -5,12 +5,15 @@
 #include <linux/sysfs.h>
 #include <linux/kobject.h>
 #include <linux/mm.h>
-#include <linux/mmzone.h>
-#include <linux/pagemap.h>
-#include <linux/rmap.h>
 #include <linux/mmu_notifier.h>
+#include <linux/mmzone.h>
 #include <linux/page_ext.h>
 #include <linux/page_idle.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/sched/mm.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 
 #define BITMAP_CHUNK_SIZE	sizeof(u64)
 #define BITMAP_CHUNK_BITS	(BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
@@ -25,18 +28,13 @@
  * page tracking. With such an indicator of user pages we can skip isolated
  * pages, but since there are not usually many of them, it will hardly affect
  * the overall result.
- *
- * This function tries to get a user memory page by pfn as described above.
  */
-static struct page *page_idle_get_page(unsigned long pfn)
+static struct page *page_idle_get_page(struct page *page_in)
 {
 	struct page *page;
 	pg_data_t *pgdat;
 
-	if (!pfn_valid(pfn))
-		return NULL;
-
-	page = pfn_to_page(pfn);
+	page = page_in;
 	if (!page || !PageLRU(page) ||
 	    !get_page_unless_zero(page))
 		return NULL;
@@ -51,6 +49,18 @@ static struct page *page_idle_get_page(unsigned long pfn)
 	return page;
 }
 
+/*
+ * This function tries to get a user memory page by pfn as described above.
+ */
+static struct page *page_idle_get_page_pfn(unsigned long pfn)
+{
+
+	if (!pfn_valid(pfn))
+		return NULL;
+
+	return page_idle_get_page(pfn_to_page(pfn));
+}
+
 static bool page_idle_clear_pte_refs_one(struct page *page,
 					struct vm_area_struct *vma,
 					unsigned long addr, void *arg)
@@ -118,6 +128,47 @@ static void page_idle_clear_pte_refs(struct page *page)
 		unlock_page(page);
 }
 
+/* Helper to get the start and end frame given a pos and count */
+static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
+				unsigned long *start, unsigned long *end)
+{
+	unsigned long max_frame;
+
+	/* If an mm is not given, assume we want physical frames */
+	max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
+
+	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
+		return -EINVAL;
+
+	*start = pos * BITS_PER_BYTE;
+	if (*start >= max_frame)
+		return -ENXIO;
+
+	*end = *start + count * BITS_PER_BYTE;
+	if (*end > max_frame)
+		*end = max_frame;
+	return 0;
+}
+
+static bool page_really_idle(struct page *page)
+{
+	if (!page)
+		return false;
+
+	if (page_is_idle(page)) {
+		/*
+		 * The page might have been referenced via a
+		 * pte, in which case it is not idle. Clear
+		 * refs and recheck.
+		 */
+		page_idle_clear_pte_refs(page);
+		if (page_is_idle(page))
+			return true;
+	}
+
+	return false;
+}
+
 static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
 				     struct bin_attribute *attr, char *buf,
 				     loff_t pos, size_t count)
@@ -125,35 +176,21 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
 	u64 *out = (u64 *)buf;
 	struct page *page;
 	unsigned long pfn, end_pfn;
-	int bit;
+	int bit, ret;
 
-	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
-		return -EINVAL;
-
-	pfn = pos * BITS_PER_BYTE;
-	if (pfn >= max_pfn)
-		return 0;
-
-	end_pfn = pfn + count * BITS_PER_BYTE;
-	if (end_pfn > max_pfn)
-		end_pfn = max_pfn;
+	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
+	if (ret == -ENXIO)
+		return 0;  /* Reads beyond max_pfn do nothing */
+	else if (ret)
+		return ret;
 
 	for (; pfn < end_pfn; pfn++) {
 		bit = pfn % BITMAP_CHUNK_BITS;
 		if (!bit)
 			*out = 0ULL;
-		page = page_idle_get_page(pfn);
-		if (page) {
-			if (page_is_idle(page)) {
-				/*
-				 * The page might have been referenced via a
-				 * pte, in which case it is not idle. Clear
-				 * refs and recheck.
-				 */
-				page_idle_clear_pte_refs(page);
-				if (page_is_idle(page))
-					*out |= 1ULL << bit;
-			}
+		page = page_idle_get_page_pfn(pfn);
+		if (page && page_really_idle(page)) {
+			*out |= 1ULL << bit;
 			put_page(page);
 		}
 		if (bit == BITMAP_CHUNK_BITS - 1)
@@ -170,23 +207,16 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
 	const u64 *in = (u64 *)buf;
 	struct page *page;
 	unsigned long pfn, end_pfn;
-	int bit;
+	int bit, ret;
 
-	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
-		return -EINVAL;
-
-	pfn = pos * BITS_PER_BYTE;
-	if (pfn >= max_pfn)
-		return -ENXIO;
-
-	end_pfn = pfn + count * BITS_PER_BYTE;
-	if (end_pfn > max_pfn)
-		end_pfn = max_pfn;
+	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
+	if (ret)
+		return ret;
 
 	for (; pfn < end_pfn; pfn++) {
 		bit = pfn % BITMAP_CHUNK_BITS;
 		if ((*in >> bit) & 1) {
-			page = page_idle_get_page(pfn);
+			page = page_idle_get_page_pfn(pfn);
 			if (page) {
 				page_idle_clear_pte_refs(page);
 				set_page_idle(page);
@@ -224,6 +254,226 @@ struct page_ext_operations page_idle_ops = {
 };
 #endif
 
+/*  page_idle tracking for /proc/<pid>/page_idle */
+
+struct page_node {
+	struct page *page;
+	unsigned long addr;
+	struct list_head list;
+};
+
+struct page_idle_proc_priv {
+	unsigned long start_addr;
+	char *buffer;
+	int write;
+
+	/* Pre-allocate and provide nodes to add_page_idle_list() */
+	struct page_node *page_nodes;
+	int cur_page_node;
+	struct list_head *idle_page_list;
+};
+
+/*
+ * Add a page to the idle page list. page can be NULL if pte is
+ * from a swapped page.
+ */
+static void add_page_idle_list(struct page *page,
+			       unsigned long addr, struct mm_walk *walk)
+{
+	struct page *page_get = NULL;
+	struct page_node *pn;
+	int bit;
+	unsigned long frames;
+	struct page_idle_proc_priv *priv = walk->private;
+	u64 *chunk = (u64 *)priv->buffer;
+
+	if (priv->write) {
+		/* Find whether this page was asked to be marked */
+		frames = (addr - priv->start_addr) >> PAGE_SHIFT;
+		bit = frames % BITMAP_CHUNK_BITS;
+		chunk = &chunk[frames / BITMAP_CHUNK_BITS];
+		if (((*chunk >> bit) & 1) == 0)
+			return;
+	}
+
+	if (page) {
+		page_get = page_idle_get_page(page);
+		if (!page_get)
+			return;
+	}
+
+	pn = &(priv->page_nodes[priv->cur_page_node++]);
+	pn->page = page_get;
+	pn->addr = addr;
+	list_add(&pn->list, priv->idle_page_list);
+}
+
+static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
+				    unsigned long end,
+				    struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+	pte_t *pte;
+	spinlock_t *ptl;
+	struct page *page;
+
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		if (pmd_present(*pmd)) {
+			page = follow_trans_huge_pmd(vma, addr, pmd,
+						     FOLL_DUMP|FOLL_WRITE);
+			if (!IS_ERR_OR_NULL(page))
+				add_page_idle_list(page, addr, walk);
+		}
+		spin_unlock(ptl);
+		return 0;
+	}
+
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		/*
+		 * We add swapped pages to the idle_page_list so that we can
+		 * reported to userspace that they are idle.
+		 */
+		if (is_swap_pte(*pte)) {
+			add_page_idle_list(NULL, addr, walk);
+			continue;
+		}
+
+		if (!pte_present(*pte))
+			continue;
+
+		page = vm_normal_page(vma, addr, *pte);
+		if (page)
+			add_page_idle_list(page, addr, walk);
+	}
+
+	pte_unmap_unlock(pte - 1, ptl);
+	return 0;
+}
+
+ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
+			       size_t count, loff_t *pos,
+			       struct task_struct *tsk, int write)
+{
+	int ret;
+	char *buffer;
+	u64 *out;
+	unsigned long start_addr, end_addr, start_frame, end_frame;
+	struct mm_struct *mm = file->private_data;
+	struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, };
+	struct page_node *cur;
+	struct page_idle_proc_priv priv;
+	bool walk_error = false;
+	LIST_HEAD(idle_page_list);
+
+	if (!mm || !mmget_not_zero(mm))
+		return -EINVAL;
+
+	if (count > PAGE_SIZE)
+		count = PAGE_SIZE;
+
+	buffer = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buffer) {
+		ret = -ENOMEM;
+		goto out_mmput;
+	}
+	out = (u64 *)buffer;
+
+	if (write && copy_from_user(buffer, ubuff, count)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame);
+	if (ret)
+		goto out;
+
+	start_addr = (start_frame << PAGE_SHIFT);
+	end_addr = (end_frame << PAGE_SHIFT);
+	priv.buffer = buffer;
+	priv.start_addr = start_addr;
+	priv.write = write;
+
+	priv.idle_page_list = &idle_page_list;
+	priv.cur_page_node = 0;
+	priv.page_nodes = kzalloc(sizeof(struct page_node) *
+				  (end_frame - start_frame), GFP_KERNEL);
+	if (!priv.page_nodes) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	walk.private = &priv;
+	walk.mm = mm;
+
+	down_read(&mm->mmap_sem);
+
+	/*
+	 * idle_page_list is needed because walk_page_vma() holds ptlock which
+	 * deadlocks with page_idle_clear_pte_refs(). So we have to collect all
+	 * pages first, and then call page_idle_clear_pte_refs().
+	 */
+	ret = walk_page_range(start_addr, end_addr, &walk);
+	if (ret)
+		walk_error = true;
+
+	list_for_each_entry(cur, &idle_page_list, list) {
+		int bit, index;
+		unsigned long off;
+		struct page *page = cur->page;
+
+		if (unlikely(walk_error))
+			goto remove_page;
+
+		if (write) {
+			if (page) {
+				page_idle_clear_pte_refs(page);
+				set_page_idle(page);
+			}
+		} else {
+			if (!page || page_really_idle(page)) {
+				off = ((cur->addr) >> PAGE_SHIFT) - start_frame;
+				bit = off % BITMAP_CHUNK_BITS;
+				index = off / BITMAP_CHUNK_BITS;
+				out[index] |= 1ULL << bit;
+			}
+		}
+remove_page:
+		if (page)
+			put_page(page);
+	}
+
+	if (!write && !walk_error)
+		ret = copy_to_user(ubuff, buffer, count);
+
+	up_read(&mm->mmap_sem);
+	kfree(priv.page_nodes);
+out:
+	kfree(buffer);
+out_mmput:
+	mmput(mm);
+	if (!ret)
+		ret = count;
+	return ret;
+
+}
+
+ssize_t page_idle_proc_read(struct file *file, char __user *ubuff,
+			    size_t count, loff_t *pos, struct task_struct *tsk)
+{
+	return page_idle_proc_generic(file, ubuff, count, pos, tsk, 0);
+}
+
+ssize_t page_idle_proc_write(struct file *file, char __user *ubuff,
+			     size_t count, loff_t *pos, struct task_struct *tsk)
+{
+	return page_idle_proc_generic(file, ubuff, count, pos, tsk, 1);
+}
+
 static int __init page_idle_init(void)
 {
 	int err;
-- 
2.22.0.709.g102302147b-goog

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v3 2/2] doc: Update documentation for page_idle virtual address indexing
  2019-07-26 15:23 [PATCH v3 1/2] mm/page_idle: Add per-pid idle page tracking using virtual indexing Joel Fernandes (Google)
@ 2019-07-26 15:23 ` Joel Fernandes (Google)
  2019-07-26 20:17   ` sspatil
  2019-07-31  6:44   ` Mike Rapoport
  2019-07-30 13:06 ` [PATCH v3 1/2] mm/page_idle: Add per-pid idle page tracking using virtual indexing Joel Fernandes
  2019-07-31  8:53 ` Minchan Kim
  2 siblings, 2 replies; 10+ messages in thread
From: Joel Fernandes (Google) @ 2019-07-26 15:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	Alexey Dobriyan, Andrew Morton, Brendan Gregg, Christian Hansen,
	dancol, fmayer, joaodias, joelaf, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Michal Hocko, Mike Rapoport, minchan, namhyung, Roman Gushchin,
	Stephen Rothwell, surenb, tkjos, Vladimir Davydov,
	Vlastimil Babka, wvw

This patch updates the documentation with the new page_idle tracking
feature which uses virtual address indexing.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 .../admin-guide/mm/idle_page_tracking.rst     | 43 ++++++++++++++++---
 1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/mm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst
index df9394fb39c2..1eeac78c94a7 100644
--- a/Documentation/admin-guide/mm/idle_page_tracking.rst
+++ b/Documentation/admin-guide/mm/idle_page_tracking.rst
@@ -19,10 +19,14 @@ It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
 
 User API
 ========
+There are 2 ways to access the idle page tracking API. One uses physical
+address indexing, another uses a simpler virtual address indexing scheme.
 
-The idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
-Currently, it consists of the only read-write file,
-``/sys/kernel/mm/page_idle/bitmap``.
+Physical address indexing
+-------------------------
+The idle page tracking API for physical address indexing using page frame
+numbers (PFN) is located at ``/sys/kernel/mm/page_idle``.  Currently, it
+consists of the only read-write file, ``/sys/kernel/mm/page_idle/bitmap``.
 
 The file implements a bitmap where each bit corresponds to a memory page. The
 bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
@@ -74,6 +78,31 @@ See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more
 information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and
 ``/proc/kpagecgroup``.
 
+Virtual address indexing
+------------------------
+The idle page tracking API for virtual address indexing using virtual page
+frame numbers (VFN) is located at ``/proc/<pid>/page_idle``. It is a bitmap
+that follows the same semantics as ``/sys/kernel/mm/page_idle/bitmap``
+except that it uses virtual instead of physical frame numbers.
+
+This idle page tracking API does not need deal with PFN so it does not require
+prior lookups of ``pagemap`` in order to find if page is idle or not. This is
+an advantage on some systems where looking up PFN is considered a security
+issue.  Also in some cases, this interface could be slightly more reliable to
+use than physical address indexing, since in physical address indexing, address
+space changes can occur between reading the ``pagemap`` and reading the
+``bitmap``, while in virtual address indexing, the process's ``mmap_sem`` is
+held for the duration of the access.
+
+To estimate the amount of pages that are not used by a workload one should:
+
+ 1. Mark all the workload's pages as idle by setting corresponding bits in
+    ``/proc/<pid>/page_idle``.
+
+ 2. Wait until the workload accesses its working set.
+
+ 3. Read ``/proc/<pid>/page_idle`` and count the number of bits set.
+
 .. _impl_details:
 
 Implementation Details
@@ -99,10 +128,10 @@ When a dirty page is written to swap or disk as a result of memory reclaim or
 exceeding the dirty memory limit, it is not marked referenced.
 
 The idle memory tracking feature adds a new page flag, the Idle flag. This flag
-is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
-:ref:`User API <user_api>`
-section), and cleared automatically whenever a page is referenced as defined
-above.
+is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` for physical
+addressing or by writing to ``/proc/<pid>/page_idle`` for virtual
+addressing (see the :ref:`User API <user_api>` section), and cleared
+automatically whenever a page is referenced as defined above.
 
 When a page is marked idle, the Accessed bit must be cleared in all PTEs it is
 mapped to, otherwise we will not be able to detect accesses to the page coming
-- 
2.22.0.709.g102302147b-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/2] doc: Update documentation for page_idle virtual address indexing
  2019-07-26 15:23 ` [PATCH v3 2/2] doc: Update documentation for page_idle virtual address indexing Joel Fernandes (Google)
@ 2019-07-26 20:17   ` sspatil
  2019-07-26 20:26     ` Joel Fernandes
  2019-07-31  6:44   ` Mike Rapoport
  1 sibling, 1 reply; 10+ messages in thread
From: sspatil @ 2019-07-26 20:17 UTC (permalink / raw)
  To: joel, linux-kernel, adobriyan, akpm, bgregg, chansen3, dancol,
	fmayer, joaodias, joelaf, corbet, keescook, kernel-team,
	linux-api, linux-doc, linux-fsdevel, linux-mm, mhocko, rppt,
	minchan, namhyung, guro, sfr, surenb, tkjos, vdavydov.dev,
	vbabka, wvw, sspatil+mutt
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Brendan Gregg,
	Christian Hansen, dancol, fmayer, joaodias, joelaf,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport, minchan,
	namhyung, Roman Gushchin, Stephen Rothwell, surenb, tkjos,
	Vladimir Davydov, Vlastimil Babka, wvw

Thanks Joel, just a couple of nits for the doc inline below. Other than that,

Reviewed-by: Sandeep Patil <sspatil@google.com>

I'll plan on making changes to Android to use this instead of the pagemap +
page_idle. I think it will also be considerably faster.

On Fri, Jul 26, 2019 at 11:23:19AM -0400, Joel Fernandes (Google) wrote:
> This patch updates the documentation with the new page_idle tracking
> feature which uses virtual address indexing.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  .../admin-guide/mm/idle_page_tracking.rst     | 43 ++++++++++++++++---
>  1 file changed, 36 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst
> index df9394fb39c2..1eeac78c94a7 100644
> --- a/Documentation/admin-guide/mm/idle_page_tracking.rst
> +++ b/Documentation/admin-guide/mm/idle_page_tracking.rst
> @@ -19,10 +19,14 @@ It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
>  
>  User API
>  ========
> +There are 2 ways to access the idle page tracking API. One uses physical
> +address indexing, another uses a simpler virtual address indexing scheme.
>  
> -The idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
> -Currently, it consists of the only read-write file,
> -``/sys/kernel/mm/page_idle/bitmap``.
> +Physical address indexing
> +-------------------------
> +The idle page tracking API for physical address indexing using page frame
> +numbers (PFN) is located at ``/sys/kernel/mm/page_idle``.  Currently, it
> +consists of the only read-write file, ``/sys/kernel/mm/page_idle/bitmap``.
>  
>  The file implements a bitmap where each bit corresponds to a memory page. The
>  bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
> @@ -74,6 +78,31 @@ See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more
>  information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and
>  ``/proc/kpagecgroup``.
>  
> +Virtual address indexing
> +------------------------
> +The idle page tracking API for virtual address indexing using virtual page
> +frame numbers (VFN) is located at ``/proc/<pid>/page_idle``. It is a bitmap
> +that follows the same semantics as ``/sys/kernel/mm/page_idle/bitmap``
> +except that it uses virtual instead of physical frame numbers.
> +
> +This idle page tracking API does not need deal with PFN so it does not require

s/need//

> +prior lookups of ``pagemap`` in order to find if page is idle or not. This is

s/in order to find if page is idle or not//

> +an advantage on some systems where looking up PFN is considered a security
> +issue.  Also in some cases, this interface could be slightly more reliable to
> +use than physical address indexing, since in physical address indexing, address
> +space changes can occur between reading the ``pagemap`` and reading the
> +``bitmap``, while in virtual address indexing, the process's ``mmap_sem`` is
> +held for the duration of the access.
> +
> +To estimate the amount of pages that are not used by a workload one should:
> +
> + 1. Mark all the workload's pages as idle by setting corresponding bits in
> +    ``/proc/<pid>/page_idle``.
> +
> + 2. Wait until the workload accesses its working set.
> +
> + 3. Read ``/proc/<pid>/page_idle`` and count the number of bits set.
> +
>  .. _impl_details:
>  
>  Implementation Details
> @@ -99,10 +128,10 @@ When a dirty page is written to swap or disk as a result of memory reclaim or
>  exceeding the dirty memory limit, it is not marked referenced.
>  
>  The idle memory tracking feature adds a new page flag, the Idle flag. This flag
> -is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
> -:ref:`User API <user_api>`
> -section), and cleared automatically whenever a page is referenced as defined
> -above.
> +is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` for physical
> +addressing or by writing to ``/proc/<pid>/page_idle`` for virtual
> +addressing (see the :ref:`User API <user_api>` section), and cleared
> +automatically whenever a page is referenced as defined above.
>  
>  When a page is marked idle, the Accessed bit must be cleared in all PTEs it is
>  mapped to, otherwise we will not be able to detect accesses to the page coming
> -- 
> 2.22.0.709.g102302147b-goog
> 
> -- 
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/2] doc: Update documentation for page_idle virtual address indexing
  2019-07-26 20:17   ` sspatil
@ 2019-07-26 20:26     ` Joel Fernandes
  0 siblings, 0 replies; 10+ messages in thread
From: Joel Fernandes @ 2019-07-26 20:26 UTC (permalink / raw)
  To: sspatil
  Cc: linux-kernel, adobriyan, akpm, bgregg, chansen3, dancol, fmayer,
	joaodias, corbet, keescook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, mhocko, rppt, minchan, namhyung, guro,
	sfr, surenb, tkjos, vdavydov.dev, vbabka, wvw, sspatil+mutt

On Fri, Jul 26, 2019 at 01:17:10PM -0700, sspatil@google.com wrote:
> Thanks Joel, just a couple of nits for the doc inline below. Other than that,
> 
> Reviewed-by: Sandeep Patil <sspatil@google.com>

Thanks!

> I'll plan on making changes to Android to use this instead of the pagemap +
> page_idle. I think it will also be considerably faster.

Cool, glad to know.

> On Fri, Jul 26, 2019 at 11:23:19AM -0400, Joel Fernandes (Google) wrote:
> > This patch updates the documentation with the new page_idle tracking
> > feature which uses virtual address indexing.
> > 
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >  .../admin-guide/mm/idle_page_tracking.rst     | 43 ++++++++++++++++---
> >  1 file changed, 36 insertions(+), 7 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/mm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst
> > index df9394fb39c2..1eeac78c94a7 100644
> > --- a/Documentation/admin-guide/mm/idle_page_tracking.rst
> > +++ b/Documentation/admin-guide/mm/idle_page_tracking.rst
> > @@ -19,10 +19,14 @@ It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
> >  
> >  User API
> >  ========
> > +There are 2 ways to access the idle page tracking API. One uses physical
> > +address indexing, another uses a simpler virtual address indexing scheme.
> >  
> > -The idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
> > -Currently, it consists of the only read-write file,
> > -``/sys/kernel/mm/page_idle/bitmap``.
> > +Physical address indexing
> > +-------------------------
> > +The idle page tracking API for physical address indexing using page frame
> > +numbers (PFN) is located at ``/sys/kernel/mm/page_idle``.  Currently, it
> > +consists of the only read-write file, ``/sys/kernel/mm/page_idle/bitmap``.
> >  
> >  The file implements a bitmap where each bit corresponds to a memory page. The
> >  bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
> > @@ -74,6 +78,31 @@ See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more
> >  information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and
> >  ``/proc/kpagecgroup``.
> >  
> > +Virtual address indexing
> > +------------------------
> > +The idle page tracking API for virtual address indexing using virtual page
> > +frame numbers (VFN) is located at ``/proc/<pid>/page_idle``. It is a bitmap
> > +that follows the same semantics as ``/sys/kernel/mm/page_idle/bitmap``
> > +except that it uses virtual instead of physical frame numbers.
> > +
> > +This idle page tracking API does not need deal with PFN so it does not require
> 
> s/need//
> 
> > +prior lookups of ``pagemap`` in order to find if page is idle or not. This is
> 
> s/in order to find if page is idle or not//

Fixed both, thank you! Will send out update soon.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/2] mm/page_idle: Add per-pid idle page tracking using virtual indexing
  2019-07-26 15:23 [PATCH v3 1/2] mm/page_idle: Add per-pid idle page tracking using virtual indexing Joel Fernandes (Google)
  2019-07-26 15:23 ` [PATCH v3 2/2] doc: Update documentation for page_idle virtual address indexing Joel Fernandes (Google)
@ 2019-07-30 13:06 ` Joel Fernandes
  2019-07-31  8:53 ` Minchan Kim
  2 siblings, 0 replies; 10+ messages in thread
From: Joel Fernandes @ 2019-07-30 13:06 UTC (permalink / raw)
  To: LKML
  Cc: Alexey Dobriyan, Andrew Morton, Brendan Gregg, Christian Hansen,
	Daniel Colascione, Florian Mayer, John Dias, Joel Fernandes,
	Jonathan Corbet, Kees Cook, kernel-team, Linux API,
	open list:DOCUMENTATION, Linux FS Devel, linux-mm, Michal Hocko,
	Mike Rapoport, Minchan Kim, Namhyung Kim, Roman Gushchin,
	Stephen Rothwell, Suren Baghdasaryan, Todd Kjos,
	Vladimir Davydov, Vlastimil Babka, Wei Wang

On Fri, Jul 26, 2019 at 11:23 AM Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> The page_idle tracking feature currently requires looking up the pagemap
> for a process followed by interacting with /sys/kernel/mm/page_idle.
> Looking up PFN from pagemap in Android devices is not supported by
> unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.
>
> This patch adds support to directly interact with page_idle tracking at
> the PID level by introducing a /proc/<pid>/page_idle file.  It follows
> the exact same semantics as the global /sys/kernel/mm/page_idle, but now
> looking up PFN through pagemap is not needed since the interface uses
> virtual frame numbers, and at the same time also does not require
> SYS_ADMIN.
>
> In Android, we are using this for the heap profiler (heapprofd) which
> profiles and pin points code paths which allocates and leaves memory
> idle for long periods of time. This method solves the security issue
> with userspace learning the PFN, and while at it is also shown to yield
> better results than the pagemap lookup, the theory being that the window
> where the address space can change is reduced by eliminating the
> intermediate pagemap look up stage. In virtual address indexing, the
> process's mmap_sem is held for the duration of the access.
>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>
> ---
> v2->v3:
> Fixed a bug where I was doing a kfree that is not needed due to not
> needing to do GFP_ATOMIC allocations.
>
> v1->v2:
> Mark swap ptes as idle (Minchan)
> Avoid need for GFP_ATOMIC (Andrew)
> Get rid of idle_page_list lock by moving list to stack

I believe all suggestions have been addressed.  Do these look good now?

thanks,

 - Joel



> Internal review -> v1:
> Fixes from Suren.
> Corrections to change log, docs (Florian, Sandeep)
>
>  fs/proc/base.c            |   3 +
>  fs/proc/internal.h        |   1 +
>  fs/proc/task_mmu.c        |  57 +++++++
>  include/linux/page_idle.h |   4 +
>  mm/page_idle.c            | 340 +++++++++++++++++++++++++++++++++-----
>  5 files changed, 360 insertions(+), 45 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 77eb628ecc7f..a58dd74606e9 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3021,6 +3021,9 @@ static const struct pid_entry tgid_base_stuff[] = {
>         REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
>         REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/2] doc: Update documentation for page_idle virtual address indexing
  2019-07-26 15:23 ` [PATCH v3 2/2] doc: Update documentation for page_idle virtual address indexing Joel Fernandes (Google)
  2019-07-26 20:17   ` sspatil
@ 2019-07-31  6:44   ` Mike Rapoport
  1 sibling, 0 replies; 10+ messages in thread
From: Mike Rapoport @ 2019-07-31  6:44 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Brendan Gregg,
	Christian Hansen, dancol, fmayer, joaodias, joelaf,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Michal Hocko, minchan, namhyung,
	Roman Gushchin, Stephen Rothwell, surenb, tkjos,
	Vladimir Davydov, Vlastimil Babka, wvw

On Fri, Jul 26, 2019 at 11:23:19AM -0400, Joel Fernandes (Google) wrote:
> This patch updates the documentation with the new page_idle tracking
> feature which uses virtual address indexing.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

One nit below, otherwise

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  .../admin-guide/mm/idle_page_tracking.rst     | 43 ++++++++++++++++---
>  1 file changed, 36 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst
> index df9394fb39c2..1eeac78c94a7 100644
> --- a/Documentation/admin-guide/mm/idle_page_tracking.rst
> +++ b/Documentation/admin-guide/mm/idle_page_tracking.rst
> @@ -19,10 +19,14 @@ It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
>  
>  User API
>  ========
> +There are 2 ways to access the idle page tracking API. One uses physical
> +address indexing, another uses a simpler virtual address indexing scheme.
>  
> -The idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
> -Currently, it consists of the only read-write file,
> -``/sys/kernel/mm/page_idle/bitmap``.
> +Physical address indexing
> +-------------------------
> +The idle page tracking API for physical address indexing using page frame
> +numbers (PFN) is located at ``/sys/kernel/mm/page_idle``.  Currently, it
> +consists of the only read-write file, ``/sys/kernel/mm/page_idle/bitmap``.
>  
>  The file implements a bitmap where each bit corresponds to a memory page. The
>  bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
> @@ -74,6 +78,31 @@ See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more
>  information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and
>  ``/proc/kpagecgroup``.
>  
> +Virtual address indexing
> +------------------------
> +The idle page tracking API for virtual address indexing using virtual page
> +frame numbers (VFN) is located at ``/proc/<pid>/page_idle``. It is a bitmap
> +that follows the same semantics as ``/sys/kernel/mm/page_idle/bitmap``
> +except that it uses virtual instead of physical frame numbers.

Can you please make it more explicit that VFNs are in the <pid>'s address
space?

> +
> +This idle page tracking API does not need deal with PFN so it does not require
> +prior lookups of ``pagemap`` in order to find if page is idle or not. This is
> +an advantage on some systems where looking up PFN is considered a security
> +issue.  Also in some cases, this interface could be slightly more reliable to
> +use than physical address indexing, since in physical address indexing, address
> +space changes can occur between reading the ``pagemap`` and reading the
> +``bitmap``, while in virtual address indexing, the process's ``mmap_sem`` is
> +held for the duration of the access.
> +
> +To estimate the amount of pages that are not used by a workload one should:
> +
> + 1. Mark all the workload's pages as idle by setting corresponding bits in
> +    ``/proc/<pid>/page_idle``.
> +
> + 2. Wait until the workload accesses its working set.
> +
> + 3. Read ``/proc/<pid>/page_idle`` and count the number of bits set.
> +
>  .. _impl_details:
>  
>  Implementation Details
> @@ -99,10 +128,10 @@ When a dirty page is written to swap or disk as a result of memory reclaim or
>  exceeding the dirty memory limit, it is not marked referenced.
>  
>  The idle memory tracking feature adds a new page flag, the Idle flag. This flag
> -is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
> -:ref:`User API <user_api>`
> -section), and cleared automatically whenever a page is referenced as defined
> -above.
> +is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` for physical
> +addressing or by writing to ``/proc/<pid>/page_idle`` for virtual
> +addressing (see the :ref:`User API <user_api>` section), and cleared
> +automatically whenever a page is referenced as defined above.
>  
>  When a page is marked idle, the Accessed bit must be cleared in all PTEs it is
>  mapped to, otherwise we will not be able to detect accesses to the page coming
> -- 
> 2.22.0.709.g102302147b-goog
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/2] mm/page_idle: Add per-pid idle page tracking using virtual indexing
  2019-07-26 15:23 [PATCH v3 1/2] mm/page_idle: Add per-pid idle page tracking using virtual indexing Joel Fernandes (Google)
  2019-07-26 15:23 ` [PATCH v3 2/2] doc: Update documentation for page_idle virtual address indexing Joel Fernandes (Google)
  2019-07-30 13:06 ` [PATCH v3 1/2] mm/page_idle: Add per-pid idle page tracking using virtual indexing Joel Fernandes
@ 2019-07-31  8:53 ` Minchan Kim
  2019-07-31 17:19   ` Joel Fernandes
  2 siblings, 1 reply; 10+ messages in thread
From: Minchan Kim @ 2019-07-31  8:53 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Brendan Gregg,
	Christian Hansen, dancol, fmayer, joaodias, joelaf,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport, namhyung,
	Roman Gushchin, Stephen Rothwell, surenb, tkjos,
	Vladimir Davydov, Vlastimil Babka, wvw

Hi Joel,

On Fri, Jul 26, 2019 at 11:23:18AM -0400, Joel Fernandes (Google) wrote:
> The page_idle tracking feature currently requires looking up the pagemap
> for a process followed by interacting with /sys/kernel/mm/page_idle.
> Looking up PFN from pagemap in Android devices is not supported by
> unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.
> 
> This patch adds support to directly interact with page_idle tracking at
> the PID level by introducing a /proc/<pid>/page_idle file.  It follows
> the exact same semantics as the global /sys/kernel/mm/page_idle, but now
> looking up PFN through pagemap is not needed since the interface uses
> virtual frame numbers, and at the same time also does not require
> SYS_ADMIN.
> 
> In Android, we are using this for the heap profiler (heapprofd) which
> profiles and pin points code paths which allocates and leaves memory
> idle for long periods of time. This method solves the security issue
> with userspace learning the PFN, and while at it is also shown to yield
> better results than the pagemap lookup, the theory being that the window
> where the address space can change is reduced by eliminating the
> intermediate pagemap look up stage. In virtual address indexing, the
> process's mmap_sem is held for the duration of the access.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> ---
> v2->v3:
> Fixed a bug where I was doing a kfree that is not needed due to not
> needing to do GFP_ATOMIC allocations.
> 
> v1->v2:
> Mark swap ptes as idle (Minchan)
> Avoid need for GFP_ATOMIC (Andrew)
> Get rid of idle_page_list lock by moving list to stack
> 
> Internal review -> v1:
> Fixes from Suren.
> Corrections to change log, docs (Florian, Sandeep)
> 
>  fs/proc/base.c            |   3 +
>  fs/proc/internal.h        |   1 +
>  fs/proc/task_mmu.c        |  57 +++++++
>  include/linux/page_idle.h |   4 +
>  mm/page_idle.c            | 340 +++++++++++++++++++++++++++++++++-----
>  5 files changed, 360 insertions(+), 45 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 77eb628ecc7f..a58dd74606e9 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3021,6 +3021,9 @@ static const struct pid_entry tgid_base_stuff[] = {
>  	REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
>  	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
>  	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> +	REG("page_idle", S_IRUSR|S_IWUSR, proc_page_idle_operations),
> +#endif
>  #endif
>  #ifdef CONFIG_SECURITY
>  	DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index cd0c8d5ce9a1..bc9371880c63 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -293,6 +293,7 @@ extern const struct file_operations proc_pid_smaps_operations;
>  extern const struct file_operations proc_pid_smaps_rollup_operations;
>  extern const struct file_operations proc_clear_refs_operations;
>  extern const struct file_operations proc_pagemap_operations;
> +extern const struct file_operations proc_page_idle_operations;
>  
>  extern unsigned long task_vsize(struct mm_struct *);
>  extern unsigned long task_statm(struct mm_struct *,
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 4d2b860dbc3f..11ccc53da38e 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1642,6 +1642,63 @@ const struct file_operations proc_pagemap_operations = {
>  	.open		= pagemap_open,
>  	.release	= pagemap_release,
>  };
> +
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> +static ssize_t proc_page_idle_read(struct file *file, char __user *buf,
> +				   size_t count, loff_t *ppos)
> +{
> +	int ret;
> +	struct task_struct *tsk = get_proc_task(file_inode(file));
> +
> +	if (!tsk)
> +		return -EINVAL;
> +	ret = page_idle_proc_read(file, buf, count, ppos, tsk);
> +	put_task_struct(tsk);

Why do you need task_struct here? You already got the task in open
and got mm there so you could pass the MM here instead of task.

> +	return ret;
> +}
> +
> +static ssize_t proc_page_idle_write(struct file *file, const char __user *buf,
> +				 size_t count, loff_t *ppos)
> +{
> +	int ret;
> +	struct task_struct *tsk = get_proc_task(file_inode(file));
> +
> +	if (!tsk)
> +		return -EINVAL;
> +	ret = page_idle_proc_write(file, (char __user *)buf, count, ppos, tsk);
> +	put_task_struct(tsk);
> +	return ret;
> +}
> +
> +static int proc_page_idle_open(struct inode *inode, struct file *file)
> +{
> +	struct mm_struct *mm;
> +
> +	mm = proc_mem_open(inode, PTRACE_MODE_READ);
> +	if (IS_ERR(mm))
> +		return PTR_ERR(mm);
> +	file->private_data = mm;
> +	return 0;
> +}
> +
> +static int proc_page_idle_release(struct inode *inode, struct file *file)
> +{
> +	struct mm_struct *mm = file->private_data;
> +
> +	if (mm)
> +		mmdrop(mm);
> +	return 0;
> +}
> +
> +const struct file_operations proc_page_idle_operations = {
> +	.llseek		= mem_lseek, /* borrow this */
> +	.read		= proc_page_idle_read,
> +	.write		= proc_page_idle_write,
> +	.open		= proc_page_idle_open,
> +	.release	= proc_page_idle_release,
> +};
> +#endif /* CONFIG_IDLE_PAGE_TRACKING */
> +
>  #endif /* CONFIG_PROC_PAGE_MONITOR */
>  
>  #ifdef CONFIG_NUMA
> diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> index 1e894d34bdce..f1bc2640d85e 100644
> --- a/include/linux/page_idle.h
> +++ b/include/linux/page_idle.h
> @@ -106,6 +106,10 @@ static inline void clear_page_idle(struct page *page)
>  }
>  #endif /* CONFIG_64BIT */
>  
> +ssize_t page_idle_proc_write(struct file *file,
> +	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
> +ssize_t page_idle_proc_read(struct file *file,
> +	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
>  #else /* !CONFIG_IDLE_PAGE_TRACKING */
>  
>  static inline bool page_is_young(struct page *page)
> diff --git a/mm/page_idle.c b/mm/page_idle.c
> index 295512465065..86244f7f1faa 100644
> --- a/mm/page_idle.c
> +++ b/mm/page_idle.c
> @@ -5,12 +5,15 @@
>  #include <linux/sysfs.h>
>  #include <linux/kobject.h>
>  #include <linux/mm.h>
> -#include <linux/mmzone.h>
> -#include <linux/pagemap.h>
> -#include <linux/rmap.h>
>  #include <linux/mmu_notifier.h>
> +#include <linux/mmzone.h>
>  #include <linux/page_ext.h>
>  #include <linux/page_idle.h>
> +#include <linux/pagemap.h>
> +#include <linux/rmap.h>
> +#include <linux/sched/mm.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
>  
>  #define BITMAP_CHUNK_SIZE	sizeof(u64)
>  #define BITMAP_CHUNK_BITS	(BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
> @@ -25,18 +28,13 @@
>   * page tracking. With such an indicator of user pages we can skip isolated
>   * pages, but since there are not usually many of them, it will hardly affect
>   * the overall result.
> - *
> - * This function tries to get a user memory page by pfn as described above.
>   */
> -static struct page *page_idle_get_page(unsigned long pfn)
> +static struct page *page_idle_get_page(struct page *page_in)

Looks weird function name after you changed the argument.
Maybe "bool check_valid_page(struct page *page)"?

>  {
>  	struct page *page;
>  	pg_data_t *pgdat;
>  
> -	if (!pfn_valid(pfn))
> -		return NULL;
> -
> -	page = pfn_to_page(pfn);
> +	page = page_in;
>  	if (!page || !PageLRU(page) ||
>  	    !get_page_unless_zero(page))
>  		return NULL;
> @@ -51,6 +49,18 @@ static struct page *page_idle_get_page(unsigned long pfn)
>  	return page;
>  }
>  
> +/*
> + * This function tries to get a user memory page by pfn as described above.
> + */
> +static struct page *page_idle_get_page_pfn(unsigned long pfn)

So we could use page_idle_get_page name here.

> +{
> +
> +	if (!pfn_valid(pfn))
> +		return NULL;


    page = pfn_to_page(pfn);

    return check_valid_page(page) ? page : NULL;
> +
> +	return page_idle_get_page(pfn_to_page(pfn));
> +}
> +
>  static bool page_idle_clear_pte_refs_one(struct page *page,
>  					struct vm_area_struct *vma,
>  					unsigned long addr, void *arg)
> @@ -118,6 +128,47 @@ static void page_idle_clear_pte_refs(struct page *page)
>  		unlock_page(page);
>  }
>  
> +/* Helper to get the start and end frame given a pos and count */
> +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
> +				unsigned long *start, unsigned long *end)
> +{
> +	unsigned long max_frame;
> +
> +	/* If an mm is not given, assume we want physical frames */
> +	max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
> +
> +	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> +		return -EINVAL;
> +
> +	*start = pos * BITS_PER_BYTE;
> +	if (*start >= max_frame)
> +		return -ENXIO;
> +
> +	*end = *start + count * BITS_PER_BYTE;
> +	if (*end > max_frame)
> +		*end = max_frame;
> +	return 0;
> +}
> +
> +static bool page_really_idle(struct page *page)

Just minor:
Instead of creating new API, could we combine page_is_idle with
introducing furthere argument pte_check?

    bool page_is_idle(struct page *page, bool pte_check);

> +{
> +	if (!page)
> +		return false;
> +
> +	if (page_is_idle(page)) {
> +		/*
> +		 * The page might have been referenced via a
> +		 * pte, in which case it is not idle. Clear
> +		 * refs and recheck.
> +		 */
> +		page_idle_clear_pte_refs(page);
> +		if (page_is_idle(page))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
>  static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
>  				     struct bin_attribute *attr, char *buf,
>  				     loff_t pos, size_t count)
> @@ -125,35 +176,21 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
>  	u64 *out = (u64 *)buf;
>  	struct page *page;
>  	unsigned long pfn, end_pfn;
> -	int bit;
> +	int bit, ret;
>  
> -	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> -		return -EINVAL;
> -
> -	pfn = pos * BITS_PER_BYTE;
> -	if (pfn >= max_pfn)
> -		return 0;
> -
> -	end_pfn = pfn + count * BITS_PER_BYTE;
> -	if (end_pfn > max_pfn)
> -		end_pfn = max_pfn;
> +	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
> +	if (ret == -ENXIO)
> +		return 0;  /* Reads beyond max_pfn do nothing */
> +	else if (ret)
> +		return ret;
>  
>  	for (; pfn < end_pfn; pfn++) {
>  		bit = pfn % BITMAP_CHUNK_BITS;
>  		if (!bit)
>  			*out = 0ULL;
> -		page = page_idle_get_page(pfn);
> -		if (page) {
> -			if (page_is_idle(page)) {
> -				/*
> -				 * The page might have been referenced via a
> -				 * pte, in which case it is not idle. Clear
> -				 * refs and recheck.
> -				 */
> -				page_idle_clear_pte_refs(page);
> -				if (page_is_idle(page))
> -					*out |= 1ULL << bit;
> -			}
> +		page = page_idle_get_page_pfn(pfn);
> +		if (page && page_really_idle(page)) {
> +			*out |= 1ULL << bit;
>  			put_page(page);
>  		}
>  		if (bit == BITMAP_CHUNK_BITS - 1)
> @@ -170,23 +207,16 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
>  	const u64 *in = (u64 *)buf;
>  	struct page *page;
>  	unsigned long pfn, end_pfn;
> -	int bit;
> +	int bit, ret;
>  
> -	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> -		return -EINVAL;
> -
> -	pfn = pos * BITS_PER_BYTE;
> -	if (pfn >= max_pfn)
> -		return -ENXIO;
> -
> -	end_pfn = pfn + count * BITS_PER_BYTE;
> -	if (end_pfn > max_pfn)
> -		end_pfn = max_pfn;
> +	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
> +	if (ret)
> +		return ret;
>  
>  	for (; pfn < end_pfn; pfn++) {
>  		bit = pfn % BITMAP_CHUNK_BITS;
>  		if ((*in >> bit) & 1) {
> -			page = page_idle_get_page(pfn);
> +			page = page_idle_get_page_pfn(pfn);
>  			if (page) {
>  				page_idle_clear_pte_refs(page);
>  				set_page_idle(page);
> @@ -224,6 +254,226 @@ struct page_ext_operations page_idle_ops = {
>  };
>  #endif
>  
> +/*  page_idle tracking for /proc/<pid>/page_idle */
> +
> +struct page_node {
> +	struct page *page;
> +	unsigned long addr;
> +	struct list_head list;
> +};
> +
> +struct page_idle_proc_priv {
> +	unsigned long start_addr;
> +	char *buffer;
> +	int write;
> +
> +	/* Pre-allocate and provide nodes to add_page_idle_list() */
> +	struct page_node *page_nodes;
> +	int cur_page_node;
> +	struct list_head *idle_page_list;
> +};
> +
> +/*
> + * Add a page to the idle page list. page can be NULL if pte is
> + * from a swapped page.
> + */
> +static void add_page_idle_list(struct page *page,
> +			       unsigned long addr, struct mm_walk *walk)
> +{
> +	struct page *page_get = NULL;
> +	struct page_node *pn;
> +	int bit;
> +	unsigned long frames;
> +	struct page_idle_proc_priv *priv = walk->private;
> +	u64 *chunk = (u64 *)priv->buffer;
> +
> +	if (priv->write) {
> +		/* Find whether this page was asked to be marked */
> +		frames = (addr - priv->start_addr) >> PAGE_SHIFT;
> +		bit = frames % BITMAP_CHUNK_BITS;
> +		chunk = &chunk[frames / BITMAP_CHUNK_BITS];
> +		if (((*chunk >> bit) & 1) == 0)
> +			return;
> +	}
> +
> +	if (page) {
> +		page_get = page_idle_get_page(page);
> +		if (!page_get)
> +			return;
> +	}
> +
> +	pn = &(priv->page_nodes[priv->cur_page_node++]);
> +	pn->page = page_get;
> +	pn->addr = addr;
> +	list_add(&pn->list, priv->idle_page_list);
> +}
> +
> +static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
> +				    unsigned long end,
> +				    struct mm_walk *walk)
> +{
> +	struct vm_area_struct *vma = walk->vma;
> +	pte_t *pte;
> +	spinlock_t *ptl;
> +	struct page *page;
> +
> +	ptl = pmd_trans_huge_lock(pmd, vma);
> +	if (ptl) {
> +		if (pmd_present(*pmd)) {
> +			page = follow_trans_huge_pmd(vma, addr, pmd,
> +						     FOLL_DUMP|FOLL_WRITE);
> +			if (!IS_ERR_OR_NULL(page))
> +				add_page_idle_list(page, addr, walk);
> +		}
> +		spin_unlock(ptl);
> +		return 0;
> +	}
> +
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> +		/*
> +		 * We add swapped pages to the idle_page_list so that we can
> +		 * reported to userspace that they are idle.
> +		 */
> +		if (is_swap_pte(*pte)) {

I suggested "let's consider every swapped out pages as IDLE" but
let's think about this case:

1. mark heap of the process as IDLE
2. process touch working set
3. process's heap pages are swap out by meory spike or madvise
4. heap profiler investigates the process's IDLE page and surprised all of
heap are idle.

It's the good scenario for other purpose because non-idle pages(IOW,
workingset) could be readahead when the app will restart.

Maybe, squeeze the idle bit in the swap pte to check it.

> +			add_page_idle_list(NULL, addr, walk);
> +			continue;
> +		}
> +
> +		if (!pte_present(*pte))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, *pte);
> +		if (page)
> +			add_page_idle_list(page, addr, walk);
> +	}
> +
> +	pte_unmap_unlock(pte - 1, ptl);
> +	return 0;
> +}
> +
> +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
> +			       size_t count, loff_t *pos,
> +			       struct task_struct *tsk, int write)
> +{
> +	int ret;
> +	char *buffer;
> +	u64 *out;
> +	unsigned long start_addr, end_addr, start_frame, end_frame;
> +	struct mm_struct *mm = file->private_data;
> +	struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, };
> +	struct page_node *cur;
> +	struct page_idle_proc_priv priv;
> +	bool walk_error = false;
> +	LIST_HEAD(idle_page_list);
> +
> +	if (!mm || !mmget_not_zero(mm))
> +		return -EINVAL;
> +
> +	if (count > PAGE_SIZE)
> +		count = PAGE_SIZE;
> +
> +	buffer = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +	if (!buffer) {
> +		ret = -ENOMEM;
> +		goto out_mmput;
> +	}
> +	out = (u64 *)buffer;
> +
> +	if (write && copy_from_user(buffer, ubuff, count)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame);
> +	if (ret)
> +		goto out;
> +
> +	start_addr = (start_frame << PAGE_SHIFT);
> +	end_addr = (end_frame << PAGE_SHIFT);
> +	priv.buffer = buffer;
> +	priv.start_addr = start_addr;
> +	priv.write = write;
> +
> +	priv.idle_page_list = &idle_page_list;
> +	priv.cur_page_node = 0;
> +	priv.page_nodes = kzalloc(sizeof(struct page_node) *
> +				  (end_frame - start_frame), GFP_KERNEL);
> +	if (!priv.page_nodes) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	walk.private = &priv;
> +	walk.mm = mm;
> +
> +	down_read(&mm->mmap_sem);
> +
> +	/*
> +	 * idle_page_list is needed because walk_page_vma() holds ptlock which
> +	 * deadlocks with page_idle_clear_pte_refs(). So we have to collect all
> +	 * pages first, and then call page_idle_clear_pte_refs().
> +	 */

Thanks for the comment, I was curious why you want to have
idle_page_list and the reason is here.

How about making this /proc/<pid>/page_idle per-process granuariy,
unlike system level /sys/xxx/page_idle? What I meant is not to check
rmap to see any reference from random process but just check only
access from the target process. It would be more proper as /proc/
<pid>/ interface and good for per-process tracking as well as
fast.

> +	ret = walk_page_range(start_addr, end_addr, &walk);
> +	if (ret)
> +		walk_error = true;
> +
> +	list_for_each_entry(cur, &idle_page_list, list) {
> +		int bit, index;
> +		unsigned long off;
> +		struct page *page = cur->page;
> +
> +		if (unlikely(walk_error))
> +			goto remove_page;
> +
> +		if (write) {
> +			if (page) {
> +				page_idle_clear_pte_refs(page);
> +				set_page_idle(page);
> +			}
> +		} else {
> +			if (!page || page_really_idle(page)) {
> +				off = ((cur->addr) >> PAGE_SHIFT) - start_frame;
> +				bit = off % BITMAP_CHUNK_BITS;
> +				index = off / BITMAP_CHUNK_BITS;
> +				out[index] |= 1ULL << bit;
> +			}
> +		}
> +remove_page:
> +		if (page)
> +			put_page(page);
> +	}
> +
> +	if (!write && !walk_error)
> +		ret = copy_to_user(ubuff, buffer, count);
> +
> +	up_read(&mm->mmap_sem);
> +	kfree(priv.page_nodes);
> +out:
> +	kfree(buffer);
> +out_mmput:
> +	mmput(mm);
> +	if (!ret)
> +		ret = count;
> +	return ret;
> +
> +}
> +
> +ssize_t page_idle_proc_read(struct file *file, char __user *ubuff,
> +			    size_t count, loff_t *pos, struct task_struct *tsk)
> +{
> +	return page_idle_proc_generic(file, ubuff, count, pos, tsk, 0);
> +}
> +
> +ssize_t page_idle_proc_write(struct file *file, char __user *ubuff,
> +			     size_t count, loff_t *pos, struct task_struct *tsk)
> +{
> +	return page_idle_proc_generic(file, ubuff, count, pos, tsk, 1);
> +}
> +
>  static int __init page_idle_init(void)
>  {
>  	int err;
> -- 
> 2.22.0.709.g102302147b-goog

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/2] mm/page_idle: Add per-pid idle page tracking using virtual indexing
  2019-07-31  8:53 ` Minchan Kim
@ 2019-07-31 17:19   ` Joel Fernandes
  2019-08-05  7:55     ` Minchan Kim
  0 siblings, 1 reply; 10+ messages in thread
From: Joel Fernandes @ 2019-07-31 17:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Brendan Gregg,
	Christian Hansen, dancol, fmayer, joaodias, Jonathan Corbet,
	Kees Cook, kernel-team, linux-api, linux-doc, linux-fsdevel,
	linux-mm, Michal Hocko, Mike Rapoport, namhyung, Roman Gushchin,
	Stephen Rothwell, surenb, tkjos, Vladimir Davydov,
	Vlastimil Babka, wvw

On Wed, Jul 31, 2019 at 05:53:35PM +0900, Minchan Kim wrote:
> Hi Joel,
> 
> On Fri, Jul 26, 2019 at 11:23:18AM -0400, Joel Fernandes (Google) wrote:
> > The page_idle tracking feature currently requires looking up the pagemap
> > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > Looking up PFN from pagemap in Android devices is not supported by
> > unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.
> > 
[snip]
> > index 77eb628ecc7f..a58dd74606e9 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -3021,6 +3021,9 @@ static const struct pid_entry tgid_base_stuff[] = {
> >  	REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
> >  	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
> >  	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
> > +#ifdef CONFIG_IDLE_PAGE_TRACKING
> > +	REG("page_idle", S_IRUSR|S_IWUSR, proc_page_idle_operations),
> > +#endif
> >  #endif
> >  #ifdef CONFIG_SECURITY
> >  	DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
> > diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> > index cd0c8d5ce9a1..bc9371880c63 100644
> > --- a/fs/proc/internal.h
> > +++ b/fs/proc/internal.h
> > @@ -293,6 +293,7 @@ extern const struct file_operations proc_pid_smaps_operations;
> >  extern const struct file_operations proc_pid_smaps_rollup_operations;
> >  extern const struct file_operations proc_clear_refs_operations;
> >  extern const struct file_operations proc_pagemap_operations;
> > +extern const struct file_operations proc_page_idle_operations;
> >  
> >  extern unsigned long task_vsize(struct mm_struct *);
> >  extern unsigned long task_statm(struct mm_struct *,
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 4d2b860dbc3f..11ccc53da38e 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -1642,6 +1642,63 @@ const struct file_operations proc_pagemap_operations = {
> >  	.open		= pagemap_open,
> >  	.release	= pagemap_release,
> >  };
> > +
> > +#ifdef CONFIG_IDLE_PAGE_TRACKING
> > +static ssize_t proc_page_idle_read(struct file *file, char __user *buf,
> > +				   size_t count, loff_t *ppos)
> > +{
> > +	int ret;
> > +	struct task_struct *tsk = get_proc_task(file_inode(file));
> > +
> > +	if (!tsk)
> > +		return -EINVAL;
> > +	ret = page_idle_proc_read(file, buf, count, ppos, tsk);
> > +	put_task_struct(tsk);
> 
> Why do you need task_struct here? You already got the task in open
> and got mm there so you could pass the MM here instead of task.

Good point, will just use mm.

[snip]
> > diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> > index 1e894d34bdce..f1bc2640d85e 100644
> > --- a/include/linux/page_idle.h
> > +++ b/include/linux/page_idle.h
> > @@ -106,6 +106,10 @@ static inline void clear_page_idle(struct page *page)
> >  }
> >  #endif /* CONFIG_64BIT */
> >  
> > +ssize_t page_idle_proc_write(struct file *file,
> > +	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
> > +ssize_t page_idle_proc_read(struct file *file,
> > +	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
> >  #else /* !CONFIG_IDLE_PAGE_TRACKING */
> >  
> >  static inline bool page_is_young(struct page *page)
> > diff --git a/mm/page_idle.c b/mm/page_idle.c
> > index 295512465065..86244f7f1faa 100644
> > --- a/mm/page_idle.c
> > +++ b/mm/page_idle.c
> > @@ -5,12 +5,15 @@
> >  #include <linux/sysfs.h>
> >  #include <linux/kobject.h>
> >  #include <linux/mm.h>
> > -#include <linux/mmzone.h>
> > -#include <linux/pagemap.h>
> > -#include <linux/rmap.h>
> >  #include <linux/mmu_notifier.h>
> > +#include <linux/mmzone.h>
> >  #include <linux/page_ext.h>
> >  #include <linux/page_idle.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/rmap.h>
> > +#include <linux/sched/mm.h>
> > +#include <linux/swap.h>
> > +#include <linux/swapops.h>
> >  
> >  #define BITMAP_CHUNK_SIZE	sizeof(u64)
> >  #define BITMAP_CHUNK_BITS	(BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
> > @@ -25,18 +28,13 @@
> >   * page tracking. With such an indicator of user pages we can skip isolated
> >   * pages, but since there are not usually many of them, it will hardly affect
> >   * the overall result.
> > - *
> > - * This function tries to get a user memory page by pfn as described above.
> >   */
> > -static struct page *page_idle_get_page(unsigned long pfn)
> > +static struct page *page_idle_get_page(struct page *page_in)
> 
> Looks weird function name after you changed the argument.
> Maybe "bool check_valid_page(struct page *page)"?


I don't think so, this function does a get_page_unless_zero() on the page as well.

> >  {
> >  	struct page *page;
> >  	pg_data_t *pgdat;
> >  
> > -	if (!pfn_valid(pfn))
> > -		return NULL;
> > -
> > -	page = pfn_to_page(pfn);
> > +	page = page_in;
> >  	if (!page || !PageLRU(page) ||
> >  	    !get_page_unless_zero(page))
> >  		return NULL;
> > @@ -51,6 +49,18 @@ static struct page *page_idle_get_page(unsigned long pfn)
> >  	return page;
> >  }
> >  
> > +/*
> > + * This function tries to get a user memory page by pfn as described above.
> > + */
> > +static struct page *page_idle_get_page_pfn(unsigned long pfn)
> 
> So we could use page_idle_get_page name here.


Based on above comment, I prefer to keep same name. Do you agree?


> > +	return page_idle_get_page(pfn_to_page(pfn));
> > +}
> > +
> >  static bool page_idle_clear_pte_refs_one(struct page *page,
> >  					struct vm_area_struct *vma,
> >  					unsigned long addr, void *arg)
> > @@ -118,6 +128,47 @@ static void page_idle_clear_pte_refs(struct page *page)
> >  		unlock_page(page);
> >  }
> >  
> > +/* Helper to get the start and end frame given a pos and count */
> > +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
> > +				unsigned long *start, unsigned long *end)
> > +{
> > +	unsigned long max_frame;
> > +
> > +	/* If an mm is not given, assume we want physical frames */
> > +	max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
> > +
> > +	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> > +		return -EINVAL;
> > +
> > +	*start = pos * BITS_PER_BYTE;
> > +	if (*start >= max_frame)
> > +		return -ENXIO;
> > +
> > +	*end = *start + count * BITS_PER_BYTE;
> > +	if (*end > max_frame)
> > +		*end = max_frame;
> > +	return 0;
> > +}
> > +
> > +static bool page_really_idle(struct page *page)
> 
> Just minor:
> Instead of creating new API, could we combine page_is_idle with
> introducing furthere argument pte_check?


I cannot see in the code where pte_check will be false when this is called? I
could rename the function to page_idle_check_ptes() if that's Ok with you.

[snip]
> +
> > +static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
> > +				    unsigned long end,
> > +				    struct mm_walk *walk)
> > +{
> > +	struct vm_area_struct *vma = walk->vma;
> > +	pte_t *pte;
> > +	spinlock_t *ptl;
> > +	struct page *page;
> > +
> > +	ptl = pmd_trans_huge_lock(pmd, vma);
> > +	if (ptl) {
> > +		if (pmd_present(*pmd)) {
> > +			page = follow_trans_huge_pmd(vma, addr, pmd,
> > +						     FOLL_DUMP|FOLL_WRITE);
> > +			if (!IS_ERR_OR_NULL(page))
> > +				add_page_idle_list(page, addr, walk);
> > +		}
> > +		spin_unlock(ptl);
> > +		return 0;
> > +	}
> > +
> > +	if (pmd_trans_unstable(pmd))
> > +		return 0;
> > +
> > +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> > +		/*
> > +		 * We add swapped pages to the idle_page_list so that we can
> > +		 * reported to userspace that they are idle.
> > +		 */
> > +		if (is_swap_pte(*pte)) {
> 
> I suggested "let's consider every swapped out pages as IDLE" but
> let's think about this case:
> 
> 1. mark heap of the process as IDLE
> 2. process touch working set
> 3. process's heap pages are swap out by meory spike or madvise
> 4. heap profiler investigates the process's IDLE page and surprised all of
> heap are idle.
> 
> It's the good scenario for other purpose because non-idle pages(IOW,
> workingset) could be readahead when the app will restart.
> 
> Maybe, squeeze the idle bit in the swap pte to check it.


Ok, I will look more into this. Konstantin had similar ideas here too.


> > +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
> > +			       size_t count, loff_t *pos,
> > +			       struct task_struct *tsk, int write)
> > +{
> > +	int ret;
> > +	char *buffer;
> > +	u64 *out;
> > +	unsigned long start_addr, end_addr, start_frame, end_frame;
> > +	struct mm_struct *mm = file->private_data;
> > +	struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, };
> > +	struct page_node *cur;
> > +	struct page_idle_proc_priv priv;
> > +	bool walk_error = false;
> > +	LIST_HEAD(idle_page_list);
> > +
> > +	if (!mm || !mmget_not_zero(mm))
> > +		return -EINVAL;
> > +
> > +	if (count > PAGE_SIZE)
> > +		count = PAGE_SIZE;
> > +
> > +	buffer = kzalloc(PAGE_SIZE, GFP_KERNEL);
> > +	if (!buffer) {
> > +		ret = -ENOMEM;
> > +		goto out_mmput;
> > +	}
> > +	out = (u64 *)buffer;
> > +
> > +	if (write && copy_from_user(buffer, ubuff, count)) {
> > +		ret = -EFAULT;
> > +		goto out;
> > +	}
> > +
> > +	ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame);
> > +	if (ret)
> > +		goto out;
> > +
> > +	start_addr = (start_frame << PAGE_SHIFT);
> > +	end_addr = (end_frame << PAGE_SHIFT);
> > +	priv.buffer = buffer;
> > +	priv.start_addr = start_addr;
> > +	priv.write = write;
> > +
> > +	priv.idle_page_list = &idle_page_list;
> > +	priv.cur_page_node = 0;
> > +	priv.page_nodes = kzalloc(sizeof(struct page_node) *
> > +				  (end_frame - start_frame), GFP_KERNEL);
> > +	if (!priv.page_nodes) {
> > +		ret = -ENOMEM;
> > +		goto out;
> > +	}
> > +
> > +	walk.private = &priv;
> > +	walk.mm = mm;
> > +
> > +	down_read(&mm->mmap_sem);
> > +
> > +	/*
> > +	 * idle_page_list is needed because walk_page_vma() holds ptlock which
> > +	 * deadlocks with page_idle_clear_pte_refs(). So we have to collect all
> > +	 * pages first, and then call page_idle_clear_pte_refs().
> > +	 */
> 
> Thanks for the comment, I was curious why you want to have
> idle_page_list and the reason is here.
> 
> How about making this /proc/<pid>/page_idle per-process granuariy,
> unlike system level /sys/xxx/page_idle? What I meant is not to check
> rmap to see any reference from random process but just check only
> access from the target process. It would be more proper as /proc/
> <pid>/ interface and good for per-process tracking as well as
> fast.


I prefer not to do this for the following reasons:
(1) It makes a feature lost, now accesses to shared pages will not be
accounted properly. 

(2) It makes it inconsistent with other idle page tracking mechanism. I
prefer if post per-process. At the heart of it, the tracking is always at the
physical page level -- I feel that is how it should be. Other drawback, is
also we have to document this subtlety.

Another reason is the performance is pretty good already with this mechanism
with rmap. I did idle tracking on 512MB range in about 15ms for read and 15ms
for write.

In the future if it is an issue, we can consider it.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/2] mm/page_idle: Add per-pid idle page tracking using virtual indexing
  2019-07-31 17:19   ` Joel Fernandes
@ 2019-08-05  7:55     ` Minchan Kim
  2019-08-05 13:10       ` Joel Fernandes
  0 siblings, 1 reply; 10+ messages in thread
From: Minchan Kim @ 2019-08-05  7:55 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Brendan Gregg,
	Christian Hansen, dancol, fmayer, joaodias, Jonathan Corbet,
	Kees Cook, kernel-team, linux-api, linux-doc, linux-fsdevel,
	linux-mm, Michal Hocko, Mike Rapoport, namhyung, Roman Gushchin,
	Stephen Rothwell, surenb, tkjos, Vladimir Davydov,
	Vlastimil Babka, wvw

Hi Joel,

On Wed, Jul 31, 2019 at 01:19:37PM -0400, Joel Fernandes wrote:
> > > -static struct page *page_idle_get_page(unsigned long pfn)
> > > +static struct page *page_idle_get_page(struct page *page_in)
> > 
> > Looks weird function name after you changed the argument.
> > Maybe "bool check_valid_page(struct page *page)"?
> 
> 
> I don't think so, this function does a get_page_unless_zero() on the page as well.
> 
> > >  {
> > >  	struct page *page;
> > >  	pg_data_t *pgdat;
> > >  
> > > -	if (!pfn_valid(pfn))
> > > -		return NULL;
> > > -
> > > -	page = pfn_to_page(pfn);
> > > +	page = page_in;
> > >  	if (!page || !PageLRU(page) ||
> > >  	    !get_page_unless_zero(page))
> > >  		return NULL;
> > > @@ -51,6 +49,18 @@ static struct page *page_idle_get_page(unsigned long pfn)
> > >  	return page;
> > >  }
> > >  
> > > +/*
> > > + * This function tries to get a user memory page by pfn as described above.
> > > + */
> > > +static struct page *page_idle_get_page_pfn(unsigned long pfn)
> > 
> > So we could use page_idle_get_page name here.
> 
> 
> Based on above comment, I prefer to keep same name. Do you agree?

Yes, I agree. Just please add a comment about refcount in the description
on page_idle_get_page.

> 
> 
> > > +	return page_idle_get_page(pfn_to_page(pfn));
> > > +}
> > > +
> > >  static bool page_idle_clear_pte_refs_one(struct page *page,
> > >  					struct vm_area_struct *vma,
> > >  					unsigned long addr, void *arg)
> > > @@ -118,6 +128,47 @@ static void page_idle_clear_pte_refs(struct page *page)
> > >  		unlock_page(page);
> > >  }
> > >  
> > > +/* Helper to get the start and end frame given a pos and count */
> > > +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
> > > +				unsigned long *start, unsigned long *end)
> > > +{
> > > +	unsigned long max_frame;
> > > +
> > > +	/* If an mm is not given, assume we want physical frames */
> > > +	max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
> > > +
> > > +	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> > > +		return -EINVAL;
> > > +
> > > +	*start = pos * BITS_PER_BYTE;
> > > +	if (*start >= max_frame)
> > > +		return -ENXIO;
> > > +
> > > +	*end = *start + count * BITS_PER_BYTE;
> > > +	if (*end > max_frame)
> > > +		*end = max_frame;
> > > +	return 0;
> > > +}
> > > +
> > > +static bool page_really_idle(struct page *page)
> > 
> > Just minor:
> > Instead of creating new API, could we combine page_is_idle with
> > introducing furthere argument pte_check?
> 
> 
> I cannot see in the code where pte_check will be false when this is called? I
> could rename the function to page_idle_check_ptes() if that's Ok with you.

What I don't like is _*really*_ part of the funcion name.

I see several page_is_idle calls in huge_memory.c, migration.c, swap.c.
They could just check only page flag so they could use "false" with pte_check.

< snip >
 
> > > +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
> > > +			       size_t count, loff_t *pos,
> > > +			       struct task_struct *tsk, int write)
> > > +{
> > > +	int ret;
> > > +	char *buffer;
> > > +	u64 *out;
> > > +	unsigned long start_addr, end_addr, start_frame, end_frame;
> > > +	struct mm_struct *mm = file->private_data;
> > > +	struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, };
> > > +	struct page_node *cur;
> > > +	struct page_idle_proc_priv priv;
> > > +	bool walk_error = false;
> > > +	LIST_HEAD(idle_page_list);
> > > +
> > > +	if (!mm || !mmget_not_zero(mm))
> > > +		return -EINVAL;
> > > +
> > > +	if (count > PAGE_SIZE)
> > > +		count = PAGE_SIZE;
> > > +
> > > +	buffer = kzalloc(PAGE_SIZE, GFP_KERNEL);
> > > +	if (!buffer) {
> > > +		ret = -ENOMEM;
> > > +		goto out_mmput;
> > > +	}
> > > +	out = (u64 *)buffer;
> > > +
> > > +	if (write && copy_from_user(buffer, ubuff, count)) {
> > > +		ret = -EFAULT;
> > > +		goto out;
> > > +	}
> > > +
> > > +	ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame);
> > > +	if (ret)
> > > +		goto out;
> > > +
> > > +	start_addr = (start_frame << PAGE_SHIFT);
> > > +	end_addr = (end_frame << PAGE_SHIFT);
> > > +	priv.buffer = buffer;
> > > +	priv.start_addr = start_addr;
> > > +	priv.write = write;
> > > +
> > > +	priv.idle_page_list = &idle_page_list;
> > > +	priv.cur_page_node = 0;
> > > +	priv.page_nodes = kzalloc(sizeof(struct page_node) *
> > > +				  (end_frame - start_frame), GFP_KERNEL);
> > > +	if (!priv.page_nodes) {
> > > +		ret = -ENOMEM;
> > > +		goto out;
> > > +	}
> > > +
> > > +	walk.private = &priv;
> > > +	walk.mm = mm;
> > > +
> > > +	down_read(&mm->mmap_sem);
> > > +
> > > +	/*
> > > +	 * idle_page_list is needed because walk_page_vma() holds ptlock which
> > > +	 * deadlocks with page_idle_clear_pte_refs(). So we have to collect all
> > > +	 * pages first, and then call page_idle_clear_pte_refs().
> > > +	 */
> > 
> > Thanks for the comment, I was curious why you want to have
> > idle_page_list and the reason is here.
> > 
> > How about making this /proc/<pid>/page_idle per-process granuariy,
> > unlike system level /sys/xxx/page_idle? What I meant is not to check
> > rmap to see any reference from random process but just check only
> > access from the target process. It would be more proper as /proc/
> > <pid>/ interface and good for per-process tracking as well as
> > fast.
> 
> 
> I prefer not to do this for the following reasons:
> (1) It makes a feature lost, now accesses to shared pages will not be
> accounted properly. 

Do you really want to check global attribute by per-process interface?
That would be doable with existing idle page tracking feature and that's
the one of reasons page idle tracking was born(e.g. even, page cache
for non-mapped) unlike clear_refs.
Once we create a new interface by per-process, just checking the process
-granuariy access check sounds more reasonable to me.

With that, we could catch only idle pages of the target process even though
the page was touched by several other processes.
If the user want to know global level access point, they could use
exisint interface(If there is a concern(e.g., security) to use existing
idle page tracking, let's discuss it as other topic how we could make
existing feature more useful).

IOW, my point is that we already have global access check(1. from ptes
among several processes, 2. from page flag for non-mapped pages) feature
from from existing idle page tracking interface and now we are about to create
new interface for per-process wise so I wanted to create a particular
feature which cannot be covered by existing iterface.

> 
> (2) It makes it inconsistent with other idle page tracking mechanism. I

That's the my comment to create different idle page tracking we couldn't
do with existing interface.

> prefer if post per-process. At the heart of it, the tracking is always at the

What does it mean "post per-process"?

> physical page level -- I feel that is how it should be. Other drawback, is
> also we have to document this subtlety.

Sorry, Could you elaborate it a bit?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/2] mm/page_idle: Add per-pid idle page tracking using virtual indexing
  2019-08-05  7:55     ` Minchan Kim
@ 2019-08-05 13:10       ` Joel Fernandes
  0 siblings, 0 replies; 10+ messages in thread
From: Joel Fernandes @ 2019-08-05 13:10 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Brendan Gregg,
	Christian Hansen, dancol, fmayer, joaodias, Jonathan Corbet,
	Kees Cook, kernel-team, linux-api, linux-doc, linux-fsdevel,
	linux-mm, Michal Hocko, Mike Rapoport, namhyung, Roman Gushchin,
	Stephen Rothwell, surenb, tkjos, Vladimir Davydov,
	Vlastimil Babka, wvw

On Mon, Aug 05, 2019 at 04:55:47PM +0900, Minchan Kim wrote:
> Hi Joel,

Hi Minchan,

> On Wed, Jul 31, 2019 at 01:19:37PM -0400, Joel Fernandes wrote:
> > > > -static struct page *page_idle_get_page(unsigned long pfn)
> > > > +static struct page *page_idle_get_page(struct page *page_in)
> > > 
> > > Looks weird function name after you changed the argument.
> > > Maybe "bool check_valid_page(struct page *page)"?
> > 
> > 
> > I don't think so, this function does a get_page_unless_zero() on the page as well.
> > 
> > > >  {
> > > >  	struct page *page;
> > > >  	pg_data_t *pgdat;
> > > >  
> > > > -	if (!pfn_valid(pfn))
> > > > -		return NULL;
> > > > -
> > > > -	page = pfn_to_page(pfn);
> > > > +	page = page_in;
> > > >  	if (!page || !PageLRU(page) ||
> > > >  	    !get_page_unless_zero(page))
> > > >  		return NULL;
> > > > @@ -51,6 +49,18 @@ static struct page *page_idle_get_page(unsigned long pfn)
> > > >  	return page;
> > > >  }
> > > >  
> > > > +/*
> > > > + * This function tries to get a user memory page by pfn as described above.
> > > > + */
> > > > +static struct page *page_idle_get_page_pfn(unsigned long pfn)
> > > 
> > > So we could use page_idle_get_page name here.
> > 
> > 
> > Based on above comment, I prefer to keep same name. Do you agree?
> 
> Yes, I agree. Just please add a comment about refcount in the description
> on page_idle_get_page.

Ok.


> > > > +	return page_idle_get_page(pfn_to_page(pfn));
> > > > +}
> > > > +
> > > >  static bool page_idle_clear_pte_refs_one(struct page *page,
> > > >  					struct vm_area_struct *vma,
> > > >  					unsigned long addr, void *arg)
> > > > @@ -118,6 +128,47 @@ static void page_idle_clear_pte_refs(struct page *page)
> > > >  		unlock_page(page);
> > > >  }
> > > >  
> > > > +/* Helper to get the start and end frame given a pos and count */
> > > > +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
> > > > +				unsigned long *start, unsigned long *end)
> > > > +{
> > > > +	unsigned long max_frame;
> > > > +
> > > > +	/* If an mm is not given, assume we want physical frames */
> > > > +	max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
> > > > +
> > > > +	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> > > > +		return -EINVAL;
> > > > +
> > > > +	*start = pos * BITS_PER_BYTE;
> > > > +	if (*start >= max_frame)
> > > > +		return -ENXIO;
> > > > +
> > > > +	*end = *start + count * BITS_PER_BYTE;
> > > > +	if (*end > max_frame)
> > > > +		*end = max_frame;
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static bool page_really_idle(struct page *page)
> > > 
> > > Just minor:
> > > Instead of creating new API, could we combine page_is_idle with
> > > introducing furthere argument pte_check?
> > 
> > 
> > I cannot see in the code where pte_check will be false when this is called? I
> > could rename the function to page_idle_check_ptes() if that's Ok with you.
> 
> What I don't like is _*really*_ part of the funcion name.
> 
> I see several page_is_idle calls in huge_memory.c, migration.c, swap.c.
> They could just check only page flag so they could use "false" with pte_check.

I will rename it to page_idle_check_ptes(). If you want pte_check argument,
that can be a later patch if/when there are other users for it in other
files. Hope that's reasonable.


> > > > +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
> > > > +			       size_t count, loff_t *pos,
> > > > +			       struct task_struct *tsk, int write)
> > > > +{
> > > > +	int ret;
> > > > +	char *buffer;
> > > > +	u64 *out;
> > > > +	unsigned long start_addr, end_addr, start_frame, end_frame;
> > > > +	struct mm_struct *mm = file->private_data;
> > > > +	struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, };
> > > > +	struct page_node *cur;
> > > > +	struct page_idle_proc_priv priv;
> > > > +	bool walk_error = false;
> > > > +	LIST_HEAD(idle_page_list);
> > > > +
> > > > +	if (!mm || !mmget_not_zero(mm))
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (count > PAGE_SIZE)
> > > > +		count = PAGE_SIZE;
> > > > +
> > > > +	buffer = kzalloc(PAGE_SIZE, GFP_KERNEL);
> > > > +	if (!buffer) {
> > > > +		ret = -ENOMEM;
> > > > +		goto out_mmput;
> > > > +	}
> > > > +	out = (u64 *)buffer;
> > > > +
> > > > +	if (write && copy_from_user(buffer, ubuff, count)) {
> > > > +		ret = -EFAULT;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame);
> > > > +	if (ret)
> > > > +		goto out;
> > > > +
> > > > +	start_addr = (start_frame << PAGE_SHIFT);
> > > > +	end_addr = (end_frame << PAGE_SHIFT);
> > > > +	priv.buffer = buffer;
> > > > +	priv.start_addr = start_addr;
> > > > +	priv.write = write;
> > > > +
> > > > +	priv.idle_page_list = &idle_page_list;
> > > > +	priv.cur_page_node = 0;
> > > > +	priv.page_nodes = kzalloc(sizeof(struct page_node) *
> > > > +				  (end_frame - start_frame), GFP_KERNEL);
> > > > +	if (!priv.page_nodes) {
> > > > +		ret = -ENOMEM;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	walk.private = &priv;
> > > > +	walk.mm = mm;
> > > > +
> > > > +	down_read(&mm->mmap_sem);
> > > > +
> > > > +	/*
> > > > +	 * idle_page_list is needed because walk_page_vma() holds ptlock which
> > > > +	 * deadlocks with page_idle_clear_pte_refs(). So we have to collect all
> > > > +	 * pages first, and then call page_idle_clear_pte_refs().
> > > > +	 */
> > > 
> > > Thanks for the comment, I was curious why you want to have
> > > idle_page_list and the reason is here.
> > > 
> > > How about making this /proc/<pid>/page_idle per-process granuariy,
> > > unlike system level /sys/xxx/page_idle? What I meant is not to check
> > > rmap to see any reference from random process but just check only
> > > access from the target process. It would be more proper as /proc/
> > > <pid>/ interface and good for per-process tracking as well as
> > > fast.
> > 
> > 
> > I prefer not to do this for the following reasons:
> > (1) It makes a feature lost, now accesses to shared pages will not be
> > accounted properly. 
> 
> Do you really want to check global attribute by per-process interface?

Pages are inherrently not per-process, they are global. A page does not
necessarily belong to a process. An anonymous page can be shared. We are
operating on pages in the end of the day.

I think you are confusing the per-process file interface with the core
mechanism. The core mechanism always operations on physical PAGES.


> That would be doable with existing idle page tracking feature and that's
> the one of reasons page idle tracking was born(e.g. even, page cache
> for non-mapped) unlike clear_refs.

I think you are misunderstanding the patch, the patch does not want to change
the core mechanism. That is a bit out of scope for the patch. Page
idle-tracking at the core of it looks at PTE of all processes. We are just
using the VFN (virtual frame) interface to skip the need for separate pagemap
look up -- that's it.


> Once we create a new interface by per-process, just checking the process
> -granuariy access check sounds more reasonable to me.

It sounds reasonable but there is no reason to not do the full and proper
page tracking for now, including shared pages. Otherwise it makes it
inconsistent with the existing mechanism and can confuse the user about what
to expect (especially for shared pages).


> With that, we could catch only idle pages of the target process even though
> the page was touched by several other processes.
> If the user want to know global level access point, they could use
> exisint interface(If there is a concern(e.g., security) to use existing
> idle page tracking, let's discuss it as other topic how we could make
> existing feature more useful).
> 
> IOW, my point is that we already have global access check(1. from ptes
> among several processes, 2. from page flag for non-mapped pages) feature
> from from existing idle page tracking interface and now we are about to create
> new interface for per-process wise so I wanted to create a particular
> feature which cannot be covered by existing iterface.

Yes, it sounds like you want to create a different feature. Then that can be
a follow-up different patch, and that is out of scope for this patch.


> > (2) It makes it inconsistent with other idle page tracking mechanism. I
> 
> That's the my comment to create different idle page tracking we couldn't
> do with existing interface.

Yes, sure. But that can be a different patch and we can weigh the benefits of
it at that time. I don't want to introduce a new page tracking mechanism, I
am just trying to reuse the existing one.


> > prefer if post per-process. At the heart of it, the tracking is always at the
> 
> What does it mean "post per-process"?

Sorry it was a typo, I meant "the core mechanism should not be a per-process
one, but a global one". We are just changing the interface in this patch, we
are not changing the existing core mechanism. That gives us all the benefits
of the existing code such as non-interference with page reclaim code, without
introducing any new bugs. By the way I did fix a bug in the existing original
code as well!


> > physical page level -- I feel that is how it should be. Other drawback, is
> > also we have to document this subtlety.
> 
> Sorry, Could you elaborate it a bit?

I meant, with a new mechanism as the one you are proposing, we have to
document that now shared pages will not be tracked properly. That is a
'subtle difference' and will have to be documented appropriated in the
'internals' section of the idle page tracking document.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-08-05 13:11 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-26 15:23 [PATCH v3 1/2] mm/page_idle: Add per-pid idle page tracking using virtual indexing Joel Fernandes (Google)
2019-07-26 15:23 ` [PATCH v3 2/2] doc: Update documentation for page_idle virtual address indexing Joel Fernandes (Google)
2019-07-26 20:17   ` sspatil
2019-07-26 20:26     ` Joel Fernandes
2019-07-31  6:44   ` Mike Rapoport
2019-07-30 13:06 ` [PATCH v3 1/2] mm/page_idle: Add per-pid idle page tracking using virtual indexing Joel Fernandes
2019-07-31  8:53 ` Minchan Kim
2019-07-31 17:19   ` Joel Fernandes
2019-08-05  7:55     ` Minchan Kim
2019-08-05 13:10       ` Joel Fernandes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).