linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
@ 2019-07-22 21:32 Joel Fernandes (Google)
  2019-07-22 21:32 ` [PATCH v1 2/2] doc: Update documentation for page_idle virtual address indexing Joel Fernandes (Google)
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Joel Fernandes (Google) @ 2019-07-22 21:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	vdavydov.dev, Brendan Gregg, kernel-team, Alexey Dobriyan,
	Al Viro, Andrew Morton, carmenjackson, Christian Hansen,
	Colin Ian King, dancol, David Howells, fmayer, joaodias, joelaf,
	Jonathan Corbet, Kees Cook, Kirill Tkhai, Konstantin Khlebnikov,
	linux-doc, linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport,
	minchan, minchan, namhyung, sspatil, surenb, Thomas Gleixner,
	timmurray, tkjos, Vlastimil Babka, wvw

The page_idle tracking feature currently requires looking up the pagemap
for a process followed by interacting with /sys/kernel/mm/page_idle.
This is quite cumbersome and can be error-prone too. If between
accessing the per-PID pagemap and the global page_idle bitmap, if
something changes with the page then the information is not accurate.
More over looking up PFN from pagemap in Android devices is not
supported by unprivileged process and requires SYS_ADMIN and gives 0 for
the PFN.

This patch adds support to directly interact with page_idle tracking at
the PID level by introducing a /proc/<pid>/page_idle file. This
eliminates the need for userspace to calculate the mapping of the page.
It follows the exact same semantics as the global
/sys/kernel/mm/page_idle, however it is easier to use for some usecases
where looking up PFN is not needed and also does not require SYS_ADMIN.
It ended up simplifying userspace code, solving the security issue
mentioned and works quite well. SELinux does not need to be turned off
since no pagemap look up is needed.

In Android, we are using this for the heap profiler (heapprofd) which
profiles and pin points code paths which allocates and leaves memory
idle for long periods of time.

Documentation material:
The idle page tracking API for virtual address indexing using virtual page
frame numbers (VFN) is located at /proc/<pid>/page_idle. It is a bitmap
that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
except that it uses virtual instead of physical frame numbers.

This idle page tracking API can be simpler to use than physical address
indexing, since the pagemap for a process does not need to be looked up
to mark or read a page's idle bit. It is also more accurate than
physical address indexing since in physical address indexing, address
space changes can occur between reading the pagemap and reading the
bitmap. In virtual address indexing, the process's mmap_sem is held for
the duration of the access.

Cc: vdavydov.dev@gmail.com
Cc: Brendan Gregg <bgregg@netflix.com>
Cc: kernel-team@android.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

---
Internal review -> v1:
Fixes from Suren.
Corrections to change log, docs (Florian, Sandeep)

 fs/proc/base.c            |   3 +
 fs/proc/internal.h        |   1 +
 fs/proc/task_mmu.c        |  57 +++++++
 include/linux/page_idle.h |   4 +
 mm/page_idle.c            | 305 +++++++++++++++++++++++++++++++++-----
 5 files changed, 330 insertions(+), 40 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 77eb628ecc7f..a58dd74606e9 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3021,6 +3021,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 	REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
 	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
 	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+	REG("page_idle", S_IRUSR|S_IWUSR, proc_page_idle_operations),
+#endif
 #endif
 #ifdef CONFIG_SECURITY
 	DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index cd0c8d5ce9a1..bc9371880c63 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -293,6 +293,7 @@ extern const struct file_operations proc_pid_smaps_operations;
 extern const struct file_operations proc_pid_smaps_rollup_operations;
 extern const struct file_operations proc_clear_refs_operations;
 extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_page_idle_operations;
 
 extern unsigned long task_vsize(struct mm_struct *);
 extern unsigned long task_statm(struct mm_struct *,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4d2b860dbc3f..11ccc53da38e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1642,6 +1642,63 @@ const struct file_operations proc_pagemap_operations = {
 	.open		= pagemap_open,
 	.release	= pagemap_release,
 };
+
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+static ssize_t proc_page_idle_read(struct file *file, char __user *buf,
+				   size_t count, loff_t *ppos)
+{
+	int ret;
+	struct task_struct *tsk = get_proc_task(file_inode(file));
+
+	if (!tsk)
+		return -EINVAL;
+	ret = page_idle_proc_read(file, buf, count, ppos, tsk);
+	put_task_struct(tsk);
+	return ret;
+}
+
+static ssize_t proc_page_idle_write(struct file *file, const char __user *buf,
+				 size_t count, loff_t *ppos)
+{
+	int ret;
+	struct task_struct *tsk = get_proc_task(file_inode(file));
+
+	if (!tsk)
+		return -EINVAL;
+	ret = page_idle_proc_write(file, (char __user *)buf, count, ppos, tsk);
+	put_task_struct(tsk);
+	return ret;
+}
+
+static int proc_page_idle_open(struct inode *inode, struct file *file)
+{
+	struct mm_struct *mm;
+
+	mm = proc_mem_open(inode, PTRACE_MODE_READ);
+	if (IS_ERR(mm))
+		return PTR_ERR(mm);
+	file->private_data = mm;
+	return 0;
+}
+
+static int proc_page_idle_release(struct inode *inode, struct file *file)
+{
+	struct mm_struct *mm = file->private_data;
+
+	if (mm)
+		mmdrop(mm);
+	return 0;
+}
+
+const struct file_operations proc_page_idle_operations = {
+	.llseek		= mem_lseek, /* borrow this */
+	.read		= proc_page_idle_read,
+	.write		= proc_page_idle_write,
+	.open		= proc_page_idle_open,
+	.release	= proc_page_idle_release,
+};
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
 #endif /* CONFIG_PROC_PAGE_MONITOR */
 
 #ifdef CONFIG_NUMA
diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
index 1e894d34bdce..f1bc2640d85e 100644
--- a/include/linux/page_idle.h
+++ b/include/linux/page_idle.h
@@ -106,6 +106,10 @@ static inline void clear_page_idle(struct page *page)
 }
 #endif /* CONFIG_64BIT */
 
+ssize_t page_idle_proc_write(struct file *file,
+	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
+ssize_t page_idle_proc_read(struct file *file,
+	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
 #else /* !CONFIG_IDLE_PAGE_TRACKING */
 
 static inline bool page_is_young(struct page *page)
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 295512465065..874a60c41fef 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -11,6 +11,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/page_ext.h>
 #include <linux/page_idle.h>
+#include <linux/sched/mm.h>
 
 #define BITMAP_CHUNK_SIZE	sizeof(u64)
 #define BITMAP_CHUNK_BITS	(BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
@@ -28,15 +29,12 @@
  *
  * This function tries to get a user memory page by pfn as described above.
  */
-static struct page *page_idle_get_page(unsigned long pfn)
+static struct page *page_idle_get_page(struct page *page_in)
 {
 	struct page *page;
 	pg_data_t *pgdat;
 
-	if (!pfn_valid(pfn))
-		return NULL;
-
-	page = pfn_to_page(pfn);
+	page = page_in;
 	if (!page || !PageLRU(page) ||
 	    !get_page_unless_zero(page))
 		return NULL;
@@ -51,6 +49,15 @@ static struct page *page_idle_get_page(unsigned long pfn)
 	return page;
 }
 
+static struct page *page_idle_get_page_pfn(unsigned long pfn)
+{
+
+	if (!pfn_valid(pfn))
+		return NULL;
+
+	return page_idle_get_page(pfn_to_page(pfn));
+}
+
 static bool page_idle_clear_pte_refs_one(struct page *page,
 					struct vm_area_struct *vma,
 					unsigned long addr, void *arg)
@@ -118,6 +125,47 @@ static void page_idle_clear_pte_refs(struct page *page)
 		unlock_page(page);
 }
 
+/* Helper to get the start and end frame given a pos and count */
+static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
+				unsigned long *start, unsigned long *end)
+{
+	unsigned long max_frame;
+
+	/* If an mm is not given, assume we want physical frames */
+	max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
+
+	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
+		return -EINVAL;
+
+	*start = pos * BITS_PER_BYTE;
+	if (*start >= max_frame)
+		return -ENXIO;
+
+	*end = *start + count * BITS_PER_BYTE;
+	if (*end > max_frame)
+		*end = max_frame;
+	return 0;
+}
+
+static bool page_really_idle(struct page *page)
+{
+	if (!page)
+		return false;
+
+	if (page_is_idle(page)) {
+		/*
+		 * The page might have been referenced via a
+		 * pte, in which case it is not idle. Clear
+		 * refs and recheck.
+		 */
+		page_idle_clear_pte_refs(page);
+		if (page_is_idle(page))
+			return true;
+	}
+
+	return false;
+}
+
 static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
 				     struct bin_attribute *attr, char *buf,
 				     loff_t pos, size_t count)
@@ -125,35 +173,21 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
 	u64 *out = (u64 *)buf;
 	struct page *page;
 	unsigned long pfn, end_pfn;
-	int bit;
-
-	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
-		return -EINVAL;
-
-	pfn = pos * BITS_PER_BYTE;
-	if (pfn >= max_pfn)
-		return 0;
+	int bit, ret;
 
-	end_pfn = pfn + count * BITS_PER_BYTE;
-	if (end_pfn > max_pfn)
-		end_pfn = max_pfn;
+	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
+	if (ret == -ENXIO)
+		return 0;  /* Reads beyond max_pfn do nothing */
+	else if (ret)
+		return ret;
 
 	for (; pfn < end_pfn; pfn++) {
 		bit = pfn % BITMAP_CHUNK_BITS;
 		if (!bit)
 			*out = 0ULL;
-		page = page_idle_get_page(pfn);
-		if (page) {
-			if (page_is_idle(page)) {
-				/*
-				 * The page might have been referenced via a
-				 * pte, in which case it is not idle. Clear
-				 * refs and recheck.
-				 */
-				page_idle_clear_pte_refs(page);
-				if (page_is_idle(page))
-					*out |= 1ULL << bit;
-			}
+		page = page_idle_get_page_pfn(pfn);
+		if (page && page_really_idle(page)) {
+			*out |= 1ULL << bit;
 			put_page(page);
 		}
 		if (bit == BITMAP_CHUNK_BITS - 1)
@@ -170,23 +204,16 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
 	const u64 *in = (u64 *)buf;
 	struct page *page;
 	unsigned long pfn, end_pfn;
-	int bit;
+	int bit, ret;
 
-	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
-		return -EINVAL;
-
-	pfn = pos * BITS_PER_BYTE;
-	if (pfn >= max_pfn)
-		return -ENXIO;
-
-	end_pfn = pfn + count * BITS_PER_BYTE;
-	if (end_pfn > max_pfn)
-		end_pfn = max_pfn;
+	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
+	if (ret)
+		return ret;
 
 	for (; pfn < end_pfn; pfn++) {
 		bit = pfn % BITMAP_CHUNK_BITS;
 		if ((*in >> bit) & 1) {
-			page = page_idle_get_page(pfn);
+			page = page_idle_get_page_pfn(pfn);
 			if (page) {
 				page_idle_clear_pte_refs(page);
 				set_page_idle(page);
@@ -224,10 +251,208 @@ struct page_ext_operations page_idle_ops = {
 };
 #endif
 
+/*  page_idle tracking for /proc/<pid>/page_idle */
+
+static DEFINE_SPINLOCK(idle_page_list_lock);
+struct list_head idle_page_list;
+
+struct page_node {
+	struct page *page;
+	unsigned long addr;
+	struct list_head list;
+};
+
+struct page_idle_proc_priv {
+	unsigned long start_addr;
+	char *buffer;
+	int write;
+};
+
+static void add_page_idle_list(struct page *page,
+			       unsigned long addr, struct mm_walk *walk)
+{
+	struct page *page_get;
+	struct page_node *pn;
+	int bit;
+	unsigned long frames;
+	struct page_idle_proc_priv *priv = walk->private;
+	u64 *chunk = (u64 *)priv->buffer;
+
+	if (priv->write) {
+		/* Find whether this page was asked to be marked */
+		frames = (addr - priv->start_addr) >> PAGE_SHIFT;
+		bit = frames % BITMAP_CHUNK_BITS;
+		chunk = &chunk[frames / BITMAP_CHUNK_BITS];
+		if (((*chunk >> bit) & 1) == 0)
+			return;
+	}
+
+	page_get = page_idle_get_page(page);
+	if (!page_get)
+		return;
+
+	pn = kmalloc(sizeof(*pn), GFP_ATOMIC);
+	if (!pn)
+		return;
+
+	pn->page = page_get;
+	pn->addr = addr;
+	list_add(&pn->list, &idle_page_list);
+}
+
+static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
+				    unsigned long end,
+				    struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+	pte_t *pte;
+	spinlock_t *ptl;
+	struct page *page;
+
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		if (pmd_present(*pmd)) {
+			page = follow_trans_huge_pmd(vma, addr, pmd,
+						     FOLL_DUMP|FOLL_WRITE);
+			if (!IS_ERR_OR_NULL(page))
+				add_page_idle_list(page, addr, walk);
+		}
+		spin_unlock(ptl);
+		return 0;
+	}
+
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		if (!pte_present(*pte))
+			continue;
+
+		page = vm_normal_page(vma, addr, *pte);
+		if (page)
+			add_page_idle_list(page, addr, walk);
+	}
+
+	pte_unmap_unlock(pte - 1, ptl);
+	return 0;
+}
+
+ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
+			       size_t count, loff_t *pos,
+			       struct task_struct *tsk, int write)
+{
+	int ret;
+	char *buffer;
+	u64 *out;
+	unsigned long start_addr, end_addr, start_frame, end_frame;
+	struct mm_struct *mm = file->private_data;
+	struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, };
+	struct page_node *cur, *next;
+	struct page_idle_proc_priv priv;
+	bool walk_error = false;
+
+	if (!mm || !mmget_not_zero(mm))
+		return -EINVAL;
+
+	if (count > PAGE_SIZE)
+		count = PAGE_SIZE;
+
+	buffer = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buffer) {
+		ret = -ENOMEM;
+		goto out_mmput;
+	}
+	out = (u64 *)buffer;
+
+	if (write && copy_from_user(buffer, ubuff, count)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame);
+	if (ret)
+		goto out;
+
+	start_addr = (start_frame << PAGE_SHIFT);
+	end_addr = (end_frame << PAGE_SHIFT);
+	priv.buffer = buffer;
+	priv.start_addr = start_addr;
+	priv.write = write;
+	walk.private = &priv;
+	walk.mm = mm;
+
+	down_read(&mm->mmap_sem);
+
+	/*
+	 * Protects the idle_page_list which is needed because
+	 * walk_page_vma() holds ptlock which deadlocks with
+	 * page_idle_clear_pte_refs(). So we have to collect all
+	 * pages first, and then call page_idle_clear_pte_refs().
+	 */
+	spin_lock(&idle_page_list_lock);
+	ret = walk_page_range(start_addr, end_addr, &walk);
+	if (ret)
+		walk_error = true;
+
+	list_for_each_entry_safe(cur, next, &idle_page_list, list) {
+		int bit, index;
+		unsigned long off;
+		struct page *page = cur->page;
+
+		if (unlikely(walk_error))
+			goto remove_page;
+
+		if (write) {
+			page_idle_clear_pte_refs(page);
+			set_page_idle(page);
+		} else {
+			if (page_really_idle(page)) {
+				off = ((cur->addr) >> PAGE_SHIFT) - start_frame;
+				bit = off % BITMAP_CHUNK_BITS;
+				index = off / BITMAP_CHUNK_BITS;
+				out[index] |= 1ULL << bit;
+			}
+		}
+remove_page:
+		put_page(page);
+		list_del(&cur->list);
+		kfree(cur);
+	}
+	spin_unlock(&idle_page_list_lock);
+
+	if (!write && !walk_error)
+		ret = copy_to_user(ubuff, buffer, count);
+
+	up_read(&mm->mmap_sem);
+out:
+	kfree(buffer);
+out_mmput:
+	mmput(mm);
+	if (!ret)
+		ret = count;
+	return ret;
+
+}
+
+ssize_t page_idle_proc_read(struct file *file, char __user *ubuff,
+			    size_t count, loff_t *pos, struct task_struct *tsk)
+{
+	return page_idle_proc_generic(file, ubuff, count, pos, tsk, 0);
+}
+
+ssize_t page_idle_proc_write(struct file *file, char __user *ubuff,
+			     size_t count, loff_t *pos, struct task_struct *tsk)
+{
+	return page_idle_proc_generic(file, ubuff, count, pos, tsk, 1);
+}
+
 static int __init page_idle_init(void)
 {
 	int err;
 
+	INIT_LIST_HEAD(&idle_page_list);
+
 	err = sysfs_create_group(mm_kobj, &page_idle_attr_group);
 	if (err) {
 		pr_err("page_idle: register sysfs failed\n");
-- 
2.22.0.657.g960e92d24f-goog

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v1 2/2] doc: Update documentation for page_idle virtual address indexing
  2019-07-22 21:32 [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing Joel Fernandes (Google)
@ 2019-07-22 21:32 ` Joel Fernandes (Google)
  2019-07-22 22:06 ` [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing Andrew Morton
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Joel Fernandes (Google) @ 2019-07-22 21:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	Alexey Dobriyan, Al Viro, Andrew Morton, Brendan Gregg,
	carmenjackson, Christian Hansen, Colin Ian King, dancol,
	David Howells, fmayer, joaodias, joelaf, Jonathan Corbet,
	Kees Cook, kernel-team, Kirill Tkhai, Konstantin Khlebnikov,
	linux-doc, linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport,
	minchan, minchan, namhyung, sspatil, surenb, Thomas Gleixner,
	timmurray, tkjos, vdavydov.dev, Vlastimil Babka, wvw

This patch updates the documentation with the new page_idle tracking
feature which uses virtual address indexing.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 .../admin-guide/mm/idle_page_tracking.rst     | 41 +++++++++++++++----
 1 file changed, 34 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/mm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst
index df9394fb39c2..70d3bf6f1f8c 100644
--- a/Documentation/admin-guide/mm/idle_page_tracking.rst
+++ b/Documentation/admin-guide/mm/idle_page_tracking.rst
@@ -19,10 +19,14 @@ It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
 
 User API
 ========
+There are 2 ways to access the idle page tracking API. One uses physical
+address indexing, another uses a simpler virtual address indexing scheme.
 
-The idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
-Currently, it consists of the only read-write file,
-``/sys/kernel/mm/page_idle/bitmap``.
+Physical address indexing
+-------------------------
+The idle page tracking API for physical address indexing using page frame
+numbers (PFN) is located at ``/sys/kernel/mm/page_idle``.  Currently, it
+consists of the only read-write file, ``/sys/kernel/mm/page_idle/bitmap``.
 
 The file implements a bitmap where each bit corresponds to a memory page. The
 bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
@@ -74,6 +78,29 @@ See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more
 information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and
 ``/proc/kpagecgroup``.
 
+Virtual address indexing
+------------------------
+The idle page tracking API for virtual address indexing using virtual page
+frame numbers (VFN) is located at ``/proc/<pid>/page_idle``. It is a bitmap
+that follows the same semantics as ``/sys/kernel/mm/page_idle/bitmap``
+except that it uses virtual instead of physical frame numbers.
+
+This idle page tracking API can be simpler to use than physical address
+indexing, since the ``pagemap`` for a process does not need to be looked up to
+mark or read a page's idle bit. It is also more accurate than physical address
+indexing since in physical address indexing, address space changes can occur
+between reading the ``pagemap`` and reading the ``bitmap``. In virtual address
+indexing, the process's ``mmap_sem`` is held for the duration of the access.
+
+To estimate the amount of pages that are not used by a workload one should:
+
+ 1. Mark all the workload's pages as idle by setting corresponding bits in
+    ``/proc/<pid>/page_idle``.
+
+ 2. Wait until the workload accesses its working set.
+
+ 3. Read ``/proc/<pid>/page_idle`` and count the number of bits set.
+
 .. _impl_details:
 
 Implementation Details
@@ -99,10 +126,10 @@ When a dirty page is written to swap or disk as a result of memory reclaim or
 exceeding the dirty memory limit, it is not marked referenced.
 
 The idle memory tracking feature adds a new page flag, the Idle flag. This flag
-is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
-:ref:`User API <user_api>`
-section), and cleared automatically whenever a page is referenced as defined
-above.
+is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` for physical
+addressing or by writing to ``/proc/<pid>/page_idle`` for virtual
+addressing (see the :ref:`User API <user_api>` section), and cleared
+automatically whenever a page is referenced as defined above.
 
 When a page is marked idle, the Accessed bit must be cleared in all PTEs it is
 mapped to, otherwise we will not be able to detect accesses to the page coming
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
  2019-07-22 21:32 [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing Joel Fernandes (Google)
  2019-07-22 21:32 ` [PATCH v1 2/2] doc: Update documentation for page_idle virtual address indexing Joel Fernandes (Google)
@ 2019-07-22 22:06 ` Andrew Morton
  2019-07-23 14:43   ` Joel Fernandes
  2019-07-24 19:33   ` Joel Fernandes
  2019-07-23  6:05 ` Michal Hocko
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 18+ messages in thread
From: Andrew Morton @ 2019-07-22 22:06 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, vdavydov.dev, Brendan Gregg, kernel-team,
	Alexey Dobriyan, Al Viro, carmenjackson, Christian Hansen,
	Colin Ian King, dancol, David Howells, fmayer, joaodias, joelaf,
	Jonathan Corbet, Kees Cook, Kirill Tkhai, Konstantin Khlebnikov,
	linux-doc, linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport,
	minchan, minchan, namhyung, sspatil, surenb, Thomas Gleixner,
	timmurray, tkjos, Vlastimil Babka, wvw

On Mon, 22 Jul 2019 17:32:04 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote:

> The page_idle tracking feature currently requires looking up the pagemap
> for a process followed by interacting with /sys/kernel/mm/page_idle.
> This is quite cumbersome and can be error-prone too. If between
> accessing the per-PID pagemap and the global page_idle bitmap, if
> something changes with the page then the information is not accurate.

Well, it's never going to be "accurate" - something could change one
nanosecond after userspace has read the data...

Presumably with this approach the data will be "more" accurate.  How
big a problem has this inaccuracy proven to be in real-world usage?

> More over looking up PFN from pagemap in Android devices is not
> supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> the PFN.
> 
> This patch adds support to directly interact with page_idle tracking at
> the PID level by introducing a /proc/<pid>/page_idle file. This
> eliminates the need for userspace to calculate the mapping of the page.
> It follows the exact same semantics as the global
> /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> where looking up PFN is not needed and also does not require SYS_ADMIN.
> It ended up simplifying userspace code, solving the security issue
> mentioned and works quite well. SELinux does not need to be turned off
> since no pagemap look up is needed.
> 
> In Android, we are using this for the heap profiler (heapprofd) which
> profiles and pin points code paths which allocates and leaves memory
> idle for long periods of time.
> 
> Documentation material:
> The idle page tracking API for virtual address indexing using virtual page
> frame numbers (VFN) is located at /proc/<pid>/page_idle. It is a bitmap
> that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
> except that it uses virtual instead of physical frame numbers.
> 
> This idle page tracking API can be simpler to use than physical address
> indexing, since the pagemap for a process does not need to be looked up
> to mark or read a page's idle bit. It is also more accurate than
> physical address indexing since in physical address indexing, address
> space changes can occur between reading the pagemap and reading the
> bitmap. In virtual address indexing, the process's mmap_sem is held for
> the duration of the access.
> 
> ...
>
> --- a/mm/page_idle.c
> +++ b/mm/page_idle.c
> @@ -11,6 +11,7 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/page_ext.h>
>  #include <linux/page_idle.h>
> +#include <linux/sched/mm.h>
>  
>  #define BITMAP_CHUNK_SIZE	sizeof(u64)
>  #define BITMAP_CHUNK_BITS	(BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
> @@ -28,15 +29,12 @@
>   *
>   * This function tries to get a user memory page by pfn as described above.
>   */

Above comment needs updating or moving?

> -static struct page *page_idle_get_page(unsigned long pfn)
> +static struct page *page_idle_get_page(struct page *page_in)
>  {
>  	struct page *page;
>  	pg_data_t *pgdat;
>  
> -	if (!pfn_valid(pfn))
> -		return NULL;
> -
> -	page = pfn_to_page(pfn);
> +	page = page_in;
>  	if (!page || !PageLRU(page) ||
>  	    !get_page_unless_zero(page))
>  		return NULL;
>
> ...
>
> +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
> +				unsigned long *start, unsigned long *end)
> +{
> +	unsigned long max_frame;
> +
> +	/* If an mm is not given, assume we want physical frames */
> +	max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
> +
> +	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> +		return -EINVAL;
> +
> +	*start = pos * BITS_PER_BYTE;
> +	if (*start >= max_frame)
> +		return -ENXIO;

Is said to mean "The system tried to use the device represented by a
file you specified, and it couldnt find the device.  This can mean that
the device file was installed incorrectly, or that the physical device
is missing or not correctly attached to the computer."

This doesn't seem appropriate in this usage and is hence possibly
misleading.  Someone whose application fails with ENXIO will be
scratching their heads.

> +	*end = *start + count * BITS_PER_BYTE;
> +	if (*end > max_frame)
> +		*end = max_frame;
> +	return 0;
> +}
> +
>
> ...
>
> +static void add_page_idle_list(struct page *page,
> +			       unsigned long addr, struct mm_walk *walk)
> +{
> +	struct page *page_get;
> +	struct page_node *pn;
> +	int bit;
> +	unsigned long frames;
> +	struct page_idle_proc_priv *priv = walk->private;
> +	u64 *chunk = (u64 *)priv->buffer;
> +
> +	if (priv->write) {
> +		/* Find whether this page was asked to be marked */
> +		frames = (addr - priv->start_addr) >> PAGE_SHIFT;
> +		bit = frames % BITMAP_CHUNK_BITS;
> +		chunk = &chunk[frames / BITMAP_CHUNK_BITS];
> +		if (((*chunk >> bit) & 1) == 0)
> +			return;
> +	}
> +
> +	page_get = page_idle_get_page(page);
> +	if (!page_get)
> +		return;
> +
> +	pn = kmalloc(sizeof(*pn), GFP_ATOMIC);

I'm not liking this GFP_ATOMIC.  If I'm reading the code correctly,
userspace can ask for an arbitrarily large number of GFP_ATOMIC
allocations by doing a large read.  This can potentially exhaust page
reserves which things like networking Rx interrupts need and can make
this whole feature less reliable.

> +	if (!pn)
> +		return;
> +
> +	pn->page = page_get;
> +	pn->addr = addr;
> +	list_add(&pn->list, &idle_page_list);
> +}
> +
> +static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
> +				    unsigned long end,
> +				    struct mm_walk *walk)
> +{
> +	struct vm_area_struct *vma = walk->vma;
> +	pte_t *pte;
> +	spinlock_t *ptl;
> +	struct page *page;
> +
> +	ptl = pmd_trans_huge_lock(pmd, vma);
> +	if (ptl) {
> +		if (pmd_present(*pmd)) {
> +			page = follow_trans_huge_pmd(vma, addr, pmd,
> +						     FOLL_DUMP|FOLL_WRITE);
> +			if (!IS_ERR_OR_NULL(page))
> +				add_page_idle_list(page, addr, walk);
> +		}
> +		spin_unlock(ptl);
> +		return 0;
> +	}
> +
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> +		if (!pte_present(*pte))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, *pte);
> +		if (page)
> +			add_page_idle_list(page, addr, walk);
> +	}
> +
> +	pte_unmap_unlock(pte - 1, ptl);
> +	return 0;
> +}
> +
> +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
> +			       size_t count, loff_t *pos,
> +			       struct task_struct *tsk, int write)
> +{
> +	int ret;
> +	char *buffer;
> +	u64 *out;
> +	unsigned long start_addr, end_addr, start_frame, end_frame;
> +	struct mm_struct *mm = file->private_data;
> +	struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, };
> +	struct page_node *cur, *next;
> +	struct page_idle_proc_priv priv;
> +	bool walk_error = false;
> +
> +	if (!mm || !mmget_not_zero(mm))
> +		return -EINVAL;
> +
> +	if (count > PAGE_SIZE)
> +		count = PAGE_SIZE;
> +
> +	buffer = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +	if (!buffer) {
> +		ret = -ENOMEM;
> +		goto out_mmput;
> +	}
> +	out = (u64 *)buffer;
> +
> +	if (write && copy_from_user(buffer, ubuff, count)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame);
> +	if (ret)
> +		goto out;
> +
> +	start_addr = (start_frame << PAGE_SHIFT);
> +	end_addr = (end_frame << PAGE_SHIFT);
> +	priv.buffer = buffer;
> +	priv.start_addr = start_addr;
> +	priv.write = write;
> +	walk.private = &priv;
> +	walk.mm = mm;
> +
> +	down_read(&mm->mmap_sem);
> +
> +	/*
> +	 * Protects the idle_page_list which is needed because
> +	 * walk_page_vma() holds ptlock which deadlocks with
> +	 * page_idle_clear_pte_refs(). So we have to collect all
> +	 * pages first, and then call page_idle_clear_pte_refs().
> +	 */
> +	spin_lock(&idle_page_list_lock);
> +	ret = walk_page_range(start_addr, end_addr, &walk);
> +	if (ret)
> +		walk_error = true;
> +
> +	list_for_each_entry_safe(cur, next, &idle_page_list, list) {
> +		int bit, index;
> +		unsigned long off;
> +		struct page *page = cur->page;
> +
> +		if (unlikely(walk_error))
> +			goto remove_page;
> +
> +		if (write) {
> +			page_idle_clear_pte_refs(page);
> +			set_page_idle(page);
> +		} else {
> +			if (page_really_idle(page)) {
> +				off = ((cur->addr) >> PAGE_SHIFT) - start_frame;
> +				bit = off % BITMAP_CHUNK_BITS;
> +				index = off / BITMAP_CHUNK_BITS;
> +				out[index] |= 1ULL << bit;
> +			}
> +		}
> +remove_page:
> +		put_page(page);
> +		list_del(&cur->list);
> +		kfree(cur);
> +	}
> +	spin_unlock(&idle_page_list_lock);
> +
> +	if (!write && !walk_error)
> +		ret = copy_to_user(ubuff, buffer, count);
> +
> +	up_read(&mm->mmap_sem);
> +out:
> +	kfree(buffer);
> +out_mmput:
> +	mmput(mm);
> +	if (!ret)
> +		ret = count;
> +	return ret;
> +
> +}
> +
> +ssize_t page_idle_proc_read(struct file *file, char __user *ubuff,
> +			    size_t count, loff_t *pos, struct task_struct *tsk)
> +{
> +	return page_idle_proc_generic(file, ubuff, count, pos, tsk, 0);
> +}
> +
> +ssize_t page_idle_proc_write(struct file *file, char __user *ubuff,
> +			     size_t count, loff_t *pos, struct task_struct *tsk)
> +{
> +	return page_idle_proc_generic(file, ubuff, count, pos, tsk, 1);
> +}
> +
>  static int __init page_idle_init(void)
>  {
>  	int err;
>  
> +	INIT_LIST_HEAD(&idle_page_list);
> +
>  	err = sysfs_create_group(mm_kobj, &page_idle_attr_group);
>  	if (err) {
>  		pr_err("page_idle: register sysfs failed\n");
> -- 
>
> ...
>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
  2019-07-22 21:32 [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing Joel Fernandes (Google)
  2019-07-22 21:32 ` [PATCH v1 2/2] doc: Update documentation for page_idle virtual address indexing Joel Fernandes (Google)
  2019-07-22 22:06 ` [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing Andrew Morton
@ 2019-07-23  6:05 ` Michal Hocko
  2019-07-23 14:34   ` Joel Fernandes
  2019-07-23  6:13 ` Minchan Kim
  2019-07-23  8:43 ` Konstantin Khlebnikov
  4 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2019-07-23  6:05 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, vdavydov.dev, Brendan Gregg, kernel-team,
	Alexey Dobriyan, Al Viro, Andrew Morton, carmenjackson,
	Christian Hansen, Colin Ian King, dancol, David Howells, fmayer,
	joaodias, joelaf, Jonathan Corbet, Kees Cook, Kirill Tkhai,
	Konstantin Khlebnikov, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, minchan, namhyung, sspatil, surenb,
	Thomas Gleixner, timmurray, tkjos, Vlastimil Babka, wvw,
	linux-api

[Cc linux-api - please always do CC this list when introducing a user
 visible API]

On Mon 22-07-19 17:32:04, Joel Fernandes (Google) wrote:
> The page_idle tracking feature currently requires looking up the pagemap
> for a process followed by interacting with /sys/kernel/mm/page_idle.
> This is quite cumbersome and can be error-prone too. If between
> accessing the per-PID pagemap and the global page_idle bitmap, if
> something changes with the page then the information is not accurate.
> More over looking up PFN from pagemap in Android devices is not
> supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> the PFN.
> 
> This patch adds support to directly interact with page_idle tracking at
> the PID level by introducing a /proc/<pid>/page_idle file. This
> eliminates the need for userspace to calculate the mapping of the page.
> It follows the exact same semantics as the global
> /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> where looking up PFN is not needed and also does not require SYS_ADMIN.
> It ended up simplifying userspace code, solving the security issue
> mentioned and works quite well. SELinux does not need to be turned off
> since no pagemap look up is needed.
> 
> In Android, we are using this for the heap profiler (heapprofd) which
> profiles and pin points code paths which allocates and leaves memory
> idle for long periods of time.
> 
> Documentation material:
> The idle page tracking API for virtual address indexing using virtual page
> frame numbers (VFN) is located at /proc/<pid>/page_idle. It is a bitmap
> that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
> except that it uses virtual instead of physical frame numbers.
> 
> This idle page tracking API can be simpler to use than physical address
> indexing, since the pagemap for a process does not need to be looked up
> to mark or read a page's idle bit. It is also more accurate than
> physical address indexing since in physical address indexing, address
> space changes can occur between reading the pagemap and reading the
> bitmap. In virtual address indexing, the process's mmap_sem is held for
> the duration of the access.

I didn't get to read the actual code but the overall idea makes sense to
me. I can see this being useful for userspace memory management (along
with remote MADV_PAGEOUT, MADV_COLD).

Normally I would object that a cumbersome nature of the existing
interface can be hidden in a userspace but I do agree that rowhammer has
made this one close to unusable for anything but a privileged process.

I do not think you can make any argument about accuracy because
the information will never be accurate. Sure the race window is smaller
in principle but you can hardly say anything about how much or whether
at all.

Thanks.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
  2019-07-22 21:32 [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing Joel Fernandes (Google)
                   ` (2 preceding siblings ...)
  2019-07-23  6:05 ` Michal Hocko
@ 2019-07-23  6:13 ` Minchan Kim
  2019-07-23 14:20   ` Joel Fernandes
  2019-07-23  8:43 ` Konstantin Khlebnikov
  4 siblings, 1 reply; 18+ messages in thread
From: Minchan Kim @ 2019-07-23  6:13 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, vdavydov.dev, Brendan Gregg, kernel-team,
	Alexey Dobriyan, Al Viro, Andrew Morton, carmenjackson,
	Christian Hansen, Colin Ian King, dancol, David Howells, fmayer,
	joaodias, joelaf, Jonathan Corbet, Kees Cook, Kirill Tkhai,
	Konstantin Khlebnikov, linux-doc, linux-fsdevel, linux-mm,
	Michal Hocko, Mike Rapoport, namhyung, sspatil, surenb,
	Thomas Gleixner, timmurray, tkjos, Vlastimil Babka, wvw

Hi Joel,

On Mon, Jul 22, 2019 at 05:32:04PM -0400, Joel Fernandes (Google) wrote:
> The page_idle tracking feature currently requires looking up the pagemap
> for a process followed by interacting with /sys/kernel/mm/page_idle.
> This is quite cumbersome and can be error-prone too. If between

cumbersome: That's the fair tradeoff between idle page tracking and
clear_refs because idle page tracking could check even though the page
is not mapped.

error-prone: What's the error?

> accessing the per-PID pagemap and the global page_idle bitmap, if
> something changes with the page then the information is not accurate.

What you mean with error is this timing issue?
Why do you need to be accurate? IOW, accurate is always good but what's
the scale of the accuracy?

> More over looking up PFN from pagemap in Android devices is not
> supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> the PFN.
> 
> This patch adds support to directly interact with page_idle tracking at
> the PID level by introducing a /proc/<pid>/page_idle file. This
> eliminates the need for userspace to calculate the mapping of the page.
> It follows the exact same semantics as the global
> /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> where looking up PFN is not needed and also does not require SYS_ADMIN.

Ah, so the primary goal is to provide convinience interface and it would
help accurary, too. IOW, accuracy is not your main goal?

> It ended up simplifying userspace code, solving the security issue
> mentioned and works quite well. SELinux does not need to be turned off
> since no pagemap look up is needed.

I'm not sure how it is painful to check it via pagemap for your goal
but not sure it's a good idea to create new ABI for just convinience.
I think that's library we have.

> 
> In Android, we are using this for the heap profiler (heapprofd) which
> profiles and pin points code paths which allocates and leaves memory
> idle for long periods of time.

So the goal is to detect idle pages with idle memory tracking?
It couldn't work well because such idle pages could finally swap out and
lose every flags of the page descriptor which is working mechanism of
idle page tracking. It should have named "workingset page tracking",
not "idle page tracking".

> 
> Documentation material:
> The idle page tracking API for virtual address indexing using virtual page
> frame numbers (VFN) is located at /proc/<pid>/page_idle. It is a bitmap
> that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
> except that it uses virtual instead of physical frame numbers.
> 
> This idle page tracking API can be simpler to use than physical address
> indexing, since the pagemap for a process does not need to be looked up
> to mark or read a page's idle bit. It is also more accurate than
> physical address indexing since in physical address indexing, address
> space changes can occur between reading the pagemap and reading the
> bitmap. In virtual address indexing, the process's mmap_sem is held for
> the duration of the access.
> 
> Cc: vdavydov.dev@gmail.com
> Cc: Brendan Gregg <bgregg@netflix.com>
> Cc: kernel-team@android.com
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> ---
> Internal review -> v1:
> Fixes from Suren.
> Corrections to change log, docs (Florian, Sandeep)
> 
>  fs/proc/base.c            |   3 +
>  fs/proc/internal.h        |   1 +
>  fs/proc/task_mmu.c        |  57 +++++++
>  include/linux/page_idle.h |   4 +
>  mm/page_idle.c            | 305 +++++++++++++++++++++++++++++++++-----
>  5 files changed, 330 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 77eb628ecc7f..a58dd74606e9 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3021,6 +3021,9 @@ static const struct pid_entry tgid_base_stuff[] = {
>  	REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
>  	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
>  	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> +	REG("page_idle", S_IRUSR|S_IWUSR, proc_page_idle_operations),
> +#endif
>  #endif
>  #ifdef CONFIG_SECURITY
>  	DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index cd0c8d5ce9a1..bc9371880c63 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -293,6 +293,7 @@ extern const struct file_operations proc_pid_smaps_operations;
>  extern const struct file_operations proc_pid_smaps_rollup_operations;
>  extern const struct file_operations proc_clear_refs_operations;
>  extern const struct file_operations proc_pagemap_operations;
> +extern const struct file_operations proc_page_idle_operations;
>  
>  extern unsigned long task_vsize(struct mm_struct *);
>  extern unsigned long task_statm(struct mm_struct *,
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 4d2b860dbc3f..11ccc53da38e 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1642,6 +1642,63 @@ const struct file_operations proc_pagemap_operations = {
>  	.open		= pagemap_open,
>  	.release	= pagemap_release,
>  };
> +
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> +static ssize_t proc_page_idle_read(struct file *file, char __user *buf,
> +				   size_t count, loff_t *ppos)
> +{
> +	int ret;
> +	struct task_struct *tsk = get_proc_task(file_inode(file));
> +
> +	if (!tsk)
> +		return -EINVAL;
> +	ret = page_idle_proc_read(file, buf, count, ppos, tsk);
> +	put_task_struct(tsk);
> +	return ret;
> +}
> +
> +static ssize_t proc_page_idle_write(struct file *file, const char __user *buf,
> +				 size_t count, loff_t *ppos)
> +{
> +	int ret;
> +	struct task_struct *tsk = get_proc_task(file_inode(file));
> +
> +	if (!tsk)
> +		return -EINVAL;
> +	ret = page_idle_proc_write(file, (char __user *)buf, count, ppos, tsk);
> +	put_task_struct(tsk);
> +	return ret;
> +}
> +
> +static int proc_page_idle_open(struct inode *inode, struct file *file)
> +{
> +	struct mm_struct *mm;
> +
> +	mm = proc_mem_open(inode, PTRACE_MODE_READ);
> +	if (IS_ERR(mm))
> +		return PTR_ERR(mm);
> +	file->private_data = mm;
> +	return 0;
> +}
> +
> +static int proc_page_idle_release(struct inode *inode, struct file *file)
> +{
> +	struct mm_struct *mm = file->private_data;
> +
> +	if (mm)
> +		mmdrop(mm);
> +	return 0;
> +}
> +
> +const struct file_operations proc_page_idle_operations = {
> +	.llseek		= mem_lseek, /* borrow this */
> +	.read		= proc_page_idle_read,
> +	.write		= proc_page_idle_write,
> +	.open		= proc_page_idle_open,
> +	.release	= proc_page_idle_release,
> +};
> +#endif /* CONFIG_IDLE_PAGE_TRACKING */
> +
>  #endif /* CONFIG_PROC_PAGE_MONITOR */
>  
>  #ifdef CONFIG_NUMA
> diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> index 1e894d34bdce..f1bc2640d85e 100644
> --- a/include/linux/page_idle.h
> +++ b/include/linux/page_idle.h
> @@ -106,6 +106,10 @@ static inline void clear_page_idle(struct page *page)
>  }
>  #endif /* CONFIG_64BIT */
>  
> +ssize_t page_idle_proc_write(struct file *file,
> +	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
> +ssize_t page_idle_proc_read(struct file *file,
> +	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
>  #else /* !CONFIG_IDLE_PAGE_TRACKING */
>  
>  static inline bool page_is_young(struct page *page)
> diff --git a/mm/page_idle.c b/mm/page_idle.c
> index 295512465065..874a60c41fef 100644
> --- a/mm/page_idle.c
> +++ b/mm/page_idle.c
> @@ -11,6 +11,7 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/page_ext.h>
>  #include <linux/page_idle.h>
> +#include <linux/sched/mm.h>
>  
>  #define BITMAP_CHUNK_SIZE	sizeof(u64)
>  #define BITMAP_CHUNK_BITS	(BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
> @@ -28,15 +29,12 @@
>   *
>   * This function tries to get a user memory page by pfn as described above.
>   */
> -static struct page *page_idle_get_page(unsigned long pfn)
> +static struct page *page_idle_get_page(struct page *page_in)
>  {
>  	struct page *page;
>  	pg_data_t *pgdat;
>  
> -	if (!pfn_valid(pfn))
> -		return NULL;
> -
> -	page = pfn_to_page(pfn);
> +	page = page_in;
>  	if (!page || !PageLRU(page) ||
>  	    !get_page_unless_zero(page))
>  		return NULL;
> @@ -51,6 +49,15 @@ static struct page *page_idle_get_page(unsigned long pfn)
>  	return page;
>  }
>  
> +static struct page *page_idle_get_page_pfn(unsigned long pfn)
> +{
> +
> +	if (!pfn_valid(pfn))
> +		return NULL;
> +
> +	return page_idle_get_page(pfn_to_page(pfn));
> +}
> +
>  static bool page_idle_clear_pte_refs_one(struct page *page,
>  					struct vm_area_struct *vma,
>  					unsigned long addr, void *arg)
> @@ -118,6 +125,47 @@ static void page_idle_clear_pte_refs(struct page *page)
>  		unlock_page(page);
>  }
>  
> +/* Helper to get the start and end frame given a pos and count */
> +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
> +				unsigned long *start, unsigned long *end)
> +{
> +	unsigned long max_frame;
> +
> +	/* If an mm is not given, assume we want physical frames */
> +	max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
> +
> +	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> +		return -EINVAL;
> +
> +	*start = pos * BITS_PER_BYTE;
> +	if (*start >= max_frame)
> +		return -ENXIO;
> +
> +	*end = *start + count * BITS_PER_BYTE;
> +	if (*end > max_frame)
> +		*end = max_frame;
> +	return 0;
> +}
> +
> +static bool page_really_idle(struct page *page)
> +{
> +	if (!page)
> +		return false;
> +
> +	if (page_is_idle(page)) {
> +		/*
> +		 * The page might have been referenced via a
> +		 * pte, in which case it is not idle. Clear
> +		 * refs and recheck.
> +		 */
> +		page_idle_clear_pte_refs(page);
> +		if (page_is_idle(page))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
>  static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
>  				     struct bin_attribute *attr, char *buf,
>  				     loff_t pos, size_t count)
> @@ -125,35 +173,21 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
>  	u64 *out = (u64 *)buf;
>  	struct page *page;
>  	unsigned long pfn, end_pfn;
> -	int bit;
> -
> -	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> -		return -EINVAL;
> -
> -	pfn = pos * BITS_PER_BYTE;
> -	if (pfn >= max_pfn)
> -		return 0;
> +	int bit, ret;
>  
> -	end_pfn = pfn + count * BITS_PER_BYTE;
> -	if (end_pfn > max_pfn)
> -		end_pfn = max_pfn;
> +	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
> +	if (ret == -ENXIO)
> +		return 0;  /* Reads beyond max_pfn do nothing */
> +	else if (ret)
> +		return ret;
>  
>  	for (; pfn < end_pfn; pfn++) {
>  		bit = pfn % BITMAP_CHUNK_BITS;
>  		if (!bit)
>  			*out = 0ULL;
> -		page = page_idle_get_page(pfn);
> -		if (page) {
> -			if (page_is_idle(page)) {
> -				/*
> -				 * The page might have been referenced via a
> -				 * pte, in which case it is not idle. Clear
> -				 * refs and recheck.
> -				 */
> -				page_idle_clear_pte_refs(page);
> -				if (page_is_idle(page))
> -					*out |= 1ULL << bit;
> -			}
> +		page = page_idle_get_page_pfn(pfn);
> +		if (page && page_really_idle(page)) {
> +			*out |= 1ULL << bit;
>  			put_page(page);
>  		}
>  		if (bit == BITMAP_CHUNK_BITS - 1)
> @@ -170,23 +204,16 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
>  	const u64 *in = (u64 *)buf;
>  	struct page *page;
>  	unsigned long pfn, end_pfn;
> -	int bit;
> +	int bit, ret;
>  
> -	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> -		return -EINVAL;
> -
> -	pfn = pos * BITS_PER_BYTE;
> -	if (pfn >= max_pfn)
> -		return -ENXIO;
> -
> -	end_pfn = pfn + count * BITS_PER_BYTE;
> -	if (end_pfn > max_pfn)
> -		end_pfn = max_pfn;
> +	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
> +	if (ret)
> +		return ret;
>  
>  	for (; pfn < end_pfn; pfn++) {
>  		bit = pfn % BITMAP_CHUNK_BITS;
>  		if ((*in >> bit) & 1) {
> -			page = page_idle_get_page(pfn);
> +			page = page_idle_get_page_pfn(pfn);
>  			if (page) {
>  				page_idle_clear_pte_refs(page);
>  				set_page_idle(page);
> @@ -224,10 +251,208 @@ struct page_ext_operations page_idle_ops = {
>  };
>  #endif
>  
> +/*  page_idle tracking for /proc/<pid>/page_idle */
> +
> +static DEFINE_SPINLOCK(idle_page_list_lock);
> +struct list_head idle_page_list;
> +
> +struct page_node {
> +	struct page *page;
> +	unsigned long addr;
> +	struct list_head list;
> +};
> +
> +struct page_idle_proc_priv {
> +	unsigned long start_addr;
> +	char *buffer;
> +	int write;
> +};
> +
> +static void add_page_idle_list(struct page *page,
> +			       unsigned long addr, struct mm_walk *walk)
> +{
> +	struct page *page_get;
> +	struct page_node *pn;
> +	int bit;
> +	unsigned long frames;
> +	struct page_idle_proc_priv *priv = walk->private;
> +	u64 *chunk = (u64 *)priv->buffer;
> +
> +	if (priv->write) {
> +		/* Find whether this page was asked to be marked */
> +		frames = (addr - priv->start_addr) >> PAGE_SHIFT;
> +		bit = frames % BITMAP_CHUNK_BITS;
> +		chunk = &chunk[frames / BITMAP_CHUNK_BITS];
> +		if (((*chunk >> bit) & 1) == 0)
> +			return;
> +	}
> +
> +	page_get = page_idle_get_page(page);
> +	if (!page_get)
> +		return;
> +
> +	pn = kmalloc(sizeof(*pn), GFP_ATOMIC);
> +	if (!pn)
> +		return;
> +
> +	pn->page = page_get;
> +	pn->addr = addr;
> +	list_add(&pn->list, &idle_page_list);
> +}
> +
> +static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
> +				    unsigned long end,
> +				    struct mm_walk *walk)
> +{
> +	struct vm_area_struct *vma = walk->vma;
> +	pte_t *pte;
> +	spinlock_t *ptl;
> +	struct page *page;
> +
> +	ptl = pmd_trans_huge_lock(pmd, vma);
> +	if (ptl) {
> +		if (pmd_present(*pmd)) {
> +			page = follow_trans_huge_pmd(vma, addr, pmd,
> +						     FOLL_DUMP|FOLL_WRITE);
> +			if (!IS_ERR_OR_NULL(page))
> +				add_page_idle_list(page, addr, walk);
> +		}
> +		spin_unlock(ptl);
> +		return 0;
> +	}
> +
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> +		if (!pte_present(*pte))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, *pte);
> +		if (page)
> +			add_page_idle_list(page, addr, walk);
> +	}
> +
> +	pte_unmap_unlock(pte - 1, ptl);
> +	return 0;
> +}
> +
> +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
> +			       size_t count, loff_t *pos,
> +			       struct task_struct *tsk, int write)
> +{
> +	int ret;
> +	char *buffer;
> +	u64 *out;
> +	unsigned long start_addr, end_addr, start_frame, end_frame;
> +	struct mm_struct *mm = file->private_data;
> +	struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, };
> +	struct page_node *cur, *next;
> +	struct page_idle_proc_priv priv;
> +	bool walk_error = false;
> +
> +	if (!mm || !mmget_not_zero(mm))
> +		return -EINVAL;
> +
> +	if (count > PAGE_SIZE)
> +		count = PAGE_SIZE;
> +
> +	buffer = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +	if (!buffer) {
> +		ret = -ENOMEM;
> +		goto out_mmput;
> +	}
> +	out = (u64 *)buffer;
> +
> +	if (write && copy_from_user(buffer, ubuff, count)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame);
> +	if (ret)
> +		goto out;
> +
> +	start_addr = (start_frame << PAGE_SHIFT);
> +	end_addr = (end_frame << PAGE_SHIFT);
> +	priv.buffer = buffer;
> +	priv.start_addr = start_addr;
> +	priv.write = write;
> +	walk.private = &priv;
> +	walk.mm = mm;
> +
> +	down_read(&mm->mmap_sem);
> +
> +	/*
> +	 * Protects the idle_page_list which is needed because
> +	 * walk_page_vma() holds ptlock which deadlocks with
> +	 * page_idle_clear_pte_refs(). So we have to collect all
> +	 * pages first, and then call page_idle_clear_pte_refs().
> +	 */
> +	spin_lock(&idle_page_list_lock);
> +	ret = walk_page_range(start_addr, end_addr, &walk);
> +	if (ret)
> +		walk_error = true;
> +
> +	list_for_each_entry_safe(cur, next, &idle_page_list, list) {
> +		int bit, index;
> +		unsigned long off;
> +		struct page *page = cur->page;
> +
> +		if (unlikely(walk_error))
> +			goto remove_page;
> +
> +		if (write) {
> +			page_idle_clear_pte_refs(page);
> +			set_page_idle(page);
> +		} else {
> +			if (page_really_idle(page)) {
> +				off = ((cur->addr) >> PAGE_SHIFT) - start_frame;
> +				bit = off % BITMAP_CHUNK_BITS;
> +				index = off / BITMAP_CHUNK_BITS;
> +				out[index] |= 1ULL << bit;
> +			}
> +		}
> +remove_page:
> +		put_page(page);
> +		list_del(&cur->list);
> +		kfree(cur);
> +	}
> +	spin_unlock(&idle_page_list_lock);
> +
> +	if (!write && !walk_error)
> +		ret = copy_to_user(ubuff, buffer, count);
> +
> +	up_read(&mm->mmap_sem);
> +out:
> +	kfree(buffer);
> +out_mmput:
> +	mmput(mm);
> +	if (!ret)
> +		ret = count;
> +	return ret;
> +
> +}
> +
> +ssize_t page_idle_proc_read(struct file *file, char __user *ubuff,
> +			    size_t count, loff_t *pos, struct task_struct *tsk)
> +{
> +	return page_idle_proc_generic(file, ubuff, count, pos, tsk, 0);
> +}
> +
> +ssize_t page_idle_proc_write(struct file *file, char __user *ubuff,
> +			     size_t count, loff_t *pos, struct task_struct *tsk)
> +{
> +	return page_idle_proc_generic(file, ubuff, count, pos, tsk, 1);
> +}
> +
>  static int __init page_idle_init(void)
>  {
>  	int err;
>  
> +	INIT_LIST_HEAD(&idle_page_list);
> +
>  	err = sysfs_create_group(mm_kobj, &page_idle_attr_group);
>  	if (err) {
>  		pr_err("page_idle: register sysfs failed\n");
> -- 
> 2.22.0.657.g960e92d24f-goog

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
  2019-07-22 21:32 [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing Joel Fernandes (Google)
                   ` (3 preceding siblings ...)
  2019-07-23  6:13 ` Minchan Kim
@ 2019-07-23  8:43 ` Konstantin Khlebnikov
  2019-07-23 10:10   ` Konstantin Khlebnikov
  4 siblings, 1 reply; 18+ messages in thread
From: Konstantin Khlebnikov @ 2019-07-23  8:43 UTC (permalink / raw)
  To: Joel Fernandes (Google), linux-kernel
  Cc: vdavydov.dev, Brendan Gregg, kernel-team, Alexey Dobriyan,
	Al Viro, Andrew Morton, carmenjackson, Christian Hansen,
	Colin Ian King, dancol, David Howells, fmayer, joaodias, joelaf,
	Jonathan Corbet, Kees Cook, Kirill Tkhai, linux-doc,
	linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport, minchan,
	minchan, namhyung, sspatil, surenb, Thomas Gleixner, timmurray,
	tkjos, Vlastimil Babka, wvw

On 23.07.2019 0:32, Joel Fernandes (Google) wrote:
> The page_idle tracking feature currently requires looking up the pagemap
> for a process followed by interacting with /sys/kernel/mm/page_idle.
> This is quite cumbersome and can be error-prone too. If between
> accessing the per-PID pagemap and the global page_idle bitmap, if
> something changes with the page then the information is not accurate.
> More over looking up PFN from pagemap in Android devices is not
> supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> the PFN.
> 
> This patch adds support to directly interact with page_idle tracking at
> the PID level by introducing a /proc/<pid>/page_idle file. This
> eliminates the need for userspace to calculate the mapping of the page.
> It follows the exact same semantics as the global
> /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> where looking up PFN is not needed and also does not require SYS_ADMIN.
> It ended up simplifying userspace code, solving the security issue
> mentioned and works quite well. SELinux does not need to be turned off
> since no pagemap look up is needed.
> 
> In Android, we are using this for the heap profiler (heapprofd) which
> profiles and pin points code paths which allocates and leaves memory
> idle for long periods of time.
> 
> Documentation material:
> The idle page tracking API for virtual address indexing using virtual page
> frame numbers (VFN) is located at /proc/<pid>/page_idle. It is a bitmap
> that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
> except that it uses virtual instead of physical frame numbers.
> 
> This idle page tracking API can be simpler to use than physical address
> indexing, since the pagemap for a process does not need to be looked up
> to mark or read a page's idle bit. It is also more accurate than
> physical address indexing since in physical address indexing, address
> space changes can occur between reading the pagemap and reading the
> bitmap. In virtual address indexing, the process's mmap_sem is held for
> the duration of the access.

Maybe integrate this into existing interface: /proc/pid/clear_refs and
/proc/pid/pagemap ?

I.e.  echo X > /proc/pid/clear_refs clears reference bits in ptes and
marks pages idle only for pages mapped in this process.
And idle bit in /proc/pid/pagemap tells that page is still idle in this process.
This is faster - we don't need to walk whole rmap for that.

> 
> Cc: vdavydov.dev@gmail.com
> Cc: Brendan Gregg <bgregg@netflix.com>
> Cc: kernel-team@android.com
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> ---
> Internal review -> v1:
> Fixes from Suren.
> Corrections to change log, docs (Florian, Sandeep)
> 
>   fs/proc/base.c            |   3 +
>   fs/proc/internal.h        |   1 +
>   fs/proc/task_mmu.c        |  57 +++++++
>   include/linux/page_idle.h |   4 +
>   mm/page_idle.c            | 305 +++++++++++++++++++++++++++++++++-----
>   5 files changed, 330 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 77eb628ecc7f..a58dd74606e9 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3021,6 +3021,9 @@ static const struct pid_entry tgid_base_stuff[] = {
>   	REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
>   	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
>   	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> +	REG("page_idle", S_IRUSR|S_IWUSR, proc_page_idle_operations),
> +#endif
>   #endif
>   #ifdef CONFIG_SECURITY
>   	DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index cd0c8d5ce9a1..bc9371880c63 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -293,6 +293,7 @@ extern const struct file_operations proc_pid_smaps_operations;
>   extern const struct file_operations proc_pid_smaps_rollup_operations;
>   extern const struct file_operations proc_clear_refs_operations;
>   extern const struct file_operations proc_pagemap_operations;
> +extern const struct file_operations proc_page_idle_operations;
>   
>   extern unsigned long task_vsize(struct mm_struct *);
>   extern unsigned long task_statm(struct mm_struct *,
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 4d2b860dbc3f..11ccc53da38e 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1642,6 +1642,63 @@ const struct file_operations proc_pagemap_operations = {
>   	.open		= pagemap_open,
>   	.release	= pagemap_release,
>   };
> +
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> +static ssize_t proc_page_idle_read(struct file *file, char __user *buf,
> +				   size_t count, loff_t *ppos)
> +{
> +	int ret;
> +	struct task_struct *tsk = get_proc_task(file_inode(file));
> +
> +	if (!tsk)
> +		return -EINVAL;
> +	ret = page_idle_proc_read(file, buf, count, ppos, tsk);
> +	put_task_struct(tsk);
> +	return ret;
> +}
> +
> +static ssize_t proc_page_idle_write(struct file *file, const char __user *buf,
> +				 size_t count, loff_t *ppos)
> +{
> +	int ret;
> +	struct task_struct *tsk = get_proc_task(file_inode(file));
> +
> +	if (!tsk)
> +		return -EINVAL;
> +	ret = page_idle_proc_write(file, (char __user *)buf, count, ppos, tsk);
> +	put_task_struct(tsk);
> +	return ret;
> +}
> +
> +static int proc_page_idle_open(struct inode *inode, struct file *file)
> +{
> +	struct mm_struct *mm;
> +
> +	mm = proc_mem_open(inode, PTRACE_MODE_READ);
> +	if (IS_ERR(mm))
> +		return PTR_ERR(mm);
> +	file->private_data = mm;
> +	return 0;
> +}
> +
> +static int proc_page_idle_release(struct inode *inode, struct file *file)
> +{
> +	struct mm_struct *mm = file->private_data;
> +
> +	if (mm)
> +		mmdrop(mm);
> +	return 0;
> +}
> +
> +const struct file_operations proc_page_idle_operations = {
> +	.llseek		= mem_lseek, /* borrow this */
> +	.read		= proc_page_idle_read,
> +	.write		= proc_page_idle_write,
> +	.open		= proc_page_idle_open,
> +	.release	= proc_page_idle_release,
> +};
> +#endif /* CONFIG_IDLE_PAGE_TRACKING */
> +
>   #endif /* CONFIG_PROC_PAGE_MONITOR */
>   
>   #ifdef CONFIG_NUMA
> diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> index 1e894d34bdce..f1bc2640d85e 100644
> --- a/include/linux/page_idle.h
> +++ b/include/linux/page_idle.h
> @@ -106,6 +106,10 @@ static inline void clear_page_idle(struct page *page)
>   }
>   #endif /* CONFIG_64BIT */
>   
> +ssize_t page_idle_proc_write(struct file *file,
> +	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
> +ssize_t page_idle_proc_read(struct file *file,
> +	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
>   #else /* !CONFIG_IDLE_PAGE_TRACKING */
>   
>   static inline bool page_is_young(struct page *page)
> diff --git a/mm/page_idle.c b/mm/page_idle.c
> index 295512465065..874a60c41fef 100644
> --- a/mm/page_idle.c
> +++ b/mm/page_idle.c
> @@ -11,6 +11,7 @@
>   #include <linux/mmu_notifier.h>
>   #include <linux/page_ext.h>
>   #include <linux/page_idle.h>
> +#include <linux/sched/mm.h>
>   
>   #define BITMAP_CHUNK_SIZE	sizeof(u64)
>   #define BITMAP_CHUNK_BITS	(BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
> @@ -28,15 +29,12 @@
>    *
>    * This function tries to get a user memory page by pfn as described above.
>    */
> -static struct page *page_idle_get_page(unsigned long pfn)
> +static struct page *page_idle_get_page(struct page *page_in)
>   {
>   	struct page *page;
>   	pg_data_t *pgdat;
>   
> -	if (!pfn_valid(pfn))
> -		return NULL;
> -
> -	page = pfn_to_page(pfn);
> +	page = page_in;
>   	if (!page || !PageLRU(page) ||
>   	    !get_page_unless_zero(page))
>   		return NULL;
> @@ -51,6 +49,15 @@ static struct page *page_idle_get_page(unsigned long pfn)
>   	return page;
>   }
>   
> +static struct page *page_idle_get_page_pfn(unsigned long pfn)
> +{
> +
> +	if (!pfn_valid(pfn))
> +		return NULL;
> +
> +	return page_idle_get_page(pfn_to_page(pfn));
> +}
> +
>   static bool page_idle_clear_pte_refs_one(struct page *page,
>   					struct vm_area_struct *vma,
>   					unsigned long addr, void *arg)
> @@ -118,6 +125,47 @@ static void page_idle_clear_pte_refs(struct page *page)
>   		unlock_page(page);
>   }
>   
> +/* Helper to get the start and end frame given a pos and count */
> +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
> +				unsigned long *start, unsigned long *end)
> +{
> +	unsigned long max_frame;
> +
> +	/* If an mm is not given, assume we want physical frames */
> +	max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
> +
> +	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> +		return -EINVAL;
> +
> +	*start = pos * BITS_PER_BYTE;
> +	if (*start >= max_frame)
> +		return -ENXIO;
> +
> +	*end = *start + count * BITS_PER_BYTE;
> +	if (*end > max_frame)
> +		*end = max_frame;
> +	return 0;
> +}
> +
> +static bool page_really_idle(struct page *page)
> +{
> +	if (!page)
> +		return false;
> +
> +	if (page_is_idle(page)) {
> +		/*
> +		 * The page might have been referenced via a
> +		 * pte, in which case it is not idle. Clear
> +		 * refs and recheck.
> +		 */
> +		page_idle_clear_pte_refs(page);
> +		if (page_is_idle(page))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
>   static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
>   				     struct bin_attribute *attr, char *buf,
>   				     loff_t pos, size_t count)
> @@ -125,35 +173,21 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
>   	u64 *out = (u64 *)buf;
>   	struct page *page;
>   	unsigned long pfn, end_pfn;
> -	int bit;
> -
> -	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> -		return -EINVAL;
> -
> -	pfn = pos * BITS_PER_BYTE;
> -	if (pfn >= max_pfn)
> -		return 0;
> +	int bit, ret;
>   
> -	end_pfn = pfn + count * BITS_PER_BYTE;
> -	if (end_pfn > max_pfn)
> -		end_pfn = max_pfn;
> +	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
> +	if (ret == -ENXIO)
> +		return 0;  /* Reads beyond max_pfn do nothing */
> +	else if (ret)
> +		return ret;
>   
>   	for (; pfn < end_pfn; pfn++) {
>   		bit = pfn % BITMAP_CHUNK_BITS;
>   		if (!bit)
>   			*out = 0ULL;
> -		page = page_idle_get_page(pfn);
> -		if (page) {
> -			if (page_is_idle(page)) {
> -				/*
> -				 * The page might have been referenced via a
> -				 * pte, in which case it is not idle. Clear
> -				 * refs and recheck.
> -				 */
> -				page_idle_clear_pte_refs(page);
> -				if (page_is_idle(page))
> -					*out |= 1ULL << bit;
> -			}
> +		page = page_idle_get_page_pfn(pfn);
> +		if (page && page_really_idle(page)) {
> +			*out |= 1ULL << bit;
>   			put_page(page);
>   		}
>   		if (bit == BITMAP_CHUNK_BITS - 1)
> @@ -170,23 +204,16 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
>   	const u64 *in = (u64 *)buf;
>   	struct page *page;
>   	unsigned long pfn, end_pfn;
> -	int bit;
> +	int bit, ret;
>   
> -	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> -		return -EINVAL;
> -
> -	pfn = pos * BITS_PER_BYTE;
> -	if (pfn >= max_pfn)
> -		return -ENXIO;
> -
> -	end_pfn = pfn + count * BITS_PER_BYTE;
> -	if (end_pfn > max_pfn)
> -		end_pfn = max_pfn;
> +	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
> +	if (ret)
> +		return ret;
>   
>   	for (; pfn < end_pfn; pfn++) {
>   		bit = pfn % BITMAP_CHUNK_BITS;
>   		if ((*in >> bit) & 1) {
> -			page = page_idle_get_page(pfn);
> +			page = page_idle_get_page_pfn(pfn);
>   			if (page) {
>   				page_idle_clear_pte_refs(page);
>   				set_page_idle(page);
> @@ -224,10 +251,208 @@ struct page_ext_operations page_idle_ops = {
>   };
>   #endif
>   
> +/*  page_idle tracking for /proc/<pid>/page_idle */
> +
> +static DEFINE_SPINLOCK(idle_page_list_lock);
> +struct list_head idle_page_list;
> +
> +struct page_node {
> +	struct page *page;
> +	unsigned long addr;
> +	struct list_head list;
> +};
> +
> +struct page_idle_proc_priv {
> +	unsigned long start_addr;
> +	char *buffer;
> +	int write;
> +};
> +
> +static void add_page_idle_list(struct page *page,
> +			       unsigned long addr, struct mm_walk *walk)
> +{
> +	struct page *page_get;
> +	struct page_node *pn;
> +	int bit;
> +	unsigned long frames;
> +	struct page_idle_proc_priv *priv = walk->private;
> +	u64 *chunk = (u64 *)priv->buffer;
> +
> +	if (priv->write) {
> +		/* Find whether this page was asked to be marked */
> +		frames = (addr - priv->start_addr) >> PAGE_SHIFT;
> +		bit = frames % BITMAP_CHUNK_BITS;
> +		chunk = &chunk[frames / BITMAP_CHUNK_BITS];
> +		if (((*chunk >> bit) & 1) == 0)
> +			return;
> +	}
> +
> +	page_get = page_idle_get_page(page);
> +	if (!page_get)
> +		return;
> +
> +	pn = kmalloc(sizeof(*pn), GFP_ATOMIC);
> +	if (!pn)
> +		return;
> +
> +	pn->page = page_get;
> +	pn->addr = addr;
> +	list_add(&pn->list, &idle_page_list);
> +}
> +
> +static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
> +				    unsigned long end,
> +				    struct mm_walk *walk)
> +{
> +	struct vm_area_struct *vma = walk->vma;
> +	pte_t *pte;
> +	spinlock_t *ptl;
> +	struct page *page;
> +
> +	ptl = pmd_trans_huge_lock(pmd, vma);
> +	if (ptl) {
> +		if (pmd_present(*pmd)) {
> +			page = follow_trans_huge_pmd(vma, addr, pmd,
> +						     FOLL_DUMP|FOLL_WRITE);
> +			if (!IS_ERR_OR_NULL(page))
> +				add_page_idle_list(page, addr, walk);
> +		}
> +		spin_unlock(ptl);
> +		return 0;
> +	}
> +
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> +		if (!pte_present(*pte))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, *pte);
> +		if (page)
> +			add_page_idle_list(page, addr, walk);
> +	}
> +
> +	pte_unmap_unlock(pte - 1, ptl);
> +	return 0;
> +}
> +
> +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
> +			       size_t count, loff_t *pos,
> +			       struct task_struct *tsk, int write)
> +{
> +	int ret;
> +	char *buffer;
> +	u64 *out;
> +	unsigned long start_addr, end_addr, start_frame, end_frame;
> +	struct mm_struct *mm = file->private_data;
> +	struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, };
> +	struct page_node *cur, *next;
> +	struct page_idle_proc_priv priv;
> +	bool walk_error = false;
> +
> +	if (!mm || !mmget_not_zero(mm))
> +		return -EINVAL;
> +
> +	if (count > PAGE_SIZE)
> +		count = PAGE_SIZE;
> +
> +	buffer = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +	if (!buffer) {
> +		ret = -ENOMEM;
> +		goto out_mmput;
> +	}
> +	out = (u64 *)buffer;
> +
> +	if (write && copy_from_user(buffer, ubuff, count)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame);
> +	if (ret)
> +		goto out;
> +
> +	start_addr = (start_frame << PAGE_SHIFT);
> +	end_addr = (end_frame << PAGE_SHIFT);
> +	priv.buffer = buffer;
> +	priv.start_addr = start_addr;
> +	priv.write = write;
> +	walk.private = &priv;
> +	walk.mm = mm;
> +
> +	down_read(&mm->mmap_sem);
> +
> +	/*
> +	 * Protects the idle_page_list which is needed because
> +	 * walk_page_vma() holds ptlock which deadlocks with
> +	 * page_idle_clear_pte_refs(). So we have to collect all
> +	 * pages first, and then call page_idle_clear_pte_refs().
> +	 */
> +	spin_lock(&idle_page_list_lock);
> +	ret = walk_page_range(start_addr, end_addr, &walk);
> +	if (ret)
> +		walk_error = true;
> +
> +	list_for_each_entry_safe(cur, next, &idle_page_list, list) {
> +		int bit, index;
> +		unsigned long off;
> +		struct page *page = cur->page;
> +
> +		if (unlikely(walk_error))
> +			goto remove_page;
> +
> +		if (write) {
> +			page_idle_clear_pte_refs(page);
> +			set_page_idle(page);
> +		} else {
> +			if (page_really_idle(page)) {
> +				off = ((cur->addr) >> PAGE_SHIFT) - start_frame;
> +				bit = off % BITMAP_CHUNK_BITS;
> +				index = off / BITMAP_CHUNK_BITS;
> +				out[index] |= 1ULL << bit;
> +			}
> +		}
> +remove_page:
> +		put_page(page);
> +		list_del(&cur->list);
> +		kfree(cur);
> +	}
> +	spin_unlock(&idle_page_list_lock);
> +
> +	if (!write && !walk_error)
> +		ret = copy_to_user(ubuff, buffer, count);
> +
> +	up_read(&mm->mmap_sem);
> +out:
> +	kfree(buffer);
> +out_mmput:
> +	mmput(mm);
> +	if (!ret)
> +		ret = count;
> +	return ret;
> +
> +}
> +
> +ssize_t page_idle_proc_read(struct file *file, char __user *ubuff,
> +			    size_t count, loff_t *pos, struct task_struct *tsk)
> +{
> +	return page_idle_proc_generic(file, ubuff, count, pos, tsk, 0);
> +}
> +
> +ssize_t page_idle_proc_write(struct file *file, char __user *ubuff,
> +			     size_t count, loff_t *pos, struct task_struct *tsk)
> +{
> +	return page_idle_proc_generic(file, ubuff, count, pos, tsk, 1);
> +}
> +
>   static int __init page_idle_init(void)
>   {
>   	int err;
>   
> +	INIT_LIST_HEAD(&idle_page_list);
> +
>   	err = sysfs_create_group(mm_kobj, &page_idle_attr_group);
>   	if (err) {
>   		pr_err("page_idle: register sysfs failed\n");
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
  2019-07-23  8:43 ` Konstantin Khlebnikov
@ 2019-07-23 10:10   ` Konstantin Khlebnikov
  2019-07-23 13:47     ` Joel Fernandes
  0 siblings, 1 reply; 18+ messages in thread
From: Konstantin Khlebnikov @ 2019-07-23 10:10 UTC (permalink / raw)
  To: Joel Fernandes (Google), linux-kernel
  Cc: vdavydov.dev, Brendan Gregg, kernel-team, Alexey Dobriyan,
	Al Viro, Andrew Morton, carmenjackson, Christian Hansen,
	Colin Ian King, dancol, David Howells, fmayer, joaodias, joelaf,
	Jonathan Corbet, Kees Cook, Kirill Tkhai, linux-doc,
	linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport, minchan,
	minchan, namhyung, sspatil, surenb, Thomas Gleixner, timmurray,
	tkjos, Vlastimil Babka, wvw

On 23.07.2019 11:43, Konstantin Khlebnikov wrote:
> On 23.07.2019 0:32, Joel Fernandes (Google) wrote:
>> The page_idle tracking feature currently requires looking up the pagemap
>> for a process followed by interacting with /sys/kernel/mm/page_idle.
>> This is quite cumbersome and can be error-prone too. If between
>> accessing the per-PID pagemap and the global page_idle bitmap, if
>> something changes with the page then the information is not accurate.
>> More over looking up PFN from pagemap in Android devices is not
>> supported by unprivileged process and requires SYS_ADMIN and gives 0 for
>> the PFN.
>>
>> This patch adds support to directly interact with page_idle tracking at
>> the PID level by introducing a /proc/<pid>/page_idle file. This
>> eliminates the need for userspace to calculate the mapping of the page.
>> It follows the exact same semantics as the global
>> /sys/kernel/mm/page_idle, however it is easier to use for some usecases
>> where looking up PFN is not needed and also does not require SYS_ADMIN.
>> It ended up simplifying userspace code, solving the security issue
>> mentioned and works quite well. SELinux does not need to be turned off
>> since no pagemap look up is needed.
>>
>> In Android, we are using this for the heap profiler (heapprofd) which
>> profiles and pin points code paths which allocates and leaves memory
>> idle for long periods of time.
>>
>> Documentation material:
>> The idle page tracking API for virtual address indexing using virtual page
>> frame numbers (VFN) is located at /proc/<pid>/page_idle. It is a bitmap
>> that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
>> except that it uses virtual instead of physical frame numbers.
>>
>> This idle page tracking API can be simpler to use than physical address
>> indexing, since the pagemap for a process does not need to be looked up
>> to mark or read a page's idle bit. It is also more accurate than
>> physical address indexing since in physical address indexing, address
>> space changes can occur between reading the pagemap and reading the
>> bitmap. In virtual address indexing, the process's mmap_sem is held for
>> the duration of the access.
> 
> Maybe integrate this into existing interface: /proc/pid/clear_refs and
> /proc/pid/pagemap ?
> 
> I.e.  echo X > /proc/pid/clear_refs clears reference bits in ptes and
> marks pages idle only for pages mapped in this process.
> And idle bit in /proc/pid/pagemap tells that page is still idle in this process.
> This is faster - we don't need to walk whole rmap for that.

Moreover, this is so cheap so could be counted and shown in smaps.
Unlike to clearing real access bits this does not disrupt memory reclaimer.
Killer feature.

> 
>>
>> Cc: vdavydov.dev@gmail.com
>> Cc: Brendan Gregg <bgregg@netflix.com>
>> Cc: kernel-team@android.com
>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>>
>> ---
>> Internal review -> v1:
>> Fixes from Suren.
>> Corrections to change log, docs (Florian, Sandeep)
>>
>>   fs/proc/base.c            |   3 +
>>   fs/proc/internal.h        |   1 +
>>   fs/proc/task_mmu.c        |  57 +++++++
>>   include/linux/page_idle.h |   4 +
>>   mm/page_idle.c            | 305 +++++++++++++++++++++++++++++++++-----
>>   5 files changed, 330 insertions(+), 40 deletions(-)
>>
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index 77eb628ecc7f..a58dd74606e9 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -3021,6 +3021,9 @@ static const struct pid_entry tgid_base_stuff[] = {
>>       REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
>>       REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
>>       REG("pagemap",    S_IRUSR, proc_pagemap_operations),
>> +#ifdef CONFIG_IDLE_PAGE_TRACKING
>> +    REG("page_idle", S_IRUSR|S_IWUSR, proc_page_idle_operations),
>> +#endif
>>   #endif
>>   #ifdef CONFIG_SECURITY
>>       DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
>> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
>> index cd0c8d5ce9a1..bc9371880c63 100644
>> --- a/fs/proc/internal.h
>> +++ b/fs/proc/internal.h
>> @@ -293,6 +293,7 @@ extern const struct file_operations proc_pid_smaps_operations;
>>   extern const struct file_operations proc_pid_smaps_rollup_operations;
>>   extern const struct file_operations proc_clear_refs_operations;
>>   extern const struct file_operations proc_pagemap_operations;
>> +extern const struct file_operations proc_page_idle_operations;
>>   extern unsigned long task_vsize(struct mm_struct *);
>>   extern unsigned long task_statm(struct mm_struct *,
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 4d2b860dbc3f..11ccc53da38e 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -1642,6 +1642,63 @@ const struct file_operations proc_pagemap_operations = {
>>       .open        = pagemap_open,
>>       .release    = pagemap_release,
>>   };
>> +
>> +#ifdef CONFIG_IDLE_PAGE_TRACKING
>> +static ssize_t proc_page_idle_read(struct file *file, char __user *buf,
>> +                   size_t count, loff_t *ppos)
>> +{
>> +    int ret;
>> +    struct task_struct *tsk = get_proc_task(file_inode(file));
>> +
>> +    if (!tsk)
>> +        return -EINVAL;
>> +    ret = page_idle_proc_read(file, buf, count, ppos, tsk);
>> +    put_task_struct(tsk);
>> +    return ret;
>> +}
>> +
>> +static ssize_t proc_page_idle_write(struct file *file, const char __user *buf,
>> +                 size_t count, loff_t *ppos)
>> +{
>> +    int ret;
>> +    struct task_struct *tsk = get_proc_task(file_inode(file));
>> +
>> +    if (!tsk)
>> +        return -EINVAL;
>> +    ret = page_idle_proc_write(file, (char __user *)buf, count, ppos, tsk);
>> +    put_task_struct(tsk);
>> +    return ret;
>> +}
>> +
>> +static int proc_page_idle_open(struct inode *inode, struct file *file)
>> +{
>> +    struct mm_struct *mm;
>> +
>> +    mm = proc_mem_open(inode, PTRACE_MODE_READ);
>> +    if (IS_ERR(mm))
>> +        return PTR_ERR(mm);
>> +    file->private_data = mm;
>> +    return 0;
>> +}
>> +
>> +static int proc_page_idle_release(struct inode *inode, struct file *file)
>> +{
>> +    struct mm_struct *mm = file->private_data;
>> +
>> +    if (mm)
>> +        mmdrop(mm);
>> +    return 0;
>> +}
>> +
>> +const struct file_operations proc_page_idle_operations = {
>> +    .llseek        = mem_lseek, /* borrow this */
>> +    .read        = proc_page_idle_read,
>> +    .write        = proc_page_idle_write,
>> +    .open        = proc_page_idle_open,
>> +    .release    = proc_page_idle_release,
>> +};
>> +#endif /* CONFIG_IDLE_PAGE_TRACKING */
>> +
>>   #endif /* CONFIG_PROC_PAGE_MONITOR */
>>   #ifdef CONFIG_NUMA
>> diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
>> index 1e894d34bdce..f1bc2640d85e 100644
>> --- a/include/linux/page_idle.h
>> +++ b/include/linux/page_idle.h
>> @@ -106,6 +106,10 @@ static inline void clear_page_idle(struct page *page)
>>   }
>>   #endif /* CONFIG_64BIT */
>> +ssize_t page_idle_proc_write(struct file *file,
>> +    char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
>> +ssize_t page_idle_proc_read(struct file *file,
>> +    char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
>>   #else /* !CONFIG_IDLE_PAGE_TRACKING */
>>   static inline bool page_is_young(struct page *page)
>> diff --git a/mm/page_idle.c b/mm/page_idle.c
>> index 295512465065..874a60c41fef 100644
>> --- a/mm/page_idle.c
>> +++ b/mm/page_idle.c
>> @@ -11,6 +11,7 @@
>>   #include <linux/mmu_notifier.h>
>>   #include <linux/page_ext.h>
>>   #include <linux/page_idle.h>
>> +#include <linux/sched/mm.h>
>>   #define BITMAP_CHUNK_SIZE    sizeof(u64)
>>   #define BITMAP_CHUNK_BITS    (BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
>> @@ -28,15 +29,12 @@
>>    *
>>    * This function tries to get a user memory page by pfn as described above.
>>    */
>> -static struct page *page_idle_get_page(unsigned long pfn)
>> +static struct page *page_idle_get_page(struct page *page_in)
>>   {
>>       struct page *page;
>>       pg_data_t *pgdat;
>> -    if (!pfn_valid(pfn))
>> -        return NULL;
>> -
>> -    page = pfn_to_page(pfn);
>> +    page = page_in;
>>       if (!page || !PageLRU(page) ||
>>           !get_page_unless_zero(page))
>>           return NULL;
>> @@ -51,6 +49,15 @@ static struct page *page_idle_get_page(unsigned long pfn)
>>       return page;
>>   }
>> +static struct page *page_idle_get_page_pfn(unsigned long pfn)
>> +{
>> +
>> +    if (!pfn_valid(pfn))
>> +        return NULL;
>> +
>> +    return page_idle_get_page(pfn_to_page(pfn));
>> +}
>> +
>>   static bool page_idle_clear_pte_refs_one(struct page *page,
>>                       struct vm_area_struct *vma,
>>                       unsigned long addr, void *arg)
>> @@ -118,6 +125,47 @@ static void page_idle_clear_pte_refs(struct page *page)
>>           unlock_page(page);
>>   }
>> +/* Helper to get the start and end frame given a pos and count */
>> +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
>> +                unsigned long *start, unsigned long *end)
>> +{
>> +    unsigned long max_frame;
>> +
>> +    /* If an mm is not given, assume we want physical frames */
>> +    max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
>> +
>> +    if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
>> +        return -EINVAL;
>> +
>> +    *start = pos * BITS_PER_BYTE;
>> +    if (*start >= max_frame)
>> +        return -ENXIO;
>> +
>> +    *end = *start + count * BITS_PER_BYTE;
>> +    if (*end > max_frame)
>> +        *end = max_frame;
>> +    return 0;
>> +}
>> +
>> +static bool page_really_idle(struct page *page)
>> +{
>> +    if (!page)
>> +        return false;
>> +
>> +    if (page_is_idle(page)) {
>> +        /*
>> +         * The page might have been referenced via a
>> +         * pte, in which case it is not idle. Clear
>> +         * refs and recheck.
>> +         */
>> +        page_idle_clear_pte_refs(page);
>> +        if (page_is_idle(page))
>> +            return true;
>> +    }
>> +
>> +    return false;
>> +}
>> +
>>   static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
>>                        struct bin_attribute *attr, char *buf,
>>                        loff_t pos, size_t count)
>> @@ -125,35 +173,21 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
>>       u64 *out = (u64 *)buf;
>>       struct page *page;
>>       unsigned long pfn, end_pfn;
>> -    int bit;
>> -
>> -    if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
>> -        return -EINVAL;
>> -
>> -    pfn = pos * BITS_PER_BYTE;
>> -    if (pfn >= max_pfn)
>> -        return 0;
>> +    int bit, ret;
>> -    end_pfn = pfn + count * BITS_PER_BYTE;
>> -    if (end_pfn > max_pfn)
>> -        end_pfn = max_pfn;
>> +    ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
>> +    if (ret == -ENXIO)
>> +        return 0;  /* Reads beyond max_pfn do nothing */
>> +    else if (ret)
>> +        return ret;
>>       for (; pfn < end_pfn; pfn++) {
>>           bit = pfn % BITMAP_CHUNK_BITS;
>>           if (!bit)
>>               *out = 0ULL;
>> -        page = page_idle_get_page(pfn);
>> -        if (page) {
>> -            if (page_is_idle(page)) {
>> -                /*
>> -                 * The page might have been referenced via a
>> -                 * pte, in which case it is not idle. Clear
>> -                 * refs and recheck.
>> -                 */
>> -                page_idle_clear_pte_refs(page);
>> -                if (page_is_idle(page))
>> -                    *out |= 1ULL << bit;
>> -            }
>> +        page = page_idle_get_page_pfn(pfn);
>> +        if (page && page_really_idle(page)) {
>> +            *out |= 1ULL << bit;
>>               put_page(page);
>>           }
>>           if (bit == BITMAP_CHUNK_BITS - 1)
>> @@ -170,23 +204,16 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
>>       const u64 *in = (u64 *)buf;
>>       struct page *page;
>>       unsigned long pfn, end_pfn;
>> -    int bit;
>> +    int bit, ret;
>> -    if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
>> -        return -EINVAL;
>> -
>> -    pfn = pos * BITS_PER_BYTE;
>> -    if (pfn >= max_pfn)
>> -        return -ENXIO;
>> -
>> -    end_pfn = pfn + count * BITS_PER_BYTE;
>> -    if (end_pfn > max_pfn)
>> -        end_pfn = max_pfn;
>> +    ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
>> +    if (ret)
>> +        return ret;
>>       for (; pfn < end_pfn; pfn++) {
>>           bit = pfn % BITMAP_CHUNK_BITS;
>>           if ((*in >> bit) & 1) {
>> -            page = page_idle_get_page(pfn);
>> +            page = page_idle_get_page_pfn(pfn);
>>               if (page) {
>>                   page_idle_clear_pte_refs(page);
>>                   set_page_idle(page);
>> @@ -224,10 +251,208 @@ struct page_ext_operations page_idle_ops = {
>>   };
>>   #endif
>> +/*  page_idle tracking for /proc/<pid>/page_idle */
>> +
>> +static DEFINE_SPINLOCK(idle_page_list_lock);
>> +struct list_head idle_page_list;
>> +
>> +struct page_node {
>> +    struct page *page;
>> +    unsigned long addr;
>> +    struct list_head list;
>> +};
>> +
>> +struct page_idle_proc_priv {
>> +    unsigned long start_addr;
>> +    char *buffer;
>> +    int write;
>> +};
>> +
>> +static void add_page_idle_list(struct page *page,
>> +                   unsigned long addr, struct mm_walk *walk)
>> +{
>> +    struct page *page_get;
>> +    struct page_node *pn;
>> +    int bit;
>> +    unsigned long frames;
>> +    struct page_idle_proc_priv *priv = walk->private;
>> +    u64 *chunk = (u64 *)priv->buffer;
>> +
>> +    if (priv->write) {
>> +        /* Find whether this page was asked to be marked */
>> +        frames = (addr - priv->start_addr) >> PAGE_SHIFT;
>> +        bit = frames % BITMAP_CHUNK_BITS;
>> +        chunk = &chunk[frames / BITMAP_CHUNK_BITS];
>> +        if (((*chunk >> bit) & 1) == 0)
>> +            return;
>> +    }
>> +
>> +    page_get = page_idle_get_page(page);
>> +    if (!page_get)
>> +        return;
>> +
>> +    pn = kmalloc(sizeof(*pn), GFP_ATOMIC);
>> +    if (!pn)
>> +        return;
>> +
>> +    pn->page = page_get;
>> +    pn->addr = addr;
>> +    list_add(&pn->list, &idle_page_list);
>> +}
>> +
>> +static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
>> +                    unsigned long end,
>> +                    struct mm_walk *walk)
>> +{
>> +    struct vm_area_struct *vma = walk->vma;
>> +    pte_t *pte;
>> +    spinlock_t *ptl;
>> +    struct page *page;
>> +
>> +    ptl = pmd_trans_huge_lock(pmd, vma);
>> +    if (ptl) {
>> +        if (pmd_present(*pmd)) {
>> +            page = follow_trans_huge_pmd(vma, addr, pmd,
>> +                             FOLL_DUMP|FOLL_WRITE);
>> +            if (!IS_ERR_OR_NULL(page))
>> +                add_page_idle_list(page, addr, walk);
>> +        }
>> +        spin_unlock(ptl);
>> +        return 0;
>> +    }
>> +
>> +    if (pmd_trans_unstable(pmd))
>> +        return 0;
>> +
>> +    pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>> +    for (; addr != end; pte++, addr += PAGE_SIZE) {
>> +        if (!pte_present(*pte))
>> +            continue;
>> +
>> +        page = vm_normal_page(vma, addr, *pte);
>> +        if (page)
>> +            add_page_idle_list(page, addr, walk);
>> +    }
>> +
>> +    pte_unmap_unlock(pte - 1, ptl);
>> +    return 0;
>> +}
>> +
>> +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
>> +                   size_t count, loff_t *pos,
>> +                   struct task_struct *tsk, int write)
>> +{
>> +    int ret;
>> +    char *buffer;
>> +    u64 *out;
>> +    unsigned long start_addr, end_addr, start_frame, end_frame;
>> +    struct mm_struct *mm = file->private_data;
>> +    struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, };
>> +    struct page_node *cur, *next;
>> +    struct page_idle_proc_priv priv;
>> +    bool walk_error = false;
>> +
>> +    if (!mm || !mmget_not_zero(mm))
>> +        return -EINVAL;
>> +
>> +    if (count > PAGE_SIZE)
>> +        count = PAGE_SIZE;
>> +
>> +    buffer = kzalloc(PAGE_SIZE, GFP_KERNEL);
>> +    if (!buffer) {
>> +        ret = -ENOMEM;
>> +        goto out_mmput;
>> +    }
>> +    out = (u64 *)buffer;
>> +
>> +    if (write && copy_from_user(buffer, ubuff, count)) {
>> +        ret = -EFAULT;
>> +        goto out;
>> +    }
>> +
>> +    ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame);
>> +    if (ret)
>> +        goto out;
>> +
>> +    start_addr = (start_frame << PAGE_SHIFT);
>> +    end_addr = (end_frame << PAGE_SHIFT);
>> +    priv.buffer = buffer;
>> +    priv.start_addr = start_addr;
>> +    priv.write = write;
>> +    walk.private = &priv;
>> +    walk.mm = mm;
>> +
>> +    down_read(&mm->mmap_sem);
>> +
>> +    /*
>> +     * Protects the idle_page_list which is needed because
>> +     * walk_page_vma() holds ptlock which deadlocks with
>> +     * page_idle_clear_pte_refs(). So we have to collect all
>> +     * pages first, and then call page_idle_clear_pte_refs().
>> +     */
>> +    spin_lock(&idle_page_list_lock);
>> +    ret = walk_page_range(start_addr, end_addr, &walk);
>> +    if (ret)
>> +        walk_error = true;
>> +
>> +    list_for_each_entry_safe(cur, next, &idle_page_list, list) {
>> +        int bit, index;
>> +        unsigned long off;
>> +        struct page *page = cur->page;
>> +
>> +        if (unlikely(walk_error))
>> +            goto remove_page;
>> +
>> +        if (write) {
>> +            page_idle_clear_pte_refs(page);
>> +            set_page_idle(page);
>> +        } else {
>> +            if (page_really_idle(page)) {
>> +                off = ((cur->addr) >> PAGE_SHIFT) - start_frame;
>> +                bit = off % BITMAP_CHUNK_BITS;
>> +                index = off / BITMAP_CHUNK_BITS;
>> +                out[index] |= 1ULL << bit;
>> +            }
>> +        }
>> +remove_page:
>> +        put_page(page);
>> +        list_del(&cur->list);
>> +        kfree(cur);
>> +    }
>> +    spin_unlock(&idle_page_list_lock);
>> +
>> +    if (!write && !walk_error)
>> +        ret = copy_to_user(ubuff, buffer, count);
>> +
>> +    up_read(&mm->mmap_sem);
>> +out:
>> +    kfree(buffer);
>> +out_mmput:
>> +    mmput(mm);
>> +    if (!ret)
>> +        ret = count;
>> +    return ret;
>> +
>> +}
>> +
>> +ssize_t page_idle_proc_read(struct file *file, char __user *ubuff,
>> +                size_t count, loff_t *pos, struct task_struct *tsk)
>> +{
>> +    return page_idle_proc_generic(file, ubuff, count, pos, tsk, 0);
>> +}
>> +
>> +ssize_t page_idle_proc_write(struct file *file, char __user *ubuff,
>> +                 size_t count, loff_t *pos, struct task_struct *tsk)
>> +{
>> +    return page_idle_proc_generic(file, ubuff, count, pos, tsk, 1);
>> +}
>> +
>>   static int __init page_idle_init(void)
>>   {
>>       int err;
>> +    INIT_LIST_HEAD(&idle_page_list);
>> +
>>       err = sysfs_create_group(mm_kobj, &page_idle_attr_group);
>>       if (err) {
>>           pr_err("page_idle: register sysfs failed\n");
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
  2019-07-23 10:10   ` Konstantin Khlebnikov
@ 2019-07-23 13:47     ` Joel Fernandes
  0 siblings, 0 replies; 18+ messages in thread
From: Joel Fernandes @ 2019-07-23 13:47 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-kernel, vdavydov.dev, Brendan Gregg, kernel-team,
	Alexey Dobriyan, Al Viro, Andrew Morton, carmenjackson,
	Christian Hansen, Colin Ian King, dancol, David Howells, fmayer,
	joaodias, Jonathan Corbet, Kees Cook, Kirill Tkhai, linux-doc,
	linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport, minchan,
	minchan, namhyung, sspatil, surenb, Thomas Gleixner, timmurray,
	tkjos, Vlastimil Babka, wvw

On Tue, Jul 23, 2019 at 01:10:05PM +0300, Konstantin Khlebnikov wrote:
> On 23.07.2019 11:43, Konstantin Khlebnikov wrote:
> > On 23.07.2019 0:32, Joel Fernandes (Google) wrote:
> > > The page_idle tracking feature currently requires looking up the pagemap
> > > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > > This is quite cumbersome and can be error-prone too. If between
> > > accessing the per-PID pagemap and the global page_idle bitmap, if
> > > something changes with the page then the information is not accurate.
> > > More over looking up PFN from pagemap in Android devices is not
> > > supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> > > the PFN.
> > > 
> > > This patch adds support to directly interact with page_idle tracking at
> > > the PID level by introducing a /proc/<pid>/page_idle file. This
> > > eliminates the need for userspace to calculate the mapping of the page.
> > > It follows the exact same semantics as the global
> > > /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> > > where looking up PFN is not needed and also does not require SYS_ADMIN.
> > > It ended up simplifying userspace code, solving the security issue
> > > mentioned and works quite well. SELinux does not need to be turned off
> > > since no pagemap look up is needed.
> > > 
> > > In Android, we are using this for the heap profiler (heapprofd) which
> > > profiles and pin points code paths which allocates and leaves memory
> > > idle for long periods of time.
> > > 
> > > Documentation material:
> > > The idle page tracking API for virtual address indexing using virtual page
> > > frame numbers (VFN) is located at /proc/<pid>/page_idle. It is a bitmap
> > > that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
> > > except that it uses virtual instead of physical frame numbers.
> > > 
> > > This idle page tracking API can be simpler to use than physical address
> > > indexing, since the pagemap for a process does not need to be looked up
> > > to mark or read a page's idle bit. It is also more accurate than
> > > physical address indexing since in physical address indexing, address
> > > space changes can occur between reading the pagemap and reading the
> > > bitmap. In virtual address indexing, the process's mmap_sem is held for
> > > the duration of the access.
> > 
> > Maybe integrate this into existing interface: /proc/pid/clear_refs and
> > /proc/pid/pagemap ?
> > 
> > I.e.  echo X > /proc/pid/clear_refs clears reference bits in ptes and
> > marks pages idle only for pages mapped in this process.
> > And idle bit in /proc/pid/pagemap tells that page is still idle in this process.
> > This is faster - we don't need to walk whole rmap for that.
> 
> Moreover, this is so cheap so could be counted and shown in smaps.
> Unlike to clearing real access bits this does not disrupt memory reclaimer.
> Killer feature.

I replied to your patch:
https://lore.kernel.org/lkml/20190723134647.GA104199@google.com/T/#med8992e75c32d9c47f95b119d24a43ded36420bc


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
  2019-07-23  6:13 ` Minchan Kim
@ 2019-07-23 14:20   ` Joel Fernandes
  2019-07-24  4:28     ` Minchan Kim
  0 siblings, 1 reply; 18+ messages in thread
From: Joel Fernandes @ 2019-07-23 14:20 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, vdavydov.dev, Brendan Gregg, kernel-team,
	Alexey Dobriyan, Al Viro, Andrew Morton, carmenjackson,
	Christian Hansen, Colin Ian King, dancol, David Howells, fmayer,
	joaodias, Jonathan Corbet, Kees Cook, Kirill Tkhai,
	Konstantin Khlebnikov, linux-doc, linux-fsdevel, linux-mm,
	Michal Hocko, Mike Rapoport, namhyung, sspatil, surenb,
	Thomas Gleixner, timmurray, tkjos, Vlastimil Babka, wvw

On Tue, Jul 23, 2019 at 03:13:58PM +0900, Minchan Kim wrote:
> Hi Joel,
> 
> On Mon, Jul 22, 2019 at 05:32:04PM -0400, Joel Fernandes (Google) wrote:
> > The page_idle tracking feature currently requires looking up the pagemap
> > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > This is quite cumbersome and can be error-prone too. If between
> 
> cumbersome: That's the fair tradeoff between idle page tracking and
> clear_refs because idle page tracking could check even though the page
> is not mapped.

It is fair tradeoff, but could be made simpler. The userspace code got
reduced by a good amount as well.

> error-prone: What's the error?

We see in normal Android usage, that some of the times pages appear not to be
idle even when they really are idle. Reproducing this is a bit unpredictable
and happens at random occasions. With this new interface, we are seeing this
happen much much lesser.

> > accessing the per-PID pagemap and the global page_idle bitmap, if
> > something changes with the page then the information is not accurate.
> 
> What you mean with error is this timing issue?
> Why do you need to be accurate? IOW, accurate is always good but what's
> the scale of the accuracy?

There is a time window between looking up pagemap and checking if page is
idle. Anyway, see below for the primary goals as you asked:

> > More over looking up PFN from pagemap in Android devices is not
> > supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> > the PFN.
> > 
> > This patch adds support to directly interact with page_idle tracking at
> > the PID level by introducing a /proc/<pid>/page_idle file. This
> > eliminates the need for userspace to calculate the mapping of the page.
> > It follows the exact same semantics as the global
> > /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> > where looking up PFN is not needed and also does not require SYS_ADMIN.
> 
> Ah, so the primary goal is to provide convinience interface and it would
> help accurary, too. IOW, accuracy is not your main goal?

There are a couple of primary goals: Security, conveience and also solving
the accuracy/reliability problem we are seeing. Do keep in mind looking up
PFN has security implications. The PFN field in pagemap is zeroed if the user
does not have CAP_SYS_ADMIN.

> > In Android, we are using this for the heap profiler (heapprofd) which
> > profiles and pin points code paths which allocates and leaves memory
> > idle for long periods of time.
> 
> So the goal is to detect idle pages with idle memory tracking?

Isn't that what idle memory tracking does?

> It couldn't work well because such idle pages could finally swap out and
> lose every flags of the page descriptor which is working mechanism of
> idle page tracking. It should have named "workingset page tracking",
> not "idle page tracking".

The heap profiler that uses page-idle tracking is not to measure working set,
but to look for pages that are idle for long periods of time.

Thanks for bringing up the swapping corner case..  Perhaps we can improve
the heap profiler to detect this by looking at bits 0-4 in pagemap. While it
is true that we would lose access information during the window, there is a
high likelihood that the page was not accessed which is why it was swapped.
Thoughts?

thanks,

 - Joel



> > Documentation material:
> > The idle page tracking API for virtual address indexing using virtual page
> > frame numbers (VFN) is located at /proc/<pid>/page_idle. It is a bitmap
> > that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
> > except that it uses virtual instead of physical frame numbers.
> > 
> > This idle page tracking API can be simpler to use than physical address
> > indexing, since the pagemap for a process does not need to be looked up
> > to mark or read a page's idle bit. It is also more accurate than
> > physical address indexing since in physical address indexing, address
> > space changes can occur between reading the pagemap and reading the
> > bitmap. In virtual address indexing, the process's mmap_sem is held for
> > the duration of the access.
> > 
> > Cc: vdavydov.dev@gmail.com
> > Cc: Brendan Gregg <bgregg@netflix.com>
> > Cc: kernel-team@android.com
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > 
> > ---
> > Internal review -> v1:
> > Fixes from Suren.
> > Corrections to change log, docs (Florian, Sandeep)
> > 
> >  fs/proc/base.c            |   3 +
> >  fs/proc/internal.h        |   1 +
> >  fs/proc/task_mmu.c        |  57 +++++++
> >  include/linux/page_idle.h |   4 +
> >  mm/page_idle.c            | 305 +++++++++++++++++++++++++++++++++-----
> >  5 files changed, 330 insertions(+), 40 deletions(-)
> > 
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 77eb628ecc7f..a58dd74606e9 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -3021,6 +3021,9 @@ static const struct pid_entry tgid_base_stuff[] = {
> >  	REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
> >  	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
> >  	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
> > +#ifdef CONFIG_IDLE_PAGE_TRACKING
> > +	REG("page_idle", S_IRUSR|S_IWUSR, proc_page_idle_operations),
> > +#endif
> >  #endif
> >  #ifdef CONFIG_SECURITY
> >  	DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
> > diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> > index cd0c8d5ce9a1..bc9371880c63 100644
> > --- a/fs/proc/internal.h
> > +++ b/fs/proc/internal.h
> > @@ -293,6 +293,7 @@ extern const struct file_operations proc_pid_smaps_operations;
> >  extern const struct file_operations proc_pid_smaps_rollup_operations;
> >  extern const struct file_operations proc_clear_refs_operations;
> >  extern const struct file_operations proc_pagemap_operations;
> > +extern const struct file_operations proc_page_idle_operations;
> >  
> >  extern unsigned long task_vsize(struct mm_struct *);
> >  extern unsigned long task_statm(struct mm_struct *,
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 4d2b860dbc3f..11ccc53da38e 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -1642,6 +1642,63 @@ const struct file_operations proc_pagemap_operations = {
> >  	.open		= pagemap_open,
> >  	.release	= pagemap_release,
> >  };
> > +
> > +#ifdef CONFIG_IDLE_PAGE_TRACKING
> > +static ssize_t proc_page_idle_read(struct file *file, char __user *buf,
> > +				   size_t count, loff_t *ppos)
> > +{
> > +	int ret;
> > +	struct task_struct *tsk = get_proc_task(file_inode(file));
> > +
> > +	if (!tsk)
> > +		return -EINVAL;
> > +	ret = page_idle_proc_read(file, buf, count, ppos, tsk);
> > +	put_task_struct(tsk);
> > +	return ret;
> > +}
> > +
> > +static ssize_t proc_page_idle_write(struct file *file, const char __user *buf,
> > +				 size_t count, loff_t *ppos)
> > +{
> > +	int ret;
> > +	struct task_struct *tsk = get_proc_task(file_inode(file));
> > +
> > +	if (!tsk)
> > +		return -EINVAL;
> > +	ret = page_idle_proc_write(file, (char __user *)buf, count, ppos, tsk);
> > +	put_task_struct(tsk);
> > +	return ret;
> > +}
> > +
> > +static int proc_page_idle_open(struct inode *inode, struct file *file)
> > +{
> > +	struct mm_struct *mm;
> > +
> > +	mm = proc_mem_open(inode, PTRACE_MODE_READ);
> > +	if (IS_ERR(mm))
> > +		return PTR_ERR(mm);
> > +	file->private_data = mm;
> > +	return 0;
> > +}
> > +
> > +static int proc_page_idle_release(struct inode *inode, struct file *file)
> > +{
> > +	struct mm_struct *mm = file->private_data;
> > +
> > +	if (mm)
> > +		mmdrop(mm);
> > +	return 0;
> > +}
> > +
> > +const struct file_operations proc_page_idle_operations = {
> > +	.llseek		= mem_lseek, /* borrow this */
> > +	.read		= proc_page_idle_read,
> > +	.write		= proc_page_idle_write,
> > +	.open		= proc_page_idle_open,
> > +	.release	= proc_page_idle_release,
> > +};
> > +#endif /* CONFIG_IDLE_PAGE_TRACKING */
> > +
> >  #endif /* CONFIG_PROC_PAGE_MONITOR */
> >  
> >  #ifdef CONFIG_NUMA
> > diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> > index 1e894d34bdce..f1bc2640d85e 100644
> > --- a/include/linux/page_idle.h
> > +++ b/include/linux/page_idle.h
> > @@ -106,6 +106,10 @@ static inline void clear_page_idle(struct page *page)
> >  }
> >  #endif /* CONFIG_64BIT */
> >  
> > +ssize_t page_idle_proc_write(struct file *file,
> > +	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
> > +ssize_t page_idle_proc_read(struct file *file,
> > +	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
> >  #else /* !CONFIG_IDLE_PAGE_TRACKING */
> >  
> >  static inline bool page_is_young(struct page *page)
> > diff --git a/mm/page_idle.c b/mm/page_idle.c
> > index 295512465065..874a60c41fef 100644
> > --- a/mm/page_idle.c
> > +++ b/mm/page_idle.c
> > @@ -11,6 +11,7 @@
> >  #include <linux/mmu_notifier.h>
> >  #include <linux/page_ext.h>
> >  #include <linux/page_idle.h>
> > +#include <linux/sched/mm.h>
> >  
> >  #define BITMAP_CHUNK_SIZE	sizeof(u64)
> >  #define BITMAP_CHUNK_BITS	(BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
> > @@ -28,15 +29,12 @@
> >   *
> >   * This function tries to get a user memory page by pfn as described above.
> >   */
> > -static struct page *page_idle_get_page(unsigned long pfn)
> > +static struct page *page_idle_get_page(struct page *page_in)
> >  {
> >  	struct page *page;
> >  	pg_data_t *pgdat;
> >  
> > -	if (!pfn_valid(pfn))
> > -		return NULL;
> > -
> > -	page = pfn_to_page(pfn);
> > +	page = page_in;
> >  	if (!page || !PageLRU(page) ||
> >  	    !get_page_unless_zero(page))
> >  		return NULL;
> > @@ -51,6 +49,15 @@ static struct page *page_idle_get_page(unsigned long pfn)
> >  	return page;
> >  }
> >  
> > +static struct page *page_idle_get_page_pfn(unsigned long pfn)
> > +{
> > +
> > +	if (!pfn_valid(pfn))
> > +		return NULL;
> > +
> > +	return page_idle_get_page(pfn_to_page(pfn));
> > +}
> > +
> >  static bool page_idle_clear_pte_refs_one(struct page *page,
> >  					struct vm_area_struct *vma,
> >  					unsigned long addr, void *arg)
> > @@ -118,6 +125,47 @@ static void page_idle_clear_pte_refs(struct page *page)
> >  		unlock_page(page);
> >  }
> >  
> > +/* Helper to get the start and end frame given a pos and count */
> > +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
> > +				unsigned long *start, unsigned long *end)
> > +{
> > +	unsigned long max_frame;
> > +
> > +	/* If an mm is not given, assume we want physical frames */
> > +	max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
> > +
> > +	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> > +		return -EINVAL;
> > +
> > +	*start = pos * BITS_PER_BYTE;
> > +	if (*start >= max_frame)
> > +		return -ENXIO;
> > +
> > +	*end = *start + count * BITS_PER_BYTE;
> > +	if (*end > max_frame)
> > +		*end = max_frame;
> > +	return 0;
> > +}
> > +
> > +static bool page_really_idle(struct page *page)
> > +{
> > +	if (!page)
> > +		return false;
> > +
> > +	if (page_is_idle(page)) {
> > +		/*
> > +		 * The page might have been referenced via a
> > +		 * pte, in which case it is not idle. Clear
> > +		 * refs and recheck.
> > +		 */
> > +		page_idle_clear_pte_refs(page);
> > +		if (page_is_idle(page))
> > +			return true;
> > +	}
> > +
> > +	return false;
> > +}
> > +
> >  static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
> >  				     struct bin_attribute *attr, char *buf,
> >  				     loff_t pos, size_t count)
> > @@ -125,35 +173,21 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
> >  	u64 *out = (u64 *)buf;
> >  	struct page *page;
> >  	unsigned long pfn, end_pfn;
> > -	int bit;
> > -
> > -	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> > -		return -EINVAL;
> > -
> > -	pfn = pos * BITS_PER_BYTE;
> > -	if (pfn >= max_pfn)
> > -		return 0;
> > +	int bit, ret;
> >  
> > -	end_pfn = pfn + count * BITS_PER_BYTE;
> > -	if (end_pfn > max_pfn)
> > -		end_pfn = max_pfn;
> > +	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
> > +	if (ret == -ENXIO)
> > +		return 0;  /* Reads beyond max_pfn do nothing */
> > +	else if (ret)
> > +		return ret;
> >  
> >  	for (; pfn < end_pfn; pfn++) {
> >  		bit = pfn % BITMAP_CHUNK_BITS;
> >  		if (!bit)
> >  			*out = 0ULL;
> > -		page = page_idle_get_page(pfn);
> > -		if (page) {
> > -			if (page_is_idle(page)) {
> > -				/*
> > -				 * The page might have been referenced via a
> > -				 * pte, in which case it is not idle. Clear
> > -				 * refs and recheck.
> > -				 */
> > -				page_idle_clear_pte_refs(page);
> > -				if (page_is_idle(page))
> > -					*out |= 1ULL << bit;
> > -			}
> > +		page = page_idle_get_page_pfn(pfn);
> > +		if (page && page_really_idle(page)) {
> > +			*out |= 1ULL << bit;
> >  			put_page(page);
> >  		}
> >  		if (bit == BITMAP_CHUNK_BITS - 1)
> > @@ -170,23 +204,16 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
> >  	const u64 *in = (u64 *)buf;
> >  	struct page *page;
> >  	unsigned long pfn, end_pfn;
> > -	int bit;
> > +	int bit, ret;
> >  
> > -	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> > -		return -EINVAL;
> > -
> > -	pfn = pos * BITS_PER_BYTE;
> > -	if (pfn >= max_pfn)
> > -		return -ENXIO;
> > -
> > -	end_pfn = pfn + count * BITS_PER_BYTE;
> > -	if (end_pfn > max_pfn)
> > -		end_pfn = max_pfn;
> > +	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
> > +	if (ret)
> > +		return ret;
> >  
> >  	for (; pfn < end_pfn; pfn++) {
> >  		bit = pfn % BITMAP_CHUNK_BITS;
> >  		if ((*in >> bit) & 1) {
> > -			page = page_idle_get_page(pfn);
> > +			page = page_idle_get_page_pfn(pfn);
> >  			if (page) {
> >  				page_idle_clear_pte_refs(page);
> >  				set_page_idle(page);
> > @@ -224,10 +251,208 @@ struct page_ext_operations page_idle_ops = {
> >  };
> >  #endif
> >  
> > +/*  page_idle tracking for /proc/<pid>/page_idle */
> > +
> > +static DEFINE_SPINLOCK(idle_page_list_lock);
> > +struct list_head idle_page_list;
> > +
> > +struct page_node {
> > +	struct page *page;
> > +	unsigned long addr;
> > +	struct list_head list;
> > +};
> > +
> > +struct page_idle_proc_priv {
> > +	unsigned long start_addr;
> > +	char *buffer;
> > +	int write;
> > +};
> > +
> > +static void add_page_idle_list(struct page *page,
> > +			       unsigned long addr, struct mm_walk *walk)
> > +{
> > +	struct page *page_get;
> > +	struct page_node *pn;
> > +	int bit;
> > +	unsigned long frames;
> > +	struct page_idle_proc_priv *priv = walk->private;
> > +	u64 *chunk = (u64 *)priv->buffer;
> > +
> > +	if (priv->write) {
> > +		/* Find whether this page was asked to be marked */
> > +		frames = (addr - priv->start_addr) >> PAGE_SHIFT;
> > +		bit = frames % BITMAP_CHUNK_BITS;
> > +		chunk = &chunk[frames / BITMAP_CHUNK_BITS];
> > +		if (((*chunk >> bit) & 1) == 0)
> > +			return;
> > +	}
> > +
> > +	page_get = page_idle_get_page(page);
> > +	if (!page_get)
> > +		return;
> > +
> > +	pn = kmalloc(sizeof(*pn), GFP_ATOMIC);
> > +	if (!pn)
> > +		return;
> > +
> > +	pn->page = page_get;
> > +	pn->addr = addr;
> > +	list_add(&pn->list, &idle_page_list);
> > +}
> > +
> > +static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
> > +				    unsigned long end,
> > +				    struct mm_walk *walk)
> > +{
> > +	struct vm_area_struct *vma = walk->vma;
> > +	pte_t *pte;
> > +	spinlock_t *ptl;
> > +	struct page *page;
> > +
> > +	ptl = pmd_trans_huge_lock(pmd, vma);
> > +	if (ptl) {
> > +		if (pmd_present(*pmd)) {
> > +			page = follow_trans_huge_pmd(vma, addr, pmd,
> > +						     FOLL_DUMP|FOLL_WRITE);
> > +			if (!IS_ERR_OR_NULL(page))
> > +				add_page_idle_list(page, addr, walk);
> > +		}
> > +		spin_unlock(ptl);
> > +		return 0;
> > +	}
> > +
> > +	if (pmd_trans_unstable(pmd))
> > +		return 0;
> > +
> > +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> > +		if (!pte_present(*pte))
> > +			continue;
> > +
> > +		page = vm_normal_page(vma, addr, *pte);
> > +		if (page)
> > +			add_page_idle_list(page, addr, walk);
> > +	}
> > +
> > +	pte_unmap_unlock(pte - 1, ptl);
> > +	return 0;
> > +}
> > +
> > +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
> > +			       size_t count, loff_t *pos,
> > +			       struct task_struct *tsk, int write)
> > +{
> > +	int ret;
> > +	char *buffer;
> > +	u64 *out;
> > +	unsigned long start_addr, end_addr, start_frame, end_frame;
> > +	struct mm_struct *mm = file->private_data;
> > +	struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, };
> > +	struct page_node *cur, *next;
> > +	struct page_idle_proc_priv priv;
> > +	bool walk_error = false;
> > +
> > +	if (!mm || !mmget_not_zero(mm))
> > +		return -EINVAL;
> > +
> > +	if (count > PAGE_SIZE)
> > +		count = PAGE_SIZE;
> > +
> > +	buffer = kzalloc(PAGE_SIZE, GFP_KERNEL);
> > +	if (!buffer) {
> > +		ret = -ENOMEM;
> > +		goto out_mmput;
> > +	}
> > +	out = (u64 *)buffer;
> > +
> > +	if (write && copy_from_user(buffer, ubuff, count)) {
> > +		ret = -EFAULT;
> > +		goto out;
> > +	}
> > +
> > +	ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame);
> > +	if (ret)
> > +		goto out;
> > +
> > +	start_addr = (start_frame << PAGE_SHIFT);
> > +	end_addr = (end_frame << PAGE_SHIFT);
> > +	priv.buffer = buffer;
> > +	priv.start_addr = start_addr;
> > +	priv.write = write;
> > +	walk.private = &priv;
> > +	walk.mm = mm;
> > +
> > +	down_read(&mm->mmap_sem);
> > +
> > +	/*
> > +	 * Protects the idle_page_list which is needed because
> > +	 * walk_page_vma() holds ptlock which deadlocks with
> > +	 * page_idle_clear_pte_refs(). So we have to collect all
> > +	 * pages first, and then call page_idle_clear_pte_refs().
> > +	 */
> > +	spin_lock(&idle_page_list_lock);
> > +	ret = walk_page_range(start_addr, end_addr, &walk);
> > +	if (ret)
> > +		walk_error = true;
> > +
> > +	list_for_each_entry_safe(cur, next, &idle_page_list, list) {
> > +		int bit, index;
> > +		unsigned long off;
> > +		struct page *page = cur->page;
> > +
> > +		if (unlikely(walk_error))
> > +			goto remove_page;
> > +
> > +		if (write) {
> > +			page_idle_clear_pte_refs(page);
> > +			set_page_idle(page);
> > +		} else {
> > +			if (page_really_idle(page)) {
> > +				off = ((cur->addr) >> PAGE_SHIFT) - start_frame;
> > +				bit = off % BITMAP_CHUNK_BITS;
> > +				index = off / BITMAP_CHUNK_BITS;
> > +				out[index] |= 1ULL << bit;
> > +			}
> > +		}
> > +remove_page:
> > +		put_page(page);
> > +		list_del(&cur->list);
> > +		kfree(cur);
> > +	}
> > +	spin_unlock(&idle_page_list_lock);
> > +
> > +	if (!write && !walk_error)
> > +		ret = copy_to_user(ubuff, buffer, count);
> > +
> > +	up_read(&mm->mmap_sem);
> > +out:
> > +	kfree(buffer);
> > +out_mmput:
> > +	mmput(mm);
> > +	if (!ret)
> > +		ret = count;
> > +	return ret;
> > +
> > +}
> > +
> > +ssize_t page_idle_proc_read(struct file *file, char __user *ubuff,
> > +			    size_t count, loff_t *pos, struct task_struct *tsk)
> > +{
> > +	return page_idle_proc_generic(file, ubuff, count, pos, tsk, 0);
> > +}
> > +
> > +ssize_t page_idle_proc_write(struct file *file, char __user *ubuff,
> > +			     size_t count, loff_t *pos, struct task_struct *tsk)
> > +{
> > +	return page_idle_proc_generic(file, ubuff, count, pos, tsk, 1);
> > +}
> > +
> >  static int __init page_idle_init(void)
> >  {
> >  	int err;
> >  
> > +	INIT_LIST_HEAD(&idle_page_list);
> > +
> >  	err = sysfs_create_group(mm_kobj, &page_idle_attr_group);
> >  	if (err) {
> >  		pr_err("page_idle: register sysfs failed\n");
> > -- 
> > 2.22.0.657.g960e92d24f-goog

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
  2019-07-23  6:05 ` Michal Hocko
@ 2019-07-23 14:34   ` Joel Fernandes
  0 siblings, 0 replies; 18+ messages in thread
From: Joel Fernandes @ 2019-07-23 14:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, vdavydov.dev, Brendan Gregg, kernel-team,
	Alexey Dobriyan, Al Viro, Andrew Morton, carmenjackson,
	Christian Hansen, Colin Ian King, dancol, David Howells, fmayer,
	joaodias, Jonathan Corbet, Kees Cook, Kirill Tkhai,
	Konstantin Khlebnikov, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, minchan, namhyung, sspatil, surenb,
	Thomas Gleixner, timmurray, tkjos, Vlastimil Babka, wvw,
	linux-api

On Tue, Jul 23, 2019 at 08:05:25AM +0200, Michal Hocko wrote:
> [Cc linux-api - please always do CC this list when introducing a user
>  visible API]

Sorry, will do.

> On Mon 22-07-19 17:32:04, Joel Fernandes (Google) wrote:
> > The page_idle tracking feature currently requires looking up the pagemap
> > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > This is quite cumbersome and can be error-prone too. If between
> > accessing the per-PID pagemap and the global page_idle bitmap, if
> > something changes with the page then the information is not accurate.
> > More over looking up PFN from pagemap in Android devices is not
> > supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> > the PFN.
> > 
> > This patch adds support to directly interact with page_idle tracking at
> > the PID level by introducing a /proc/<pid>/page_idle file. This
> > eliminates the need for userspace to calculate the mapping of the page.
> > It follows the exact same semantics as the global
> > /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> > where looking up PFN is not needed and also does not require SYS_ADMIN.
> > It ended up simplifying userspace code, solving the security issue
> > mentioned and works quite well. SELinux does not need to be turned off
> > since no pagemap look up is needed.
> > 
> > In Android, we are using this for the heap profiler (heapprofd) which
> > profiles and pin points code paths which allocates and leaves memory
> > idle for long periods of time.
> > 
> > Documentation material:
> > The idle page tracking API for virtual address indexing using virtual page
> > frame numbers (VFN) is located at /proc/<pid>/page_idle. It is a bitmap
> > that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
> > except that it uses virtual instead of physical frame numbers.
> > 
> > This idle page tracking API can be simpler to use than physical address
> > indexing, since the pagemap for a process does not need to be looked up
> > to mark or read a page's idle bit. It is also more accurate than
> > physical address indexing since in physical address indexing, address
> > space changes can occur between reading the pagemap and reading the
> > bitmap. In virtual address indexing, the process's mmap_sem is held for
> > the duration of the access.
> 
> I didn't get to read the actual code but the overall idea makes sense to
> me. I can see this being useful for userspace memory management (along
> with remote MADV_PAGEOUT, MADV_COLD).

Thanks.

> Normally I would object that a cumbersome nature of the existing
> interface can be hidden in a userspace but I do agree that rowhammer has
> made this one close to unusable for anything but a privileged process.

Agreed, this is one of the primary motivations for the patch as you said.

> I do not think you can make any argument about accuracy because
> the information will never be accurate. Sure the race window is smaller
> in principle but you can hardly say anything about how much or whether
> at all.

Sure, fair enough. That is why I wasn't beating the drum too much on the
accuracy point. However, this surprisingly does work quite well.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
  2019-07-22 22:06 ` [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing Andrew Morton
@ 2019-07-23 14:43   ` Joel Fernandes
  2019-07-24 19:33   ` Joel Fernandes
  1 sibling, 0 replies; 18+ messages in thread
From: Joel Fernandes @ 2019-07-23 14:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, vdavydov.dev, Brendan Gregg, kernel-team,
	Alexey Dobriyan, Al Viro, carmenjackson, Christian Hansen,
	Colin Ian King, dancol, David Howells, fmayer, joaodias,
	Jonathan Corbet, Kees Cook, Kirill Tkhai, Konstantin Khlebnikov,
	linux-doc, linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport,
	minchan, minchan, namhyung, sspatil, surenb, Thomas Gleixner,
	timmurray, tkjos, Vlastimil Babka, wvw

On Mon, Jul 22, 2019 at 03:06:39PM -0700, Andrew Morton wrote:
> On Mon, 22 Jul 2019 17:32:04 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote:
> 
> > The page_idle tracking feature currently requires looking up the pagemap
> > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > This is quite cumbersome and can be error-prone too. If between
> > accessing the per-PID pagemap and the global page_idle bitmap, if
> > something changes with the page then the information is not accurate.
> 
> Well, it's never going to be "accurate" - something could change one
> nanosecond after userspace has read the data...
> 
> Presumably with this approach the data will be "more" accurate.  How
> big a problem has this inaccuracy proven to be in real-world usage?

Has proven to be quite a thorn. But the security issue is the main problem..

> > More over looking up PFN from pagemap in Android devices is not
> > supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> > the PFN.

..as mentioned here.

I should have emphasized on the security issue more, will do so in the next
revision.

> > This patch adds support to directly interact with page_idle tracking at
> > the PID level by introducing a /proc/<pid>/page_idle file. This
> > eliminates the need for userspace to calculate the mapping of the page.
> > It follows the exact same semantics as the global
> > /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> > where looking up PFN is not needed and also does not require SYS_ADMIN.
> > It ended up simplifying userspace code, solving the security issue
> > mentioned and works quite well. SELinux does not need to be turned off
> > since no pagemap look up is needed.
> > 
> > In Android, we are using this for the heap profiler (heapprofd) which
> > profiles and pin points code paths which allocates and leaves memory
> > idle for long periods of time.
> > 
> > Documentation material:
> > The idle page tracking API for virtual address indexing using virtual page
> > frame numbers (VFN) is located at /proc/<pid>/page_idle. It is a bitmap
> > that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
> > except that it uses virtual instead of physical frame numbers.
> > 
> > This idle page tracking API can be simpler to use than physical address
> > indexing, since the pagemap for a process does not need to be looked up
> > to mark or read a page's idle bit. It is also more accurate than
> > physical address indexing since in physical address indexing, address
> > space changes can occur between reading the pagemap and reading the
> > bitmap. In virtual address indexing, the process's mmap_sem is held for
> > the duration of the access.
> > 
> > ...
> >
> > --- a/mm/page_idle.c
> > +++ b/mm/page_idle.c
> > @@ -11,6 +11,7 @@
> >  #include <linux/mmu_notifier.h>
> >  #include <linux/page_ext.h>
> >  #include <linux/page_idle.h>
> > +#include <linux/sched/mm.h>
> >  
> >  #define BITMAP_CHUNK_SIZE	sizeof(u64)
> >  #define BITMAP_CHUNK_BITS	(BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
> > @@ -28,15 +29,12 @@
> >   *
> >   * This function tries to get a user memory page by pfn as described above.
> >   */
> 
> Above comment needs updating or moving?
> 
> > -static struct page *page_idle_get_page(unsigned long pfn)
> > +static struct page *page_idle_get_page(struct page *page_in)
> >  {
> >  	struct page *page;
> >  	pg_data_t *pgdat;
> >  
> > -	if (!pfn_valid(pfn))
> > -		return NULL;
> > -
> > -	page = pfn_to_page(pfn);
> > +	page = page_in;
> >  	if (!page || !PageLRU(page) ||
> >  	    !get_page_unless_zero(page))
> >  		return NULL;
> >
> > ...
> >
> > +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
> > +				unsigned long *start, unsigned long *end)
> > +{
> > +	unsigned long max_frame;
> > +
> > +	/* If an mm is not given, assume we want physical frames */
> > +	max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
> > +
> > +	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> > +		return -EINVAL;
> > +
> > +	*start = pos * BITS_PER_BYTE;
> > +	if (*start >= max_frame)
> > +		return -ENXIO;
> 
> Is said to mean "The system tried to use the device represented by a
> file you specified, and it couldnt find the device.  This can mean that
> the device file was installed incorrectly, or that the physical device
> is missing or not correctly attached to the computer."
> 
> This doesn't seem appropriate in this usage and is hence possibly
> misleading.  Someone whose application fails with ENXIO will be
> scratching their heads.

This actually keeps it consistent with the current code. I refactored that
code a bit and I'm reusing parts of it to keep lines of code less. See
page_idle_bitmap_write where it returns -ENXIO in current upstream.

However note that I am actually returning 0 if page_idle_bitmap_write()
returns -ENXIO:

+	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
+	if (ret == -ENXIO)
+		return 0;  /* Reads beyond max_pfn do nothing */

The reason I do it this way is, I am using page_idle_get_frames() in the old
code and the new code, a bit confusing I know! But it is the cleanest way I
could find to keep this code common.

> > +	*end = *start + count * BITS_PER_BYTE;
> > +	if (*end > max_frame)
> > +		*end = max_frame;
> > +	return 0;
> > +}
> > +
> >
> > ...
> >
> > +static void add_page_idle_list(struct page *page,
> > +			       unsigned long addr, struct mm_walk *walk)
> > +{
> > +	struct page *page_get;
> > +	struct page_node *pn;
> > +	int bit;
> > +	unsigned long frames;
> > +	struct page_idle_proc_priv *priv = walk->private;
> > +	u64 *chunk = (u64 *)priv->buffer;
> > +
> > +	if (priv->write) {
> > +		/* Find whether this page was asked to be marked */
> > +		frames = (addr - priv->start_addr) >> PAGE_SHIFT;
> > +		bit = frames % BITMAP_CHUNK_BITS;
> > +		chunk = &chunk[frames / BITMAP_CHUNK_BITS];
> > +		if (((*chunk >> bit) & 1) == 0)
> > +			return;
> > +	}
> > +
> > +	page_get = page_idle_get_page(page);
> > +	if (!page_get)
> > +		return;
> > +
> > +	pn = kmalloc(sizeof(*pn), GFP_ATOMIC);
> 
> I'm not liking this GFP_ATOMIC.  If I'm reading the code correctly,
> userspace can ask for an arbitrarily large number of GFP_ATOMIC
> allocations by doing a large read.  This can potentially exhaust page
> reserves which things like networking Rx interrupts need and can make
> this whole feature less reliable.

Ok, I will look into this more and possibly do the allocation another way.
spinlocks are held hence I use GFP_ATOMIC..

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
  2019-07-23 14:20   ` Joel Fernandes
@ 2019-07-24  4:28     ` Minchan Kim
  2019-07-24 14:10       ` Joel Fernandes
  0 siblings, 1 reply; 18+ messages in thread
From: Minchan Kim @ 2019-07-24  4:28 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, vdavydov.dev, Brendan Gregg, kernel-team,
	Alexey Dobriyan, Al Viro, Andrew Morton, carmenjackson,
	Christian Hansen, Colin Ian King, dancol, David Howells, fmayer,
	joaodias, Jonathan Corbet, Kees Cook, Kirill Tkhai,
	Konstantin Khlebnikov, linux-doc, linux-fsdevel, linux-mm,
	Michal Hocko, Mike Rapoport, namhyung, sspatil, surenb,
	Thomas Gleixner, timmurray, tkjos, Vlastimil Babka, wvw

On Tue, Jul 23, 2019 at 10:20:49AM -0400, Joel Fernandes wrote:
> On Tue, Jul 23, 2019 at 03:13:58PM +0900, Minchan Kim wrote:
> > Hi Joel,
> > 
> > On Mon, Jul 22, 2019 at 05:32:04PM -0400, Joel Fernandes (Google) wrote:
> > > The page_idle tracking feature currently requires looking up the pagemap
> > > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > > This is quite cumbersome and can be error-prone too. If between
> > 
> > cumbersome: That's the fair tradeoff between idle page tracking and
> > clear_refs because idle page tracking could check even though the page
> > is not mapped.
> 
> It is fair tradeoff, but could be made simpler. The userspace code got
> reduced by a good amount as well.
> 
> > error-prone: What's the error?
> 
> We see in normal Android usage, that some of the times pages appear not to be
> idle even when they really are idle. Reproducing this is a bit unpredictable
> and happens at random occasions. With this new interface, we are seeing this
> happen much much lesser.

I don't know how you did test. Maybe that could be contributed by
swapping out or shared pages touched by other processes or some kernel
behavior not to keep access bit of their operation.
Please investigate more what's the root cause. That would be important
point to justify for the patch motivation.

> 
> > > accessing the per-PID pagemap and the global page_idle bitmap, if
> > > something changes with the page then the information is not accurate.
> > 
> > What you mean with error is this timing issue?
> > Why do you need to be accurate? IOW, accurate is always good but what's
> > the scale of the accuracy?
> 
> There is a time window between looking up pagemap and checking if page is
> idle. Anyway, see below for the primary goals as you asked:
> 
> > > More over looking up PFN from pagemap in Android devices is not
> > > supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> > > the PFN.
> > > 
> > > This patch adds support to directly interact with page_idle tracking at
> > > the PID level by introducing a /proc/<pid>/page_idle file. This
> > > eliminates the need for userspace to calculate the mapping of the page.
> > > It follows the exact same semantics as the global
> > > /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> > > where looking up PFN is not needed and also does not require SYS_ADMIN.
> > 
> > Ah, so the primary goal is to provide convinience interface and it would
> > help accurary, too. IOW, accuracy is not your main goal?
> 
> There are a couple of primary goals: Security, conveience and also solving
> the accuracy/reliability problem we are seeing. Do keep in mind looking up
> PFN has security implications. The PFN field in pagemap is zeroed if the user
> does not have CAP_SYS_ADMIN.

Myaybe you don't need PFN. is it?

> 
> > > In Android, we are using this for the heap profiler (heapprofd) which
> > > profiles and pin points code paths which allocates and leaves memory
> > > idle for long periods of time.
> > 
> > So the goal is to detect idle pages with idle memory tracking?
> 
> Isn't that what idle memory tracking does?

To me, it's rather misleading. Please read motivation section in document.
The feature would be good to detect workingset pages, not idle pages
because workingset pages are never freed, swapped out and even we could
count on newly allocated pages.

Motivation
==========

The idle page tracking feature allows to track which memory pages are being
accessed by a workload and which are idle. This information can be useful for
estimating the workload's working set size, which, in turn, can be taken into
account when configuring the workload parameters, setting memory cgroup limits,
or deciding where to place the workload within a compute cluster.

> 
> > It couldn't work well because such idle pages could finally swap out and
> > lose every flags of the page descriptor which is working mechanism of
> > idle page tracking. It should have named "workingset page tracking",
> > not "idle page tracking".
> 
> The heap profiler that uses page-idle tracking is not to measure working set,
> but to look for pages that are idle for long periods of time.

It's important part. Please include it in the description so that people
understands what's the usecase. As I said above, if it aims for finding
idle pages durting the period, current idle page tracking feature is not
good ironically.

> 
> Thanks for bringing up the swapping corner case..  Perhaps we can improve
> the heap profiler to detect this by looking at bits 0-4 in pagemap. While it

Yeb, that could work but it could add overhead again what you want to remove?
Even, userspace should keep metadata to identify that page was already swapped
in last period or newly swapped in new period.

> is true that we would lose access information during the window, there is a
> high likelihood that the page was not accessed which is why it was swapped.
> Thoughts?

It depends on system memory size, workingset size and your sampling period.
It would be never corner case for small memory system as they want to use
memory more efficiently.

> 
> thanks,
> 
>  - Joel
> 
> 
> 
> > > Documentation material:
> > > The idle page tracking API for virtual address indexing using virtual page
> > > frame numbers (VFN) is located at /proc/<pid>/page_idle. It is a bitmap
> > > that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
> > > except that it uses virtual instead of physical frame numbers.
> > > 
> > > This idle page tracking API can be simpler to use than physical address
> > > indexing, since the pagemap for a process does not need to be looked up
> > > to mark or read a page's idle bit. It is also more accurate than
> > > physical address indexing since in physical address indexing, address
> > > space changes can occur between reading the pagemap and reading the
> > > bitmap. In virtual address indexing, the process's mmap_sem is held for
> > > the duration of the access.
> > > 
> > > Cc: vdavydov.dev@gmail.com
> > > Cc: Brendan Gregg <bgregg@netflix.com>
> > > Cc: kernel-team@android.com
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > 
> > > ---
> > > Internal review -> v1:
> > > Fixes from Suren.
> > > Corrections to change log, docs (Florian, Sandeep)
> > > 
> > >  fs/proc/base.c            |   3 +
> > >  fs/proc/internal.h        |   1 +
> > >  fs/proc/task_mmu.c        |  57 +++++++
> > >  include/linux/page_idle.h |   4 +
> > >  mm/page_idle.c            | 305 +++++++++++++++++++++++++++++++++-----
> > >  5 files changed, 330 insertions(+), 40 deletions(-)
> > > 
> > > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > > index 77eb628ecc7f..a58dd74606e9 100644
> > > --- a/fs/proc/base.c
> > > +++ b/fs/proc/base.c
> > > @@ -3021,6 +3021,9 @@ static const struct pid_entry tgid_base_stuff[] = {
> > >  	REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
> > >  	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
> > >  	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
> > > +#ifdef CONFIG_IDLE_PAGE_TRACKING
> > > +	REG("page_idle", S_IRUSR|S_IWUSR, proc_page_idle_operations),
> > > +#endif
> > >  #endif
> > >  #ifdef CONFIG_SECURITY
> > >  	DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
> > > diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> > > index cd0c8d5ce9a1..bc9371880c63 100644
> > > --- a/fs/proc/internal.h
> > > +++ b/fs/proc/internal.h
> > > @@ -293,6 +293,7 @@ extern const struct file_operations proc_pid_smaps_operations;
> > >  extern const struct file_operations proc_pid_smaps_rollup_operations;
> > >  extern const struct file_operations proc_clear_refs_operations;
> > >  extern const struct file_operations proc_pagemap_operations;
> > > +extern const struct file_operations proc_page_idle_operations;
> > >  
> > >  extern unsigned long task_vsize(struct mm_struct *);
> > >  extern unsigned long task_statm(struct mm_struct *,
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index 4d2b860dbc3f..11ccc53da38e 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -1642,6 +1642,63 @@ const struct file_operations proc_pagemap_operations = {
> > >  	.open		= pagemap_open,
> > >  	.release	= pagemap_release,
> > >  };
> > > +
> > > +#ifdef CONFIG_IDLE_PAGE_TRACKING
> > > +static ssize_t proc_page_idle_read(struct file *file, char __user *buf,
> > > +				   size_t count, loff_t *ppos)
> > > +{
> > > +	int ret;
> > > +	struct task_struct *tsk = get_proc_task(file_inode(file));
> > > +
> > > +	if (!tsk)
> > > +		return -EINVAL;
> > > +	ret = page_idle_proc_read(file, buf, count, ppos, tsk);
> > > +	put_task_struct(tsk);
> > > +	return ret;
> > > +}
> > > +
> > > +static ssize_t proc_page_idle_write(struct file *file, const char __user *buf,
> > > +				 size_t count, loff_t *ppos)
> > > +{
> > > +	int ret;
> > > +	struct task_struct *tsk = get_proc_task(file_inode(file));
> > > +
> > > +	if (!tsk)
> > > +		return -EINVAL;
> > > +	ret = page_idle_proc_write(file, (char __user *)buf, count, ppos, tsk);
> > > +	put_task_struct(tsk);
> > > +	return ret;
> > > +}
> > > +
> > > +static int proc_page_idle_open(struct inode *inode, struct file *file)
> > > +{
> > > +	struct mm_struct *mm;
> > > +
> > > +	mm = proc_mem_open(inode, PTRACE_MODE_READ);
> > > +	if (IS_ERR(mm))
> > > +		return PTR_ERR(mm);
> > > +	file->private_data = mm;
> > > +	return 0;
> > > +}
> > > +
> > > +static int proc_page_idle_release(struct inode *inode, struct file *file)
> > > +{
> > > +	struct mm_struct *mm = file->private_data;
> > > +
> > > +	if (mm)
> > > +		mmdrop(mm);
> > > +	return 0;
> > > +}
> > > +
> > > +const struct file_operations proc_page_idle_operations = {
> > > +	.llseek		= mem_lseek, /* borrow this */
> > > +	.read		= proc_page_idle_read,
> > > +	.write		= proc_page_idle_write,
> > > +	.open		= proc_page_idle_open,
> > > +	.release	= proc_page_idle_release,
> > > +};
> > > +#endif /* CONFIG_IDLE_PAGE_TRACKING */
> > > +
> > >  #endif /* CONFIG_PROC_PAGE_MONITOR */
> > >  
> > >  #ifdef CONFIG_NUMA
> > > diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> > > index 1e894d34bdce..f1bc2640d85e 100644
> > > --- a/include/linux/page_idle.h
> > > +++ b/include/linux/page_idle.h
> > > @@ -106,6 +106,10 @@ static inline void clear_page_idle(struct page *page)
> > >  }
> > >  #endif /* CONFIG_64BIT */
> > >  
> > > +ssize_t page_idle_proc_write(struct file *file,
> > > +	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
> > > +ssize_t page_idle_proc_read(struct file *file,
> > > +	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
> > >  #else /* !CONFIG_IDLE_PAGE_TRACKING */
> > >  
> > >  static inline bool page_is_young(struct page *page)
> > > diff --git a/mm/page_idle.c b/mm/page_idle.c
> > > index 295512465065..874a60c41fef 100644
> > > --- a/mm/page_idle.c
> > > +++ b/mm/page_idle.c
> > > @@ -11,6 +11,7 @@
> > >  #include <linux/mmu_notifier.h>
> > >  #include <linux/page_ext.h>
> > >  #include <linux/page_idle.h>
> > > +#include <linux/sched/mm.h>
> > >  
> > >  #define BITMAP_CHUNK_SIZE	sizeof(u64)
> > >  #define BITMAP_CHUNK_BITS	(BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
> > > @@ -28,15 +29,12 @@
> > >   *
> > >   * This function tries to get a user memory page by pfn as described above.
> > >   */
> > > -static struct page *page_idle_get_page(unsigned long pfn)
> > > +static struct page *page_idle_get_page(struct page *page_in)
> > >  {
> > >  	struct page *page;
> > >  	pg_data_t *pgdat;
> > >  
> > > -	if (!pfn_valid(pfn))
> > > -		return NULL;
> > > -
> > > -	page = pfn_to_page(pfn);
> > > +	page = page_in;
> > >  	if (!page || !PageLRU(page) ||
> > >  	    !get_page_unless_zero(page))
> > >  		return NULL;
> > > @@ -51,6 +49,15 @@ static struct page *page_idle_get_page(unsigned long pfn)
> > >  	return page;
> > >  }
> > >  
> > > +static struct page *page_idle_get_page_pfn(unsigned long pfn)
> > > +{
> > > +
> > > +	if (!pfn_valid(pfn))
> > > +		return NULL;
> > > +
> > > +	return page_idle_get_page(pfn_to_page(pfn));
> > > +}
> > > +
> > >  static bool page_idle_clear_pte_refs_one(struct page *page,
> > >  					struct vm_area_struct *vma,
> > >  					unsigned long addr, void *arg)
> > > @@ -118,6 +125,47 @@ static void page_idle_clear_pte_refs(struct page *page)
> > >  		unlock_page(page);
> > >  }
> > >  
> > > +/* Helper to get the start and end frame given a pos and count */
> > > +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
> > > +				unsigned long *start, unsigned long *end)
> > > +{
> > > +	unsigned long max_frame;
> > > +
> > > +	/* If an mm is not given, assume we want physical frames */
> > > +	max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
> > > +
> > > +	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> > > +		return -EINVAL;
> > > +
> > > +	*start = pos * BITS_PER_BYTE;
> > > +	if (*start >= max_frame)
> > > +		return -ENXIO;
> > > +
> > > +	*end = *start + count * BITS_PER_BYTE;
> > > +	if (*end > max_frame)
> > > +		*end = max_frame;
> > > +	return 0;
> > > +}
> > > +
> > > +static bool page_really_idle(struct page *page)
> > > +{
> > > +	if (!page)
> > > +		return false;
> > > +
> > > +	if (page_is_idle(page)) {
> > > +		/*
> > > +		 * The page might have been referenced via a
> > > +		 * pte, in which case it is not idle. Clear
> > > +		 * refs and recheck.
> > > +		 */
> > > +		page_idle_clear_pte_refs(page);
> > > +		if (page_is_idle(page))
> > > +			return true;
> > > +	}
> > > +
> > > +	return false;
> > > +}
> > > +
> > >  static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
> > >  				     struct bin_attribute *attr, char *buf,
> > >  				     loff_t pos, size_t count)
> > > @@ -125,35 +173,21 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
> > >  	u64 *out = (u64 *)buf;
> > >  	struct page *page;
> > >  	unsigned long pfn, end_pfn;
> > > -	int bit;
> > > -
> > > -	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> > > -		return -EINVAL;
> > > -
> > > -	pfn = pos * BITS_PER_BYTE;
> > > -	if (pfn >= max_pfn)
> > > -		return 0;
> > > +	int bit, ret;
> > >  
> > > -	end_pfn = pfn + count * BITS_PER_BYTE;
> > > -	if (end_pfn > max_pfn)
> > > -		end_pfn = max_pfn;
> > > +	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
> > > +	if (ret == -ENXIO)
> > > +		return 0;  /* Reads beyond max_pfn do nothing */
> > > +	else if (ret)
> > > +		return ret;
> > >  
> > >  	for (; pfn < end_pfn; pfn++) {
> > >  		bit = pfn % BITMAP_CHUNK_BITS;
> > >  		if (!bit)
> > >  			*out = 0ULL;
> > > -		page = page_idle_get_page(pfn);
> > > -		if (page) {
> > > -			if (page_is_idle(page)) {
> > > -				/*
> > > -				 * The page might have been referenced via a
> > > -				 * pte, in which case it is not idle. Clear
> > > -				 * refs and recheck.
> > > -				 */
> > > -				page_idle_clear_pte_refs(page);
> > > -				if (page_is_idle(page))
> > > -					*out |= 1ULL << bit;
> > > -			}
> > > +		page = page_idle_get_page_pfn(pfn);
> > > +		if (page && page_really_idle(page)) {
> > > +			*out |= 1ULL << bit;
> > >  			put_page(page);
> > >  		}
> > >  		if (bit == BITMAP_CHUNK_BITS - 1)
> > > @@ -170,23 +204,16 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
> > >  	const u64 *in = (u64 *)buf;
> > >  	struct page *page;
> > >  	unsigned long pfn, end_pfn;
> > > -	int bit;
> > > +	int bit, ret;
> > >  
> > > -	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> > > -		return -EINVAL;
> > > -
> > > -	pfn = pos * BITS_PER_BYTE;
> > > -	if (pfn >= max_pfn)
> > > -		return -ENXIO;
> > > -
> > > -	end_pfn = pfn + count * BITS_PER_BYTE;
> > > -	if (end_pfn > max_pfn)
> > > -		end_pfn = max_pfn;
> > > +	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
> > > +	if (ret)
> > > +		return ret;
> > >  
> > >  	for (; pfn < end_pfn; pfn++) {
> > >  		bit = pfn % BITMAP_CHUNK_BITS;
> > >  		if ((*in >> bit) & 1) {
> > > -			page = page_idle_get_page(pfn);
> > > +			page = page_idle_get_page_pfn(pfn);
> > >  			if (page) {
> > >  				page_idle_clear_pte_refs(page);
> > >  				set_page_idle(page);
> > > @@ -224,10 +251,208 @@ struct page_ext_operations page_idle_ops = {
> > >  };
> > >  #endif
> > >  
> > > +/*  page_idle tracking for /proc/<pid>/page_idle */
> > > +
> > > +static DEFINE_SPINLOCK(idle_page_list_lock);
> > > +struct list_head idle_page_list;
> > > +
> > > +struct page_node {
> > > +	struct page *page;
> > > +	unsigned long addr;
> > > +	struct list_head list;
> > > +};
> > > +
> > > +struct page_idle_proc_priv {
> > > +	unsigned long start_addr;
> > > +	char *buffer;
> > > +	int write;
> > > +};
> > > +
> > > +static void add_page_idle_list(struct page *page,
> > > +			       unsigned long addr, struct mm_walk *walk)
> > > +{
> > > +	struct page *page_get;
> > > +	struct page_node *pn;
> > > +	int bit;
> > > +	unsigned long frames;
> > > +	struct page_idle_proc_priv *priv = walk->private;
> > > +	u64 *chunk = (u64 *)priv->buffer;
> > > +
> > > +	if (priv->write) {
> > > +		/* Find whether this page was asked to be marked */
> > > +		frames = (addr - priv->start_addr) >> PAGE_SHIFT;
> > > +		bit = frames % BITMAP_CHUNK_BITS;
> > > +		chunk = &chunk[frames / BITMAP_CHUNK_BITS];
> > > +		if (((*chunk >> bit) & 1) == 0)
> > > +			return;
> > > +	}
> > > +
> > > +	page_get = page_idle_get_page(page);
> > > +	if (!page_get)
> > > +		return;
> > > +
> > > +	pn = kmalloc(sizeof(*pn), GFP_ATOMIC);
> > > +	if (!pn)
> > > +		return;
> > > +
> > > +	pn->page = page_get;
> > > +	pn->addr = addr;
> > > +	list_add(&pn->list, &idle_page_list);
> > > +}
> > > +
> > > +static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
> > > +				    unsigned long end,
> > > +				    struct mm_walk *walk)
> > > +{
> > > +	struct vm_area_struct *vma = walk->vma;
> > > +	pte_t *pte;
> > > +	spinlock_t *ptl;
> > > +	struct page *page;
> > > +
> > > +	ptl = pmd_trans_huge_lock(pmd, vma);
> > > +	if (ptl) {
> > > +		if (pmd_present(*pmd)) {
> > > +			page = follow_trans_huge_pmd(vma, addr, pmd,
> > > +						     FOLL_DUMP|FOLL_WRITE);
> > > +			if (!IS_ERR_OR_NULL(page))
> > > +				add_page_idle_list(page, addr, walk);
> > > +		}
> > > +		spin_unlock(ptl);
> > > +		return 0;
> > > +	}
> > > +
> > > +	if (pmd_trans_unstable(pmd))
> > > +		return 0;
> > > +
> > > +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> > > +		if (!pte_present(*pte))
> > > +			continue;
> > > +
> > > +		page = vm_normal_page(vma, addr, *pte);
> > > +		if (page)
> > > +			add_page_idle_list(page, addr, walk);
> > > +	}
> > > +
> > > +	pte_unmap_unlock(pte - 1, ptl);
> > > +	return 0;
> > > +}
> > > +
> > > +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
> > > +			       size_t count, loff_t *pos,
> > > +			       struct task_struct *tsk, int write)
> > > +{
> > > +	int ret;
> > > +	char *buffer;
> > > +	u64 *out;
> > > +	unsigned long start_addr, end_addr, start_frame, end_frame;
> > > +	struct mm_struct *mm = file->private_data;
> > > +	struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, };
> > > +	struct page_node *cur, *next;
> > > +	struct page_idle_proc_priv priv;
> > > +	bool walk_error = false;
> > > +
> > > +	if (!mm || !mmget_not_zero(mm))
> > > +		return -EINVAL;
> > > +
> > > +	if (count > PAGE_SIZE)
> > > +		count = PAGE_SIZE;
> > > +
> > > +	buffer = kzalloc(PAGE_SIZE, GFP_KERNEL);
> > > +	if (!buffer) {
> > > +		ret = -ENOMEM;
> > > +		goto out_mmput;
> > > +	}
> > > +	out = (u64 *)buffer;
> > > +
> > > +	if (write && copy_from_user(buffer, ubuff, count)) {
> > > +		ret = -EFAULT;
> > > +		goto out;
> > > +	}
> > > +
> > > +	ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame);
> > > +	if (ret)
> > > +		goto out;
> > > +
> > > +	start_addr = (start_frame << PAGE_SHIFT);
> > > +	end_addr = (end_frame << PAGE_SHIFT);
> > > +	priv.buffer = buffer;
> > > +	priv.start_addr = start_addr;
> > > +	priv.write = write;
> > > +	walk.private = &priv;
> > > +	walk.mm = mm;
> > > +
> > > +	down_read(&mm->mmap_sem);
> > > +
> > > +	/*
> > > +	 * Protects the idle_page_list which is needed because
> > > +	 * walk_page_vma() holds ptlock which deadlocks with
> > > +	 * page_idle_clear_pte_refs(). So we have to collect all
> > > +	 * pages first, and then call page_idle_clear_pte_refs().
> > > +	 */
> > > +	spin_lock(&idle_page_list_lock);
> > > +	ret = walk_page_range(start_addr, end_addr, &walk);
> > > +	if (ret)
> > > +		walk_error = true;
> > > +
> > > +	list_for_each_entry_safe(cur, next, &idle_page_list, list) {
> > > +		int bit, index;
> > > +		unsigned long off;
> > > +		struct page *page = cur->page;
> > > +
> > > +		if (unlikely(walk_error))
> > > +			goto remove_page;
> > > +
> > > +		if (write) {
> > > +			page_idle_clear_pte_refs(page);
> > > +			set_page_idle(page);
> > > +		} else {
> > > +			if (page_really_idle(page)) {
> > > +				off = ((cur->addr) >> PAGE_SHIFT) - start_frame;
> > > +				bit = off % BITMAP_CHUNK_BITS;
> > > +				index = off / BITMAP_CHUNK_BITS;
> > > +				out[index] |= 1ULL << bit;
> > > +			}
> > > +		}
> > > +remove_page:
> > > +		put_page(page);
> > > +		list_del(&cur->list);
> > > +		kfree(cur);
> > > +	}
> > > +	spin_unlock(&idle_page_list_lock);
> > > +
> > > +	if (!write && !walk_error)
> > > +		ret = copy_to_user(ubuff, buffer, count);
> > > +
> > > +	up_read(&mm->mmap_sem);
> > > +out:
> > > +	kfree(buffer);
> > > +out_mmput:
> > > +	mmput(mm);
> > > +	if (!ret)
> > > +		ret = count;
> > > +	return ret;
> > > +
> > > +}
> > > +
> > > +ssize_t page_idle_proc_read(struct file *file, char __user *ubuff,
> > > +			    size_t count, loff_t *pos, struct task_struct *tsk)
> > > +{
> > > +	return page_idle_proc_generic(file, ubuff, count, pos, tsk, 0);
> > > +}
> > > +
> > > +ssize_t page_idle_proc_write(struct file *file, char __user *ubuff,
> > > +			     size_t count, loff_t *pos, struct task_struct *tsk)
> > > +{
> > > +	return page_idle_proc_generic(file, ubuff, count, pos, tsk, 1);
> > > +}
> > > +
> > >  static int __init page_idle_init(void)
> > >  {
> > >  	int err;
> > >  
> > > +	INIT_LIST_HEAD(&idle_page_list);
> > > +
> > >  	err = sysfs_create_group(mm_kobj, &page_idle_attr_group);
> > >  	if (err) {
> > >  		pr_err("page_idle: register sysfs failed\n");
> > > -- 
> > > 2.22.0.657.g960e92d24f-goog

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
  2019-07-24  4:28     ` Minchan Kim
@ 2019-07-24 14:10       ` Joel Fernandes
  2019-07-25  8:15         ` Konstantin Khlebnikov
  0 siblings, 1 reply; 18+ messages in thread
From: Joel Fernandes @ 2019-07-24 14:10 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, vdavydov.dev, Brendan Gregg, kernel-team,
	Alexey Dobriyan, Al Viro, Andrew Morton, carmenjackson,
	Christian Hansen, Colin Ian King, dancol, David Howells, fmayer,
	joaodias, Jonathan Corbet, Kees Cook, Kirill Tkhai,
	Konstantin Khlebnikov, linux-doc, linux-fsdevel, linux-mm,
	Michal Hocko, Mike Rapoport, namhyung, sspatil, surenb,
	Thomas Gleixner, timmurray, tkjos, Vlastimil Babka, wvw

On Wed, Jul 24, 2019 at 01:28:42PM +0900, Minchan Kim wrote:
> On Tue, Jul 23, 2019 at 10:20:49AM -0400, Joel Fernandes wrote:
> > On Tue, Jul 23, 2019 at 03:13:58PM +0900, Minchan Kim wrote:
> > > Hi Joel,
> > > 
> > > On Mon, Jul 22, 2019 at 05:32:04PM -0400, Joel Fernandes (Google) wrote:
> > > > The page_idle tracking feature currently requires looking up the pagemap
> > > > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > > > This is quite cumbersome and can be error-prone too. If between
> > > 
> > > cumbersome: That's the fair tradeoff between idle page tracking and
> > > clear_refs because idle page tracking could check even though the page
> > > is not mapped.
> > 
> > It is fair tradeoff, but could be made simpler. The userspace code got
> > reduced by a good amount as well.
> > 
> > > error-prone: What's the error?
> > 
> > We see in normal Android usage, that some of the times pages appear not to be
> > idle even when they really are idle. Reproducing this is a bit unpredictable
> > and happens at random occasions. With this new interface, we are seeing this
> > happen much much lesser.
> 
> I don't know how you did test. Maybe that could be contributed by
> swapping out or shared pages touched by other processes or some kernel
> behavior not to keep access bit of their operation.

It could be something along these lines is my thinking as well. So we know
its already has issues due to what you mentioned, I am not sure what else
needs investigation?

> Please investigate more what's the root cause. That would be important
> point to justify for the patch motivation.

The motivation is security. I am dropping the 'accuracy' factor I mentioned
from the patch description since it created a lot of confusion.

> > > > More over looking up PFN from pagemap in Android devices is not
> > > > supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> > > > the PFN.
> > > > 
> > > > This patch adds support to directly interact with page_idle tracking at
> > > > the PID level by introducing a /proc/<pid>/page_idle file. This
> > > > eliminates the need for userspace to calculate the mapping of the page.
> > > > It follows the exact same semantics as the global
> > > > /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> > > > where looking up PFN is not needed and also does not require SYS_ADMIN.
> > > 
> > > Ah, so the primary goal is to provide convinience interface and it would
> > > help accurary, too. IOW, accuracy is not your main goal?
> > 
> > There are a couple of primary goals: Security, conveience and also solving
> > the accuracy/reliability problem we are seeing. Do keep in mind looking up
> > PFN has security implications. The PFN field in pagemap is zeroed if the user
> > does not have CAP_SYS_ADMIN.
> 
> Myaybe you don't need PFN. is it?

With the traditional idle tracking, PFN is needed which has the mentioned
security issues. This patch solves it. And the interface is identical and
familiar to the existing page_idle bitmap interface.

> > > > In Android, we are using this for the heap profiler (heapprofd) which
> > > > profiles and pin points code paths which allocates and leaves memory
> > > > idle for long periods of time.
> > > 
> > > So the goal is to detect idle pages with idle memory tracking?
> > 
> > Isn't that what idle memory tracking does?
> 
> To me, it's rather misleading. Please read motivation section in document.
> The feature would be good to detect workingset pages, not idle pages
> because workingset pages are never freed, swapped out and even we could
> count on newly allocated pages.
> 
> Motivation
> ==========
> 
> The idle page tracking feature allows to track which memory pages are being
> accessed by a workload and which are idle. This information can be useful for
> estimating the workload's working set size, which, in turn, can be taken into
> account when configuring the workload parameters, setting memory cgroup limits,
> or deciding where to place the workload within a compute cluster.

As we discussed by chat, we could collect additional metadata to check if
pages were swapped or freed ever since the time we marked them as idle.
However this can be incremental improvement.

> > > It couldn't work well because such idle pages could finally swap out and
> > > lose every flags of the page descriptor which is working mechanism of
> > > idle page tracking. It should have named "workingset page tracking",
> > > not "idle page tracking".
> > 
> > The heap profiler that uses page-idle tracking is not to measure working set,
> > but to look for pages that are idle for long periods of time.
> 
> It's important part. Please include it in the description so that people
> understands what's the usecase. As I said above, if it aims for finding
> idle pages durting the period, current idle page tracking feature is not
> good ironically.

Ok, I will mention.

> > Thanks for bringing up the swapping corner case..  Perhaps we can improve
> > the heap profiler to detect this by looking at bits 0-4 in pagemap. While it
> 
> Yeb, that could work but it could add overhead again what you want to remove?
> Even, userspace should keep metadata to identify that page was already swapped
> in last period or newly swapped in new period.

Yep.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
  2019-07-22 22:06 ` [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing Andrew Morton
  2019-07-23 14:43   ` Joel Fernandes
@ 2019-07-24 19:33   ` Joel Fernandes
  1 sibling, 0 replies; 18+ messages in thread
From: Joel Fernandes @ 2019-07-24 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, vdavydov.dev, Brendan Gregg, kernel-team,
	Alexey Dobriyan, Al Viro, carmenjackson, Christian Hansen,
	Colin Ian King, dancol, David Howells, fmayer, joaodias,
	Jonathan Corbet, Kees Cook, Kirill Tkhai, Konstantin Khlebnikov,
	linux-doc, linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport,
	minchan, minchan, namhyung, sspatil, surenb, Thomas Gleixner,
	timmurray, tkjos, Vlastimil Babka, wvw

On Mon, Jul 22, 2019 at 03:06:39PM -0700, Andrew Morton wrote:
[snip] 
> > +	*end = *start + count * BITS_PER_BYTE;
> > +	if (*end > max_frame)
> > +		*end = max_frame;
> > +	return 0;
> > +}
> > +
> >
> > ...
> >
> > +static void add_page_idle_list(struct page *page,
> > +			       unsigned long addr, struct mm_walk *walk)
> > +{
> > +	struct page *page_get;
> > +	struct page_node *pn;
> > +	int bit;
> > +	unsigned long frames;
> > +	struct page_idle_proc_priv *priv = walk->private;
> > +	u64 *chunk = (u64 *)priv->buffer;
> > +
> > +	if (priv->write) {
> > +		/* Find whether this page was asked to be marked */
> > +		frames = (addr - priv->start_addr) >> PAGE_SHIFT;
> > +		bit = frames % BITMAP_CHUNK_BITS;
> > +		chunk = &chunk[frames / BITMAP_CHUNK_BITS];
> > +		if (((*chunk >> bit) & 1) == 0)
> > +			return;
> > +	}
> > +
> > +	page_get = page_idle_get_page(page);
> > +	if (!page_get)
> > +		return;
> > +
> > +	pn = kmalloc(sizeof(*pn), GFP_ATOMIC);
> 
> I'm not liking this GFP_ATOMIC.  If I'm reading the code correctly,
> userspace can ask for an arbitrarily large number of GFP_ATOMIC
> allocations by doing a large read.  This can potentially exhaust page
> reserves which things like networking Rx interrupts need and can make
> this whole feature less reliable.

For the revision, I will pre-allocate the page nodes in advance so it does
not need to do this. Diff on top of this patch is below. Let me know any
comments, thanks.

Btw, I also dropped the idle_page_list_lock by putting the idle_page_list
list_head on the stack instead of heap.
---8<-----------------------

From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Subject: [PATCH] mm/page_idle: Avoid need for GFP_ATOMIC

GFP_ATOMIC can harm allocations does by other allocations that are in
need of reserves and the like. Pre-allocate the nodes list so that
spinlocked region can just use it.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 mm/page_idle.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/mm/page_idle.c b/mm/page_idle.c
index 874a60c41fef..b9c790721f16 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -266,6 +266,10 @@ struct page_idle_proc_priv {
 	unsigned long start_addr;
 	char *buffer;
 	int write;
+
+	/* Pre-allocate and provide nodes to add_page_idle_list() */
+	struct page_node *page_nodes;
+	int cur_page_node;
 };
 
 static void add_page_idle_list(struct page *page,
@@ -291,10 +295,7 @@ static void add_page_idle_list(struct page *page,
 	if (!page_get)
 		return;
 
-	pn = kmalloc(sizeof(*pn), GFP_ATOMIC);
-	if (!pn)
-		return;
-
+	pn = &(priv->page_nodes[priv->cur_page_node++]);
 	pn->page = page_get;
 	pn->addr = addr;
 	list_add(&pn->list, &idle_page_list);
@@ -379,6 +380,15 @@ ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
 	priv.buffer = buffer;
 	priv.start_addr = start_addr;
 	priv.write = write;
+
+	priv.cur_page_node = 0;
+	priv.page_nodes = kzalloc(sizeof(struct page_node) * (end_frame - start_frame),
+				  GFP_KERNEL);
+	if (!priv.page_nodes) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
 	walk.private = &priv;
 	walk.mm = mm;
 
@@ -425,6 +435,7 @@ ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
 		ret = copy_to_user(ubuff, buffer, count);
 
 	up_read(&mm->mmap_sem);
+	kfree(priv.page_nodes);
 out:
 	kfree(buffer);
 out_mmput:
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
  2019-07-24 14:10       ` Joel Fernandes
@ 2019-07-25  8:15         ` Konstantin Khlebnikov
  2019-07-26  0:06           ` Joel Fernandes
  0 siblings, 1 reply; 18+ messages in thread
From: Konstantin Khlebnikov @ 2019-07-25  8:15 UTC (permalink / raw)
  To: Joel Fernandes, Minchan Kim
  Cc: linux-kernel, vdavydov.dev, Brendan Gregg, kernel-team,
	Alexey Dobriyan, Al Viro, Andrew Morton, carmenjackson,
	Christian Hansen, Colin Ian King, dancol, David Howells, fmayer,
	joaodias, Jonathan Corbet, Kees Cook, Kirill Tkhai, linux-doc,
	linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport, namhyung,
	sspatil

On 24.07.2019 17:10, Joel Fernandes wrote:> On Wed, Jul 24, 2019 at 01:28:42PM +0900, Minchan Kim wrote:
 >> On Tue, Jul 23, 2019 at 10:20:49AM -0400, Joel Fernandes wrote:
 >>> On Tue, Jul 23, 2019 at 03:13:58PM +0900, Minchan Kim wrote:
 >>>> Hi Joel,
 >>>>
 >>>> On Mon, Jul 22, 2019 at 05:32:04PM -0400, Joel Fernandes (Google) wrote:
 >>>>> The page_idle tracking feature currently requires looking up the pagemap
 >>>>> for a process followed by interacting with /sys/kernel/mm/page_idle.
 >>>>> This is quite cumbersome and can be error-prone too. If between
 >>>>
 >>>> cumbersome: That's the fair tradeoff between idle page tracking and
 >>>> clear_refs because idle page tracking could check even though the page
 >>>> is not mapped.
 >>>
 >>> It is fair tradeoff, but could be made simpler. The userspace code got
 >>> reduced by a good amount as well.
 >>>
 >>>> error-prone: What's the error?
 >>>
 >>> We see in normal Android usage, that some of the times pages appear not to be
 >>> idle even when they really are idle. Reproducing this is a bit unpredictable
 >>> and happens at random occasions. With this new interface, we are seeing this
 >>> happen much much lesser.
 >>
 >> I don't know how you did test. Maybe that could be contributed by
 >> swapping out or shared pages touched by other processes or some kernel
 >> behavior not to keep access bit of their operation.
 >
 > It could be something along these lines is my thinking as well. So we know
 > its already has issues due to what you mentioned, I am not sure what else
 > needs investigation?
 >
 >> Please investigate more what's the root cause. That would be important
 >> point to justify for the patch motivation.
 >
 > The motivation is security. I am dropping the 'accuracy' factor I mentioned
 > from the patch description since it created a lot of confusion.
If you are tracking idle working set for one process you could use degrading
'accuracy' for good - just don't walk page rmap and play only with access
bits in one process. Foreign access could be detected with arbitrary delay,
but this does not important if main goal is heap profiling.

 >
 >>>>> More over looking up PFN from pagemap in Android devices is not
 >>>>> supported by unprivileged process and requires SYS_ADMIN and gives 0 for
 >>>>> the PFN.
 >>>>>
 >>>>> This patch adds support to directly interact with page_idle tracking at
 >>>>> the PID level by introducing a /proc/<pid>/page_idle file. This
 >>>>> eliminates the need for userspace to calculate the mapping of the page.
 >>>>> It follows the exact same semantics as the global
 >>>>> /sys/kernel/mm/page_idle, however it is easier to use for some usecases
 >>>>> where looking up PFN is not needed and also does not require SYS_ADMIN.
 >>>>
 >>>> Ah, so the primary goal is to provide convinience interface and it would
 >>>> help accurary, too. IOW, accuracy is not your main goal?
 >>>
 >>> There are a couple of primary goals: Security, conveience and also solving
 >>> the accuracy/reliability problem we are seeing. Do keep in mind looking up
 >>> PFN has security implications. The PFN field in pagemap is zeroed if the user
 >>> does not have CAP_SYS_ADMIN.
 >>
 >> Myaybe you don't need PFN. is it?
 >
 > With the traditional idle tracking, PFN is needed which has the mentioned
 > security issues. This patch solves it. And the interface is identical and
 > familiar to the existing page_idle bitmap interface.
 >
 >>>>> In Android, we are using this for the heap profiler (heapprofd) which
 >>>>> profiles and pin points code paths which allocates and leaves memory
 >>>>> idle for long periods of time.
 >>>>
 >>>> So the goal is to detect idle pages with idle memory tracking?
 >>>
 >>> Isn't that what idle memory tracking does?
 >>
 >> To me, it's rather misleading. Please read motivation section in document.
 >> The feature would be good to detect workingset pages, not idle pages
 >> because workingset pages are never freed, swapped out and even we could
 >> count on newly allocated pages.
 >>
 >> Motivation
 >> ==========
 >>
 >> The idle page tracking feature allows to track which memory pages are being
 >> accessed by a workload and which are idle. This information can be useful for
 >> estimating the workload's working set size, which, in turn, can be taken into
 >> account when configuring the workload parameters, setting memory cgroup limits,
 >> or deciding where to place the workload within a compute cluster.
 >
 > As we discussed by chat, we could collect additional metadata to check if
 > pages were swapped or freed ever since the time we marked them as idle.
 > However this can be incremental improvement.
 >
 >>>> It couldn't work well because such idle pages could finally swap out and
 >>>> lose every flags of the page descriptor which is working mechanism of
 >>>> idle page tracking. It should have named "workingset page tracking",
 >>>> not "idle page tracking".
 >>>
 >>> The heap profiler that uses page-idle tracking is not to measure working set,
 >>> but to look for pages that are idle for long periods of time.
 >>
 >> It's important part. Please include it in the description so that people
 >> understands what's the usecase. As I said above, if it aims for finding
 >> idle pages durting the period, current idle page tracking feature is not
 >> good ironically.
 >
 > Ok, I will mention.
 >
 >>> Thanks for bringing up the swapping corner case..  Perhaps we can improve
 >>> the heap profiler to detect this by looking at bits 0-4 in pagemap. While it
 >>
 >> Yeb, that could work but it could add overhead again what you want to remove?
 >> Even, userspace should keep metadata to identify that page was already swapped
 >> in last period or newly swapped in new period.
 >
 > Yep.
Between samples page could be read from swap and swapped out back multiple times.
For tracking this swap ptes could be marked with idle bit too.
I believe it's not so hard to find free bit for this.

Refault\swapout will automatically clear this bit in pte even if
page goes nowhere stays if swap-cache.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
  2019-07-25  8:15         ` Konstantin Khlebnikov
@ 2019-07-26  0:06           ` Joel Fernandes
  2019-07-26 11:16             ` Konstantin Khlebnikov
  0 siblings, 1 reply; 18+ messages in thread
From: Joel Fernandes @ 2019-07-26  0:06 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Minchan Kim, linux-kernel, vdavydov.dev, Brendan Gregg,
	kernel-team, Alexey Dobriyan, Al Viro, Andrew Morton,
	carmenjackson, Christian Hansen, Colin Ian King, dancol,
	David Howells, fmayer, joaodias, Jonathan Corbet, Kees Cook,
	Kirill Tkhai, linux-doc, linux-fsdevel, linux-mm, Michal Hocko,
	Mike Rapoport, namhyung, sspatil

On Thu, Jul 25, 2019 at 11:15:53AM +0300, Konstantin Khlebnikov wrote:
[snip]
> >>> Thanks for bringing up the swapping corner case..  Perhaps we can improve
> >>> the heap profiler to detect this by looking at bits 0-4 in pagemap. While it
> >>
> >> Yeb, that could work but it could add overhead again what you want to remove?
> >> Even, userspace should keep metadata to identify that page was already swapped
> >> in last period or newly swapped in new period.
> >
> > Yep.
> Between samples page could be read from swap and swapped out back multiple times.
> For tracking this swap ptes could be marked with idle bit too.
> I believe it's not so hard to find free bit for this.
> 
> Refault\swapout will automatically clear this bit in pte even if
> page goes nowhere stays if swap-cache.

Could you clarify more about your idea? Do you mean swapout will clear the new
idle swap-pte bit if the page was accessed just before the swapout?

Instead, I thought of using is_swap_pte() to detect if the PTE belong to a
page that was swapped. And if so, then assume the page was idle. Sure we
would miss data that the page was accessed before the swap out in the
sampling window, however if the page was swapped out, then it is likely idle
anyway.

My current patch was just reporting swapped out pages as non-idle (idle bit
not set) which is wrong as Minchan pointed. So I added below patch on top of
this patch (still testing..) :

thanks,

 - Joel
---8<-----------------------

diff --git a/mm/page_idle.c b/mm/page_idle.c
index 3667ed9cc904..46c2dd18cca8 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -271,10 +271,14 @@ struct page_idle_proc_priv {
 	struct list_head *idle_page_list;
 };
 
+/*
+ * Add a page to the idle page list.
+ * page can also be NULL if pte was not present or swapped.
+ */
 static void add_page_idle_list(struct page *page,
 			       unsigned long addr, struct mm_walk *walk)
 {
-	struct page *page_get;
+	struct page *page_get = NULL;
 	struct page_node *pn;
 	int bit;
 	unsigned long frames;
@@ -290,9 +294,11 @@ static void add_page_idle_list(struct page *page,
 			return;
 	}
 
-	page_get = page_idle_get_page(page);
-	if (!page_get)
-		return;
+	if (page) {
+		page_get = page_idle_get_page(page);
+		if (!page_get)
+			return;
+	}
 
 	pn = &(priv->page_nodes[priv->cur_page_node++]);
 	pn->page = page_get;
@@ -326,6 +332,15 @@ static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		/*
+		 * We add swapped pages to the idle_page_list so that we can
+		 * reported to userspace that they are idle.
+		 */
+		if (is_swap_pte(*pte)) {
+			add_page_idle_list(NULL, addr, walk);
+			continue;
+		}
+
 		if (!pte_present(*pte))
 			continue;
 
@@ -413,10 +428,12 @@ ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
 			goto remove_page;
 
 		if (write) {
-			page_idle_clear_pte_refs(page);
-			set_page_idle(page);
+			if (page) {
+				page_idle_clear_pte_refs(page);
+				set_page_idle(page);
+			}
 		} else {
-			if (page_really_idle(page)) {
+			if (!page || page_really_idle(page)) {
 				off = ((cur->addr) >> PAGE_SHIFT) - start_frame;
 				bit = off % BITMAP_CHUNK_BITS;
 				index = off / BITMAP_CHUNK_BITS;
-- 
2.22.0.709.g102302147b-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
  2019-07-26  0:06           ` Joel Fernandes
@ 2019-07-26 11:16             ` Konstantin Khlebnikov
  2019-07-26 12:54               ` Joel Fernandes
  0 siblings, 1 reply; 18+ messages in thread
From: Konstantin Khlebnikov @ 2019-07-26 11:16 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Minchan Kim, linux-kernel, vdavydov.dev, Brendan Gregg,
	kernel-team, Alexey Dobriyan, Al Viro, Andrew Morton,
	carmenjackson, Christian Hansen, Colin Ian King, dancol,
	David Howells, fmayer, joaodias, Jonathan Corbet, Kees Cook,
	Kirill Tkhai, linux-doc, linux-fsdevel, linux-mm, Michal Hocko,
	Mike Rapoport, namhyung, sspatil

On 26.07.2019 3:06, Joel Fernandes wrote:
> On Thu, Jul 25, 2019 at 11:15:53AM +0300, Konstantin Khlebnikov wrote:
> [snip]
>>>>> Thanks for bringing up the swapping corner case..  Perhaps we can improve
>>>>> the heap profiler to detect this by looking at bits 0-4 in pagemap. While it
>>>>
>>>> Yeb, that could work but it could add overhead again what you want to remove?
>>>> Even, userspace should keep metadata to identify that page was already swapped
>>>> in last period or newly swapped in new period.
>>>
>>> Yep.
>> Between samples page could be read from swap and swapped out back multiple times.
>> For tracking this swap ptes could be marked with idle bit too.
>> I believe it's not so hard to find free bit for this.
>>
>> Refault\swapout will automatically clear this bit in pte even if
>> page goes nowhere stays if swap-cache.
> 
> Could you clarify more about your idea? Do you mean swapout will clear the new
> idle swap-pte bit if the page was accessed just before the swapout? >
> Instead, I thought of using is_swap_pte() to detect if the PTE belong to a
> page that was swapped. And if so, then assume the page was idle. Sure we
> would miss data that the page was accessed before the swap out in the
> sampling window, however if the page was swapped out, then it is likely idle
> anyway.


I mean page might be in swap when you mark pages idle and
then been accessed and swapped back before second pass.

I propose marking swap pte with idle bit which will be automatically
cleared by following swapin/swapout pair:

page alloc -> install page pte
page swapout -> install swap entry in pte
mark vm idle -> set swap-idle bit in swap pte
access/swapin -> install page pte (clear page idle if set)
page swapout -> install swap entry in pte (without swap idle bit)
scan vm idle -> see swap entry without idle bit -> page has been accessed since marking idle

One bit in pte is enough for tracking. This does not needs any propagation for
idle bits between page and swap, or marking pages as idle in swap cache.

> 
> My current patch was just reporting swapped out pages as non-idle (idle bit
> not set) which is wrong as Minchan pointed. So I added below patch on top of
> this patch (still testing..) :
> 
> thanks,
> 
>   - Joel
> ---8<-----------------------
> 
> diff --git a/mm/page_idle.c b/mm/page_idle.c
> index 3667ed9cc904..46c2dd18cca8 100644
> --- a/mm/page_idle.c
> +++ b/mm/page_idle.c
> @@ -271,10 +271,14 @@ struct page_idle_proc_priv {
>   	struct list_head *idle_page_list;
>   };
>   
> +/*
> + * Add a page to the idle page list.
> + * page can also be NULL if pte was not present or swapped.
> + */
>   static void add_page_idle_list(struct page *page,
>   			       unsigned long addr, struct mm_walk *walk)
>   {
> -	struct page *page_get;
> +	struct page *page_get = NULL;
>   	struct page_node *pn;
>   	int bit;
>   	unsigned long frames;
> @@ -290,9 +294,11 @@ static void add_page_idle_list(struct page *page,
>   			return;
>   	}
>   
> -	page_get = page_idle_get_page(page);
> -	if (!page_get)
> -		return;
> +	if (page) {
> +		page_get = page_idle_get_page(page);
> +		if (!page_get)
> +			return;
> +	}
>   
>   	pn = &(priv->page_nodes[priv->cur_page_node++]);
>   	pn->page = page_get;
> @@ -326,6 +332,15 @@ static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
>   
>   	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>   	for (; addr != end; pte++, addr += PAGE_SIZE) {
> +		/*
> +		 * We add swapped pages to the idle_page_list so that we can
> +		 * reported to userspace that they are idle.
> +		 */
> +		if (is_swap_pte(*pte)) {
> +			add_page_idle_list(NULL, addr, walk);
> +			continue;
> +		}
> +
>   		if (!pte_present(*pte))
>   			continue;
>   
> @@ -413,10 +428,12 @@ ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
>   			goto remove_page;
>   
>   		if (write) {
> -			page_idle_clear_pte_refs(page);
> -			set_page_idle(page);
> +			if (page) {
> +				page_idle_clear_pte_refs(page);
> +				set_page_idle(page);
> +			}
>   		} else {
> -			if (page_really_idle(page)) {
> +			if (!page || page_really_idle(page)) {
>   				off = ((cur->addr) >> PAGE_SHIFT) - start_frame;
>   				bit = off % BITMAP_CHUNK_BITS;
>   				index = off / BITMAP_CHUNK_BITS;
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
  2019-07-26 11:16             ` Konstantin Khlebnikov
@ 2019-07-26 12:54               ` Joel Fernandes
  0 siblings, 0 replies; 18+ messages in thread
From: Joel Fernandes @ 2019-07-26 12:54 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Minchan Kim, linux-kernel, vdavydov.dev, Brendan Gregg,
	kernel-team, Alexey Dobriyan, Al Viro, Andrew Morton,
	carmenjackson, Christian Hansen, Colin Ian King, dancol,
	David Howells, fmayer, joaodias, Jonathan Corbet, Kees Cook,
	Kirill Tkhai, linux-doc, linux-fsdevel, linux-mm, Michal Hocko,
	Mike Rapoport, namhyung, sspatil

On Fri, Jul 26, 2019 at 02:16:20PM +0300, Konstantin Khlebnikov wrote:
> On 26.07.2019 3:06, Joel Fernandes wrote:
> > On Thu, Jul 25, 2019 at 11:15:53AM +0300, Konstantin Khlebnikov wrote:
> > [snip]
> > > > > > Thanks for bringing up the swapping corner case..  Perhaps we can improve
> > > > > > the heap profiler to detect this by looking at bits 0-4 in pagemap. While it
> > > > > 
> > > > > Yeb, that could work but it could add overhead again what you want to remove?
> > > > > Even, userspace should keep metadata to identify that page was already swapped
> > > > > in last period or newly swapped in new period.
> > > > 
> > > > Yep.
> > > Between samples page could be read from swap and swapped out back multiple times.
> > > For tracking this swap ptes could be marked with idle bit too.
> > > I believe it's not so hard to find free bit for this.
> > > 
> > > Refault\swapout will automatically clear this bit in pte even if
> > > page goes nowhere stays if swap-cache.
> > 
> > Could you clarify more about your idea? Do you mean swapout will clear the new
> > idle swap-pte bit if the page was accessed just before the swapout? >
> > Instead, I thought of using is_swap_pte() to detect if the PTE belong to a
> > page that was swapped. And if so, then assume the page was idle. Sure we
> > would miss data that the page was accessed before the swap out in the
> > sampling window, however if the page was swapped out, then it is likely idle
> > anyway.
> 
> 
> I mean page might be in swap when you mark pages idle and
> then been accessed and swapped back before second pass.
> 
> I propose marking swap pte with idle bit which will be automatically
> cleared by following swapin/swapout pair:
> 
> page alloc -> install page pte
> page swapout -> install swap entry in pte
> mark vm idle -> set swap-idle bit in swap pte
> access/swapin -> install page pte (clear page idle if set)
> page swapout -> install swap entry in pte (without swap idle bit)
> scan vm idle -> see swap entry without idle bit -> page has been accessed since marking idle
> 
> One bit in pte is enough for tracking. This does not needs any propagation for
> idle bits between page and swap, or marking pages as idle in swap cache.

Ok I see the case you are referring to now. This can be a follow-up patch to
address the case, because.. the limitation you mentioned is also something
inherrent in the (traditional) physical page_idle tracking if that were used.
The reason being, after swapping, the PTE is not mapped to any page so there
is nothing to mark as idle. So if the page gets swapped out and in in the
meanwhile, then you would run into the same issue.

But yes, we should certainly address it in the future. I just want to keep
things simple at the moment. I will make a note about your suggestion but you
are welcomed to write a patch for it on top of my patch. I am about to send
another revision shortly for futhre review.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-07-26 12:54 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-22 21:32 [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing Joel Fernandes (Google)
2019-07-22 21:32 ` [PATCH v1 2/2] doc: Update documentation for page_idle virtual address indexing Joel Fernandes (Google)
2019-07-22 22:06 ` [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing Andrew Morton
2019-07-23 14:43   ` Joel Fernandes
2019-07-24 19:33   ` Joel Fernandes
2019-07-23  6:05 ` Michal Hocko
2019-07-23 14:34   ` Joel Fernandes
2019-07-23  6:13 ` Minchan Kim
2019-07-23 14:20   ` Joel Fernandes
2019-07-24  4:28     ` Minchan Kim
2019-07-24 14:10       ` Joel Fernandes
2019-07-25  8:15         ` Konstantin Khlebnikov
2019-07-26  0:06           ` Joel Fernandes
2019-07-26 11:16             ` Konstantin Khlebnikov
2019-07-26 12:54               ` Joel Fernandes
2019-07-23  8:43 ` Konstantin Khlebnikov
2019-07-23 10:10   ` Konstantin Khlebnikov
2019-07-23 13:47     ` Joel Fernandes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).