linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing
@ 2019-08-05 17:04 Joel Fernandes (Google)
  2019-08-05 17:04 ` [PATCH v4 2/5] [RFC] x86: Add support for idle bit in swap PTE Joel Fernandes (Google)
                   ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: Joel Fernandes (Google) @ 2019-08-05 17:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	Alexey Dobriyan, Andrew Morton, Borislav Petkov, Brendan Gregg,
	Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, joelaf, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Michal Hocko, Mike Rapoport, minchan, namhyung, paulmck,
	Robin Murphy, Roman Gushchin, Stephen Rothwell, surenb,
	Thomas Gleixner, tkjos, Vladimir Davydov, Vlastimil Babka,
	Will Deacon

The page_idle tracking feature currently requires looking up the pagemap
for a process followed by interacting with /sys/kernel/mm/page_idle.
Looking up PFN from pagemap in Android devices is not supported by
unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.

This patch adds support to directly interact with page_idle tracking at
the PID level by introducing a /proc/<pid>/page_idle file.  It follows
the exact same semantics as the global /sys/kernel/mm/page_idle, but now
looking up PFN through pagemap is not needed since the interface uses
virtual frame numbers, and at the same time also does not require
SYS_ADMIN.

In Android, we are using this for the heap profiler (heapprofd) which
profiles and pin points code paths which allocates and leaves memory
idle for long periods of time. This method solves the security issue
with userspace learning the PFN, and while at it is also shown to yield
better results than the pagemap lookup, the theory being that the window
where the address space can change is reduced by eliminating the
intermediate pagemap look up stage. In virtual address indexing, the
process's mmap_sem is held for the duration of the access.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
v3->v4: Minor fixups (Minchan)
        Add swap pte handling (Konstantin, Minchan)
v2->v3:
Fixed a bug where I was doing a kfree that is not needed due to not
needing to do GFP_ATOMIC allocations.

v1->v2:
Mark swap ptes as idle (Minchan)
Avoid need for GFP_ATOMIC (Andrew)
Get rid of idle_page_list lock by moving list to stack

Internal review -> v1:
Fixes from Suren.
Corrections to change log, docs (Florian, Sandeep)

 arch/Kconfig                  |   3 +
 fs/proc/base.c                |   3 +
 fs/proc/internal.h            |   1 +
 fs/proc/task_mmu.c            |  43 ++++
 include/asm-generic/pgtable.h |   6 +
 include/linux/page_idle.h     |   4 +
 mm/page_idle.c                | 359 +++++++++++++++++++++++++++++-----
 mm/rmap.c                     |   2 +
 8 files changed, 376 insertions(+), 45 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index a7b57dd42c26..3aa121ce824e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -575,6 +575,9 @@ config ARCH_WANT_HUGE_PMD_SHARE
 config HAVE_ARCH_SOFT_DIRTY
 	bool
 
+config HAVE_ARCH_PTE_SWP_PGIDLE
+	bool
+
 config HAVE_MOD_ARCH_SPECIFIC
 	bool
 	help
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ebea9501afb8..fd2f74bd4e35 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3039,6 +3039,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 	REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
 	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
 	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+	REG("page_idle", S_IRUSR|S_IWUSR, proc_page_idle_operations),
+#endif
 #endif
 #ifdef CONFIG_SECURITY
 	DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index cd0c8d5ce9a1..bc9371880c63 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -293,6 +293,7 @@ extern const struct file_operations proc_pid_smaps_operations;
 extern const struct file_operations proc_pid_smaps_rollup_operations;
 extern const struct file_operations proc_clear_refs_operations;
 extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_page_idle_operations;
 
 extern unsigned long task_vsize(struct mm_struct *);
 extern unsigned long task_statm(struct mm_struct *,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 582c5e680176..a9003fe8d267 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1650,6 +1650,49 @@ const struct file_operations proc_pagemap_operations = {
 	.open		= pagemap_open,
 	.release	= pagemap_release,
 };
+
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+static ssize_t proc_page_idle_read(struct file *file, char __user *buf,
+				   size_t count, loff_t *ppos)
+{
+	return page_idle_proc_read(file, buf, count, ppos);
+}
+
+static ssize_t proc_page_idle_write(struct file *file, const char __user *buf,
+				 size_t count, loff_t *ppos)
+{
+	return page_idle_proc_write(file, (char __user *)buf, count, ppos);
+}
+
+static int proc_page_idle_open(struct inode *inode, struct file *file)
+{
+	struct mm_struct *mm;
+
+	mm = proc_mem_open(inode, PTRACE_MODE_READ);
+	if (IS_ERR(mm))
+		return PTR_ERR(mm);
+	file->private_data = mm;
+	return 0;
+}
+
+static int proc_page_idle_release(struct inode *inode, struct file *file)
+{
+	struct mm_struct *mm = file->private_data;
+
+	if (mm)
+		mmdrop(mm);
+	return 0;
+}
+
+const struct file_operations proc_page_idle_operations = {
+	.llseek		= mem_lseek, /* borrow this */
+	.read		= proc_page_idle_read,
+	.write		= proc_page_idle_write,
+	.open		= proc_page_idle_open,
+	.release	= proc_page_idle_release,
+};
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
 #endif /* CONFIG_PROC_PAGE_MONITOR */
 
 #ifdef CONFIG_NUMA
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 75d9d68a6de7..6d51d0a355a7 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -712,6 +712,12 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
 #define arch_start_context_switch(prev)	do {} while (0)
 #endif
 
+#ifndef CONFIG_HAVE_ARCH_PTE_SWP_PGIDLE
+static inline pte_t pte_swp_mkpage_idle(pte_t pte) { return pte; }
+static inline int pte_swp_page_idle(pte_t pte) { return 0; }
+static inline pte_t pte_swp_clear_mkpage_idle(pte_t pte) { return pte; }
+#endif
+
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 #ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
 static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
index 1e894d34bdce..f1bc2640d85e 100644
--- a/include/linux/page_idle.h
+++ b/include/linux/page_idle.h
@@ -106,6 +106,10 @@ static inline void clear_page_idle(struct page *page)
 }
 #endif /* CONFIG_64BIT */
 
+ssize_t page_idle_proc_write(struct file *file,
+	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
+ssize_t page_idle_proc_read(struct file *file,
+	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
 #else /* !CONFIG_IDLE_PAGE_TRACKING */
 
 static inline bool page_is_young(struct page *page)
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 295512465065..a5b00d63216c 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -5,17 +5,22 @@
 #include <linux/sysfs.h>
 #include <linux/kobject.h>
 #include <linux/mm.h>
-#include <linux/mmzone.h>
-#include <linux/pagemap.h>
-#include <linux/rmap.h>
 #include <linux/mmu_notifier.h>
+#include <linux/mmzone.h>
 #include <linux/page_ext.h>
 #include <linux/page_idle.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/sched/mm.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 
 #define BITMAP_CHUNK_SIZE	sizeof(u64)
 #define BITMAP_CHUNK_BITS	(BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
 
 /*
+ * Get a reference to a page for idle tracking purposes, with additional checks.
+ *
  * Idle page tracking only considers user memory pages, for other types of
  * pages the idle flag is always unset and an attempt to set it is silently
  * ignored.
@@ -25,18 +30,13 @@
  * page tracking. With such an indicator of user pages we can skip isolated
  * pages, but since there are not usually many of them, it will hardly affect
  * the overall result.
- *
- * This function tries to get a user memory page by pfn as described above.
  */
-static struct page *page_idle_get_page(unsigned long pfn)
+static struct page *page_idle_get_page(struct page *page_in)
 {
 	struct page *page;
 	pg_data_t *pgdat;
 
-	if (!pfn_valid(pfn))
-		return NULL;
-
-	page = pfn_to_page(pfn);
+	page = page_in;
 	if (!page || !PageLRU(page) ||
 	    !get_page_unless_zero(page))
 		return NULL;
@@ -51,6 +51,18 @@ static struct page *page_idle_get_page(unsigned long pfn)
 	return page;
 }
 
+/*
+ * This function tries to get a user memory page by pfn as described above.
+ */
+static struct page *page_idle_get_page_pfn(unsigned long pfn)
+{
+
+	if (!pfn_valid(pfn))
+		return NULL;
+
+	return page_idle_get_page(pfn_to_page(pfn));
+}
+
 static bool page_idle_clear_pte_refs_one(struct page *page,
 					struct vm_area_struct *vma,
 					unsigned long addr, void *arg)
@@ -118,6 +130,47 @@ static void page_idle_clear_pte_refs(struct page *page)
 		unlock_page(page);
 }
 
+/* Helper to get the start and end frame given a pos and count */
+static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
+				unsigned long *start, unsigned long *end)
+{
+	unsigned long max_frame;
+
+	/* If an mm is not given, assume we want physical frames */
+	max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
+
+	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
+		return -EINVAL;
+
+	*start = pos * BITS_PER_BYTE;
+	if (*start >= max_frame)
+		return -ENXIO;
+
+	*end = *start + count * BITS_PER_BYTE;
+	if (*end > max_frame)
+		*end = max_frame;
+	return 0;
+}
+
+static bool page_idle_pte_check(struct page *page)
+{
+	if (!page)
+		return false;
+
+	if (page_is_idle(page)) {
+		/*
+		 * The page might have been referenced via a
+		 * pte, in which case it is not idle. Clear
+		 * refs and recheck.
+		 */
+		page_idle_clear_pte_refs(page);
+		if (page_is_idle(page))
+			return true;
+	}
+
+	return false;
+}
+
 static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
 				     struct bin_attribute *attr, char *buf,
 				     loff_t pos, size_t count)
@@ -125,35 +178,21 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
 	u64 *out = (u64 *)buf;
 	struct page *page;
 	unsigned long pfn, end_pfn;
-	int bit;
+	int bit, ret;
 
-	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
-		return -EINVAL;
-
-	pfn = pos * BITS_PER_BYTE;
-	if (pfn >= max_pfn)
-		return 0;
-
-	end_pfn = pfn + count * BITS_PER_BYTE;
-	if (end_pfn > max_pfn)
-		end_pfn = max_pfn;
+	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
+	if (ret == -ENXIO)
+		return 0;  /* Reads beyond max_pfn do nothing */
+	else if (ret)
+		return ret;
 
 	for (; pfn < end_pfn; pfn++) {
 		bit = pfn % BITMAP_CHUNK_BITS;
 		if (!bit)
 			*out = 0ULL;
-		page = page_idle_get_page(pfn);
-		if (page) {
-			if (page_is_idle(page)) {
-				/*
-				 * The page might have been referenced via a
-				 * pte, in which case it is not idle. Clear
-				 * refs and recheck.
-				 */
-				page_idle_clear_pte_refs(page);
-				if (page_is_idle(page))
-					*out |= 1ULL << bit;
-			}
+		page = page_idle_get_page_pfn(pfn);
+		if (page && page_idle_pte_check(page)) {
+			*out |= 1ULL << bit;
 			put_page(page);
 		}
 		if (bit == BITMAP_CHUNK_BITS - 1)
@@ -170,23 +209,16 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
 	const u64 *in = (u64 *)buf;
 	struct page *page;
 	unsigned long pfn, end_pfn;
-	int bit;
-
-	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
-		return -EINVAL;
+	int bit, ret;
 
-	pfn = pos * BITS_PER_BYTE;
-	if (pfn >= max_pfn)
-		return -ENXIO;
-
-	end_pfn = pfn + count * BITS_PER_BYTE;
-	if (end_pfn > max_pfn)
-		end_pfn = max_pfn;
+	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
+	if (ret)
+		return ret;
 
 	for (; pfn < end_pfn; pfn++) {
 		bit = pfn % BITMAP_CHUNK_BITS;
 		if ((*in >> bit) & 1) {
-			page = page_idle_get_page(pfn);
+			page = page_idle_get_page_pfn(pfn);
 			if (page) {
 				page_idle_clear_pte_refs(page);
 				set_page_idle(page);
@@ -224,6 +256,243 @@ struct page_ext_operations page_idle_ops = {
 };
 #endif
 
+/*  page_idle tracking for /proc/<pid>/page_idle */
+
+struct page_node {
+	struct page *page;
+	unsigned long addr;
+	struct list_head list;
+};
+
+struct page_idle_proc_priv {
+	unsigned long start_addr;
+	char *buffer;
+	int write;
+
+	/* Pre-allocate and provide nodes to pte_page_idle_proc_add() */
+	struct page_node *page_nodes;
+	int cur_page_node;
+	struct list_head *idle_page_list;
+};
+
+/*
+ * Set a page as idle or add it to a list to be set as idle later.
+ */
+static void pte_page_idle_proc_add(struct page *page,
+			       unsigned long addr, struct mm_walk *walk)
+{
+	struct page *page_get = NULL;
+	struct page_node *pn;
+	int bit;
+	unsigned long frames;
+	struct page_idle_proc_priv *priv = walk->private;
+	u64 *chunk = (u64 *)priv->buffer;
+
+	if (priv->write) {
+		VM_BUG_ON(!page);
+
+		/* Find whether this page was asked to be marked */
+		frames = (addr - priv->start_addr) >> PAGE_SHIFT;
+		bit = frames % BITMAP_CHUNK_BITS;
+		chunk = &chunk[frames / BITMAP_CHUNK_BITS];
+		if (((*chunk >> bit) & 1) == 0)
+			return;
+	}
+
+	if (page) {
+		page_get = page_idle_get_page(page);
+		if (!page_get)
+			return;
+	} else {
+		/* For swapped pages, set output bit as idle */
+		frames = (addr - priv->start_addr) >> PAGE_SHIFT;
+		bit = frames % BITMAP_CHUNK_BITS;
+		chunk = &chunk[frames / BITMAP_CHUNK_BITS];
+		*chunk |= (1 << bit);
+		return;
+	}
+
+	/*
+	 * For all other pages, add it to a list since we have to walk rmap,
+	 * which acquires ptlock, and we cannot walk rmap right now.
+	 */
+	pn = &(priv->page_nodes[priv->cur_page_node++]);
+	pn->page = page_get;
+	pn->addr = addr;
+	list_add(&pn->list, priv->idle_page_list);
+}
+
+static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
+				    unsigned long end,
+				    struct mm_walk *walk)
+{
+	pte_t *pte;
+	spinlock_t *ptl;
+	struct page *page;
+	struct vm_area_struct *vma = walk->vma;
+	struct page_idle_proc_priv *priv = walk->private;
+
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		if (pmd_present(*pmd)) {
+			page = follow_trans_huge_pmd(vma, addr, pmd,
+						     FOLL_DUMP|FOLL_WRITE);
+			if (!IS_ERR_OR_NULL(page))
+				pte_page_idle_proc_add(page, addr, walk);
+		}
+		spin_unlock(ptl);
+		return 0;
+	}
+
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		/* For swap_pte handling, we use an idle bit in the swap pte. */
+		if (is_swap_pte(*pte)) {
+			if (priv->write) {
+				set_pte_at(walk->mm, addr, pte,
+					   pte_swp_mkpage_idle(*pte));
+			} else {
+				/* If swap pte has idle bit set, report it as idle */
+				if (pte_swp_page_idle(*pte))
+					pte_page_idle_proc_add(NULL, addr, walk);
+			}
+			continue;
+		}
+
+		if (!pte_present(*pte))
+			continue;
+
+		page = vm_normal_page(vma, addr, *pte);
+		if (page)
+			pte_page_idle_proc_add(page, addr, walk);
+	}
+
+	pte_unmap_unlock(pte - 1, ptl);
+	return 0;
+}
+
+ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
+			       size_t count, loff_t *pos, int write)
+{
+	int ret;
+	char *buffer;
+	u64 *out;
+	unsigned long start_addr, end_addr, start_frame, end_frame;
+	struct mm_struct *mm = file->private_data;
+	struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, };
+	struct page_node *cur;
+	struct page_idle_proc_priv priv;
+	bool walk_error = false;
+	LIST_HEAD(idle_page_list);
+
+	if (!mm || !mmget_not_zero(mm))
+		return -EINVAL;
+
+	if (count > PAGE_SIZE)
+		count = PAGE_SIZE;
+
+	buffer = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buffer) {
+		ret = -ENOMEM;
+		goto out_mmput;
+	}
+	out = (u64 *)buffer;
+
+	if (write && copy_from_user(buffer, ubuff, count)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame);
+	if (ret)
+		goto out;
+
+	start_addr = (start_frame << PAGE_SHIFT);
+	end_addr = (end_frame << PAGE_SHIFT);
+	priv.buffer = buffer;
+	priv.start_addr = start_addr;
+	priv.write = write;
+
+	priv.idle_page_list = &idle_page_list;
+	priv.cur_page_node = 0;
+	priv.page_nodes = kzalloc(sizeof(struct page_node) *
+				  (end_frame - start_frame), GFP_KERNEL);
+	if (!priv.page_nodes) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	walk.private = &priv;
+	walk.mm = mm;
+
+	down_read(&mm->mmap_sem);
+
+	/*
+	 * idle_page_list is needed because walk_page_vma() holds ptlock which
+	 * deadlocks with page_idle_clear_pte_refs(). So we have to collect all
+	 * pages first, and then call page_idle_clear_pte_refs().
+	 */
+	ret = walk_page_range(start_addr, end_addr, &walk);
+	if (ret)
+		walk_error = true;
+
+	list_for_each_entry(cur, &idle_page_list, list) {
+		int bit, index;
+		unsigned long off;
+		struct page *page = cur->page;
+
+		if (unlikely(walk_error))
+			goto remove_page;
+
+		if (write) {
+			if (page) {
+				page_idle_clear_pte_refs(page);
+				set_page_idle(page);
+			}
+		} else {
+			/* If page is NULL, it was swapped out */
+			if (!page || page_idle_pte_check(page)) {
+				off = ((cur->addr) >> PAGE_SHIFT) - start_frame;
+				bit = off % BITMAP_CHUNK_BITS;
+				index = off / BITMAP_CHUNK_BITS;
+				out[index] |= 1ULL << bit;
+			}
+		}
+remove_page:
+		if (page)
+			put_page(page);
+	}
+
+	if (!write && !walk_error)
+		ret = copy_to_user(ubuff, buffer, count);
+
+	up_read(&mm->mmap_sem);
+	kfree(priv.page_nodes);
+out:
+	kfree(buffer);
+out_mmput:
+	mmput(mm);
+	if (!ret)
+		ret = count;
+	return ret;
+
+}
+
+ssize_t page_idle_proc_read(struct file *file, char __user *ubuff,
+			    size_t count, loff_t *pos)
+{
+	return page_idle_proc_generic(file, ubuff, count, pos, 0);
+}
+
+ssize_t page_idle_proc_write(struct file *file, char __user *ubuff,
+			     size_t count, loff_t *pos, struct mm_struct *mm)
+{
+	return page_idle_proc_generic(file, ubuff, count, pos, 1);
+}
+
 static int __init page_idle_init(void)
 {
 	int err;
diff --git a/mm/rmap.c b/mm/rmap.c
index e5dfe2ae6b0d..4bd618aab402 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1629,6 +1629,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (page_is_idle(page))
+				swp_pte = pte_swp_mkpage_idle(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
 			/* Invalidate as we cleared the pte */
 			mmu_notifier_invalidate_range(mm, address,
-- 
2.22.0.770.g0f2c4a37fd-goog

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 2/5] [RFC] x86: Add support for idle bit in swap PTE
  2019-08-05 17:04 [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing Joel Fernandes (Google)
@ 2019-08-05 17:04 ` Joel Fernandes (Google)
  2019-08-05 17:04 ` [PATCH v4 3/5] [RFC] arm64: " Joel Fernandes (Google)
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Joel Fernandes (Google) @ 2019-08-05 17:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	Alexey Dobriyan, Andrew Morton, Borislav Petkov, Brendan Gregg,
	Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, joelaf, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Michal Hocko, Mike Rapoport, minchan, namhyung, paulmck,
	Robin Murphy, Roman Gushchin, Stephen Rothwell, surenb,
	Thomas Gleixner, tkjos, Vladimir Davydov, Vlastimil Babka,
	Will Deacon

This bit will be used by idle page tracking code to correctly identify
if a page that was swapped out was idle before it got swapped out.
Without this PTE bit, we lose information about if a page is idle or not
since the page frame gets unmapped and the page gets freed.

Bits 2-6 are unused in the swap PTE (see the comment in
arch/x86/include/asm/pgtable_64.h). Bit 2 corresponds to _PAGE_USER. Use
it for swap PTE purposes.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 arch/x86/Kconfig                     |  1 +
 arch/x86/include/asm/pgtable.h       | 15 +++++++++++++++
 arch/x86/include/asm/pgtable_types.h |  6 ++++++
 3 files changed, 22 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 222855cc0158..728f22370f17 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -139,6 +139,7 @@ config X86
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS	if MMU && COMPAT
 	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
 	select HAVE_ARCH_PREL32_RELOCATIONS
+	select HAVE_ARCH_PTE_SWP_PGIDLE
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_STACKLEAK
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0bc530c4eb13..ef3e662cee4a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1371,6 +1371,21 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
 #endif
 #endif
 
+static inline pte_t pte_swp_mkpage_idle(pte_t pte)
+{
+	return pte_set_flags(pte, _PAGE_SWP_PGIDLE);
+}
+
+static inline int pte_swp_page_idle(pte_t pte)
+{
+	return pte_flags(pte) & _PAGE_SWP_PGIDLE;
+}
+
+static inline pte_t pte_swp_clear_mkpage_idle(pte_t pte)
+{
+	return pte_clear_flags(pte, _PAGE_SWP_PGIDLE);
+}
+
 #define PKRU_AD_BIT 0x1
 #define PKRU_WD_BIT 0x2
 #define PKRU_BITS_PER_PKEY 2
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index b5e49e6bac63..6739cba4c900 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -100,6 +100,12 @@
 #define _PAGE_SWP_SOFT_DIRTY	(_AT(pteval_t, 0))
 #endif
 
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+#define _PAGE_SWP_PGIDLE	_PAGE_USER
+#else
+#define _PAGE_SWP_PGIDLE	(_AT(pteval_t, 0))
+#endif
+
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX	(_AT(pteval_t, 1) << _PAGE_BIT_NX)
 #define _PAGE_DEVMAP	(_AT(u64, 1) << _PAGE_BIT_DEVMAP)
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
  2019-08-05 17:04 [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing Joel Fernandes (Google)
  2019-08-05 17:04 ` [PATCH v4 2/5] [RFC] x86: Add support for idle bit in swap PTE Joel Fernandes (Google)
@ 2019-08-05 17:04 ` Joel Fernandes (Google)
  2019-08-06  8:42   ` Michal Hocko
  2019-08-05 17:04 ` [PATCH v4 4/5] page_idle: Drain all LRU pagevec before idle tracking Joel Fernandes (Google)
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Joel Fernandes (Google) @ 2019-08-05 17:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	Robin Murphy, Alexey Dobriyan, Andrew Morton, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, joelaf, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Michal Hocko, Mike Rapoport, minchan, namhyung, paulmck,
	Roman Gushchin, Stephen Rothwell, surenb, Thomas Gleixner, tkjos,
	Vladimir Davydov, Vlastimil Babka, Will Deacon

This bit will be used by idle page tracking code to correctly identify
if a page that was swapped out was idle before it got swapped out.
Without this PTE bit, we lose information about if a page is idle or not
since the page frame gets unmapped.

In this patch we reuse PTE_DEVMAP bit since idle page tracking only
works on user pages in the LRU. Device pages should not consitute those
so it should be unused and safe to use.

Cc: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 arch/arm64/Kconfig                    |  1 +
 arch/arm64/include/asm/pgtable-prot.h |  1 +
 arch/arm64/include/asm/pgtable.h      | 15 +++++++++++++++
 3 files changed, 17 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 3adcec05b1f6..9d1412c693d7 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -128,6 +128,7 @@ config ARM64
 	select HAVE_ARCH_MMAP_RND_BITS
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
 	select HAVE_ARCH_PREL32_RELOCATIONS
+	select HAVE_ARCH_PTE_SWP_PGIDLE
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_STACKLEAK
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index 92d2e9f28f28..917b15c5d63a 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -18,6 +18,7 @@
 #define PTE_SPECIAL		(_AT(pteval_t, 1) << 56)
 #define PTE_DEVMAP		(_AT(pteval_t, 1) << 57)
 #define PTE_PROT_NONE		(_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
+#define PTE_SWP_PGIDLE		PTE_DEVMAP		 /* for idle page tracking during swapout */
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 3f5461f7b560..558f5ebd81ba 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -212,6 +212,21 @@ static inline pte_t pte_mkdevmap(pte_t pte)
 	return set_pte_bit(pte, __pgprot(PTE_DEVMAP));
 }
 
+static inline int pte_swp_page_idle(pte_t pte)
+{
+	return 0;
+}
+
+static inline pte_t pte_swp_mkpage_idle(pte_t pte)
+{
+	return set_pte_bit(pte, __pgprot(PTE_SWP_PGIDLE));
+}
+
+static inline pte_t pte_swp_clear_page_idle(pte_t pte)
+{
+	return clear_pte_bit(pte, __pgprot(PTE_SWP_PGIDLE));
+}
+
 static inline void set_pte(pte_t *ptep, pte_t pte)
 {
 	WRITE_ONCE(*ptep, pte);
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 4/5] page_idle: Drain all LRU pagevec before idle tracking
  2019-08-05 17:04 [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing Joel Fernandes (Google)
  2019-08-05 17:04 ` [PATCH v4 2/5] [RFC] x86: Add support for idle bit in swap PTE Joel Fernandes (Google)
  2019-08-05 17:04 ` [PATCH v4 3/5] [RFC] arm64: " Joel Fernandes (Google)
@ 2019-08-05 17:04 ` Joel Fernandes (Google)
  2019-08-06  8:43   ` Michal Hocko
  2019-08-05 17:04 ` [PATCH v4 5/5] doc: Update documentation for page_idle virtual address indexing Joel Fernandes (Google)
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Joel Fernandes (Google) @ 2019-08-05 17:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	Alexey Dobriyan, Andrew Morton, Borislav Petkov, Brendan Gregg,
	Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, joelaf, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Michal Hocko, Mike Rapoport, minchan, namhyung, paulmck,
	Robin Murphy, Roman Gushchin, Stephen Rothwell, surenb,
	Thomas Gleixner, tkjos, Vladimir Davydov, Vlastimil Babka,
	Will Deacon

During idle tracking, we see that sometimes faulted anon pages are in
pagevec but are not drained to LRU. Idle tracking considers pages only
on LRU. Drain all CPU's LRU before starting idle tracking.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 mm/page_idle.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/page_idle.c b/mm/page_idle.c
index a5b00d63216c..2972367a599f 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -180,6 +180,8 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
 	unsigned long pfn, end_pfn;
 	int bit, ret;
 
+	lru_add_drain_all();
+
 	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
 	if (ret == -ENXIO)
 		return 0;  /* Reads beyond max_pfn do nothing */
@@ -211,6 +213,8 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
 	unsigned long pfn, end_pfn;
 	int bit, ret;
 
+	lru_add_drain_all();
+
 	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
 	if (ret)
 		return ret;
@@ -428,6 +432,8 @@ ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
 	walk.private = &priv;
 	walk.mm = mm;
 
+	lru_add_drain_all();
+
 	down_read(&mm->mmap_sem);
 
 	/*
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 5/5] doc: Update documentation for page_idle virtual address indexing
  2019-08-05 17:04 [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing Joel Fernandes (Google)
                   ` (2 preceding siblings ...)
  2019-08-05 17:04 ` [PATCH v4 4/5] page_idle: Drain all LRU pagevec before idle tracking Joel Fernandes (Google)
@ 2019-08-05 17:04 ` Joel Fernandes (Google)
  2019-08-06  8:56 ` [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing Michal Hocko
  2019-08-06 22:19 ` Andrew Morton
  5 siblings, 0 replies; 29+ messages in thread
From: Joel Fernandes (Google) @ 2019-08-05 17:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	Mike Rapoport, Sandeep Patil, Alexey Dobriyan, Andrew Morton,
	Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	joelaf, Jonathan Corbet, Kees Cook, kernel-team, linux-api,
	linux-doc, linux-fsdevel, linux-mm, Michal Hocko, minchan,
	namhyung, paulmck, Robin Murphy, Roman Gushchin,
	Stephen Rothwell, surenb, Thomas Gleixner, tkjos,
	Vladimir Davydov, Vlastimil Babka, Will Deacon

This patch updates the documentation with the new page_idle tracking
feature which uses virtual address indexing.

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Sandeep Patil <sspatil@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 .../admin-guide/mm/idle_page_tracking.rst     | 43 ++++++++++++++++---
 1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/mm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst
index df9394fb39c2..9eef32000f5e 100644
--- a/Documentation/admin-guide/mm/idle_page_tracking.rst
+++ b/Documentation/admin-guide/mm/idle_page_tracking.rst
@@ -19,10 +19,14 @@ It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
 
 User API
 ========
+There are 2 ways to access the idle page tracking API. One uses physical
+address indexing, another uses a simpler virtual address indexing scheme.
 
-The idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
-Currently, it consists of the only read-write file,
-``/sys/kernel/mm/page_idle/bitmap``.
+Physical address indexing
+-------------------------
+The idle page tracking API for physical address indexing using page frame
+numbers (PFN) is located at ``/sys/kernel/mm/page_idle``.  Currently, it
+consists of the only read-write file, ``/sys/kernel/mm/page_idle/bitmap``.
 
 The file implements a bitmap where each bit corresponds to a memory page. The
 bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
@@ -74,6 +78,31 @@ See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more
 information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and
 ``/proc/kpagecgroup``.
 
+Virtual address indexing
+------------------------
+The idle page tracking API for virtual address indexing using virtual frame
+numbers (VFN) for a process ``<pid>`` is located at ``/proc/<pid>/page_idle``.
+It is a bitmap that follows the same semantics as
+``/sys/kernel/mm/page_idle/bitmap`` except that it uses virtual instead of
+physical frame numbers.
+
+This idle page tracking API does not deal with PFN so it does not require prior
+lookups of ``pagemap``. This is an advantage on some systems where looking up
+PFN is considered a security issue.  Also in some cases, this interface could
+be slightly more reliable to use than physical address indexing, since in
+physical address indexing, address space changes can occur between reading the
+``pagemap`` and reading the ``bitmap``, while in virtual address indexing, the
+process's ``mmap_sem`` is held for the duration of the access.
+
+To estimate the amount of pages that are not used by a workload one should:
+
+ 1. Mark all the workload's pages as idle by setting corresponding bits in
+    ``/proc/<pid>/page_idle``.
+
+ 2. Wait until the workload accesses its working set.
+
+ 3. Read ``/proc/<pid>/page_idle`` and count the number of bits set.
+
 .. _impl_details:
 
 Implementation Details
@@ -99,10 +128,10 @@ When a dirty page is written to swap or disk as a result of memory reclaim or
 exceeding the dirty memory limit, it is not marked referenced.
 
 The idle memory tracking feature adds a new page flag, the Idle flag. This flag
-is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
-:ref:`User API <user_api>`
-section), and cleared automatically whenever a page is referenced as defined
-above.
+is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` for physical
+addressing or by writing to ``/proc/<pid>/page_idle`` for virtual
+addressing (see the :ref:`User API <user_api>` section), and cleared
+automatically whenever a page is referenced as defined above.
 
 When a page is marked idle, the Accessed bit must be cleared in all PTEs it is
 mapped to, otherwise we will not be able to detect accesses to the page coming
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
  2019-08-05 17:04 ` [PATCH v4 3/5] [RFC] arm64: " Joel Fernandes (Google)
@ 2019-08-06  8:42   ` Michal Hocko
  2019-08-06 10:36     ` Joel Fernandes
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2019-08-06  8:42 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Robin Murphy, Alexey Dobriyan, Andrew Morton,
	Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	joelaf, Jonathan Corbet, Kees Cook, kernel-team, linux-api,
	linux-doc, linux-fsdevel, linux-mm, Mike Rapoport, minchan,
	namhyung, paulmck, Roman Gushchin, Stephen Rothwell, surenb,
	Thomas Gleixner, tkjos, Vladimir Davydov, Vlastimil Babka,
	Will Deacon

On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> This bit will be used by idle page tracking code to correctly identify
> if a page that was swapped out was idle before it got swapped out.
> Without this PTE bit, we lose information about if a page is idle or not
> since the page frame gets unmapped.

And why do we need that? Why cannot we simply assume all swapped out
pages to be idle? They were certainly idle enough to be reclaimed,
right? Or what does idle actualy mean here?

> In this patch we reuse PTE_DEVMAP bit since idle page tracking only
> works on user pages in the LRU. Device pages should not consitute those
> so it should be unused and safe to use.
> 
> Cc: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  arch/arm64/Kconfig                    |  1 +
>  arch/arm64/include/asm/pgtable-prot.h |  1 +
>  arch/arm64/include/asm/pgtable.h      | 15 +++++++++++++++
>  3 files changed, 17 insertions(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 3adcec05b1f6..9d1412c693d7 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -128,6 +128,7 @@ config ARM64
>  	select HAVE_ARCH_MMAP_RND_BITS
>  	select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
>  	select HAVE_ARCH_PREL32_RELOCATIONS
> +	select HAVE_ARCH_PTE_SWP_PGIDLE
>  	select HAVE_ARCH_SECCOMP_FILTER
>  	select HAVE_ARCH_STACKLEAK
>  	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
> diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
> index 92d2e9f28f28..917b15c5d63a 100644
> --- a/arch/arm64/include/asm/pgtable-prot.h
> +++ b/arch/arm64/include/asm/pgtable-prot.h
> @@ -18,6 +18,7 @@
>  #define PTE_SPECIAL		(_AT(pteval_t, 1) << 56)
>  #define PTE_DEVMAP		(_AT(pteval_t, 1) << 57)
>  #define PTE_PROT_NONE		(_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
> +#define PTE_SWP_PGIDLE		PTE_DEVMAP		 /* for idle page tracking during swapout */
>  
>  #ifndef __ASSEMBLY__
>  
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 3f5461f7b560..558f5ebd81ba 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -212,6 +212,21 @@ static inline pte_t pte_mkdevmap(pte_t pte)
>  	return set_pte_bit(pte, __pgprot(PTE_DEVMAP));
>  }
>  
> +static inline int pte_swp_page_idle(pte_t pte)
> +{
> +	return 0;
> +}
> +
> +static inline pte_t pte_swp_mkpage_idle(pte_t pte)
> +{
> +	return set_pte_bit(pte, __pgprot(PTE_SWP_PGIDLE));
> +}
> +
> +static inline pte_t pte_swp_clear_page_idle(pte_t pte)
> +{
> +	return clear_pte_bit(pte, __pgprot(PTE_SWP_PGIDLE));
> +}
> +
>  static inline void set_pte(pte_t *ptep, pte_t pte)
>  {
>  	WRITE_ONCE(*ptep, pte);
> -- 
> 2.22.0.770.g0f2c4a37fd-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/5] page_idle: Drain all LRU pagevec before idle tracking
  2019-08-05 17:04 ` [PATCH v4 4/5] page_idle: Drain all LRU pagevec before idle tracking Joel Fernandes (Google)
@ 2019-08-06  8:43   ` Michal Hocko
  2019-08-06 10:45     ` Joel Fernandes
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2019-08-06  8:43 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, joelaf, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, namhyung, paulmck, Robin Murphy,
	Roman Gushchin, Stephen Rothwell, surenb, Thomas Gleixner, tkjos,
	Vladimir Davydov, Vlastimil Babka, Will Deacon

On Mon 05-08-19 13:04:50, Joel Fernandes (Google) wrote:
> During idle tracking, we see that sometimes faulted anon pages are in
> pagevec but are not drained to LRU. Idle tracking considers pages only
> on LRU. Drain all CPU's LRU before starting idle tracking.

Please expand on why does this matter enough to introduce a potentially
expensinve draining which has to schedule a work on each CPU and wait
for them to finish.

> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  mm/page_idle.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/page_idle.c b/mm/page_idle.c
> index a5b00d63216c..2972367a599f 100644
> --- a/mm/page_idle.c
> +++ b/mm/page_idle.c
> @@ -180,6 +180,8 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
>  	unsigned long pfn, end_pfn;
>  	int bit, ret;
>  
> +	lru_add_drain_all();
> +
>  	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
>  	if (ret == -ENXIO)
>  		return 0;  /* Reads beyond max_pfn do nothing */
> @@ -211,6 +213,8 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
>  	unsigned long pfn, end_pfn;
>  	int bit, ret;
>  
> +	lru_add_drain_all();
> +
>  	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
>  	if (ret)
>  		return ret;
> @@ -428,6 +432,8 @@ ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
>  	walk.private = &priv;
>  	walk.mm = mm;
>  
> +	lru_add_drain_all();
> +
>  	down_read(&mm->mmap_sem);
>  
>  	/*
> -- 
> 2.22.0.770.g0f2c4a37fd-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing
  2019-08-05 17:04 [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing Joel Fernandes (Google)
                   ` (3 preceding siblings ...)
  2019-08-05 17:04 ` [PATCH v4 5/5] doc: Update documentation for page_idle virtual address indexing Joel Fernandes (Google)
@ 2019-08-06  8:56 ` Michal Hocko
  2019-08-06 10:47   ` Joel Fernandes
  2019-08-06 22:19 ` Andrew Morton
  5 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2019-08-06  8:56 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, joelaf, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, namhyung, paulmck, Robin Murphy,
	Roman Gushchin, Stephen Rothwell, surenb, Thomas Gleixner, tkjos,
	Vladimir Davydov, Vlastimil Babka, Will Deacon

On Mon 05-08-19 13:04:47, Joel Fernandes (Google) wrote:
> The page_idle tracking feature currently requires looking up the pagemap
> for a process followed by interacting with /sys/kernel/mm/page_idle.
> Looking up PFN from pagemap in Android devices is not supported by
> unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.
> 
> This patch adds support to directly interact with page_idle tracking at
> the PID level by introducing a /proc/<pid>/page_idle file.  It follows
> the exact same semantics as the global /sys/kernel/mm/page_idle, but now
> looking up PFN through pagemap is not needed since the interface uses
> virtual frame numbers, and at the same time also does not require
> SYS_ADMIN.
> 
> In Android, we are using this for the heap profiler (heapprofd) which
> profiles and pin points code paths which allocates and leaves memory
> idle for long periods of time. This method solves the security issue
> with userspace learning the PFN, and while at it is also shown to yield
> better results than the pagemap lookup, the theory being that the window
> where the address space can change is reduced by eliminating the
> intermediate pagemap look up stage. In virtual address indexing, the
> process's mmap_sem is held for the duration of the access.

As already mentioned in one of the previous versions. The interface
seems sane and the usecase as well. So I do not really have high level
objections.

From a quick look at the patch I would just object to pulling swap idle
tracking into this patch because it makes the review harder and it is
essentially a dead code until a later patch. I am also not sure whether
that is really necessary and it really begs for an explicit
justification.

I will try to go through the patch more carefully later as time allows.

> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
  2019-08-06  8:42   ` Michal Hocko
@ 2019-08-06 10:36     ` Joel Fernandes
  2019-08-06 10:47       ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Joel Fernandes @ 2019-08-06 10:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Robin Murphy, Alexey Dobriyan, Andrew Morton,
	Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Mike Rapoport, minchan, namhyung,
	paulmck, Roman Gushchin, Stephen Rothwell, surenb,
	Thomas Gleixner, tkjos, Vladimir Davydov, Vlastimil Babka,
	Will Deacon

On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > This bit will be used by idle page tracking code to correctly identify
> > if a page that was swapped out was idle before it got swapped out.
> > Without this PTE bit, we lose information about if a page is idle or not
> > since the page frame gets unmapped.
> 
> And why do we need that? Why cannot we simply assume all swapped out
> pages to be idle? They were certainly idle enough to be reclaimed,
> right? Or what does idle actualy mean here?

Yes, but other than swapping, in Android a page can be forced to be swapped
out as well using the new hints that Minchan is adding?

Also, even if they were idle enough to be swapped, there is a chance that they
were marked as idle and *accessed* before the swapping. Due to swapping, the
"page was accessed since we last marked it as idle" information is lost. I am
able to verify this.

Idle in this context means the same thing as in page idle tracking terms, the
page was not accessed by userspace since we last marked it as idle (using
/proc/<pid>/page_idle).

thanks,

 - Joel


> > In this patch we reuse PTE_DEVMAP bit since idle page tracking only
> > works on user pages in the LRU. Device pages should not consitute those
> > so it should be unused and safe to use.
> > 
> > Cc: Robin Murphy <robin.murphy@arm.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >  arch/arm64/Kconfig                    |  1 +
> >  arch/arm64/include/asm/pgtable-prot.h |  1 +
> >  arch/arm64/include/asm/pgtable.h      | 15 +++++++++++++++
> >  3 files changed, 17 insertions(+)
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 3adcec05b1f6..9d1412c693d7 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -128,6 +128,7 @@ config ARM64
> >  	select HAVE_ARCH_MMAP_RND_BITS
> >  	select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
> >  	select HAVE_ARCH_PREL32_RELOCATIONS
> > +	select HAVE_ARCH_PTE_SWP_PGIDLE
> >  	select HAVE_ARCH_SECCOMP_FILTER
> >  	select HAVE_ARCH_STACKLEAK
> >  	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
> > diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
> > index 92d2e9f28f28..917b15c5d63a 100644
> > --- a/arch/arm64/include/asm/pgtable-prot.h
> > +++ b/arch/arm64/include/asm/pgtable-prot.h
> > @@ -18,6 +18,7 @@
> >  #define PTE_SPECIAL		(_AT(pteval_t, 1) << 56)
> >  #define PTE_DEVMAP		(_AT(pteval_t, 1) << 57)
> >  #define PTE_PROT_NONE		(_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
> > +#define PTE_SWP_PGIDLE		PTE_DEVMAP		 /* for idle page tracking during swapout */
> >  
> >  #ifndef __ASSEMBLY__
> >  
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 3f5461f7b560..558f5ebd81ba 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -212,6 +212,21 @@ static inline pte_t pte_mkdevmap(pte_t pte)
> >  	return set_pte_bit(pte, __pgprot(PTE_DEVMAP));
> >  }
> >  
> > +static inline int pte_swp_page_idle(pte_t pte)
> > +{
> > +	return 0;
> > +}
> > +
> > +static inline pte_t pte_swp_mkpage_idle(pte_t pte)
> > +{
> > +	return set_pte_bit(pte, __pgprot(PTE_SWP_PGIDLE));
> > +}
> > +
> > +static inline pte_t pte_swp_clear_page_idle(pte_t pte)
> > +{
> > +	return clear_pte_bit(pte, __pgprot(PTE_SWP_PGIDLE));
> > +}
> > +
> >  static inline void set_pte(pte_t *ptep, pte_t pte)
> >  {
> >  	WRITE_ONCE(*ptep, pte);
> > -- 
> > 2.22.0.770.g0f2c4a37fd-goog
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/5] page_idle: Drain all LRU pagevec before idle tracking
  2019-08-06  8:43   ` Michal Hocko
@ 2019-08-06 10:45     ` Joel Fernandes
  2019-08-06 10:51       ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Joel Fernandes @ 2019-08-06 10:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, namhyung, paulmck, Robin Murphy,
	Roman Gushchin, Stephen Rothwell, surenb, Thomas Gleixner, tkjos,
	Vladimir Davydov, Vlastimil Babka, Will Deacon

On Tue, Aug 06, 2019 at 10:43:57AM +0200, Michal Hocko wrote:
> On Mon 05-08-19 13:04:50, Joel Fernandes (Google) wrote:
> > During idle tracking, we see that sometimes faulted anon pages are in
> > pagevec but are not drained to LRU. Idle tracking considers pages only
> > on LRU. Drain all CPU's LRU before starting idle tracking.
> 
> Please expand on why does this matter enough to introduce a potentially
> expensinve draining which has to schedule a work on each CPU and wait
> for them to finish.

Sure, I can expand. I am able to find multiple issues involving this. One
issue looks like idle tracking is completely broken. It shows up in my
testing as if a page that is marked as idle is always "accessed" -- because
it was never marked as idle (due to not draining of pagevec).

The other issue shows up as a failure in my "swap test", with the following
sequence:
1. Allocate some pages
2. Write to them
3. Mark them as idle                                    <--- fails
4. Introduce some memory pressure to induce swapping.
5. Check the swap bit I introduced in this series.      <--- fails to set idle
                                                             bit in swap PTE.

Draining the pagevec in advance fixes both of these issues.

This operation even if expensive is only done once during the access of the
page_idle file. Did you have a better fix in mind?

thanks,

 - Joel


> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >  mm/page_idle.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/mm/page_idle.c b/mm/page_idle.c
> > index a5b00d63216c..2972367a599f 100644
> > --- a/mm/page_idle.c
> > +++ b/mm/page_idle.c
> > @@ -180,6 +180,8 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
> >  	unsigned long pfn, end_pfn;
> >  	int bit, ret;
> >  
> > +	lru_add_drain_all();
> > +
> >  	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
> >  	if (ret == -ENXIO)
> >  		return 0;  /* Reads beyond max_pfn do nothing */
> > @@ -211,6 +213,8 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
> >  	unsigned long pfn, end_pfn;
> >  	int bit, ret;
> >  
> > +	lru_add_drain_all();
> > +
> >  	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
> >  	if (ret)
> >  		return ret;
> > @@ -428,6 +432,8 @@ ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
> >  	walk.private = &priv;
> >  	walk.mm = mm;
> >  
> > +	lru_add_drain_all();
> > +
> >  	down_read(&mm->mmap_sem);
> >  
> >  	/*
> > -- 
> > 2.22.0.770.g0f2c4a37fd-goog
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing
  2019-08-06  8:56 ` [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing Michal Hocko
@ 2019-08-06 10:47   ` Joel Fernandes
  0 siblings, 0 replies; 29+ messages in thread
From: Joel Fernandes @ 2019-08-06 10:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, namhyung, paulmck, Robin Murphy,
	Roman Gushchin, Stephen Rothwell, surenb, Thomas Gleixner, tkjos,
	Vladimir Davydov, Vlastimil Babka, Will Deacon

On Tue, Aug 06, 2019 at 10:56:05AM +0200, Michal Hocko wrote:
> On Mon 05-08-19 13:04:47, Joel Fernandes (Google) wrote:
> > The page_idle tracking feature currently requires looking up the pagemap
> > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > Looking up PFN from pagemap in Android devices is not supported by
> > unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.
> > 
> > This patch adds support to directly interact with page_idle tracking at
> > the PID level by introducing a /proc/<pid>/page_idle file.  It follows
> > the exact same semantics as the global /sys/kernel/mm/page_idle, but now
> > looking up PFN through pagemap is not needed since the interface uses
> > virtual frame numbers, and at the same time also does not require
> > SYS_ADMIN.
> > 
> > In Android, we are using this for the heap profiler (heapprofd) which
> > profiles and pin points code paths which allocates and leaves memory
> > idle for long periods of time. This method solves the security issue
> > with userspace learning the PFN, and while at it is also shown to yield
> > better results than the pagemap lookup, the theory being that the window
> > where the address space can change is reduced by eliminating the
> > intermediate pagemap look up stage. In virtual address indexing, the
> > process's mmap_sem is held for the duration of the access.
> 
> As already mentioned in one of the previous versions. The interface
> seems sane and the usecase as well. So I do not really have high level
> objections.

That is great to know.

> From a quick look at the patch I would just object to pulling swap idle
> tracking into this patch because it makes the review harder and it is
> essentially a dead code until a later patch. I am also not sure whether
> that is really necessary and it really begs for an explicit
> justification.

Ok I will split it out, and also expand on the need for it a bit more.

> 
> I will try to go through the patch more carefully later as time allows.

Thanks a lot.

> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> -- 
> Michal Hocko
> SUSE Labs

 - Joel


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
  2019-08-06 10:36     ` Joel Fernandes
@ 2019-08-06 10:47       ` Michal Hocko
  2019-08-06 11:07         ` Minchan Kim
  2019-08-06 11:14         ` Joel Fernandes
  0 siblings, 2 replies; 29+ messages in thread
From: Michal Hocko @ 2019-08-06 10:47 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Robin Murphy, Alexey Dobriyan, Andrew Morton,
	Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Mike Rapoport, minchan, namhyung,
	paulmck, Roman Gushchin, Stephen Rothwell, surenb,
	Thomas Gleixner, tkjos, Vladimir Davydov, Vlastimil Babka,
	Will Deacon

On Tue 06-08-19 06:36:27, Joel Fernandes wrote:
> On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> > On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > > This bit will be used by idle page tracking code to correctly identify
> > > if a page that was swapped out was idle before it got swapped out.
> > > Without this PTE bit, we lose information about if a page is idle or not
> > > since the page frame gets unmapped.
> > 
> > And why do we need that? Why cannot we simply assume all swapped out
> > pages to be idle? They were certainly idle enough to be reclaimed,
> > right? Or what does idle actualy mean here?
> 
> Yes, but other than swapping, in Android a page can be forced to be swapped
> out as well using the new hints that Minchan is adding?

Yes and that is effectivelly making them idle, no?

> Also, even if they were idle enough to be swapped, there is a chance that they
> were marked as idle and *accessed* before the swapping. Due to swapping, the
> "page was accessed since we last marked it as idle" information is lost. I am
> able to verify this.
> 
> Idle in this context means the same thing as in page idle tracking terms, the
> page was not accessed by userspace since we last marked it as idle (using
> /proc/<pid>/page_idle).

Please describe a usecase and why that information might be useful.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/5] page_idle: Drain all LRU pagevec before idle tracking
  2019-08-06 10:45     ` Joel Fernandes
@ 2019-08-06 10:51       ` Michal Hocko
  2019-08-06 11:19         ` Joel Fernandes
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2019-08-06 10:51 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, namhyung, paulmck, Robin Murphy,
	Roman Gushchin, Stephen Rothwell, surenb, Thomas Gleixner, tkjos,
	Vladimir Davydov, Vlastimil Babka, Will Deacon

On Tue 06-08-19 06:45:54, Joel Fernandes wrote:
> On Tue, Aug 06, 2019 at 10:43:57AM +0200, Michal Hocko wrote:
> > On Mon 05-08-19 13:04:50, Joel Fernandes (Google) wrote:
> > > During idle tracking, we see that sometimes faulted anon pages are in
> > > pagevec but are not drained to LRU. Idle tracking considers pages only
> > > on LRU. Drain all CPU's LRU before starting idle tracking.
> > 
> > Please expand on why does this matter enough to introduce a potentially
> > expensinve draining which has to schedule a work on each CPU and wait
> > for them to finish.
> 
> Sure, I can expand. I am able to find multiple issues involving this. One
> issue looks like idle tracking is completely broken. It shows up in my
> testing as if a page that is marked as idle is always "accessed" -- because
> it was never marked as idle (due to not draining of pagevec).
> 
> The other issue shows up as a failure in my "swap test", with the following
> sequence:
> 1. Allocate some pages
> 2. Write to them
> 3. Mark them as idle                                    <--- fails
> 4. Introduce some memory pressure to induce swapping.
> 5. Check the swap bit I introduced in this series.      <--- fails to set idle
>                                                              bit in swap PTE.
> 
> Draining the pagevec in advance fixes both of these issues.

This belongs to the changelog.

> This operation even if expensive is only done once during the access of the
> page_idle file. Did you have a better fix in mind?

Can we set the idle bit also for non-lru pages as long as they are
reachable via pte?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
  2019-08-06 10:47       ` Michal Hocko
@ 2019-08-06 11:07         ` Minchan Kim
  2019-08-06 11:14           ` Michal Hocko
  2019-08-06 11:14         ` Joel Fernandes
  1 sibling, 1 reply; 29+ messages in thread
From: Minchan Kim @ 2019-08-06 11:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joel Fernandes, linux-kernel, Robin Murphy, Alexey Dobriyan,
	Andrew Morton, Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Mike Rapoport, namhyung, paulmck,
	Roman Gushchin, Stephen Rothwell, surenb, Thomas Gleixner, tkjos,
	Vladimir Davydov, Vlastimil Babka, Will Deacon

On Tue, Aug 06, 2019 at 12:47:55PM +0200, Michal Hocko wrote:
> On Tue 06-08-19 06:36:27, Joel Fernandes wrote:
> > On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> > > On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > > > This bit will be used by idle page tracking code to correctly identify
> > > > if a page that was swapped out was idle before it got swapped out.
> > > > Without this PTE bit, we lose information about if a page is idle or not
> > > > since the page frame gets unmapped.
> > > 
> > > And why do we need that? Why cannot we simply assume all swapped out
> > > pages to be idle? They were certainly idle enough to be reclaimed,
> > > right? Or what does idle actualy mean here?
> > 
> > Yes, but other than swapping, in Android a page can be forced to be swapped
> > out as well using the new hints that Minchan is adding?
> 
> Yes and that is effectivelly making them idle, no?

1. mark page-A idle which was present at that time.
2. run workload
3. page-A is touched several times
4. *sudden* memory pressure happen so finally page A is finally swapped out
5. now see the page A idle - but it's incorrect.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
  2019-08-06 10:47       ` Michal Hocko
  2019-08-06 11:07         ` Minchan Kim
@ 2019-08-06 11:14         ` Joel Fernandes
  2019-08-06 11:57           ` Michal Hocko
  1 sibling, 1 reply; 29+ messages in thread
From: Joel Fernandes @ 2019-08-06 11:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Robin Murphy, Alexey Dobriyan, Andrew Morton,
	Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Mike Rapoport, minchan, namhyung,
	paulmck, Roman Gushchin, Stephen Rothwell, surenb,
	Thomas Gleixner, tkjos, Vladimir Davydov, Vlastimil Babka,
	Will Deacon

On Tue, Aug 06, 2019 at 12:47:55PM +0200, Michal Hocko wrote:
> On Tue 06-08-19 06:36:27, Joel Fernandes wrote:
> > On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> > > On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > > > This bit will be used by idle page tracking code to correctly identify
> > > > if a page that was swapped out was idle before it got swapped out.
> > > > Without this PTE bit, we lose information about if a page is idle or not
> > > > since the page frame gets unmapped.
> > > 
> > > And why do we need that? Why cannot we simply assume all swapped out
> > > pages to be idle? They were certainly idle enough to be reclaimed,
> > > right? Or what does idle actualy mean here?
> > 
> > Yes, but other than swapping, in Android a page can be forced to be swapped
> > out as well using the new hints that Minchan is adding?
> 
> Yes and that is effectivelly making them idle, no?

That depends on how you think of it. If you are thinking of a monitoring
process like a heap profiler, then from the heap profiler's (that only cares
about the process it is monitoring) perspective it will look extremely odd if
pages that are recently accessed by the process appear to be idle which would
falsely look like those processes are leaking memory. The reality being,
Android forced those pages into swap because of other reasons. I would like
for the swapping mechanism, whether forced swapping or memory reclaim, not to
interfere with the idle detection.

This is just an effort to make the idle tracking a little bit better. We
would like to not lose the 'accessed' information of the pages.

Initially, I had proposed what you are suggesting as well however the above
reasons made me to do it like this. Also Minchan and Konstantin suggested
this, so there are more people interested in the swap idle bit. Minchan, can
you provide more thoughts here? (He is on 2-week vacation from today so
hopefully replies before he vanishes ;-)).

Also assuming all swap pages as idle has other "semantic" issues. It is quite
odd if a swapped page is automatically marked as idle without userspace
telling it to. Consider the following set of events: 1. Userspace marks only
a certain memory region as idle. 2. Userspace reads back the bits
corresponding to a bigger region. Part of this bigger region is swapped.
Userspace expects all of the pages it did not mark, to have idle bit set to
'0' because it never marked them as idle. However if it is now surprised by
what it read back (not all '0' read back). Since a page is swapped, it will
be now marked "automatically" as idle as per your proposal, even if userspace
never marked it explicity before. This would be quite confusing/ambiguous.

I will include this and other information in future commit messages.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
  2019-08-06 11:07         ` Minchan Kim
@ 2019-08-06 11:14           ` Michal Hocko
  2019-08-06 11:26             ` Joel Fernandes
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2019-08-06 11:14 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Joel Fernandes, linux-kernel, Robin Murphy, Alexey Dobriyan,
	Andrew Morton, Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Mike Rapoport, namhyung, paulmck,
	Roman Gushchin, Stephen Rothwell, surenb, Thomas Gleixner, tkjos,
	Vladimir Davydov, Vlastimil Babka, Will Deacon

On Tue 06-08-19 20:07:37, Minchan Kim wrote:
> On Tue, Aug 06, 2019 at 12:47:55PM +0200, Michal Hocko wrote:
> > On Tue 06-08-19 06:36:27, Joel Fernandes wrote:
> > > On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> > > > On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > > > > This bit will be used by idle page tracking code to correctly identify
> > > > > if a page that was swapped out was idle before it got swapped out.
> > > > > Without this PTE bit, we lose information about if a page is idle or not
> > > > > since the page frame gets unmapped.
> > > > 
> > > > And why do we need that? Why cannot we simply assume all swapped out
> > > > pages to be idle? They were certainly idle enough to be reclaimed,
> > > > right? Or what does idle actualy mean here?
> > > 
> > > Yes, but other than swapping, in Android a page can be forced to be swapped
> > > out as well using the new hints that Minchan is adding?
> > 
> > Yes and that is effectivelly making them idle, no?
> 
> 1. mark page-A idle which was present at that time.
> 2. run workload
> 3. page-A is touched several times
> 4. *sudden* memory pressure happen so finally page A is finally swapped out
> 5. now see the page A idle - but it's incorrect.

Could you expand on what you mean by idle exactly? Why pageout doesn't
really qualify as "mark-idle and reclaim"? Also could you describe a
usecase where the swapout distinction really matters and it would lead
to incorrect behavior?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/5] page_idle: Drain all LRU pagevec before idle tracking
  2019-08-06 10:51       ` Michal Hocko
@ 2019-08-06 11:19         ` Joel Fernandes
  2019-08-06 11:44           ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Joel Fernandes @ 2019-08-06 11:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, namhyung, paulmck, Robin Murphy,
	Roman Gushchin, Stephen Rothwell, surenb, Thomas Gleixner, tkjos,
	Vladimir Davydov, Vlastimil Babka, Will Deacon

On Tue, Aug 06, 2019 at 12:51:49PM +0200, Michal Hocko wrote:
> On Tue 06-08-19 06:45:54, Joel Fernandes wrote:
> > On Tue, Aug 06, 2019 at 10:43:57AM +0200, Michal Hocko wrote:
> > > On Mon 05-08-19 13:04:50, Joel Fernandes (Google) wrote:
> > > > During idle tracking, we see that sometimes faulted anon pages are in
> > > > pagevec but are not drained to LRU. Idle tracking considers pages only
> > > > on LRU. Drain all CPU's LRU before starting idle tracking.
> > > 
> > > Please expand on why does this matter enough to introduce a potentially
> > > expensinve draining which has to schedule a work on each CPU and wait
> > > for them to finish.
> > 
> > Sure, I can expand. I am able to find multiple issues involving this. One
> > issue looks like idle tracking is completely broken. It shows up in my
> > testing as if a page that is marked as idle is always "accessed" -- because
> > it was never marked as idle (due to not draining of pagevec).
> > 
> > The other issue shows up as a failure in my "swap test", with the following
> > sequence:
> > 1. Allocate some pages
> > 2. Write to them
> > 3. Mark them as idle                                    <--- fails
> > 4. Introduce some memory pressure to induce swapping.
> > 5. Check the swap bit I introduced in this series.      <--- fails to set idle
> >                                                              bit in swap PTE.
> > 
> > Draining the pagevec in advance fixes both of these issues.
> 
> This belongs to the changelog.

Sure, will add.


> > This operation even if expensive is only done once during the access of the
> > page_idle file. Did you have a better fix in mind?
> 
> Can we set the idle bit also for non-lru pages as long as they are
> reachable via pte?

Not at the moment with the current page idle tracking code. PageLRU(page)
flag is checked in page_idle_get_page().

Even if we could set it for non-LRU, the idle bit (page flag) would not be
cleared if page is not on LRU because page-reclaim code (page_referenced() I
believe) would not clear it. This whole mechanism depends on page-reclaim. Or
did I miss your point?


thanks,

 - Joel


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
  2019-08-06 11:14           ` Michal Hocko
@ 2019-08-06 11:26             ` Joel Fernandes
  0 siblings, 0 replies; 29+ messages in thread
From: Joel Fernandes @ 2019-08-06 11:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, linux-kernel, Robin Murphy, Alexey Dobriyan,
	Andrew Morton, Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Mike Rapoport, namhyung, paulmck,
	Roman Gushchin, Stephen Rothwell, surenb, Thomas Gleixner, tkjos,
	Vladimir Davydov, Vlastimil Babka, Will Deacon

On Tue, Aug 06, 2019 at 01:14:52PM +0200, Michal Hocko wrote:
> On Tue 06-08-19 20:07:37, Minchan Kim wrote:
> > On Tue, Aug 06, 2019 at 12:47:55PM +0200, Michal Hocko wrote:
> > > On Tue 06-08-19 06:36:27, Joel Fernandes wrote:
> > > > On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> > > > > On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > > > > > This bit will be used by idle page tracking code to correctly identify
> > > > > > if a page that was swapped out was idle before it got swapped out.
> > > > > > Without this PTE bit, we lose information about if a page is idle or not
> > > > > > since the page frame gets unmapped.
> > > > > 
> > > > > And why do we need that? Why cannot we simply assume all swapped out
> > > > > pages to be idle? They were certainly idle enough to be reclaimed,
> > > > > right? Or what does idle actualy mean here?
> > > > 
> > > > Yes, but other than swapping, in Android a page can be forced to be swapped
> > > > out as well using the new hints that Minchan is adding?
> > > 
> > > Yes and that is effectivelly making them idle, no?
> > 
> > 1. mark page-A idle which was present at that time.
> > 2. run workload
> > 3. page-A is touched several times
> > 4. *sudden* memory pressure happen so finally page A is finally swapped out
> > 5. now see the page A idle - but it's incorrect.
> 
> Could you expand on what you mean by idle exactly? Why pageout doesn't
> really qualify as "mark-idle and reclaim"? Also could you describe a
> usecase where the swapout distinction really matters and it would lead
> to incorrect behavior?

Michal,
Did you read this post ? :
https://lore.kernel.org/lkml/20190806104715.GC218260@google.com/T/#m4ece68ceaf6e54d4d29e974f5f4c1080e733f6c1

Just wanted to be sure you did not miss it.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/5] page_idle: Drain all LRU pagevec before idle tracking
  2019-08-06 11:19         ` Joel Fernandes
@ 2019-08-06 11:44           ` Michal Hocko
  2019-08-06 13:48             ` Joel Fernandes
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2019-08-06 11:44 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, namhyung, paulmck, Robin Murphy,
	Roman Gushchin, Stephen Rothwell, surenb, Thomas Gleixner, tkjos,
	Vladimir Davydov, Vlastimil Babka, Will Deacon

On Tue 06-08-19 07:19:21, Joel Fernandes wrote:
> On Tue, Aug 06, 2019 at 12:51:49PM +0200, Michal Hocko wrote:
> > On Tue 06-08-19 06:45:54, Joel Fernandes wrote:
> > > On Tue, Aug 06, 2019 at 10:43:57AM +0200, Michal Hocko wrote:
> > > > On Mon 05-08-19 13:04:50, Joel Fernandes (Google) wrote:
> > > > > During idle tracking, we see that sometimes faulted anon pages are in
> > > > > pagevec but are not drained to LRU. Idle tracking considers pages only
> > > > > on LRU. Drain all CPU's LRU before starting idle tracking.
> > > > 
> > > > Please expand on why does this matter enough to introduce a potentially
> > > > expensinve draining which has to schedule a work on each CPU and wait
> > > > for them to finish.
> > > 
> > > Sure, I can expand. I am able to find multiple issues involving this. One
> > > issue looks like idle tracking is completely broken. It shows up in my
> > > testing as if a page that is marked as idle is always "accessed" -- because
> > > it was never marked as idle (due to not draining of pagevec).
> > > 
> > > The other issue shows up as a failure in my "swap test", with the following
> > > sequence:
> > > 1. Allocate some pages
> > > 2. Write to them
> > > 3. Mark them as idle                                    <--- fails
> > > 4. Introduce some memory pressure to induce swapping.
> > > 5. Check the swap bit I introduced in this series.      <--- fails to set idle
> > >                                                              bit in swap PTE.
> > > 
> > > Draining the pagevec in advance fixes both of these issues.
> > 
> > This belongs to the changelog.
> 
> Sure, will add.
> 
> 
> > > This operation even if expensive is only done once during the access of the
> > > page_idle file. Did you have a better fix in mind?
> > 
> > Can we set the idle bit also for non-lru pages as long as they are
> > reachable via pte?
> 
> Not at the moment with the current page idle tracking code. PageLRU(page)
> flag is checked in page_idle_get_page().

yes, I am aware of the current code. I strongly suspect that the PageLRU
check was there to not mark arbitrary page looked up by pfn with the
idle bit because that would be unexpected. But I might be easily wrong
here.

> Even if we could set it for non-LRU, the idle bit (page flag) would not be
> cleared if page is not on LRU because page-reclaim code (page_referenced() I
> believe) would not clear it.

Yes, it is either reclaim when checking references as you say but also
mark_page_accessed. I believe the later might still have the page on the
pcp LRU add cache. Maybe I am missing something something but it seems
that there is nothing fundamentally requiring the user mapped page to be
on the LRU list when seting the idle bit.

That being said, your big hammer approach will work more reliable but if
you do not feel like changing the underlying PageLRU assumption then
document that draining should be removed longterm.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
  2019-08-06 11:14         ` Joel Fernandes
@ 2019-08-06 11:57           ` Michal Hocko
  2019-08-06 13:43             ` Joel Fernandes
  2019-08-06 14:47             ` Minchan Kim
  0 siblings, 2 replies; 29+ messages in thread
From: Michal Hocko @ 2019-08-06 11:57 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Robin Murphy, Alexey Dobriyan, Andrew Morton,
	Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Mike Rapoport, minchan, namhyung,
	paulmck, Roman Gushchin, Stephen Rothwell, surenb,
	Thomas Gleixner, tkjos, Vladimir Davydov, Vlastimil Babka,
	Will Deacon

On Tue 06-08-19 07:14:46, Joel Fernandes wrote:
> On Tue, Aug 06, 2019 at 12:47:55PM +0200, Michal Hocko wrote:
> > On Tue 06-08-19 06:36:27, Joel Fernandes wrote:
> > > On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> > > > On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > > > > This bit will be used by idle page tracking code to correctly identify
> > > > > if a page that was swapped out was idle before it got swapped out.
> > > > > Without this PTE bit, we lose information about if a page is idle or not
> > > > > since the page frame gets unmapped.
> > > > 
> > > > And why do we need that? Why cannot we simply assume all swapped out
> > > > pages to be idle? They were certainly idle enough to be reclaimed,
> > > > right? Or what does idle actualy mean here?
> > > 
> > > Yes, but other than swapping, in Android a page can be forced to be swapped
> > > out as well using the new hints that Minchan is adding?
> > 
> > Yes and that is effectivelly making them idle, no?
> 
> That depends on how you think of it.

I would much prefer to have it documented so that I do not have to guess ;)

> If you are thinking of a monitoring
> process like a heap profiler, then from the heap profiler's (that only cares
> about the process it is monitoring) perspective it will look extremely odd if
> pages that are recently accessed by the process appear to be idle which would
> falsely look like those processes are leaking memory. The reality being,
> Android forced those pages into swap because of other reasons. I would like
> for the swapping mechanism, whether forced swapping or memory reclaim, not to
> interfere with the idle detection.

Hmm, but how are you going to handle situation when the page is unmapped
and refaulted again (e.g. a normal reclaim of a pagecache)? You are
losing that information same was as in the swapout case, no? Or am I
missing something?

> This is just an effort to make the idle tracking a little bit better. We
> would like to not lose the 'accessed' information of the pages.
> 
> Initially, I had proposed what you are suggesting as well however the above
> reasons made me to do it like this. Also Minchan and Konstantin suggested
> this, so there are more people interested in the swap idle bit. Minchan, can
> you provide more thoughts here? (He is on 2-week vacation from today so
> hopefully replies before he vanishes ;-)).

We can move on with the rest of the series in the mean time but I would
like to see a proper justification for the swap entries and why they
should be handled special.

> Also assuming all swap pages as idle has other "semantic" issues. It is quite
> odd if a swapped page is automatically marked as idle without userspace
> telling it to. Consider the following set of events: 1. Userspace marks only
> a certain memory region as idle. 2. Userspace reads back the bits
> corresponding to a bigger region. Part of this bigger region is swapped.
> Userspace expects all of the pages it did not mark, to have idle bit set to
> '0' because it never marked them as idle. However if it is now surprised by
> what it read back (not all '0' read back). Since a page is swapped, it will
> be now marked "automatically" as idle as per your proposal, even if userspace
> never marked it explicity before. This would be quite confusing/ambiguous.

OK, I see. I guess the primary question I have is how do you distinguish
Idle page which got unmapped and faulted in again from swapped out page
and refaulted - including the time the pte is not present.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
  2019-08-06 11:57           ` Michal Hocko
@ 2019-08-06 13:43             ` Joel Fernandes
  2019-08-06 14:09               ` Michal Hocko
  2019-08-06 14:47             ` Minchan Kim
  1 sibling, 1 reply; 29+ messages in thread
From: Joel Fernandes @ 2019-08-06 13:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Robin Murphy, Alexey Dobriyan, Andrew Morton,
	Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Mike Rapoport, minchan, namhyung,
	paulmck, Roman Gushchin, Stephen Rothwell, surenb,
	Thomas Gleixner, tkjos, Vladimir Davydov, Vlastimil Babka,
	Will Deacon

On Tue, Aug 06, 2019 at 01:57:03PM +0200, Michal Hocko wrote:
> On Tue 06-08-19 07:14:46, Joel Fernandes wrote:
> > On Tue, Aug 06, 2019 at 12:47:55PM +0200, Michal Hocko wrote:
> > > On Tue 06-08-19 06:36:27, Joel Fernandes wrote:
> > > > On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> > > > > On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > > > > > This bit will be used by idle page tracking code to correctly identify
> > > > > > if a page that was swapped out was idle before it got swapped out.
> > > > > > Without this PTE bit, we lose information about if a page is idle or not
> > > > > > since the page frame gets unmapped.
> > > > > 
> > > > > And why do we need that? Why cannot we simply assume all swapped out
> > > > > pages to be idle? They were certainly idle enough to be reclaimed,
> > > > > right? Or what does idle actualy mean here?
> > > > 
> > > > Yes, but other than swapping, in Android a page can be forced to be swapped
> > > > out as well using the new hints that Minchan is adding?
> > > 
> > > Yes and that is effectivelly making them idle, no?
> > 
> > That depends on how you think of it.
> 
> I would much prefer to have it documented so that I do not have to guess ;)

Sure :)

> > If you are thinking of a monitoring
> > process like a heap profiler, then from the heap profiler's (that only cares
> > about the process it is monitoring) perspective it will look extremely odd if
> > pages that are recently accessed by the process appear to be idle which would
> > falsely look like those processes are leaking memory. The reality being,
> > Android forced those pages into swap because of other reasons. I would like
> > for the swapping mechanism, whether forced swapping or memory reclaim, not to
> > interfere with the idle detection.
> 
> Hmm, but how are you going to handle situation when the page is unmapped
> and refaulted again (e.g. a normal reclaim of a pagecache)? You are
> losing that information same was as in the swapout case, no? Or am I
> missing something?

Yes you are right, it would have the same issue, thanks for bringing it up.
Should we rename this bit to PTE_IDLE and do the same thing that we are doing
for swap?

i.e. if (page_idle(page)) and page is a file page, then we write state
into the PTE of the page. Later on refault, the PTE bit would automatically
get cleared (just like it does on swap-in). But before refault, the idle
tracking code sees the page as still marked idle. Do you see any issue with that?


> > This is just an effort to make the idle tracking a little bit better. We
> > would like to not lose the 'accessed' information of the pages.
> > 
> > Initially, I had proposed what you are suggesting as well however the above
> > reasons made me to do it like this. Also Minchan and Konstantin suggested
> > this, so there are more people interested in the swap idle bit. Minchan, can
> > you provide more thoughts here? (He is on 2-week vacation from today so
> > hopefully replies before he vanishes ;-)).
> 
> We can move on with the rest of the series in the mean time but I would
> like to see a proper justification for the swap entries and why they
> should be handled special.

Ok, I will improve the changelog.


> > Also assuming all swap pages as idle has other "semantic" issues. It is quite
> > odd if a swapped page is automatically marked as idle without userspace
> > telling it to. Consider the following set of events: 1. Userspace marks only
> > a certain memory region as idle. 2. Userspace reads back the bits
> > corresponding to a bigger region. Part of this bigger region is swapped.
> > Userspace expects all of the pages it did not mark, to have idle bit set to
> > '0' because it never marked them as idle. However if it is now surprised by
> > what it read back (not all '0' read back). Since a page is swapped, it will
> > be now marked "automatically" as idle as per your proposal, even if userspace
> > never marked it explicity before. This would be quite confusing/ambiguous.
> 
> OK, I see. I guess the primary question I have is how do you distinguish
> Idle page which got unmapped and faulted in again from swapped out page
> and refaulted - including the time the pte is not present.

Ok, lets discuss more.

thanks Michal!

 - Joel


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/5] page_idle: Drain all LRU pagevec before idle tracking
  2019-08-06 11:44           ` Michal Hocko
@ 2019-08-06 13:48             ` Joel Fernandes
  0 siblings, 0 replies; 29+ messages in thread
From: Joel Fernandes @ 2019-08-06 13:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Borislav Petkov,
	Brendan Gregg, Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, namhyung, paulmck, Robin Murphy,
	Roman Gushchin, Stephen Rothwell, surenb, Thomas Gleixner, tkjos,
	Vladimir Davydov, Vlastimil Babka, Will Deacon

On Tue, Aug 06, 2019 at 01:44:02PM +0200, Michal Hocko wrote:
[snip]
> > > > This operation even if expensive is only done once during the access of the
> > > > page_idle file. Did you have a better fix in mind?
> > > 
> > > Can we set the idle bit also for non-lru pages as long as they are
> > > reachable via pte?
> > 
> > Not at the moment with the current page idle tracking code. PageLRU(page)
> > flag is checked in page_idle_get_page().
> 
> yes, I am aware of the current code. I strongly suspect that the PageLRU
> check was there to not mark arbitrary page looked up by pfn with the
> idle bit because that would be unexpected. But I might be easily wrong
> here.

Yes, quite possible.

> > Even if we could set it for non-LRU, the idle bit (page flag) would not be
> > cleared if page is not on LRU because page-reclaim code (page_referenced() I
> > believe) would not clear it.
> 
> Yes, it is either reclaim when checking references as you say but also
> mark_page_accessed. I believe the later might still have the page on the
> pcp LRU add cache. Maybe I am missing something something but it seems
> that there is nothing fundamentally requiring the user mapped page to be
> on the LRU list when seting the idle bit.
> 
> That being said, your big hammer approach will work more reliable but if
> you do not feel like changing the underlying PageLRU assumption then
> document that draining should be removed longterm.

Yes, at the moment I am in preference of keeping the underlying assumption
same. I am Ok with adding of a comment on the drain call that it is to be
removed longterm.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
  2019-08-06 13:43             ` Joel Fernandes
@ 2019-08-06 14:09               ` Michal Hocko
  0 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2019-08-06 14:09 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Robin Murphy, Alexey Dobriyan, Andrew Morton,
	Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Mike Rapoport, minchan, namhyung,
	paulmck, Roman Gushchin, Stephen Rothwell, surenb,
	Thomas Gleixner, tkjos, Vladimir Davydov, Vlastimil Babka,
	Will Deacon

On Tue 06-08-19 09:43:21, Joel Fernandes wrote:
> On Tue, Aug 06, 2019 at 01:57:03PM +0200, Michal Hocko wrote:
> > On Tue 06-08-19 07:14:46, Joel Fernandes wrote:
> > > On Tue, Aug 06, 2019 at 12:47:55PM +0200, Michal Hocko wrote:
> > > > On Tue 06-08-19 06:36:27, Joel Fernandes wrote:
> > > > > On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> > > > > > On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > > > > > > This bit will be used by idle page tracking code to correctly identify
> > > > > > > if a page that was swapped out was idle before it got swapped out.
> > > > > > > Without this PTE bit, we lose information about if a page is idle or not
> > > > > > > since the page frame gets unmapped.
> > > > > > 
> > > > > > And why do we need that? Why cannot we simply assume all swapped out
> > > > > > pages to be idle? They were certainly idle enough to be reclaimed,
> > > > > > right? Or what does idle actualy mean here?
> > > > > 
> > > > > Yes, but other than swapping, in Android a page can be forced to be swapped
> > > > > out as well using the new hints that Minchan is adding?
> > > > 
> > > > Yes and that is effectivelly making them idle, no?
> > > 
> > > That depends on how you think of it.
> > 
> > I would much prefer to have it documented so that I do not have to guess ;)
> 
> Sure :)
> 
> > > If you are thinking of a monitoring
> > > process like a heap profiler, then from the heap profiler's (that only cares
> > > about the process it is monitoring) perspective it will look extremely odd if
> > > pages that are recently accessed by the process appear to be idle which would
> > > falsely look like those processes are leaking memory. The reality being,
> > > Android forced those pages into swap because of other reasons. I would like
> > > for the swapping mechanism, whether forced swapping or memory reclaim, not to
> > > interfere with the idle detection.
> > 
> > Hmm, but how are you going to handle situation when the page is unmapped
> > and refaulted again (e.g. a normal reclaim of a pagecache)? You are
> > losing that information same was as in the swapout case, no? Or am I
> > missing something?
> 
> Yes you are right, it would have the same issue, thanks for bringing it up.
> Should we rename this bit to PTE_IDLE and do the same thing that we are doing
> for swap?

What if we decide to tear the page table down as well? E.g. because we
can reclaim file backed mappings and free some memory used for page
tables. We do not do that right now but I can see that really large
mappings might push us that direction. Sure this is mostly a theoretical
concern but I am wondering whether promissing to keep the idle bit over
unmapping is not too much.

I am not sure how to deal with this myself, TBH. In any case the current
semantic - via pfn - will lose the idle bit already so can we mimic it
as well? We only have 1 bit for each address which makes it challenging.
The easiest way would be to declare that the idle bit might disappear on
activating or reclaiming the page. How well that suits different
usecases is a different question. I would be interested in hearing from
other people about this of course.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
  2019-08-06 11:57           ` Michal Hocko
  2019-08-06 13:43             ` Joel Fernandes
@ 2019-08-06 14:47             ` Minchan Kim
  2019-08-06 15:20               ` Joel Fernandes
  1 sibling, 1 reply; 29+ messages in thread
From: Minchan Kim @ 2019-08-06 14:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joel Fernandes, linux-kernel, Robin Murphy, Alexey Dobriyan,
	Andrew Morton, Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Mike Rapoport, namhyung, paulmck,
	Roman Gushchin, Stephen Rothwell, surenb, Thomas Gleixner, tkjos,
	Vladimir Davydov, Vlastimil Babka, Will Deacon

On Tue, Aug 06, 2019 at 01:57:03PM +0200, Michal Hocko wrote:
> On Tue 06-08-19 07:14:46, Joel Fernandes wrote:
> > On Tue, Aug 06, 2019 at 12:47:55PM +0200, Michal Hocko wrote:
> > > On Tue 06-08-19 06:36:27, Joel Fernandes wrote:
> > > > On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> > > > > On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > > > > > This bit will be used by idle page tracking code to correctly identify
> > > > > > if a page that was swapped out was idle before it got swapped out.
> > > > > > Without this PTE bit, we lose information about if a page is idle or not
> > > > > > since the page frame gets unmapped.
> > > > > 
> > > > > And why do we need that? Why cannot we simply assume all swapped out
> > > > > pages to be idle? They were certainly idle enough to be reclaimed,
> > > > > right? Or what does idle actualy mean here?
> > > > 
> > > > Yes, but other than swapping, in Android a page can be forced to be swapped
> > > > out as well using the new hints that Minchan is adding?
> > > 
> > > Yes and that is effectivelly making them idle, no?
> > 
> > That depends on how you think of it.
> 
> I would much prefer to have it documented so that I do not have to guess ;)
> 
> > If you are thinking of a monitoring
> > process like a heap profiler, then from the heap profiler's (that only cares
> > about the process it is monitoring) perspective it will look extremely odd if
> > pages that are recently accessed by the process appear to be idle which would
> > falsely look like those processes are leaking memory. The reality being,
> > Android forced those pages into swap because of other reasons. I would like
> > for the swapping mechanism, whether forced swapping or memory reclaim, not to
> > interfere with the idle detection.
> 
> Hmm, but how are you going to handle situation when the page is unmapped
> and refaulted again (e.g. a normal reclaim of a pagecache)? You are
> losing that information same was as in the swapout case, no? Or am I
> missing something?

If page is unmapped, it's not a idle memory any longer because it's
free memory. We could detect the pte is not present.

If page is refaulted, it's not a idle memory any longer because it's
accessed again. We could detect it because the newly allocated page
doesn't have a PG_idle page flag.

Both case, idle page tracking couldn't report them as IDLE so it's okay.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 3/5] [RFC] arm64: Add support for idle bit in swap PTE
  2019-08-06 14:47             ` Minchan Kim
@ 2019-08-06 15:20               ` Joel Fernandes
  0 siblings, 0 replies; 29+ messages in thread
From: Joel Fernandes @ 2019-08-06 15:20 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, linux-kernel, Robin Murphy, Alexey Dobriyan,
	Andrew Morton, Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Mike Rapoport, namhyung, paulmck,
	Roman Gushchin, Stephen Rothwell, surenb, Thomas Gleixner, tkjos,
	Vladimir Davydov, Vlastimil Babka, Will Deacon

On Tue, Aug 06, 2019 at 11:47:47PM +0900, Minchan Kim wrote:
> On Tue, Aug 06, 2019 at 01:57:03PM +0200, Michal Hocko wrote:
> > On Tue 06-08-19 07:14:46, Joel Fernandes wrote:
> > > On Tue, Aug 06, 2019 at 12:47:55PM +0200, Michal Hocko wrote:
> > > > On Tue 06-08-19 06:36:27, Joel Fernandes wrote:
> > > > > On Tue, Aug 06, 2019 at 10:42:03AM +0200, Michal Hocko wrote:
> > > > > > On Mon 05-08-19 13:04:49, Joel Fernandes (Google) wrote:
> > > > > > > This bit will be used by idle page tracking code to correctly identify
> > > > > > > if a page that was swapped out was idle before it got swapped out.
> > > > > > > Without this PTE bit, we lose information about if a page is idle or not
> > > > > > > since the page frame gets unmapped.
> > > > > > 
> > > > > > And why do we need that? Why cannot we simply assume all swapped out
> > > > > > pages to be idle? They were certainly idle enough to be reclaimed,
> > > > > > right? Or what does idle actualy mean here?
> > > > > 
> > > > > Yes, but other than swapping, in Android a page can be forced to be swapped
> > > > > out as well using the new hints that Minchan is adding?
> > > > 
> > > > Yes and that is effectivelly making them idle, no?
> > > 
> > > That depends on how you think of it.
> > 
> > I would much prefer to have it documented so that I do not have to guess ;)
> > 
> > > If you are thinking of a monitoring
> > > process like a heap profiler, then from the heap profiler's (that only cares
> > > about the process it is monitoring) perspective it will look extremely odd if
> > > pages that are recently accessed by the process appear to be idle which would
> > > falsely look like those processes are leaking memory. The reality being,
> > > Android forced those pages into swap because of other reasons. I would like
> > > for the swapping mechanism, whether forced swapping or memory reclaim, not to
> > > interfere with the idle detection.
> > 
> > Hmm, but how are you going to handle situation when the page is unmapped
> > and refaulted again (e.g. a normal reclaim of a pagecache)? You are
> > losing that information same was as in the swapout case, no? Or am I
> > missing something?
> 
> If page is unmapped, it's not a idle memory any longer because it's
> free memory. We could detect the pte is not present.

I think Michal is not talking of explictly being unmapped, but about the case
where a file-backed mapped page is unmapped due to memory pressure ? This is
similar to the swap situation.

Basically... file page is marked idle, then it is accessed by userspace. Then
memory pressure drops it off the page cache so the idle information is lost.
Next time we check the page_idle, we miss that it was accessed indeed.

It is not an issue for the heap profiler or anonymous memory per-se. But is
similar to the swap situation.

> If page is refaulted, it's not a idle memory any longer because it's
> accessed again. We could detect it because the newly allocated page
> doesn't have a PG_idle page flag.

In the refault case, yes it should not be a problem.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing
  2019-08-05 17:04 [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing Joel Fernandes (Google)
                   ` (4 preceding siblings ...)
  2019-08-06  8:56 ` [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing Michal Hocko
@ 2019-08-06 22:19 ` Andrew Morton
  2019-08-07 10:00   ` Joel Fernandes
  5 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2019-08-06 22:19 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Alexey Dobriyan, Borislav Petkov, Brendan Gregg,
	Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, joelaf, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Michal Hocko, Mike Rapoport, minchan, namhyung, paulmck,
	Robin Murphy, Roman Gushchin, Stephen Rothwell, surenb,
	Thomas Gleixner, tkjos, Vladimir Davydov, Vlastimil Babka,
	Will Deacon, Brendan Gregg

(cc Brendan's other email address, hoping for review input ;))

On Mon,  5 Aug 2019 13:04:47 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote:

> The page_idle tracking feature currently requires looking up the pagemap
> for a process followed by interacting with /sys/kernel/mm/page_idle.
> Looking up PFN from pagemap in Android devices is not supported by
> unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.
> 
> This patch adds support to directly interact with page_idle tracking at
> the PID level by introducing a /proc/<pid>/page_idle file.  It follows
> the exact same semantics as the global /sys/kernel/mm/page_idle, but now
> looking up PFN through pagemap is not needed since the interface uses
> virtual frame numbers, and at the same time also does not require
> SYS_ADMIN.
> 
> In Android, we are using this for the heap profiler (heapprofd) which
> profiles and pin points code paths which allocates and leaves memory
> idle for long periods of time. This method solves the security issue
> with userspace learning the PFN, and while at it is also shown to yield
> better results than the pagemap lookup, the theory being that the window
> where the address space can change is reduced by eliminating the
> intermediate pagemap look up stage. In virtual address indexing, the
> process's mmap_sem is held for the duration of the access.

Quite a lot of changes to the page_idle code.  Has this all been
runtime tested on architectures where
CONFIG_HAVE_ARCH_PTE_SWP_PGIDLE=n?  That could be x86 with a little
Kconfig fiddle-for-testing-purposes.

> 8 files changed, 376 insertions(+), 45 deletions(-)

Quite a lot of new code unconditionally added to major architectures. 
Are we confident that everyone will want this feature?

>
> ...
>
> +static int proc_page_idle_open(struct inode *inode, struct file *file)
> +{
> +	struct mm_struct *mm;
> +
> +	mm = proc_mem_open(inode, PTRACE_MODE_READ);
> +	if (IS_ERR(mm))
> +		return PTR_ERR(mm);
> +	file->private_data = mm;
> +	return 0;
> +}
> +
> +static int proc_page_idle_release(struct inode *inode, struct file *file)
> +{
> +	struct mm_struct *mm = file->private_data;
> +
> +	if (mm)

I suspect the test isn't needed?  proc_page_idle_release) won't be
called if proc_page_idle_open() failed?

> +		mmdrop(mm);
> +	return 0;
> +}
>
> ...
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing
  2019-08-06 22:19 ` Andrew Morton
@ 2019-08-07 10:00   ` Joel Fernandes
  2019-08-07 20:01     ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: Joel Fernandes @ 2019-08-07 10:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexey Dobriyan, Borislav Petkov, Brendan Gregg,
	Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Michal Hocko, Mike Rapoport, minchan, namhyung, paulmck,
	Robin Murphy, Roman Gushchin, Stephen Rothwell, surenb,
	Thomas Gleixner, tkjos, Vladimir Davydov, Vlastimil Babka,
	Will Deacon, Brendan Gregg

On Tue, Aug 06, 2019 at 03:19:21PM -0700, Andrew Morton wrote:
> (cc Brendan's other email address, hoping for review input ;))

;)

> On Mon,  5 Aug 2019 13:04:47 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote:
> 
> > The page_idle tracking feature currently requires looking up the pagemap
> > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > Looking up PFN from pagemap in Android devices is not supported by
> > unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.
> > 
> > This patch adds support to directly interact with page_idle tracking at
> > the PID level by introducing a /proc/<pid>/page_idle file.  It follows
> > the exact same semantics as the global /sys/kernel/mm/page_idle, but now
> > looking up PFN through pagemap is not needed since the interface uses
> > virtual frame numbers, and at the same time also does not require
> > SYS_ADMIN.
> > 
> > In Android, we are using this for the heap profiler (heapprofd) which
> > profiles and pin points code paths which allocates and leaves memory
> > idle for long periods of time. This method solves the security issue
> > with userspace learning the PFN, and while at it is also shown to yield
> > better results than the pagemap lookup, the theory being that the window
> > where the address space can change is reduced by eliminating the
> > intermediate pagemap look up stage. In virtual address indexing, the
> > process's mmap_sem is held for the duration of the access.
> 
> Quite a lot of changes to the page_idle code.  Has this all been
> runtime tested on architectures where
> CONFIG_HAVE_ARCH_PTE_SWP_PGIDLE=n?  That could be x86 with a little
> Kconfig fiddle-for-testing-purposes.

I will do this Kconfig fiddle test with CONFIG_HAVE_ARCH_PTE_SWP_PGIDLE=n and test
the patch as well.

In previous series, this flag was not there (which should have been
equivalent to the above test), and things are working fine.

> > 8 files changed, 376 insertions(+), 45 deletions(-)
> 
> Quite a lot of new code unconditionally added to major architectures. 
> Are we confident that everyone will want this feature?

I did not follow, could you clarify more? All of this diff stat is not to
architecture code:

 arch/Kconfig                  |   3 ++
 fs/proc/base.c                |   3 ++
 fs/proc/internal.h            |   1 +
 fs/proc/task_mmu.c            |  43 +++++++++++++++++++++
 include/asm-generic/pgtable.h |   6 +++
 include/linux/page_idle.h     |   4 ++
 mm/page_idle.c                | 359 +++++++++++++++++++++++++++++..
 mm/rmap.c                     |   2 +
 8 files changed, 376 insertions(+), 45 deletions(-)

The arcitecture change is in a later patch, and is not that many lines.

Also, I am planning to split the swap functionality of the patch into a
separate one for easier review.

> > +static int proc_page_idle_open(struct inode *inode, struct file *file)
> > +{
> > +	struct mm_struct *mm;
> > +
> > +	mm = proc_mem_open(inode, PTRACE_MODE_READ);
> > +	if (IS_ERR(mm))
> > +		return PTR_ERR(mm);
> > +	file->private_data = mm;
> > +	return 0;
> > +}
> > +
> > +static int proc_page_idle_release(struct inode *inode, struct file *file)
> > +{
> > +	struct mm_struct *mm = file->private_data;
> > +
> > +	if (mm)
> 
> I suspect the test isn't needed?  proc_page_idle_release) won't be
> called if proc_page_idle_open() failed?

Yes you are right, will remove the test.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing
  2019-08-07 10:00   ` Joel Fernandes
@ 2019-08-07 20:01     ` Andrew Morton
  2019-08-07 20:44       ` Joel Fernandes
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2019-08-07 20:01 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Alexey Dobriyan, Borislav Petkov, Brendan Gregg,
	Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Michal Hocko, Mike Rapoport, minchan, namhyung, paulmck,
	Robin Murphy, Roman Gushchin, Stephen Rothwell, surenb,
	Thomas Gleixner, tkjos, Vladimir Davydov, Vlastimil Babka,
	Will Deacon, Brendan Gregg

On Wed, 7 Aug 2019 06:00:13 -0400 Joel Fernandes <joel@joelfernandes.org> wrote:

> > > 8 files changed, 376 insertions(+), 45 deletions(-)
> > 
> > Quite a lot of new code unconditionally added to major architectures. 
> > Are we confident that everyone will want this feature?
> 
> I did not follow, could you clarify more? All of this diff stat is not to
> architecture code:


My point is that the patchset adds a lot of new code with no way in
which users can opt out.  Almost everyone gets a fatter kernel - how
many of those users will actually benefit from it?

If "not many" then shouldn't we be making it Kconfigurable?

Are there userspace tools which present this info to users or which
provide monitoring of some form?  Do major distros ship those tools? 
Do people use them?  etcetera.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing
  2019-08-07 20:01     ` Andrew Morton
@ 2019-08-07 20:44       ` Joel Fernandes
  0 siblings, 0 replies; 29+ messages in thread
From: Joel Fernandes @ 2019-08-07 20:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexey Dobriyan, Borislav Petkov, Brendan Gregg,
	Catalin Marinas, Christian Hansen, dancol, fmayer,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Kees Cook,
	kernel-team, linux-api, linux-doc, linux-fsdevel, linux-mm,
	Michal Hocko, Mike Rapoport, minchan, namhyung, paulmck,
	Robin Murphy, Roman Gushchin, Stephen Rothwell, surenb,
	Thomas Gleixner, tkjos, Vladimir Davydov, Vlastimil Babka,
	Will Deacon, Brendan Gregg

On Wed, Aug 07, 2019 at 01:01:22PM -0700, Andrew Morton wrote:
> On Wed, 7 Aug 2019 06:00:13 -0400 Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> > > > 8 files changed, 376 insertions(+), 45 deletions(-)
> > > 
> > > Quite a lot of new code unconditionally added to major architectures. 
> > > Are we confident that everyone will want this feature?
> > 
> > I did not follow, could you clarify more? All of this diff stat is not to
> > architecture code:
> 
> 
> My point is that the patchset adds a lot of new code with no way in
> which users can opt out.  Almost everyone gets a fatter kernel - how
> many of those users will actually benefit from it?
> 
> If "not many" then shouldn't we be making it Kconfigurable?

Almost all of this code is already configurable with
CONFIG_IDLE_PAGE_TRACKING. If you disable it, then all of this code gets
disabled.

Or are you referring to something else that needs to be made configurable?

> Are there userspace tools which present this info to users or which
> provide monitoring of some form?  Do major distros ship those tools? 
> Do people use them?  etcetera.
> 

Android's heapprofd is what I was working on which is already using it (patch
is not yet upstreamed). There is working set tracking which Sandeep (also
from Android) said he wants to use. Minchan plans to use this in combination
with ZRAM-based idle tracking. Mike Rappoport also showed some interest, but
I am not sure where/how he is using it. These are just some of the usecases I
am aware off. I am pretty sure more will come as well.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2019-08-07 20:44 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-05 17:04 [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing Joel Fernandes (Google)
2019-08-05 17:04 ` [PATCH v4 2/5] [RFC] x86: Add support for idle bit in swap PTE Joel Fernandes (Google)
2019-08-05 17:04 ` [PATCH v4 3/5] [RFC] arm64: " Joel Fernandes (Google)
2019-08-06  8:42   ` Michal Hocko
2019-08-06 10:36     ` Joel Fernandes
2019-08-06 10:47       ` Michal Hocko
2019-08-06 11:07         ` Minchan Kim
2019-08-06 11:14           ` Michal Hocko
2019-08-06 11:26             ` Joel Fernandes
2019-08-06 11:14         ` Joel Fernandes
2019-08-06 11:57           ` Michal Hocko
2019-08-06 13:43             ` Joel Fernandes
2019-08-06 14:09               ` Michal Hocko
2019-08-06 14:47             ` Minchan Kim
2019-08-06 15:20               ` Joel Fernandes
2019-08-05 17:04 ` [PATCH v4 4/5] page_idle: Drain all LRU pagevec before idle tracking Joel Fernandes (Google)
2019-08-06  8:43   ` Michal Hocko
2019-08-06 10:45     ` Joel Fernandes
2019-08-06 10:51       ` Michal Hocko
2019-08-06 11:19         ` Joel Fernandes
2019-08-06 11:44           ` Michal Hocko
2019-08-06 13:48             ` Joel Fernandes
2019-08-05 17:04 ` [PATCH v4 5/5] doc: Update documentation for page_idle virtual address indexing Joel Fernandes (Google)
2019-08-06  8:56 ` [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing Michal Hocko
2019-08-06 10:47   ` Joel Fernandes
2019-08-06 22:19 ` Andrew Morton
2019-08-07 10:00   ` Joel Fernandes
2019-08-07 20:01     ` Andrew Morton
2019-08-07 20:44       ` Joel Fernandes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).