From: Fengguang Wu <fengguang.wu@intel.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: Linux Memory Management List <linux-mm@kvack.org>, Huang Ying <ying.huang@intel.com>, Brendan Gregg <bgregg@netflix.com>, Fengguang Wu <fengguang.wu@intel.com> Cc: Peng DongX <dongx.peng@intel.com> Cc: Liu Jingqi <jingqi.liu@intel.com> Cc: Dong Eddie <eddie.dong@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: kvm@vger.kernel.org Cc: LKML <linux-kernel@vger.kernel.org> Subject: [RFC][PATCH 2/5] [PATCH 2/5] proc: introduce /proc/PID/idle_bitmap Date: Sat, 01 Sep 2018 19:28:20 +0800 [thread overview] Message-ID: <20180901124811.530300789@intel.com> (raw) In-Reply-To: 20180901112818.126790961@intel.com [-- Attachment #1: 0002-proc-introduce-proc-PID-idle_bitmap.patch --] [-- Type: text/plain, Size: 5116 bytes --] This will be similar to /sys/kernel/mm/page_idle/bitmap documented in Documentation/admin-guide/mm/idle_page_tracking.rst, however indexed by process virtual address. When using the global PFN indexed idle bitmap, we find 2 kind of overheads: - to track a task's working set, Brendan Gregg end up writing wss-v1 for small tasks and wss-v2 for large tasks: https://github.com/brendangregg/wss That's because VAs may point to random PAs throughout the physical address space. So we either query /proc/pid/pagemap first and access the lots of random PFNs (with lots of syscalls) in the bitmap, or write+read the whole system idle bitmap beforehand. - page table walking by PFN has much more overheads than to walk a page table in its natural order: - rmap queries - more locking - random memory reads/writes This interface provides a cheap path for the majority non-shared mapping pages. To walk 1TB memory of 4k active pages, it costs 2s vs 15s system time to scan the per-task/global idle bitmaps. Which means ~7x speedup. The gap will be enlarged if consider - the extra /proc/pid/pagemap walk - natural page table walks can skip the whole 512 PTEs if PMD is idle OTOH, the per-task idle bitmap is not suitable in some situations: - not accurate for shared pages - don't work with non-mapped file pages - don't perform well for sparse page tables (pointed out by Huang Ying) So it's more about complementing the existing global idle bitmap. CC: Huang Ying <ying.huang@intel.com> CC: Brendan Gregg <bgregg@netflix.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> --- fs/proc/base.c | 2 ++ fs/proc/internal.h | 1 + fs/proc/task_mmu.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 66 insertions(+) diff --git a/fs/proc/base.c b/fs/proc/base.c index aaffc0c30216..d81322b5b8d2 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2942,6 +2942,7 @@ static const struct pid_entry tgid_base_stuff[] = { REG("smaps", S_IRUGO, proc_pid_smaps_operations), REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), + REG("idle_bitmap", S_IRUSR|S_IWUSR, proc_mm_idle_operations), #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), @@ -3327,6 +3328,7 @@ static const struct pid_entry tid_base_stuff[] = { REG("smaps", S_IRUGO, proc_tid_smaps_operations), REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), + REG("idle_bitmap", S_IRUSR|S_IWUSR, proc_mm_idle_operations), #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), diff --git a/fs/proc/internal.h b/fs/proc/internal.h index da3dbfa09e79..732a502acc27 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -305,6 +305,7 @@ extern const struct file_operations proc_pid_smaps_rollup_operations; extern const struct file_operations proc_tid_smaps_operations; extern const struct file_operations proc_clear_refs_operations; extern const struct file_operations proc_pagemap_operations; +extern const struct file_operations proc_mm_idle_operations; extern unsigned long task_vsize(struct mm_struct *); extern unsigned long task_statm(struct mm_struct *, diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index dfd73a4616ce..376406a9cf45 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1564,6 +1564,69 @@ const struct file_operations proc_pagemap_operations = { .open = pagemap_open, .release = pagemap_release, }; + +/* will be filled when kvm_ept_idle module loads */ +struct file_operations proc_ept_idle_operations = { +}; +EXPORT_SYMBOL_GPL(proc_ept_idle_operations); + +static ssize_t mm_idle_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + struct task_struct *task = file->private_data; + ssize_t ret = -ESRCH; + + // TODO: implement mm_walk for normal tasks + + if (task_kvm(task)) { + if (proc_ept_idle_operations.read) + return proc_ept_idle_operations.read(file, buf, count, ppos); + } + + return ret; +} + + +static int mm_idle_open(struct inode *inode, struct file *file) +{ + struct task_struct *task = get_proc_task(inode); + + if (!task) + return -ESRCH; + + file->private_data = task; + + if (task_kvm(task)) { + if (proc_ept_idle_operations.open) + return proc_ept_idle_operations.open(inode, file); + } + + return 0; +} + +static int mm_idle_release(struct inode *inode, struct file *file) +{ + struct task_struct *task = file->private_data; + + if (!task) + return 0; + + if (task_kvm(task)) { + if (proc_ept_idle_operations.release) + return proc_ept_idle_operations.release(inode, file); + } + + put_task_struct(task); + return 0; +} + +const struct file_operations proc_mm_idle_operations = { + .llseek = mem_lseek, /* borrow this */ + .read = mm_idle_read, + .open = mm_idle_open, + .release = mm_idle_release, +}; + #endif /* CONFIG_PROC_PAGE_MONITOR */ #ifdef CONFIG_NUMA -- 2.15.0
WARNING: multiple messages have this Message-ID (diff)
From: Fengguang Wu <fengguang.wu@intel.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: Linux Memory Management List <linux-mm@kvack.org>, Huang Ying <ying.huang@intel.com>, Brendan Gregg <bgregg@netflix.com>, Fengguang Wu <fengguang.wu@intel.com>, Peng DongX <dongx.peng@intel.com>, Liu Jingqi <jingqi.liu@intel.com>, Dong Eddie <eddie.dong@intel.com>, Dave Hansen <dave.hansen@intel.com>, kvm@vger.kernel.org, LKML <linux-kernel@vger.kernel.org> Subject: [RFC][PATCH 2/5] [PATCH 2/5] proc: introduce /proc/PID/idle_bitmap Date: Sat, 01 Sep 2018 19:28:20 +0800 [thread overview] Message-ID: <20180901124811.530300789@intel.com> (raw) In-Reply-To: 20180901112818.126790961@intel.com [-- Attachment #1: 0002-proc-introduce-proc-PID-idle_bitmap.patch --] [-- Type: text/plain, Size: 5113 bytes --] This will be similar to /sys/kernel/mm/page_idle/bitmap documented in Documentation/admin-guide/mm/idle_page_tracking.rst, however indexed by process virtual address. When using the global PFN indexed idle bitmap, we find 2 kind of overheads: - to track a task's working set, Brendan Gregg end up writing wss-v1 for small tasks and wss-v2 for large tasks: https://github.com/brendangregg/wss That's because VAs may point to random PAs throughout the physical address space. So we either query /proc/pid/pagemap first and access the lots of random PFNs (with lots of syscalls) in the bitmap, or write+read the whole system idle bitmap beforehand. - page table walking by PFN has much more overheads than to walk a page table in its natural order: - rmap queries - more locking - random memory reads/writes This interface provides a cheap path for the majority non-shared mapping pages. To walk 1TB memory of 4k active pages, it costs 2s vs 15s system time to scan the per-task/global idle bitmaps. Which means ~7x speedup. The gap will be enlarged if consider - the extra /proc/pid/pagemap walk - natural page table walks can skip the whole 512 PTEs if PMD is idle OTOH, the per-task idle bitmap is not suitable in some situations: - not accurate for shared pages - don't work with non-mapped file pages - don't perform well for sparse page tables (pointed out by Huang Ying) So it's more about complementing the existing global idle bitmap. CC: Huang Ying <ying.huang@intel.com> CC: Brendan Gregg <bgregg@netflix.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> --- fs/proc/base.c | 2 ++ fs/proc/internal.h | 1 + fs/proc/task_mmu.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 66 insertions(+) diff --git a/fs/proc/base.c b/fs/proc/base.c index aaffc0c30216..d81322b5b8d2 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2942,6 +2942,7 @@ static const struct pid_entry tgid_base_stuff[] = { REG("smaps", S_IRUGO, proc_pid_smaps_operations), REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), + REG("idle_bitmap", S_IRUSR|S_IWUSR, proc_mm_idle_operations), #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), @@ -3327,6 +3328,7 @@ static const struct pid_entry tid_base_stuff[] = { REG("smaps", S_IRUGO, proc_tid_smaps_operations), REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), + REG("idle_bitmap", S_IRUSR|S_IWUSR, proc_mm_idle_operations), #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), diff --git a/fs/proc/internal.h b/fs/proc/internal.h index da3dbfa09e79..732a502acc27 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -305,6 +305,7 @@ extern const struct file_operations proc_pid_smaps_rollup_operations; extern const struct file_operations proc_tid_smaps_operations; extern const struct file_operations proc_clear_refs_operations; extern const struct file_operations proc_pagemap_operations; +extern const struct file_operations proc_mm_idle_operations; extern unsigned long task_vsize(struct mm_struct *); extern unsigned long task_statm(struct mm_struct *, diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index dfd73a4616ce..376406a9cf45 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1564,6 +1564,69 @@ const struct file_operations proc_pagemap_operations = { .open = pagemap_open, .release = pagemap_release, }; + +/* will be filled when kvm_ept_idle module loads */ +struct file_operations proc_ept_idle_operations = { +}; +EXPORT_SYMBOL_GPL(proc_ept_idle_operations); + +static ssize_t mm_idle_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + struct task_struct *task = file->private_data; + ssize_t ret = -ESRCH; + + // TODO: implement mm_walk for normal tasks + + if (task_kvm(task)) { + if (proc_ept_idle_operations.read) + return proc_ept_idle_operations.read(file, buf, count, ppos); + } + + return ret; +} + + +static int mm_idle_open(struct inode *inode, struct file *file) +{ + struct task_struct *task = get_proc_task(inode); + + if (!task) + return -ESRCH; + + file->private_data = task; + + if (task_kvm(task)) { + if (proc_ept_idle_operations.open) + return proc_ept_idle_operations.open(inode, file); + } + + return 0; +} + +static int mm_idle_release(struct inode *inode, struct file *file) +{ + struct task_struct *task = file->private_data; + + if (!task) + return 0; + + if (task_kvm(task)) { + if (proc_ept_idle_operations.release) + return proc_ept_idle_operations.release(inode, file); + } + + put_task_struct(task); + return 0; +} + +const struct file_operations proc_mm_idle_operations = { + .llseek = mem_lseek, /* borrow this */ + .read = mm_idle_read, + .open = mm_idle_open, + .release = mm_idle_release, +}; + #endif /* CONFIG_PROC_PAGE_MONITOR */ #ifdef CONFIG_NUMA -- 2.15.0
next prev parent reply other threads:[~2018-09-02 2:21 UTC|newest] Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-09-01 11:28 [RFC][PATCH 0/5] introduce /proc/PID/idle_bitmap Fengguang Wu 2018-09-01 11:28 ` Fengguang Wu 2018-09-01 11:28 ` [RFC][PATCH 1/5] [PATCH 1/5] kvm: register in task_struct Fengguang Wu 2018-09-01 11:28 ` Fengguang Wu 2018-09-01 11:28 ` Fengguang Wu [this message] 2018-09-01 11:28 ` [RFC][PATCH 2/5] [PATCH 2/5] proc: introduce /proc/PID/idle_bitmap Fengguang Wu 2018-09-04 19:02 ` Sean Christopherson 2018-09-06 14:12 ` Dave Hansen 2018-09-01 11:28 ` [RFC][PATCH 3/5] [PATCH 3/5] kvm-ept-idle: HVA indexed EPT read Fengguang Wu 2018-09-01 11:28 ` Fengguang Wu 2018-09-04 7:57 ` Nikita Leshenko 2018-09-04 8:12 ` Peng, DongX 2018-09-04 8:15 ` Fengguang Wu 2018-09-01 11:28 ` [RFC][PATCH 4/5] [PATCH 4/5] kvm-ept-idle: EPT page table walk for A bits Fengguang Wu 2018-09-01 11:28 ` Fengguang Wu 2018-09-06 14:35 ` Dave Hansen 2018-09-01 11:28 ` [RFC][PATCH 5/5] [PATCH 5/5] kvm-ept-idle: enable module Fengguang Wu 2018-09-01 11:28 ` Fengguang Wu 2018-09-04 19:14 ` Sean Christopherson 2018-09-02 8:24 ` [RFC][PATCH 0/5] introduce /proc/PID/idle_bitmap Fengguang Wu
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20180901124811.530300789@intel.com \ --to=fengguang.wu@intel.com \ --cc=akpm@linux-foundation.org \ --cc=bgregg@netflix.com \ --cc=linux-mm@kvack.org \ --cc=ying.huang@intel.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.