All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/5] introduce /proc/PID/idle_bitmap
@ 2018-09-01 11:28 ` Fengguang Wu
  0 siblings, 0 replies; 20+ messages in thread
From: Fengguang Wu @ 2018-09-01 11:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, kvm, Peng DongX, Liu Jingqi,
	Dong Eddie, Dave Hansen, Huang Ying, Brendan Gregg, Fengguang Wu,
	LKML

This new /proc/PID/idle_bitmap interface aims to complement the current global
/sys/kernel/mm/page_idle/bitmap. To enable efficient user space driven migrations.

The pros and cons will be discussed in changelog of "[PATCH] proc: introduce
/proc/PID/idle_bitmap". The driving force is to improve efficiency by 10+
times, so that hot/cold page tracking can be done in some regular intervals in
user space w/o too much overheads. Making it possible for some user space
daemon to do regular page migration between NUMA nodes of different speeds.

Note it's not about NUMA migration between local and remote nodes -- we already
have NUMA balancing for that. This interface and user space migration daemon
targets for NUMA nodes made of different mediums -- ie. DIMM and NVDIMM(*) --
with larger performance gaps. Basic policy will be "move hot pages to DIMM;
cold pages to NVDIMM".

Since NVDIMMs size can easily reach several Terabytes, working set tracking
efficiency will matter and be challeging.

(*) Here we use persistent memory (PMEM) w/o using its persistence.
Persistence is good to have, however it requires modifying applications.
Upcoming NVDIMM products like Intel Apache Pass (AEP) will be more cost and energy
effective than DRAM, but slower. Merely using it in form of NUMA memory node
could immediately benefit many workloads. For example, warm but not hot apps,
workloads with sharp hot/cold page distribution (good for migration), or relies
more on memory size than latency and bandwidth, and do more reads than writes.

This is an early RFC version to collect feedbacks. It's complete enough to demo
the basic ideas and performance, however not usable yet.

Regards,
Fengguang


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][PATCH 0/5] introduce /proc/PID/idle_bitmap
@ 2018-09-01 11:28 ` Fengguang Wu
  0 siblings, 0 replies; 20+ messages in thread
From: Fengguang Wu @ 2018-09-01 11:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, kvm, Peng DongX, Liu Jingqi,
	Dong Eddie, Dave Hansen, Huang Ying, Brendan Gregg, Fengguang Wu,
	LKML

This new /proc/PID/idle_bitmap interface aims to complement the current global
/sys/kernel/mm/page_idle/bitmap. To enable efficient user space driven migrations.

The pros and cons will be discussed in changelog of "[PATCH] proc: introduce
/proc/PID/idle_bitmap". The driving force is to improve efficiency by 10+
times, so that hot/cold page tracking can be done in some regular intervals in
user space w/o too much overheads. Making it possible for some user space
daemon to do regular page migration between NUMA nodes of different speeds.

Note it's not about NUMA migration between local and remote nodes -- we already
have NUMA balancing for that. This interface and user space migration daemon
targets for NUMA nodes made of different mediums -- ie. DIMM and NVDIMM(*) --
with larger performance gaps. Basic policy will be "move hot pages to DIMM;
cold pages to NVDIMM".

Since NVDIMMs size can easily reach several Terabytes, working set tracking
efficiency will matter and be challeging.

(*) Here we use persistent memory (PMEM) w/o using its persistence.
Persistence is good to have, however it requires modifying applications.
Upcoming NVDIMM products like Intel Apache Pass (AEP) will be more cost and energy
effective than DRAM, but slower. Merely using it in form of NUMA memory node
could immediately benefit many workloads. For example, warm but not hot apps,
workloads with sharp hot/cold page distribution (good for migration), or relies
more on memory size than latency and bandwidth, and do more reads than writes.

This is an early RFC version to collect feedbacks. It's complete enough to demo
the basic ideas and performance, however not usable yet.

Regards,
Fengguang

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][PATCH 1/5] [PATCH 1/5] kvm: register in task_struct
  2018-09-01 11:28 ` Fengguang Wu
@ 2018-09-01 11:28   ` Fengguang Wu
  -1 siblings, 0 replies; 20+ messages in thread
From: Fengguang Wu @ 2018-09-01 11:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Fengguang Wu, Peng DongX,
	Liu Jingqi, Dong Eddie, Dave Hansen, Huang Ying, Brendan Gregg,
	kvm, LKML

[-- Attachment #1: 0001-kvm-register-in-task_struct.patch --]
[-- Type: text/plain, Size: 1976 bytes --]

The added pointer will be used by the /proc/PID/idle_bitmap code to
quickly identify QEMU task and walk EPT/NPT accordingly. For virtual
machines, the A bits will be set in guest page tables and EPT/NPT,
rather than the QEMU task page table.

This costs 8 bytes in task_struct which could be wasteful for the
majority normal tasks. The alternative is to add a flag only, and
let it find the corresponding VM in kvm vm_list.

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 include/linux/sched.h | 10 ++++++++++
 virt/kvm/kvm_main.c   |  1 +
 2 files changed, 11 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 43731fe51c97..26c8549bbc28 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -38,6 +38,7 @@ struct cfs_rq;
 struct fs_struct;
 struct futex_pi_state;
 struct io_context;
+struct kvm;
 struct mempolicy;
 struct nameidata;
 struct nsproxy;
@@ -1179,6 +1180,9 @@ struct task_struct {
 	/* Used by LSM modules for access restriction: */
 	void				*security;
 #endif
+#if IS_ENABLED(CONFIG_KVM)
+        struct kvm                      *kvm;
+#endif
 
 	/*
 	 * New fields for task_struct should be added above here, so that
@@ -1898,4 +1902,10 @@ static inline void rseq_syscall(struct pt_regs *regs)
 
 #endif
 
+#if IS_ENABLED(CONFIG_KVM)
+static inline struct kvm *task_kvm(struct task_struct *t) { return t->kvm; }
+#else
+static inline struct kvm *task_kvm(struct task_struct *t) { return NULL; }
+#endif
+
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8b47507faab5..0c483720de8d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3892,6 +3892,7 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm)
 	if (type == KVM_EVENT_CREATE_VM) {
 		add_uevent_var(env, "EVENT=create");
 		kvm->userspace_pid = task_pid_nr(current);
+		current->kvm = kvm;
 	} else if (type == KVM_EVENT_DESTROY_VM) {
 		add_uevent_var(env, "EVENT=destroy");
 	}
-- 
2.15.0




^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC][PATCH 1/5] [PATCH 1/5] kvm: register in task_struct
@ 2018-09-01 11:28   ` Fengguang Wu
  0 siblings, 0 replies; 20+ messages in thread
From: Fengguang Wu @ 2018-09-01 11:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Fengguang Wu, Peng DongX,
	Liu Jingqi, Dong Eddie, Dave Hansen, Huang Ying, Brendan Gregg,
	kvm, LKML

[-- Attachment #1: 0001-kvm-register-in-task_struct.patch --]
[-- Type: text/plain, Size: 1973 bytes --]

The added pointer will be used by the /proc/PID/idle_bitmap code to
quickly identify QEMU task and walk EPT/NPT accordingly. For virtual
machines, the A bits will be set in guest page tables and EPT/NPT,
rather than the QEMU task page table.

This costs 8 bytes in task_struct which could be wasteful for the
majority normal tasks. The alternative is to add a flag only, and
let it find the corresponding VM in kvm vm_list.

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 include/linux/sched.h | 10 ++++++++++
 virt/kvm/kvm_main.c   |  1 +
 2 files changed, 11 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 43731fe51c97..26c8549bbc28 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -38,6 +38,7 @@ struct cfs_rq;
 struct fs_struct;
 struct futex_pi_state;
 struct io_context;
+struct kvm;
 struct mempolicy;
 struct nameidata;
 struct nsproxy;
@@ -1179,6 +1180,9 @@ struct task_struct {
 	/* Used by LSM modules for access restriction: */
 	void				*security;
 #endif
+#if IS_ENABLED(CONFIG_KVM)
+        struct kvm                      *kvm;
+#endif
 
 	/*
 	 * New fields for task_struct should be added above here, so that
@@ -1898,4 +1902,10 @@ static inline void rseq_syscall(struct pt_regs *regs)
 
 #endif
 
+#if IS_ENABLED(CONFIG_KVM)
+static inline struct kvm *task_kvm(struct task_struct *t) { return t->kvm; }
+#else
+static inline struct kvm *task_kvm(struct task_struct *t) { return NULL; }
+#endif
+
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8b47507faab5..0c483720de8d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3892,6 +3892,7 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm)
 	if (type == KVM_EVENT_CREATE_VM) {
 		add_uevent_var(env, "EVENT=create");
 		kvm->userspace_pid = task_pid_nr(current);
+		current->kvm = kvm;
 	} else if (type == KVM_EVENT_DESTROY_VM) {
 		add_uevent_var(env, "EVENT=destroy");
 	}
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC][PATCH 2/5] [PATCH 2/5] proc: introduce /proc/PID/idle_bitmap
  2018-09-01 11:28 ` Fengguang Wu
@ 2018-09-01 11:28   ` Fengguang Wu
  -1 siblings, 0 replies; 20+ messages in thread
From: Fengguang Wu @ 2018-09-01 11:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Huang Ying, Brendan Gregg,
	Fengguang Wu, Peng DongX, Liu Jingqi, Dong Eddie, Dave Hansen,
	kvm, LKML

[-- Attachment #1: 0002-proc-introduce-proc-PID-idle_bitmap.patch --]
[-- Type: text/plain, Size: 5116 bytes --]

This will be similar to /sys/kernel/mm/page_idle/bitmap documented in
Documentation/admin-guide/mm/idle_page_tracking.rst, however indexed
by process virtual address.

When using the global PFN indexed idle bitmap, we find 2 kind of
overheads:

- to track a task's working set, Brendan Gregg end up writing wss-v1
  for small tasks and wss-v2 for large tasks:

  https://github.com/brendangregg/wss

  That's because VAs may point to random PAs throughout the physical
  address space. So we either query /proc/pid/pagemap first and access
  the lots of random PFNs (with lots of syscalls) in the bitmap, or
  write+read the whole system idle bitmap beforehand.

- page table walking by PFN has much more overheads than to walk a
  page table in its natural order:
  - rmap queries
  - more locking
  - random memory reads/writes

This interface provides a cheap path for the majority non-shared mapping
pages. To walk 1TB memory of 4k active pages, it costs 2s vs 15s system
time to scan the per-task/global idle bitmaps. Which means ~7x speedup.
The gap will be enlarged if consider

- the extra /proc/pid/pagemap walk
- natural page table walks can skip the whole 512 PTEs if PMD is idle

OTOH, the per-task idle bitmap is not suitable in some situations:

- not accurate for shared pages
- don't work with non-mapped file pages
- don't perform well for sparse page tables (pointed out by Huang Ying)

So it's more about complementing the existing global idle bitmap.

CC: Huang Ying <ying.huang@intel.com>
CC: Brendan Gregg <bgregg@netflix.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 fs/proc/base.c     |  2 ++
 fs/proc/internal.h |  1 +
 fs/proc/task_mmu.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 66 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index aaffc0c30216..d81322b5b8d2 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2942,6 +2942,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 	REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
 	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
 	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
+	REG("idle_bitmap", S_IRUSR|S_IWUSR, proc_mm_idle_operations),
 #endif
 #ifdef CONFIG_SECURITY
 	DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
@@ -3327,6 +3328,7 @@ static const struct pid_entry tid_base_stuff[] = {
 	REG("smaps",     S_IRUGO, proc_tid_smaps_operations),
 	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
 	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
+	REG("idle_bitmap", S_IRUSR|S_IWUSR, proc_mm_idle_operations),
 #endif
 #ifdef CONFIG_SECURITY
 	DIR("attr",      S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index da3dbfa09e79..732a502acc27 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -305,6 +305,7 @@ extern const struct file_operations proc_pid_smaps_rollup_operations;
 extern const struct file_operations proc_tid_smaps_operations;
 extern const struct file_operations proc_clear_refs_operations;
 extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_mm_idle_operations;
 
 extern unsigned long task_vsize(struct mm_struct *);
 extern unsigned long task_statm(struct mm_struct *,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index dfd73a4616ce..376406a9cf45 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1564,6 +1564,69 @@ const struct file_operations proc_pagemap_operations = {
 	.open		= pagemap_open,
 	.release	= pagemap_release,
 };
+
+/* will be filled when kvm_ept_idle module loads */
+struct file_operations proc_ept_idle_operations = {
+};
+EXPORT_SYMBOL_GPL(proc_ept_idle_operations);
+
+static ssize_t mm_idle_read(struct file *file, char __user *buf,
+			    size_t count, loff_t *ppos)
+{
+	struct task_struct *task = file->private_data;
+	ssize_t ret = -ESRCH;
+
+	// TODO: implement mm_walk for normal tasks
+
+	if (task_kvm(task)) {
+		if (proc_ept_idle_operations.read)
+			return proc_ept_idle_operations.read(file, buf, count, ppos);
+	}
+
+	return ret;
+}
+
+
+static int mm_idle_open(struct inode *inode, struct file *file)
+{
+	struct task_struct *task = get_proc_task(inode);
+
+	if (!task)
+		return -ESRCH;
+
+	file->private_data = task;
+
+	if (task_kvm(task)) {
+		if (proc_ept_idle_operations.open)
+			return proc_ept_idle_operations.open(inode, file);
+	}
+
+	return 0;
+}
+
+static int mm_idle_release(struct inode *inode, struct file *file)
+{
+	struct task_struct *task = file->private_data;
+
+	if (!task)
+		return 0;
+
+	if (task_kvm(task)) {
+		if (proc_ept_idle_operations.release)
+			return proc_ept_idle_operations.release(inode, file);
+	}
+
+	put_task_struct(task);
+	return 0;
+}
+
+const struct file_operations proc_mm_idle_operations = {
+	.llseek		= mem_lseek, /* borrow this */
+	.read		= mm_idle_read,
+	.open		= mm_idle_open,
+	.release	= mm_idle_release,
+};
+
 #endif /* CONFIG_PROC_PAGE_MONITOR */
 
 #ifdef CONFIG_NUMA
-- 
2.15.0




^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC][PATCH 2/5] [PATCH 2/5] proc: introduce /proc/PID/idle_bitmap
@ 2018-09-01 11:28   ` Fengguang Wu
  0 siblings, 0 replies; 20+ messages in thread
From: Fengguang Wu @ 2018-09-01 11:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Huang Ying, Brendan Gregg,
	Fengguang Wu, Peng DongX, Liu Jingqi, Dong Eddie, Dave Hansen,
	kvm, LKML

[-- Attachment #1: 0002-proc-introduce-proc-PID-idle_bitmap.patch --]
[-- Type: text/plain, Size: 5113 bytes --]

This will be similar to /sys/kernel/mm/page_idle/bitmap documented in
Documentation/admin-guide/mm/idle_page_tracking.rst, however indexed
by process virtual address.

When using the global PFN indexed idle bitmap, we find 2 kind of
overheads:

- to track a task's working set, Brendan Gregg end up writing wss-v1
  for small tasks and wss-v2 for large tasks:

  https://github.com/brendangregg/wss

  That's because VAs may point to random PAs throughout the physical
  address space. So we either query /proc/pid/pagemap first and access
  the lots of random PFNs (with lots of syscalls) in the bitmap, or
  write+read the whole system idle bitmap beforehand.

- page table walking by PFN has much more overheads than to walk a
  page table in its natural order:
  - rmap queries
  - more locking
  - random memory reads/writes

This interface provides a cheap path for the majority non-shared mapping
pages. To walk 1TB memory of 4k active pages, it costs 2s vs 15s system
time to scan the per-task/global idle bitmaps. Which means ~7x speedup.
The gap will be enlarged if consider

- the extra /proc/pid/pagemap walk
- natural page table walks can skip the whole 512 PTEs if PMD is idle

OTOH, the per-task idle bitmap is not suitable in some situations:

- not accurate for shared pages
- don't work with non-mapped file pages
- don't perform well for sparse page tables (pointed out by Huang Ying)

So it's more about complementing the existing global idle bitmap.

CC: Huang Ying <ying.huang@intel.com>
CC: Brendan Gregg <bgregg@netflix.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 fs/proc/base.c     |  2 ++
 fs/proc/internal.h |  1 +
 fs/proc/task_mmu.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 66 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index aaffc0c30216..d81322b5b8d2 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2942,6 +2942,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 	REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
 	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
 	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
+	REG("idle_bitmap", S_IRUSR|S_IWUSR, proc_mm_idle_operations),
 #endif
 #ifdef CONFIG_SECURITY
 	DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
@@ -3327,6 +3328,7 @@ static const struct pid_entry tid_base_stuff[] = {
 	REG("smaps",     S_IRUGO, proc_tid_smaps_operations),
 	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
 	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
+	REG("idle_bitmap", S_IRUSR|S_IWUSR, proc_mm_idle_operations),
 #endif
 #ifdef CONFIG_SECURITY
 	DIR("attr",      S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index da3dbfa09e79..732a502acc27 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -305,6 +305,7 @@ extern const struct file_operations proc_pid_smaps_rollup_operations;
 extern const struct file_operations proc_tid_smaps_operations;
 extern const struct file_operations proc_clear_refs_operations;
 extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_mm_idle_operations;
 
 extern unsigned long task_vsize(struct mm_struct *);
 extern unsigned long task_statm(struct mm_struct *,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index dfd73a4616ce..376406a9cf45 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1564,6 +1564,69 @@ const struct file_operations proc_pagemap_operations = {
 	.open		= pagemap_open,
 	.release	= pagemap_release,
 };
+
+/* will be filled when kvm_ept_idle module loads */
+struct file_operations proc_ept_idle_operations = {
+};
+EXPORT_SYMBOL_GPL(proc_ept_idle_operations);
+
+static ssize_t mm_idle_read(struct file *file, char __user *buf,
+			    size_t count, loff_t *ppos)
+{
+	struct task_struct *task = file->private_data;
+	ssize_t ret = -ESRCH;
+
+	// TODO: implement mm_walk for normal tasks
+
+	if (task_kvm(task)) {
+		if (proc_ept_idle_operations.read)
+			return proc_ept_idle_operations.read(file, buf, count, ppos);
+	}
+
+	return ret;
+}
+
+
+static int mm_idle_open(struct inode *inode, struct file *file)
+{
+	struct task_struct *task = get_proc_task(inode);
+
+	if (!task)
+		return -ESRCH;
+
+	file->private_data = task;
+
+	if (task_kvm(task)) {
+		if (proc_ept_idle_operations.open)
+			return proc_ept_idle_operations.open(inode, file);
+	}
+
+	return 0;
+}
+
+static int mm_idle_release(struct inode *inode, struct file *file)
+{
+	struct task_struct *task = file->private_data;
+
+	if (!task)
+		return 0;
+
+	if (task_kvm(task)) {
+		if (proc_ept_idle_operations.release)
+			return proc_ept_idle_operations.release(inode, file);
+	}
+
+	put_task_struct(task);
+	return 0;
+}
+
+const struct file_operations proc_mm_idle_operations = {
+	.llseek		= mem_lseek, /* borrow this */
+	.read		= mm_idle_read,
+	.open		= mm_idle_open,
+	.release	= mm_idle_release,
+};
+
 #endif /* CONFIG_PROC_PAGE_MONITOR */
 
 #ifdef CONFIG_NUMA
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC][PATCH 3/5] [PATCH 3/5] kvm-ept-idle: HVA indexed EPT read
  2018-09-01 11:28 ` Fengguang Wu
@ 2018-09-01 11:28   ` Fengguang Wu
  -1 siblings, 0 replies; 20+ messages in thread
From: Fengguang Wu @ 2018-09-01 11:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Peng DongX, Fengguang Wu,
	Liu Jingqi, Dong Eddie, Dave Hansen, Huang Ying, Brendan Gregg,
	kvm, LKML

[-- Attachment #1: 0003-kvm-ept-idle-HVA-indexed-EPT-read.patch --]
[-- Type: text/plain, Size: 4587 bytes --]

For virtual machines, "accessed" bits will be set in guest page tables
and EPT/NPT. So for qemu-kvm process, convert HVA to GFN to GPA, then do
EPT/NPT walks. Thanks to the in-memslot linear HVA-GPA mapping, the conversion
can be done efficiently, outside of the loops for page table walks.

In this manner, we provide uniform interface for both virtual machines and
normal processes.

The use scenario would be per task/VM working set tracking and migration.
Very convenient for applying task/vma and VM granularity policies.

Signed-off-by: Peng DongX <dongx.peng@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 arch/x86/kvm/ept_idle.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/ept_idle.h |  24 ++++++++++
 2 files changed, 142 insertions(+)
 create mode 100644 arch/x86/kvm/ept_idle.c
 create mode 100644 arch/x86/kvm/ept_idle.h

diff --git a/arch/x86/kvm/ept_idle.c b/arch/x86/kvm/ept_idle.c
new file mode 100644
index 000000000000..5b97dd01011b
--- /dev/null
+++ b/arch/x86/kvm/ept_idle.c
@@ -0,0 +1,118 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/proc_fs.h>
+#include <linux/uaccess.h>
+#include <linux/kvm.h>
+#include <linux/kvm_host.h>
+#include <linux/bitmap.h>
+
+#include "ept_idle.h"
+
+
+// mindless copy from kvm_handle_hva_range().
+// TODO: handle order and hole.
+static int ept_idle_walk_hva_range(struct ept_idle_ctrl *eic,
+				   unsigned long start,
+				   unsigned long end)
+{
+	struct kvm_memslots *slots;
+	struct kvm_memory_slot *memslot;
+	int ret = 0;
+
+	slots = kvm_memslots(eic->kvm);
+	kvm_for_each_memslot(memslot, slots) {
+		unsigned long hva_start, hva_end;
+		gfn_t gfn_start, gfn_end;
+
+		hva_start = max(start, memslot->userspace_addr);
+		hva_end = min(end, memslot->userspace_addr +
+			      (memslot->npages << PAGE_SHIFT));
+		if (hva_start >= hva_end)
+			continue;
+		/*
+		 * {gfn(page) | page intersects with [hva_start, hva_end)} =
+		 * {gfn_start, gfn_start+1, ..., gfn_end-1}.
+		 */
+		gfn_start = hva_to_gfn_memslot(hva_start, memslot);
+		gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
+
+		ret = ept_idle_walk_gfn_range(eic, gfn_start, gfn_end);
+		if (ret)
+			return ret;
+	}
+
+	return ret;
+}
+
+static ssize_t ept_idle_read(struct file *file, char *buf,
+			     size_t count, loff_t *ppos)
+{
+	struct task_struct *task = file->private_data;
+	struct ept_idle_ctrl *eic;
+	unsigned long hva_start = *ppos << BITMAP_BYTE2PVA_SHIFT;
+	unsigned long hva_end = hva_start + (count << BITMAP_BYTE2PVA_SHIFT);
+	int ret;
+
+	if (*ppos % IDLE_BITMAP_CHUNK_SIZE ||
+	    count % IDLE_BITMAP_CHUNK_SIZE)
+		return -EINVAL;
+
+	eic = kzalloc(sizeof(*eic), GFP_KERNEL);
+	if (!eic)
+		return -EBUSY;
+
+	eic->buf = buf;
+	eic->buf_size = count;
+	eic->kvm = task_kvm(task);
+	if (!eic->kvm) {
+		ret = -EINVAL;
+		goto out_free;
+	}
+
+	ret = ept_idle_walk_hva_range(eic, hva_start, hva_end);
+	if (ret)
+		goto out_free;
+
+	ret = eic->bytes_copied;
+	*ppos += ret;
+out_free:
+	kfree(eic);
+
+	return ret;
+}
+
+static int ept_idle_open(struct inode *inode, struct file *file)
+{
+	if (!try_module_get(THIS_MODULE))
+		return -EBUSY;
+
+	return 0;
+}
+
+static int ept_idle_release(struct inode *inode, struct file *file)
+{
+	module_put(THIS_MODULE);
+	return 0;
+}
+
+extern struct file_operations proc_ept_idle_operations;
+
+static int ept_idle_entry(void)
+{
+	proc_ept_idle_operations.owner = THIS_MODULE;
+	proc_ept_idle_operations.read = ept_idle_read;
+	proc_ept_idle_operations.open = ept_idle_open;
+	proc_ept_idle_operations.release = ept_idle_release;
+
+	return 0;
+}
+
+static void ept_idle_exit(void)
+{
+	memset(&proc_ept_idle_operations, 0, sizeof(proc_ept_idle_operations));
+}
+
+MODULE_LICENSE("GPL");
+module_init(ept_idle_entry);
+module_exit(ept_idle_exit);
diff --git a/arch/x86/kvm/ept_idle.h b/arch/x86/kvm/ept_idle.h
new file mode 100644
index 000000000000..e0b9dcecf50b
--- /dev/null
+++ b/arch/x86/kvm/ept_idle.h
@@ -0,0 +1,24 @@
+#ifndef _EPT_IDLE_H
+#define _EPT_IDLE_H
+
+#define IDLE_BITMAP_CHUNK_SIZE	sizeof(u64)
+#define IDLE_BITMAP_CHUNK_BITS	(IDLE_BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
+
+#define BITMAP_BYTE2PVA_SHIFT  (3 + PAGE_SHIFT)
+
+#define EPT_IDLE_KBUF_FULL 1
+#define EPT_IDLE_KBUF_BYTES 8000
+#define EPT_IDLE_KBUF_BITS  (EPT_IDLE_KBUF_BYTES * 8)
+
+struct ept_idle_ctrl {
+	struct kvm *kvm;
+
+	u64 kbuf[EPT_IDLE_KBUF_BITS / IDLE_BITMAP_CHUNK_BITS];
+	int bits_read;
+
+	void __user *buf;
+	int buf_size;
+	int bytes_copied;
+};
+
+#endif
-- 
2.15.0




^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC][PATCH 3/5] [PATCH 3/5] kvm-ept-idle: HVA indexed EPT read
@ 2018-09-01 11:28   ` Fengguang Wu
  0 siblings, 0 replies; 20+ messages in thread
From: Fengguang Wu @ 2018-09-01 11:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Peng DongX, Fengguang Wu,
	Liu Jingqi, Dong Eddie, Dave Hansen, Huang Ying, Brendan Gregg,
	kvm, LKML

[-- Attachment #1: 0003-kvm-ept-idle-HVA-indexed-EPT-read.patch --]
[-- Type: text/plain, Size: 4584 bytes --]

For virtual machines, "accessed" bits will be set in guest page tables
and EPT/NPT. So for qemu-kvm process, convert HVA to GFN to GPA, then do
EPT/NPT walks. Thanks to the in-memslot linear HVA-GPA mapping, the conversion
can be done efficiently, outside of the loops for page table walks.

In this manner, we provide uniform interface for both virtual machines and
normal processes.

The use scenario would be per task/VM working set tracking and migration.
Very convenient for applying task/vma and VM granularity policies.

Signed-off-by: Peng DongX <dongx.peng@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 arch/x86/kvm/ept_idle.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/ept_idle.h |  24 ++++++++++
 2 files changed, 142 insertions(+)
 create mode 100644 arch/x86/kvm/ept_idle.c
 create mode 100644 arch/x86/kvm/ept_idle.h

diff --git a/arch/x86/kvm/ept_idle.c b/arch/x86/kvm/ept_idle.c
new file mode 100644
index 000000000000..5b97dd01011b
--- /dev/null
+++ b/arch/x86/kvm/ept_idle.c
@@ -0,0 +1,118 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/proc_fs.h>
+#include <linux/uaccess.h>
+#include <linux/kvm.h>
+#include <linux/kvm_host.h>
+#include <linux/bitmap.h>
+
+#include "ept_idle.h"
+
+
+// mindless copy from kvm_handle_hva_range().
+// TODO: handle order and hole.
+static int ept_idle_walk_hva_range(struct ept_idle_ctrl *eic,
+				   unsigned long start,
+				   unsigned long end)
+{
+	struct kvm_memslots *slots;
+	struct kvm_memory_slot *memslot;
+	int ret = 0;
+
+	slots = kvm_memslots(eic->kvm);
+	kvm_for_each_memslot(memslot, slots) {
+		unsigned long hva_start, hva_end;
+		gfn_t gfn_start, gfn_end;
+
+		hva_start = max(start, memslot->userspace_addr);
+		hva_end = min(end, memslot->userspace_addr +
+			      (memslot->npages << PAGE_SHIFT));
+		if (hva_start >= hva_end)
+			continue;
+		/*
+		 * {gfn(page) | page intersects with [hva_start, hva_end)} =
+		 * {gfn_start, gfn_start+1, ..., gfn_end-1}.
+		 */
+		gfn_start = hva_to_gfn_memslot(hva_start, memslot);
+		gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
+
+		ret = ept_idle_walk_gfn_range(eic, gfn_start, gfn_end);
+		if (ret)
+			return ret;
+	}
+
+	return ret;
+}
+
+static ssize_t ept_idle_read(struct file *file, char *buf,
+			     size_t count, loff_t *ppos)
+{
+	struct task_struct *task = file->private_data;
+	struct ept_idle_ctrl *eic;
+	unsigned long hva_start = *ppos << BITMAP_BYTE2PVA_SHIFT;
+	unsigned long hva_end = hva_start + (count << BITMAP_BYTE2PVA_SHIFT);
+	int ret;
+
+	if (*ppos % IDLE_BITMAP_CHUNK_SIZE ||
+	    count % IDLE_BITMAP_CHUNK_SIZE)
+		return -EINVAL;
+
+	eic = kzalloc(sizeof(*eic), GFP_KERNEL);
+	if (!eic)
+		return -EBUSY;
+
+	eic->buf = buf;
+	eic->buf_size = count;
+	eic->kvm = task_kvm(task);
+	if (!eic->kvm) {
+		ret = -EINVAL;
+		goto out_free;
+	}
+
+	ret = ept_idle_walk_hva_range(eic, hva_start, hva_end);
+	if (ret)
+		goto out_free;
+
+	ret = eic->bytes_copied;
+	*ppos += ret;
+out_free:
+	kfree(eic);
+
+	return ret;
+}
+
+static int ept_idle_open(struct inode *inode, struct file *file)
+{
+	if (!try_module_get(THIS_MODULE))
+		return -EBUSY;
+
+	return 0;
+}
+
+static int ept_idle_release(struct inode *inode, struct file *file)
+{
+	module_put(THIS_MODULE);
+	return 0;
+}
+
+extern struct file_operations proc_ept_idle_operations;
+
+static int ept_idle_entry(void)
+{
+	proc_ept_idle_operations.owner = THIS_MODULE;
+	proc_ept_idle_operations.read = ept_idle_read;
+	proc_ept_idle_operations.open = ept_idle_open;
+	proc_ept_idle_operations.release = ept_idle_release;
+
+	return 0;
+}
+
+static void ept_idle_exit(void)
+{
+	memset(&proc_ept_idle_operations, 0, sizeof(proc_ept_idle_operations));
+}
+
+MODULE_LICENSE("GPL");
+module_init(ept_idle_entry);
+module_exit(ept_idle_exit);
diff --git a/arch/x86/kvm/ept_idle.h b/arch/x86/kvm/ept_idle.h
new file mode 100644
index 000000000000..e0b9dcecf50b
--- /dev/null
+++ b/arch/x86/kvm/ept_idle.h
@@ -0,0 +1,24 @@
+#ifndef _EPT_IDLE_H
+#define _EPT_IDLE_H
+
+#define IDLE_BITMAP_CHUNK_SIZE	sizeof(u64)
+#define IDLE_BITMAP_CHUNK_BITS	(IDLE_BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
+
+#define BITMAP_BYTE2PVA_SHIFT  (3 + PAGE_SHIFT)
+
+#define EPT_IDLE_KBUF_FULL 1
+#define EPT_IDLE_KBUF_BYTES 8000
+#define EPT_IDLE_KBUF_BITS  (EPT_IDLE_KBUF_BYTES * 8)
+
+struct ept_idle_ctrl {
+	struct kvm *kvm;
+
+	u64 kbuf[EPT_IDLE_KBUF_BITS / IDLE_BITMAP_CHUNK_BITS];
+	int bits_read;
+
+	void __user *buf;
+	int buf_size;
+	int bytes_copied;
+};
+
+#endif
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC][PATCH 4/5] [PATCH 4/5] kvm-ept-idle: EPT page table walk for A bits
  2018-09-01 11:28 ` Fengguang Wu
@ 2018-09-01 11:28   ` Fengguang Wu
  -1 siblings, 0 replies; 20+ messages in thread
From: Fengguang Wu @ 2018-09-01 11:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Dave Hansen, Fengguang Wu,
	Peng DongX, Liu Jingqi, Dong Eddie, Huang Ying, Brendan Gregg,
	kvm, LKML

[-- Attachment #1: 0004-kvm-ept-idle-EPT-page-table-walk-for-A-bits.patch --]
[-- Type: text/plain, Size: 9003 bytes --]

This borrows host page table walk macros/functions to do EPT walk.
So it depends on them using the same level.

Dave Hansen raised the concern that hottest pages may be cached in TLB and
don't frequently set the accessed bits. The solution would be to invalidate TLB
for the mm being walked, when finished one round of scan.

Warning: read() also clears the accessed bit btw, in order to avoid one more
page table walk for write(). That may not be desirable for some use cases, so
we can avoid clearing accessed bit when opened in readonly mode.

The interface should be further improved to

1) report holes and huge pages in one go
2) represent huge pages and sparse page tables efficiently

(1) can be trivially fixed by extending the bitmap to more bits per PAGE_SIZE.

(2) would need fundemental changes to the interface. It seems existing solutions
for sparse files like SEEK_HOLE/SEEK_DATA and FIEMAP ioctl may not serve this
situation well. The most efficient way could be to fill user space read()
buffer with an array of small extents:

	struct idle_extent {
		unsigned type :  4; 
		unsigned nr   :  4; 
	};

where type can be one of

	4K_HOLE
	4K_IDLE
	4K_ACCESSED
	2M_HOLE
	2M_IDLE
	2M_ACCESSED
	1G_OR_LARGER_PAGE
	...

There can be up to 16 types, so more page sizes can be defined. The above names
are just for easy understanding the typical case. It's also possible that
PAGE_SIZE is not 4K, or PMD represents 4M pages. In which case we change type
names to more suitable ones like PTE_HOLE, PMD_ACCESSED. Since it's page table
walking, the user space should better know the exact page sizes. Either the
accessed bit or page migration are tied to the real page size.

Anyone interested in adding PTE_DIRTY or more types?

The main problem with such extent reporting interface is, the number of bytes
returned by read (variable extents) will mismatch the advanced file position
(fixed VA indexes), which is not POSIX compliant. Simple cp/cat may still work,
as they don't lseek based on read return value. If that's really a concern, we
may use ioctl() instead..

CC: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 arch/x86/kvm/ept_idle.c | 211 ++++++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/ept_idle.h |  55 +++++++++++++
 2 files changed, 266 insertions(+)

diff --git a/arch/x86/kvm/ept_idle.c b/arch/x86/kvm/ept_idle.c
index 5b97dd01011b..8a233ab8656d 100644
--- a/arch/x86/kvm/ept_idle.c
+++ b/arch/x86/kvm/ept_idle.c
@@ -9,6 +9,217 @@
 
 #include "ept_idle.h"
 
+static int add_to_idle_bitmap(struct ept_idle_ctrl *eic,
+			      int idle, unsigned long addr_range)
+{
+	int nbits = addr_range >> PAGE_SHIFT;
+	int bits_left = EPT_IDLE_KBUF_BITS - eic->bits_read;
+	int ret = 0;
+
+	if (nbits >= bits_left) {
+		ret = EPT_IDLE_KBUF_FULL;
+		nbits = bits_left;
+	}
+
+	// TODO: this assumes u64 == unsigned long
+	if (!idle)
+		__bitmap_clear((unsigned long *)eic->kbuf, eic->bits_read, nbits);
+	eic->bits_read += nbits;
+
+	return ret;
+}
+
+static int ept_pte_range(struct ept_idle_ctrl *eic,
+			 pmd_t *pmd, unsigned long addr, unsigned long end)
+{
+	pte_t *pte;
+	int err = 0;
+	int idle;
+
+	pte = pte_offset_kernel(pmd, addr);
+	do {
+		if (!ept_pte_present(*pte) ||
+		    !ept_pte_accessed(*pte))
+			idle = 1;
+		else {
+			idle = 0;
+			pte_clear_flags(*pte, _PAGE_EPT_ACCESSED);
+		}
+
+		err = add_to_idle_bitmap(eic, idle, PAGE_SIZE);
+		if (err)
+			break;
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+
+	return err;
+}
+
+static int ept_pmd_range(struct ept_idle_ctrl *eic,
+			 pud_t *pud, unsigned long addr, unsigned long end)
+{
+	pmd_t *pmd;
+	unsigned long next;
+	int err = 0;
+	int idle;
+
+	pmd = pmd_offset(pud, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+		idle = -1;
+		if (!ept_pmd_present(*pmd) ||
+		    !ept_pmd_accessed(*pmd)) {
+			idle = 1;
+		} else if (pmd_large(*pmd)) {
+			idle = 0;
+			pmd_clear_flags(*pmd, _PAGE_EPT_ACCESSED);
+		}
+		if (idle >= 0)
+			err = add_to_idle_bitmap(eic, idle, next - addr);
+		else
+			err = ept_pte_range(eic, pmd, addr, next);
+		if (err)
+			break;
+	} while (pmd++, addr = next, addr != end);
+
+	return err;
+}
+
+static int ept_pud_range(struct ept_idle_ctrl *eic,
+			 p4d_t *p4d, unsigned long addr, unsigned long end)
+{
+	pud_t *pud;
+	unsigned long next;
+	int err = 0;
+	int idle;
+
+	pud = pud_offset(p4d, addr);
+	do {
+		next = pud_addr_end(addr, end);
+		idle = -1;
+		if (!ept_pud_present(*pud) ||
+		    !ept_pud_accessed(*pud)) {
+			idle = 1;
+		} else if (pud_large(*pud)) {
+			idle = 0;
+			pud_clear_flags(*pud, _PAGE_EPT_ACCESSED);
+		}
+		if (idle >= 0)
+			err = add_to_idle_bitmap(eic, idle, next - addr);
+		else
+			err = ept_pmd_range(eic, pud, addr, next);
+		if (err)
+			break;
+	} while (pud++, addr = next, addr != end);
+
+	return err;
+}
+
+static int ept_p4d_range(struct ept_idle_ctrl *eic,
+			 pgd_t *pgd, unsigned long addr, unsigned long end)
+{
+	p4d_t *p4d;
+	unsigned long next;
+	int err = 0;
+
+	p4d = p4d_offset(pgd, addr);
+	do {
+		next = p4d_addr_end(addr, end);
+		if (!ept_p4d_present(*p4d))
+			err = add_to_idle_bitmap(eic, 1, next - addr);
+		else
+			err = ept_pud_range(eic, p4d, addr, next);
+		if (err)
+			break;
+	} while (p4d++, addr = next, addr != end);
+
+	return err;
+}
+
+static int ept_page_range(struct ept_idle_ctrl *eic,
+			  pgd_t *ept_root, unsigned long addr, unsigned long end)
+{
+	pgd_t *pgd;
+	unsigned long next;
+	int err = 0;
+
+	BUG_ON(addr >= end);
+	pgd = pgd_offset_pgd(ept_root, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (!ept_pgd_present(*pgd))
+			err = add_to_idle_bitmap(eic, 1, next - addr);
+		else
+			err = ept_p4d_range(eic, pgd, addr, next);
+		if (err)
+			break;
+	} while (pgd++, addr = next, addr != end);
+
+	return err;
+}
+
+static void init_ept_idle_ctrl_buffer(struct ept_idle_ctrl *eic)
+{
+	eic->bits_read = 0;
+	memset(eic->kbuf, 0xff, EPT_IDLE_KBUF_BYTES);
+}
+
+static int ept_idle_walk_gfn_range(struct ept_idle_ctrl *eic,
+				   unsigned long gfn_start,
+				   unsigned long gfn_end)
+{
+	struct kvm_vcpu *vcpu = kvm_get_vcpu(eic->kvm, 0);
+	struct kvm_mmu *mmu = &vcpu->arch.mmu;
+	pgd_t *ept_root;
+	unsigned long gpa_start = gfn_to_gpa(gfn_start);
+	unsigned long gpa_end = gfn_to_gpa(gfn_end);
+	unsigned long gpa_addr;
+	int bytes_read;
+	int ret = -EINVAL;
+
+	init_ept_idle_ctrl_buffer(eic);
+
+	spin_lock(&eic->kvm->mmu_lock);
+	if (mmu->base_role.ad_disabled) {
+		printk(KERN_NOTICE "CPU does not support EPT A/D bits tracking\n");
+		goto out_unlock;
+	}
+
+	if (mmu->shadow_root_level != 4 + (!!pgtable_l5_enabled())) {
+		printk(KERN_NOTICE "Unsupported EPT level %d\n", mmu->shadow_root_level);
+		goto out_unlock;
+	}
+
+	if (!VALID_PAGE(mmu->root_hpa)) {
+		goto out_unlock;
+	}
+
+	ept_root = __va(mmu->root_hpa);
+
+	for (gpa_addr = gpa_start;
+	     gpa_addr < gpa_end;
+	     gpa_addr += EPT_IDLE_KBUF_BITS << PAGE_SHIFT) {
+		ept_page_range(eic, ept_root, gpa_addr, gpa_end);
+		spin_unlock(&eic->kvm->mmu_lock);
+
+		bytes_read = eic->bits_read / 8;
+
+		ret = copy_to_user(eic->buf, eic->kbuf, bytes_read);
+		if (ret)
+			return -EFAULT;
+
+		eic->buf += bytes_read;
+		eic->bytes_copied += bytes_read;
+		if (eic->bytes_copied >= eic->buf_size)
+			return 0;
+
+		init_ept_idle_ctrl_buffer(eic);
+		cond_resched();
+		spin_lock(&eic->kvm->mmu_lock);
+	}
+out_unlock:
+	spin_unlock(&eic->kvm->mmu_lock);
+	return ret;
+}
 
 // mindless copy from kvm_handle_hva_range().
 // TODO: handle order and hole.
diff --git a/arch/x86/kvm/ept_idle.h b/arch/x86/kvm/ept_idle.h
index e0b9dcecf50b..b715ec7b7513 100644
--- a/arch/x86/kvm/ept_idle.h
+++ b/arch/x86/kvm/ept_idle.h
@@ -1,6 +1,61 @@
 #ifndef _EPT_IDLE_H
 #define _EPT_IDLE_H
 
+#define _PAGE_BIT_EPT_ACCESSED 8
+#define _PAGE_EPT_ACCESSED	(_AT(pteval_t, 1) << _PAGE_BIT_EPT_ACCESSED)
+
+#define _PAGE_EPT_PRESENT	(_AT(pteval_t, 7))
+
+static inline int ept_pte_present(pte_t a)
+{
+	return pte_flags(a) & _PAGE_EPT_PRESENT;
+}
+
+static inline int ept_pmd_present(pmd_t a)
+{
+	return pmd_flags(a) & _PAGE_EPT_PRESENT;
+}
+
+static inline int ept_pud_present(pud_t a)
+{
+	return pud_flags(a) & _PAGE_EPT_PRESENT;
+}
+
+static inline int ept_p4d_present(p4d_t a)
+{
+	return p4d_flags(a) & _PAGE_EPT_PRESENT;
+}
+
+static inline int ept_pgd_present(pgd_t a)
+{
+	return pgd_flags(a) & _PAGE_EPT_PRESENT;
+}
+
+static inline int ept_pte_accessed(pte_t a)
+{
+	return pte_flags(a) & _PAGE_EPT_ACCESSED;
+}
+
+static inline int ept_pmd_accessed(pmd_t a)
+{
+	return pmd_flags(a) & _PAGE_EPT_ACCESSED;
+}
+
+static inline int ept_pud_accessed(pud_t a)
+{
+	return pud_flags(a) & _PAGE_EPT_ACCESSED;
+}
+
+static inline int ept_p4d_accessed(p4d_t a)
+{
+	return p4d_flags(a) & _PAGE_EPT_ACCESSED;
+}
+
+static inline int ept_pgd_accessed(pgd_t a)
+{
+	return pgd_flags(a) & _PAGE_EPT_ACCESSED;
+}
+
 #define IDLE_BITMAP_CHUNK_SIZE	sizeof(u64)
 #define IDLE_BITMAP_CHUNK_BITS	(IDLE_BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
 
-- 
2.15.0




^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC][PATCH 4/5] [PATCH 4/5] kvm-ept-idle: EPT page table walk for A bits
@ 2018-09-01 11:28   ` Fengguang Wu
  0 siblings, 0 replies; 20+ messages in thread
From: Fengguang Wu @ 2018-09-01 11:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Dave Hansen, Fengguang Wu,
	Peng DongX, Liu Jingqi, Dong Eddie, Huang Ying, Brendan Gregg,
	kvm, LKML

[-- Attachment #1: 0004-kvm-ept-idle-EPT-page-table-walk-for-A-bits.patch --]
[-- Type: text/plain, Size: 9000 bytes --]

This borrows host page table walk macros/functions to do EPT walk.
So it depends on them using the same level.

Dave Hansen raised the concern that hottest pages may be cached in TLB and
don't frequently set the accessed bits. The solution would be to invalidate TLB
for the mm being walked, when finished one round of scan.

Warning: read() also clears the accessed bit btw, in order to avoid one more
page table walk for write(). That may not be desirable for some use cases, so
we can avoid clearing accessed bit when opened in readonly mode.

The interface should be further improved to

1) report holes and huge pages in one go
2) represent huge pages and sparse page tables efficiently

(1) can be trivially fixed by extending the bitmap to more bits per PAGE_SIZE.

(2) would need fundemental changes to the interface. It seems existing solutions
for sparse files like SEEK_HOLE/SEEK_DATA and FIEMAP ioctl may not serve this
situation well. The most efficient way could be to fill user space read()
buffer with an array of small extents:

	struct idle_extent {
		unsigned type :  4; 
		unsigned nr   :  4; 
	};

where type can be one of

	4K_HOLE
	4K_IDLE
	4K_ACCESSED
	2M_HOLE
	2M_IDLE
	2M_ACCESSED
	1G_OR_LARGER_PAGE
	...

There can be up to 16 types, so more page sizes can be defined. The above names
are just for easy understanding the typical case. It's also possible that
PAGE_SIZE is not 4K, or PMD represents 4M pages. In which case we change type
names to more suitable ones like PTE_HOLE, PMD_ACCESSED. Since it's page table
walking, the user space should better know the exact page sizes. Either the
accessed bit or page migration are tied to the real page size.

Anyone interested in adding PTE_DIRTY or more types?

The main problem with such extent reporting interface is, the number of bytes
returned by read (variable extents) will mismatch the advanced file position
(fixed VA indexes), which is not POSIX compliant. Simple cp/cat may still work,
as they don't lseek based on read return value. If that's really a concern, we
may use ioctl() instead..

CC: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 arch/x86/kvm/ept_idle.c | 211 ++++++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/ept_idle.h |  55 +++++++++++++
 2 files changed, 266 insertions(+)

diff --git a/arch/x86/kvm/ept_idle.c b/arch/x86/kvm/ept_idle.c
index 5b97dd01011b..8a233ab8656d 100644
--- a/arch/x86/kvm/ept_idle.c
+++ b/arch/x86/kvm/ept_idle.c
@@ -9,6 +9,217 @@
 
 #include "ept_idle.h"
 
+static int add_to_idle_bitmap(struct ept_idle_ctrl *eic,
+			      int idle, unsigned long addr_range)
+{
+	int nbits = addr_range >> PAGE_SHIFT;
+	int bits_left = EPT_IDLE_KBUF_BITS - eic->bits_read;
+	int ret = 0;
+
+	if (nbits >= bits_left) {
+		ret = EPT_IDLE_KBUF_FULL;
+		nbits = bits_left;
+	}
+
+	// TODO: this assumes u64 == unsigned long
+	if (!idle)
+		__bitmap_clear((unsigned long *)eic->kbuf, eic->bits_read, nbits);
+	eic->bits_read += nbits;
+
+	return ret;
+}
+
+static int ept_pte_range(struct ept_idle_ctrl *eic,
+			 pmd_t *pmd, unsigned long addr, unsigned long end)
+{
+	pte_t *pte;
+	int err = 0;
+	int idle;
+
+	pte = pte_offset_kernel(pmd, addr);
+	do {
+		if (!ept_pte_present(*pte) ||
+		    !ept_pte_accessed(*pte))
+			idle = 1;
+		else {
+			idle = 0;
+			pte_clear_flags(*pte, _PAGE_EPT_ACCESSED);
+		}
+
+		err = add_to_idle_bitmap(eic, idle, PAGE_SIZE);
+		if (err)
+			break;
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+
+	return err;
+}
+
+static int ept_pmd_range(struct ept_idle_ctrl *eic,
+			 pud_t *pud, unsigned long addr, unsigned long end)
+{
+	pmd_t *pmd;
+	unsigned long next;
+	int err = 0;
+	int idle;
+
+	pmd = pmd_offset(pud, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+		idle = -1;
+		if (!ept_pmd_present(*pmd) ||
+		    !ept_pmd_accessed(*pmd)) {
+			idle = 1;
+		} else if (pmd_large(*pmd)) {
+			idle = 0;
+			pmd_clear_flags(*pmd, _PAGE_EPT_ACCESSED);
+		}
+		if (idle >= 0)
+			err = add_to_idle_bitmap(eic, idle, next - addr);
+		else
+			err = ept_pte_range(eic, pmd, addr, next);
+		if (err)
+			break;
+	} while (pmd++, addr = next, addr != end);
+
+	return err;
+}
+
+static int ept_pud_range(struct ept_idle_ctrl *eic,
+			 p4d_t *p4d, unsigned long addr, unsigned long end)
+{
+	pud_t *pud;
+	unsigned long next;
+	int err = 0;
+	int idle;
+
+	pud = pud_offset(p4d, addr);
+	do {
+		next = pud_addr_end(addr, end);
+		idle = -1;
+		if (!ept_pud_present(*pud) ||
+		    !ept_pud_accessed(*pud)) {
+			idle = 1;
+		} else if (pud_large(*pud)) {
+			idle = 0;
+			pud_clear_flags(*pud, _PAGE_EPT_ACCESSED);
+		}
+		if (idle >= 0)
+			err = add_to_idle_bitmap(eic, idle, next - addr);
+		else
+			err = ept_pmd_range(eic, pud, addr, next);
+		if (err)
+			break;
+	} while (pud++, addr = next, addr != end);
+
+	return err;
+}
+
+static int ept_p4d_range(struct ept_idle_ctrl *eic,
+			 pgd_t *pgd, unsigned long addr, unsigned long end)
+{
+	p4d_t *p4d;
+	unsigned long next;
+	int err = 0;
+
+	p4d = p4d_offset(pgd, addr);
+	do {
+		next = p4d_addr_end(addr, end);
+		if (!ept_p4d_present(*p4d))
+			err = add_to_idle_bitmap(eic, 1, next - addr);
+		else
+			err = ept_pud_range(eic, p4d, addr, next);
+		if (err)
+			break;
+	} while (p4d++, addr = next, addr != end);
+
+	return err;
+}
+
+static int ept_page_range(struct ept_idle_ctrl *eic,
+			  pgd_t *ept_root, unsigned long addr, unsigned long end)
+{
+	pgd_t *pgd;
+	unsigned long next;
+	int err = 0;
+
+	BUG_ON(addr >= end);
+	pgd = pgd_offset_pgd(ept_root, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (!ept_pgd_present(*pgd))
+			err = add_to_idle_bitmap(eic, 1, next - addr);
+		else
+			err = ept_p4d_range(eic, pgd, addr, next);
+		if (err)
+			break;
+	} while (pgd++, addr = next, addr != end);
+
+	return err;
+}
+
+static void init_ept_idle_ctrl_buffer(struct ept_idle_ctrl *eic)
+{
+	eic->bits_read = 0;
+	memset(eic->kbuf, 0xff, EPT_IDLE_KBUF_BYTES);
+}
+
+static int ept_idle_walk_gfn_range(struct ept_idle_ctrl *eic,
+				   unsigned long gfn_start,
+				   unsigned long gfn_end)
+{
+	struct kvm_vcpu *vcpu = kvm_get_vcpu(eic->kvm, 0);
+	struct kvm_mmu *mmu = &vcpu->arch.mmu;
+	pgd_t *ept_root;
+	unsigned long gpa_start = gfn_to_gpa(gfn_start);
+	unsigned long gpa_end = gfn_to_gpa(gfn_end);
+	unsigned long gpa_addr;
+	int bytes_read;
+	int ret = -EINVAL;
+
+	init_ept_idle_ctrl_buffer(eic);
+
+	spin_lock(&eic->kvm->mmu_lock);
+	if (mmu->base_role.ad_disabled) {
+		printk(KERN_NOTICE "CPU does not support EPT A/D bits tracking\n");
+		goto out_unlock;
+	}
+
+	if (mmu->shadow_root_level != 4 + (!!pgtable_l5_enabled())) {
+		printk(KERN_NOTICE "Unsupported EPT level %d\n", mmu->shadow_root_level);
+		goto out_unlock;
+	}
+
+	if (!VALID_PAGE(mmu->root_hpa)) {
+		goto out_unlock;
+	}
+
+	ept_root = __va(mmu->root_hpa);
+
+	for (gpa_addr = gpa_start;
+	     gpa_addr < gpa_end;
+	     gpa_addr += EPT_IDLE_KBUF_BITS << PAGE_SHIFT) {
+		ept_page_range(eic, ept_root, gpa_addr, gpa_end);
+		spin_unlock(&eic->kvm->mmu_lock);
+
+		bytes_read = eic->bits_read / 8;
+
+		ret = copy_to_user(eic->buf, eic->kbuf, bytes_read);
+		if (ret)
+			return -EFAULT;
+
+		eic->buf += bytes_read;
+		eic->bytes_copied += bytes_read;
+		if (eic->bytes_copied >= eic->buf_size)
+			return 0;
+
+		init_ept_idle_ctrl_buffer(eic);
+		cond_resched();
+		spin_lock(&eic->kvm->mmu_lock);
+	}
+out_unlock:
+	spin_unlock(&eic->kvm->mmu_lock);
+	return ret;
+}
 
 // mindless copy from kvm_handle_hva_range().
 // TODO: handle order and hole.
diff --git a/arch/x86/kvm/ept_idle.h b/arch/x86/kvm/ept_idle.h
index e0b9dcecf50b..b715ec7b7513 100644
--- a/arch/x86/kvm/ept_idle.h
+++ b/arch/x86/kvm/ept_idle.h
@@ -1,6 +1,61 @@
 #ifndef _EPT_IDLE_H
 #define _EPT_IDLE_H
 
+#define _PAGE_BIT_EPT_ACCESSED 8
+#define _PAGE_EPT_ACCESSED	(_AT(pteval_t, 1) << _PAGE_BIT_EPT_ACCESSED)
+
+#define _PAGE_EPT_PRESENT	(_AT(pteval_t, 7))
+
+static inline int ept_pte_present(pte_t a)
+{
+	return pte_flags(a) & _PAGE_EPT_PRESENT;
+}
+
+static inline int ept_pmd_present(pmd_t a)
+{
+	return pmd_flags(a) & _PAGE_EPT_PRESENT;
+}
+
+static inline int ept_pud_present(pud_t a)
+{
+	return pud_flags(a) & _PAGE_EPT_PRESENT;
+}
+
+static inline int ept_p4d_present(p4d_t a)
+{
+	return p4d_flags(a) & _PAGE_EPT_PRESENT;
+}
+
+static inline int ept_pgd_present(pgd_t a)
+{
+	return pgd_flags(a) & _PAGE_EPT_PRESENT;
+}
+
+static inline int ept_pte_accessed(pte_t a)
+{
+	return pte_flags(a) & _PAGE_EPT_ACCESSED;
+}
+
+static inline int ept_pmd_accessed(pmd_t a)
+{
+	return pmd_flags(a) & _PAGE_EPT_ACCESSED;
+}
+
+static inline int ept_pud_accessed(pud_t a)
+{
+	return pud_flags(a) & _PAGE_EPT_ACCESSED;
+}
+
+static inline int ept_p4d_accessed(p4d_t a)
+{
+	return p4d_flags(a) & _PAGE_EPT_ACCESSED;
+}
+
+static inline int ept_pgd_accessed(pgd_t a)
+{
+	return pgd_flags(a) & _PAGE_EPT_ACCESSED;
+}
+
 #define IDLE_BITMAP_CHUNK_SIZE	sizeof(u64)
 #define IDLE_BITMAP_CHUNK_BITS	(IDLE_BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
 
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC][PATCH 5/5] [PATCH 5/5] kvm-ept-idle: enable module
  2018-09-01 11:28 ` Fengguang Wu
@ 2018-09-01 11:28   ` Fengguang Wu
  -1 siblings, 0 replies; 20+ messages in thread
From: Fengguang Wu @ 2018-09-01 11:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Fengguang Wu, Peng DongX,
	Liu Jingqi, Dong Eddie, Dave Hansen, Huang Ying, Brendan Gregg,
	kvm, LKML

[-- Attachment #1: 0005-kvm-ept-idle-enable-module.patch --]
[-- Type: text/plain, Size: 1445 bytes --]

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 arch/x86/kvm/Kconfig  | 11 +++++++++++
 arch/x86/kvm/Makefile |  4 ++++
 2 files changed, 15 insertions(+)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 1bbec387d289..4c6dec47fac6 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -96,6 +96,17 @@ config KVM_MMU_AUDIT
 	 This option adds a R/W kVM module parameter 'mmu_audit', which allows
 	 auditing of KVM MMU events at runtime.
 
+config KVM_EPT_IDLE
+	tristate "KVM EPT idle page tracking"
+	depends on KVM_INTEL
+	depends on PROC_PAGE_MONITOR
+	---help---
+	  Provides support for walking EPT to get the A bits on Intel
+	  processors equipped with the VT extensions.
+
+	  To compile this as a module, choose M here: the module
+	  will be called kvm-ept-idle.
+
 # OK, it's a little counter-intuitive to do this, but it puts it neatly under
 # the virtualization menu.
 source drivers/vhost/Kconfig
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index dc4f2fdf5e57..5cad0590205d 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -19,6 +19,10 @@ kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
 kvm-intel-y		+= vmx.o pmu_intel.o
 kvm-amd-y		+= svm.o pmu_amd.o
 
+kvm-ept-idle-y		+= ept_idle.o
+
 obj-$(CONFIG_KVM)	+= kvm.o
 obj-$(CONFIG_KVM_INTEL)	+= kvm-intel.o
 obj-$(CONFIG_KVM_AMD)	+= kvm-amd.o
+
+obj-$(CONFIG_KVM_EPT_IDLE)	+= kvm-ept-idle.o
-- 
2.15.0




^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC][PATCH 5/5] [PATCH 5/5] kvm-ept-idle: enable module
@ 2018-09-01 11:28   ` Fengguang Wu
  0 siblings, 0 replies; 20+ messages in thread
From: Fengguang Wu @ 2018-09-01 11:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Fengguang Wu, Peng DongX,
	Liu Jingqi, Dong Eddie, Dave Hansen, Huang Ying, Brendan Gregg,
	kvm, LKML

[-- Attachment #1: 0005-kvm-ept-idle-enable-module.patch --]
[-- Type: text/plain, Size: 1442 bytes --]

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 arch/x86/kvm/Kconfig  | 11 +++++++++++
 arch/x86/kvm/Makefile |  4 ++++
 2 files changed, 15 insertions(+)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 1bbec387d289..4c6dec47fac6 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -96,6 +96,17 @@ config KVM_MMU_AUDIT
 	 This option adds a R/W kVM module parameter 'mmu_audit', which allows
 	 auditing of KVM MMU events at runtime.
 
+config KVM_EPT_IDLE
+	tristate "KVM EPT idle page tracking"
+	depends on KVM_INTEL
+	depends on PROC_PAGE_MONITOR
+	---help---
+	  Provides support for walking EPT to get the A bits on Intel
+	  processors equipped with the VT extensions.
+
+	  To compile this as a module, choose M here: the module
+	  will be called kvm-ept-idle.
+
 # OK, it's a little counter-intuitive to do this, but it puts it neatly under
 # the virtualization menu.
 source drivers/vhost/Kconfig
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index dc4f2fdf5e57..5cad0590205d 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -19,6 +19,10 @@ kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
 kvm-intel-y		+= vmx.o pmu_intel.o
 kvm-amd-y		+= svm.o pmu_amd.o
 
+kvm-ept-idle-y		+= ept_idle.o
+
 obj-$(CONFIG_KVM)	+= kvm.o
 obj-$(CONFIG_KVM_INTEL)	+= kvm-intel.o
 obj-$(CONFIG_KVM_AMD)	+= kvm-amd.o
+
+obj-$(CONFIG_KVM_EPT_IDLE)	+= kvm-ept-idle.o
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 0/5] introduce /proc/PID/idle_bitmap
  2018-09-01 11:28 ` Fengguang Wu
                   ` (5 preceding siblings ...)
  (?)
@ 2018-09-02  8:24 ` Fengguang Wu
  -1 siblings, 0 replies; 20+ messages in thread
From: Fengguang Wu @ 2018-09-02  8:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, kvm, Peng DongX, Liu Jingqi,
	Dong Eddie, Dave Hansen, Huang Ying, Brendan Gregg, LKML

Here are the diffstat:

 arch/x86/kvm/Kconfig    |   11 +
 arch/x86/kvm/Makefile   |    4
 arch/x86/kvm/ept_idle.c |  329 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/ept_idle.h |   79 +++++++++
 fs/proc/base.c          |    2
 fs/proc/internal.h      |    1
 fs/proc/task_mmu.c      |   63 +++++++
 include/linux/sched.h   |   10 +
 virt/kvm/kvm_main.c     |    1
 9 files changed, 500 insertions(+)

Regards,
Fengguang


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 3/5] [PATCH 3/5] kvm-ept-idle: HVA indexed EPT read
  2018-09-01 11:28   ` Fengguang Wu
  (?)
@ 2018-09-04  7:57   ` Nikita Leshenko
  2018-09-04  8:12     ` Peng, DongX
  -1 siblings, 1 reply; 20+ messages in thread
From: Nikita Leshenko @ 2018-09-04  7:57 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Linux Memory Management List, Peng DongX,
	Liu Jingqi, Dong Eddie, Dave Hansen, Huang Ying, Brendan Gregg,
	kvm, LKML

On 1 Sep 2018, at 13:28, Fengguang Wu <fengguang.wu@intel.com> wrote:
> +static ssize_t ept_idle_read(struct file *file, char *buf,
> +			     size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task = file->private_data;
> +	struct ept_idle_ctrl *eic;
> +	unsigned long hva_start = *ppos << BITMAP_BYTE2PVA_SHIFT;
> +	unsigned long hva_end = hva_start + (count << BITMAP_BYTE2PVA_SHIFT);
> +	int ret;
> +
> +	if (*ppos % IDLE_BITMAP_CHUNK_SIZE ||
> +	    count % IDLE_BITMAP_CHUNK_SIZE)
> +		return -EINVAL;
> +
> +	eic = kzalloc(sizeof(*eic), GFP_KERNEL);
> +	if (!eic)
> +		return -EBUSY;
> +
> +	eic->buf = buf;
> +	eic->buf_size = count;
> +	eic->kvm = task_kvm(task);
> +	if (!eic->kvm) {
> +		ret = -EINVAL;
> +		goto out_free;
> +	}
I think you need to increment the refcount while using kvm,
otherwise kvm can be destroyed from another thread while you're
walking it.

-Nikita
> +
> +	ret = ept_idle_walk_hva_range(eic, hva_start, hva_end);
> +	if (ret)
> +		goto out_free;
> +
> +	ret = eic->bytes_copied;
> +	*ppos += ret;
> +out_free:
> +	kfree(eic);
> +
> +	return ret;
> +}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC][PATCH 3/5] [PATCH 3/5] kvm-ept-idle: HVA indexed EPT read
  2018-09-04  7:57   ` Nikita Leshenko
@ 2018-09-04  8:12     ` Peng, DongX
  2018-09-04  8:15       ` Fengguang Wu
  0 siblings, 1 reply; 20+ messages in thread
From: Peng, DongX @ 2018-09-04  8:12 UTC (permalink / raw)
  To: Nikita Leshenko, Wu, Fengguang
  Cc: Andrew Morton, Linux Memory Management List, Liu, Jingqi, Dong,
	Eddie, Hansen, Dave, Huang, Ying, Brendan Gregg, kvm, LKML

kvm_get_kvm() kvm_put_kvm()

-----Original Message-----
From: Nikita Leshenko [mailto:nikita.leshchenko@oracle.com] 
Sent: Tuesday, September 4, 2018 3:57 PM
To: Wu, Fengguang <fengguang.wu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>; Linux Memory Management List <linux-mm@kvack.org>; Peng, DongX <dongx.peng@intel.com>; Liu, Jingqi <jingqi.liu@intel.com>; Dong, Eddie <eddie.dong@intel.com>; Hansen, Dave <dave.hansen@intel.com>; Huang, Ying <ying.huang@intel.com>; Brendan Gregg <bgregg@netflix.com>; kvm@vger.kernel.org; LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC][PATCH 3/5] [PATCH 3/5] kvm-ept-idle: HVA indexed EPT read

On 1 Sep 2018, at 13:28, Fengguang Wu <fengguang.wu@intel.com> wrote:
> +static ssize_t ept_idle_read(struct file *file, char *buf,
> +			     size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task = file->private_data;
> +	struct ept_idle_ctrl *eic;
> +	unsigned long hva_start = *ppos << BITMAP_BYTE2PVA_SHIFT;
> +	unsigned long hva_end = hva_start + (count << BITMAP_BYTE2PVA_SHIFT);
> +	int ret;
> +
> +	if (*ppos % IDLE_BITMAP_CHUNK_SIZE ||
> +	    count % IDLE_BITMAP_CHUNK_SIZE)
> +		return -EINVAL;
> +
> +	eic = kzalloc(sizeof(*eic), GFP_KERNEL);
> +	if (!eic)
> +		return -EBUSY;
> +
> +	eic->buf = buf;
> +	eic->buf_size = count;
> +	eic->kvm = task_kvm(task);
> +	if (!eic->kvm) {
> +		ret = -EINVAL;
> +		goto out_free;
> +	}
I think you need to increment the refcount while using kvm, otherwise kvm can be destroyed from another thread while you're walking it.

-Nikita
> +
> +	ret = ept_idle_walk_hva_range(eic, hva_start, hva_end);
> +	if (ret)
> +		goto out_free;
> +
> +	ret = eic->bytes_copied;
> +	*ppos += ret;
> +out_free:
> +	kfree(eic);
> +
> +	return ret;
> +}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 3/5] [PATCH 3/5] kvm-ept-idle: HVA indexed EPT read
  2018-09-04  8:12     ` Peng, DongX
@ 2018-09-04  8:15       ` Fengguang Wu
  0 siblings, 0 replies; 20+ messages in thread
From: Fengguang Wu @ 2018-09-04  8:15 UTC (permalink / raw)
  To: Peng, DongX
  Cc: Nikita Leshenko, Andrew Morton, Linux Memory Management List,
	Liu, Jingqi, Dong, Eddie, Hansen, Dave, Huang, Ying,
	Brendan Gregg, kvm, LKML

Yeah thanks! Currently we are restructuring the related functions,
will add these calls when sorted out the walk order and hole issues.

Thanks,
Fengguang

On Tue, Sep 04, 2018 at 04:12:00PM +0800, Peng Dong wrote:
>kvm_get_kvm() kvm_put_kvm()
>
>-----Original Message-----
>From: Nikita Leshenko [mailto:nikita.leshchenko@oracle.com]
>Sent: Tuesday, September 4, 2018 3:57 PM
>To: Wu, Fengguang <fengguang.wu@intel.com>
>Cc: Andrew Morton <akpm@linux-foundation.org>; Linux Memory Management List <linux-mm@kvack.org>; Peng, DongX <dongx.peng@intel.com>; Liu, Jingqi <jingqi.liu@intel.com>; Dong, Eddie <eddie.dong@intel.com>; Hansen, Dave <dave.hansen@intel.com>; Huang, Ying <ying.huang@intel.com>; Brendan Gregg <bgregg@netflix.com>; kvm@vger.kernel.org; LKML <linux-kernel@vger.kernel.org>
>Subject: Re: [RFC][PATCH 3/5] [PATCH 3/5] kvm-ept-idle: HVA indexed EPT read
>
>On 1 Sep 2018, at 13:28, Fengguang Wu <fengguang.wu@intel.com> wrote:
>> +static ssize_t ept_idle_read(struct file *file, char *buf,
>> +			     size_t count, loff_t *ppos)
>> +{
>> +	struct task_struct *task = file->private_data;
>> +	struct ept_idle_ctrl *eic;
>> +	unsigned long hva_start = *ppos << BITMAP_BYTE2PVA_SHIFT;
>> +	unsigned long hva_end = hva_start + (count << BITMAP_BYTE2PVA_SHIFT);
>> +	int ret;
>> +
>> +	if (*ppos % IDLE_BITMAP_CHUNK_SIZE ||
>> +	    count % IDLE_BITMAP_CHUNK_SIZE)
>> +		return -EINVAL;
>> +
>> +	eic = kzalloc(sizeof(*eic), GFP_KERNEL);
>> +	if (!eic)
>> +		return -EBUSY;
>> +
>> +	eic->buf = buf;
>> +	eic->buf_size = count;
>> +	eic->kvm = task_kvm(task);
>> +	if (!eic->kvm) {
>> +		ret = -EINVAL;
>> +		goto out_free;
>> +	}
>I think you need to increment the refcount while using kvm, otherwise kvm can be destroyed from another thread while you're walking it.
>
>-Nikita
>> +
>> +	ret = ept_idle_walk_hva_range(eic, hva_start, hva_end);
>> +	if (ret)
>> +		goto out_free;
>> +
>> +	ret = eic->bytes_copied;
>> +	*ppos += ret;
>> +out_free:
>> +	kfree(eic);
>> +
>> +	return ret;
>> +}
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 2/5] [PATCH 2/5] proc: introduce /proc/PID/idle_bitmap
  2018-09-01 11:28   ` Fengguang Wu
  (?)
@ 2018-09-04 19:02   ` Sean Christopherson
  -1 siblings, 0 replies; 20+ messages in thread
From: Sean Christopherson @ 2018-09-04 19:02 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Linux Memory Management List, Huang Ying,
	Brendan Gregg, Peng DongX, Liu Jingqi, Dong Eddie, Dave Hansen,
	kvm, LKML

On Sat, Sep 01, 2018 at 07:28:20PM +0800, Fengguang Wu wrote:
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index da3dbfa09e79..732a502acc27 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -305,6 +305,7 @@ extern const struct file_operations proc_pid_smaps_rollup_operations;
>  extern const struct file_operations proc_tid_smaps_operations;
>  extern const struct file_operations proc_clear_refs_operations;
>  extern const struct file_operations proc_pagemap_operations;
> +extern const struct file_operations proc_mm_idle_operations;
>  
>  extern unsigned long task_vsize(struct mm_struct *);
>  extern unsigned long task_statm(struct mm_struct *,
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index dfd73a4616ce..376406a9cf45 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1564,6 +1564,69 @@ const struct file_operations proc_pagemap_operations = {
>  	.open		= pagemap_open,
>  	.release	= pagemap_release,
>  };
> +
> +/* will be filled when kvm_ept_idle module loads */
> +struct file_operations proc_ept_idle_operations = {
> +};
> +EXPORT_SYMBOL_GPL(proc_ept_idle_operations);

Exposing EPT outside of VMX specific code is wrong, e.g. this should
be something like proc_kvm_idle_operations.  This is a common theme
for all of the patches.  Only the low level bits that are EPT specific
should be named as such, everything else should be encapsulated via
KVM or some other appropriate name. 

> +static ssize_t mm_idle_read(struct file *file, char __user *buf,
> +			    size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task = file->private_data;
> +	ssize_t ret = -ESRCH;

No need for @ret, just return the error directly at the end.  And
-ESRCH isn't appropriate for a task that exists but doesn't have an
associated KVM object.

> +
> +	// TODO: implement mm_walk for normal tasks
> +
> +	if (task_kvm(task)) {
> +		if (proc_ept_idle_operations.read)
> +			return proc_ept_idle_operations.read(file, buf, count, ppos);
> +	}

Condensing the task_kvm and ops check into a single if saves two lines
per instance, e.g.:

	if (task_kvm(task) && proc_ept_idle_operations.read)
		return proc_ept_idle_operations.read(file, buf, count, ppos);
> +
> +	return ret;
> +}
> +
> +
> +static int mm_idle_open(struct inode *inode, struct file *file)
> +{
> +	struct task_struct *task = get_proc_task(inode);
> +
> +	if (!task)
> +		return -ESRCH;
> +
> +	file->private_data = task;
> +
> +	if (task_kvm(task)) {
> +		if (proc_ept_idle_operations.open)
> +			return proc_ept_idle_operations.open(inode, file);
> +	}
> +
> +	return 0;
> +}
> +
> +static int mm_idle_release(struct inode *inode, struct file *file)
> +{
> +	struct task_struct *task = file->private_data;
> +
> +	if (!task)
> +		return 0;
> +
> +	if (task_kvm(task)) {
> +		if (proc_ept_idle_operations.release)
> +			return proc_ept_idle_operations.release(inode, file);
> +	}
> +
> +	put_task_struct(task);
> +	return 0;
> +}
> +
> +const struct file_operations proc_mm_idle_operations = {
> +	.llseek		= mem_lseek, /* borrow this */
> +	.read		= mm_idle_read,
> +	.open		= mm_idle_open,
> +	.release	= mm_idle_release,
> +};
> +
>  #endif /* CONFIG_PROC_PAGE_MONITOR */
>  
>  #ifdef CONFIG_NUMA
> -- 
> 2.15.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 5/5] [PATCH 5/5] kvm-ept-idle: enable module
  2018-09-01 11:28   ` Fengguang Wu
  (?)
@ 2018-09-04 19:14   ` Sean Christopherson
  -1 siblings, 0 replies; 20+ messages in thread
From: Sean Christopherson @ 2018-09-04 19:14 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Linux Memory Management List, Peng DongX,
	Liu Jingqi, Dong Eddie, Dave Hansen, Huang Ying, Brendan Gregg,
	kvm, LKML

On Sat, Sep 01, 2018 at 07:28:23PM +0800, Fengguang Wu wrote:
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
>  arch/x86/kvm/Kconfig  | 11 +++++++++++
>  arch/x86/kvm/Makefile |  4 ++++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 1bbec387d289..4c6dec47fac6 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -96,6 +96,17 @@ config KVM_MMU_AUDIT
>  	 This option adds a R/W kVM module parameter 'mmu_audit', which allows
>  	 auditing of KVM MMU events at runtime.
>  
> +config KVM_EPT_IDLE
> +	tristate "KVM EPT idle page tracking"

KVM_MMU_IDLE_INTEL might be a more user friendly name, I doubt that
all Kconfig users would immediately associate EPT with Intel's two
dimensional paging.  And it meshes nicely with KVM_MMU_IDLE as a base
name if we ever want to move common functionality to its own module,
as well as all of the other KVM_MMU_* nomenclature.

> +	depends on KVM_INTEL
> +	depends on PROC_PAGE_MONITOR
> +	---help---
> +	  Provides support for walking EPT to get the A bits on Intel
> +	  processors equipped with the VT extensions.
> +
> +	  To compile this as a module, choose M here: the module
> +	  will be called kvm-ept-idle.
> +
>  # OK, it's a little counter-intuitive to do this, but it puts it neatly under
>  # the virtualization menu.
>  source drivers/vhost/Kconfig
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index dc4f2fdf5e57..5cad0590205d 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -19,6 +19,10 @@ kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
>  kvm-intel-y		+= vmx.o pmu_intel.o
>  kvm-amd-y		+= svm.o pmu_amd.o
>  
> +kvm-ept-idle-y		+= ept_idle.o
> +
>  obj-$(CONFIG_KVM)	+= kvm.o
>  obj-$(CONFIG_KVM_INTEL)	+= kvm-intel.o
>  obj-$(CONFIG_KVM_AMD)	+= kvm-amd.o
> +
> +obj-$(CONFIG_KVM_EPT_IDLE)	+= kvm-ept-idle.o
> -- 
> 2.15.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 2/5] [PATCH 2/5] proc: introduce /proc/PID/idle_bitmap
  2018-09-01 11:28   ` Fengguang Wu
  (?)
  (?)
@ 2018-09-06 14:12   ` Dave Hansen
  -1 siblings, 0 replies; 20+ messages in thread
From: Dave Hansen @ 2018-09-06 14:12 UTC (permalink / raw)
  To: Fengguang Wu, Andrew Morton
  Cc: Linux Memory Management List, Huang Ying, Brendan Gregg,
	Peng DongX, Liu Jingqi, Dong Eddie, kvm, LKML

On 09/01/2018 04:28 AM, Fengguang Wu wrote:
> To walk 1TB memory of 4k active pages, it costs 2s vs 15s system
> time to scan the per-task/global idle bitmaps.

To me, that says this interface simply won't work on large systems.  2s
and 15s are both simply unacceptably long.

> OTOH, the per-task idle bitmap is not suitable in some situations:
> 
> - not accurate for shared pages
> - don't work with non-mapped file pages
> - don't perform well for sparse page tables (pointed out by Huang Ying)

OK, so we have a new ABI that doesn't work on large systems, consumes
lots of time and resources to query and isn't suitable in quite a few
situations.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 4/5] [PATCH 4/5] kvm-ept-idle: EPT page table walk for A bits
  2018-09-01 11:28   ` Fengguang Wu
  (?)
@ 2018-09-06 14:35   ` Dave Hansen
  -1 siblings, 0 replies; 20+ messages in thread
From: Dave Hansen @ 2018-09-06 14:35 UTC (permalink / raw)
  To: Fengguang Wu, Andrew Morton
  Cc: Linux Memory Management List, Peng DongX, Liu Jingqi, Dong Eddie,
	Huang Ying, Brendan Gregg, kvm, LKML

On 09/01/2018 04:28 AM, Fengguang Wu wrote:
> (2) would need fundemental changes to the interface. It seems existing solutions
> for sparse files like SEEK_HOLE/SEEK_DATA and FIEMAP ioctl may not serve this
> situation well. The most efficient way could be to fill user space read()
> buffer with an array of small extents:

I've only been doing kernel development a few short years, but I've
learned that designing user/kernel interfaces is hard.

A comment in an RFC saying that we need "fundamental changes to the
interface" seems to be more of a cry for help than a request for
comment.  This basically says to me: ignore the interface, it's broken.

> This borrows host page table walk macros/functions to do EPT walk.
> So it depends on them using the same level.

Have you actually run this code?

How does this work?  It's walking the 'ept_root' that appears to be a
guest CR3 register value.  It doesn't appear to be the host CR3 value of
the qemu (or other hypervisor).

I'm also woefully confused why you are calling these EPT walks and then
walking the x86-style page tables.  EPT tables don't have the same
format as x86 page tables, plus they don't start at a CR3-provided value.

I'm also rather unsure that when running a VM, *any* host-format page
tables get updated A/D bits.  You need a host vaddr to do a host-format
page table walk in the host page tables, and the EPT tables do direct
guest physical to host physical translation.  There's no host vaddr
involved at all in those translations.

> +		if (!ept_pte_present(*pte) ||
> +		    !ept_pte_accessed(*pte))
> +			idle = 1;

Huh?  So, !Present and idle are treated the same?  If you had large
swaths of !Present memory, you would see that in this interface and say,
"gee, I've got a lot of idle memory to migrate" and then go do a bunch
of calls to migrate it?  That seems terminally wasteful.

Who is going to use this?

Do you have an example, at least dummy app?

Can more than one app read this data at the same time?  Who manages it?
Who owns it?  Are multiple reads destructive?

This entire set is almost entirely comment-free except for the
occasional C++ comments.  That doesn't inspire confidence.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-09-06 14:35 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-01 11:28 [RFC][PATCH 0/5] introduce /proc/PID/idle_bitmap Fengguang Wu
2018-09-01 11:28 ` Fengguang Wu
2018-09-01 11:28 ` [RFC][PATCH 1/5] [PATCH 1/5] kvm: register in task_struct Fengguang Wu
2018-09-01 11:28   ` Fengguang Wu
2018-09-01 11:28 ` [RFC][PATCH 2/5] [PATCH 2/5] proc: introduce /proc/PID/idle_bitmap Fengguang Wu
2018-09-01 11:28   ` Fengguang Wu
2018-09-04 19:02   ` Sean Christopherson
2018-09-06 14:12   ` Dave Hansen
2018-09-01 11:28 ` [RFC][PATCH 3/5] [PATCH 3/5] kvm-ept-idle: HVA indexed EPT read Fengguang Wu
2018-09-01 11:28   ` Fengguang Wu
2018-09-04  7:57   ` Nikita Leshenko
2018-09-04  8:12     ` Peng, DongX
2018-09-04  8:15       ` Fengguang Wu
2018-09-01 11:28 ` [RFC][PATCH 4/5] [PATCH 4/5] kvm-ept-idle: EPT page table walk for A bits Fengguang Wu
2018-09-01 11:28   ` Fengguang Wu
2018-09-06 14:35   ` Dave Hansen
2018-09-01 11:28 ` [RFC][PATCH 5/5] [PATCH 5/5] kvm-ept-idle: enable module Fengguang Wu
2018-09-01 11:28   ` Fengguang Wu
2018-09-04 19:14   ` Sean Christopherson
2018-09-02  8:24 ` [RFC][PATCH 0/5] introduce /proc/PID/idle_bitmap Fengguang Wu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.