linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
@ 2019-02-04 20:18 Nitesh Narayan Lal
  2019-02-04 20:18 ` [RFC][Patch v8 1/7] KVM: Support for guest free page hinting Nitesh Narayan Lal
                   ` (11 more replies)
  0 siblings, 12 replies; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-04 20:18 UTC (permalink / raw)
  To: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk, dhildenb,
	aarcange

The following patch-set proposes an efficient mechanism for handing freed memory between the guest and the host. It enables the guests with no page cache to rapidly free and reclaims memory to and from the host respectively.

Benefit:
With this patch-series, in our test-case, executed on a single system and single NUMA node with 15GB memory, we were able to successfully launch atleast 5 guests 
when page hinting was enabled and 3 without it. (Detailed explanation of the test procedure is provided at the bottom).

Changelog in V8:
In this patch-series, the earlier approach [1] which was used to capture and scan the pages freed by the guest has been changed. The new approach is briefly described below:

The patch-set still leverages the existing arch_free_page() to add this functionality. It maintains a per CPU array which is used to store the pages freed by the guest. The maximum number of entries which it can hold is defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it is scanned and only the pages which are available in the buddy are stored. This process continues until the array is filled with pages which are part of the buddy free list. After which it wakes up a kernel per-cpu-thread.
This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation and if the page is not reallocated and present in the buddy, the kernel thread attempts to isolate it from the buddy. If it is successfully isolated, the page is added to another per-cpu array. Once the entire scanning process is complete, all the isolated pages are reported to the host through an existing virtio-balloon driver.

Known Issues:
	* Fixed array size: The problem with having a fixed/hardcoded array size arises when the size of the guest varies. For example when the guest size increases and it starts making large allocations fixed size limits this solution's ability to capture all the freed pages. This will result in less guest free memory getting reported to the host.

Known code re-work:
	* Plan to re-use Wei's work, which communicates the poison value to the host.
	* The nomenclatures used in virtio-balloon needs to be changed so that the code can easily be distinguished from Wei's Free Page Hint code.
	* Sorting based on zonenum, to avoid repetitive zone locks for the same zone.

Other required work:
	* Run other benchmarks to evaluate the performance/impact of this approach.

Test case:
Setup:
Memory-15837 MB
Guest Memory Size-5 GB
Swap-Disabled
Test Program-Simple program which allocates 4GB memory via malloc, touches it via memset and exits.
Use case-Number of guests that can be launched completely including the successful execution of the test program.
Procedure: 
The first guest is launched and once its console is up, the test allocation program is executed with 4 GB memory request (Due to this the guest occupies almost 4-5 GB of memory in the host in a system without page hinting). Once this program exits at that time another guest is launched in the host and the same process is followed. We continue launching the guests until a guest gets killed due to low memory condition in the host.

Result:
Without Hinting-3 Guests
With Hinting-5 to 7 Guests(Based on the amount of memory freed/captured).

[1] https://www.spinics.net/lists/kvm/msg170113.html 



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [RFC][Patch v8 1/7] KVM: Support for guest free page hinting
  2019-02-04 20:18 [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Nitesh Narayan Lal
@ 2019-02-04 20:18 ` Nitesh Narayan Lal
  2019-02-05  4:14   ` Michael S. Tsirkin
  2019-02-04 20:18 ` [RFC][Patch v8 2/7] KVM: Enabling guest free page hinting via static key Nitesh Narayan Lal
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-04 20:18 UTC (permalink / raw)
  To: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk, dhildenb,
	aarcange

This patch includes the following:
1. Basic skeleton for the support
2. Enablement of x86 platform to use the same

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 arch/x86/Kbuild              |  2 +-
 arch/x86/kvm/Kconfig         |  8 ++++++++
 arch/x86/kvm/Makefile        |  2 ++
 include/linux/gfp.h          |  9 +++++++++
 include/linux/page_hinting.h | 17 +++++++++++++++++
 virt/kvm/page_hinting.c      | 36 ++++++++++++++++++++++++++++++++++++
 6 files changed, 73 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/page_hinting.h
 create mode 100644 virt/kvm/page_hinting.c

diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
index c625f57472f7..3244df4ee311 100644
--- a/arch/x86/Kbuild
+++ b/arch/x86/Kbuild
@@ -2,7 +2,7 @@ obj-y += entry/
 
 obj-$(CONFIG_PERF_EVENTS) += events/
 
-obj-$(CONFIG_KVM) += kvm/
+obj-$(subst m,y,$(CONFIG_KVM)) += kvm/
 
 # Xen paravirtualization support
 obj-$(CONFIG_XEN) += xen/
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 72fa955f4a15..2fae31459706 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -96,6 +96,14 @@ config KVM_MMU_AUDIT
 	 This option adds a R/W kVM module parameter 'mmu_audit', which allows
 	 auditing of KVM MMU events at runtime.
 
+# KVM_FREE_PAGE_HINTING will allow the guest to report the free pages to the
+# host in regular interval of time.
+config KVM_FREE_PAGE_HINTING
+       def_bool y
+       depends on KVM
+       select VIRTIO
+       select VIRTIO_BALLOON
+
 # OK, it's a little counter-intuitive to do this, but it puts it neatly under
 # the virtualization menu.
 source "drivers/vhost/Kconfig"
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 69b3a7c30013..78640a80501e 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -16,6 +16,8 @@ kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
 			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
 			   hyperv.o page_track.o debugfs.o
 
+obj-$(CONFIG_KVM_FREE_PAGE_HINTING)    += $(KVM)/page_hinting.o
+
 kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o vmx/evmcs.o vmx/nested.o
 kvm-amd-y		+= svm.o pmu_amd.o
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 5f5e25fd6149..e596527284ba 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -7,6 +7,7 @@
 #include <linux/stddef.h>
 #include <linux/linkage.h>
 #include <linux/topology.h>
+#include <linux/page_hinting.h>
 
 struct vm_area_struct;
 
@@ -456,6 +457,14 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
 	return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
 }
 
+#ifdef	CONFIG_KVM_FREE_PAGE_HINTING
+#define HAVE_ARCH_FREE_PAGE
+static inline void arch_free_page(struct page *page, int order)
+{
+	guest_free_page(page, order);
+}
+#endif
+
 #ifndef HAVE_ARCH_FREE_PAGE
 static inline void arch_free_page(struct page *page, int order) { }
 #endif
diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
new file mode 100644
index 000000000000..b54f7428f348
--- /dev/null
+++ b/include/linux/page_hinting.h
@@ -0,0 +1,17 @@
+/*
+ * Size of the array which is used to store the freed pages is defined by
+ * MAX_FGPT_ENTRIES. If possible, we have to find a better way using which
+ * we can get rid of the hardcoded array size.
+ */
+#define MAX_FGPT_ENTRIES	1000
+/*
+ * hypervisor_pages - It is a dummy structure passed with the hypercall.
+ * @pfn: page frame number for the page which needs to be sent to the host.
+ * @order: order of the page needs to be reported to the host.
+ */
+struct hypervisor_pages {
+	unsigned long pfn;
+	unsigned int order;
+};
+
+void guest_free_page(struct page *page, int order);
diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
new file mode 100644
index 000000000000..818bd6b84e0c
--- /dev/null
+++ b/virt/kvm/page_hinting.c
@@ -0,0 +1,36 @@
+#include <linux/gfp.h>
+#include <linux/mm.h>
+#include <linux/kernel.h>
+
+/*
+ * struct kvm_free_pages - Tracks the pages which are freed by the guest.
+ * @pfn: page frame number for the page which is freed.
+ * @order: order corresponding to the page freed.
+ * @zonenum: zone number to which the freed page belongs.
+ */
+struct kvm_free_pages {
+	unsigned long pfn;
+	unsigned int order;
+	int zonenum;
+};
+
+/*
+ * struct page_hinting - holds array objects for the structures used to track
+ * guest free pages, along with an index variable for each of them.
+ * @kvm_pt: array object for the structure kvm_free_pages.
+ * @kvm_pt_idx: index for kvm_free_pages object.
+ * @hypervisor_pagelist: array object for the structure hypervisor_pages.
+ * @hyp_idx: index for hypervisor_pages object.
+ */
+struct page_hinting {
+	struct kvm_free_pages kvm_pt[MAX_FGPT_ENTRIES];
+	int kvm_pt_idx;
+	struct hypervisor_pages hypervisor_pagelist[MAX_FGPT_ENTRIES];
+	int hyp_idx;
+};
+
+DEFINE_PER_CPU(struct page_hinting, hinting_obj);
+
+void guest_free_page(struct page *page, int order)
+{
+}
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [RFC][Patch v8 2/7] KVM: Enabling guest free page hinting via static key
  2019-02-04 20:18 [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Nitesh Narayan Lal
  2019-02-04 20:18 ` [RFC][Patch v8 1/7] KVM: Support for guest free page hinting Nitesh Narayan Lal
@ 2019-02-04 20:18 ` Nitesh Narayan Lal
  2019-02-08 18:07   ` Alexander Duyck
  2019-02-04 20:18 ` [RFC][Patch v8 3/7] KVM: Guest free page hinting functional skeleton Nitesh Narayan Lal
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-04 20:18 UTC (permalink / raw)
  To: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk, dhildenb,
	aarcange

This patch enables the guest free page hinting support
to enable or disable based on the STATIC key which
could be set via sysctl.

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 include/linux/gfp.h          |  2 ++
 include/linux/page_hinting.h |  5 +++++
 kernel/sysctl.c              |  9 +++++++++
 virt/kvm/page_hinting.c      | 23 +++++++++++++++++++++++
 4 files changed, 39 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index e596527284ba..8389219a076a 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -461,6 +461,8 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
 #define HAVE_ARCH_FREE_PAGE
 static inline void arch_free_page(struct page *page, int order)
 {
+	if (!static_branch_unlikely(&guest_page_hinting_key))
+		return;
 	guest_free_page(page, order);
 }
 #endif
diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
index b54f7428f348..9bdcf63e1306 100644
--- a/include/linux/page_hinting.h
+++ b/include/linux/page_hinting.h
@@ -14,4 +14,9 @@ struct hypervisor_pages {
 	unsigned int order;
 };
 
+extern int guest_page_hinting_flag;
+extern struct static_key_false guest_page_hinting_key;
+
+int guest_page_hinting_sysctl(struct ctl_table *table, int write,
+			      void __user *buffer, size_t *lenp, loff_t *ppos);
 void guest_free_page(struct page *page, int order);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index ba4d9e85feb8..5d53629c9bfb 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1690,6 +1690,15 @@ static struct ctl_table vm_table[] = {
 		.extra1		= (void *)&mmap_rnd_compat_bits_min,
 		.extra2		= (void *)&mmap_rnd_compat_bits_max,
 	},
+#endif
+#ifdef CONFIG_KVM_FREE_PAGE_HINTING
+	{
+		.procname	= "guest-page-hinting",
+		.data		= &guest_page_hinting_flag,
+		.maxlen		= sizeof(guest_page_hinting_flag),
+		.mode		= 0644,
+		.proc_handler   = guest_page_hinting_sysctl,
+	},
 #endif
 	{ }
 };
diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
index 818bd6b84e0c..4a34ea8db0c8 100644
--- a/virt/kvm/page_hinting.c
+++ b/virt/kvm/page_hinting.c
@@ -1,6 +1,7 @@
 #include <linux/gfp.h>
 #include <linux/mm.h>
 #include <linux/kernel.h>
+#include <linux/kvm_host.h>
 
 /*
  * struct kvm_free_pages - Tracks the pages which are freed by the guest.
@@ -31,6 +32,28 @@ struct page_hinting {
 
 DEFINE_PER_CPU(struct page_hinting, hinting_obj);
 
+struct static_key_false guest_page_hinting_key  = STATIC_KEY_FALSE_INIT;
+EXPORT_SYMBOL(guest_page_hinting_key);
+static DEFINE_MUTEX(hinting_mutex);
+int guest_page_hinting_flag;
+EXPORT_SYMBOL(guest_page_hinting_flag);
+
+int guest_page_hinting_sysctl(struct ctl_table *table, int write,
+			      void __user *buffer, size_t *lenp,
+			      loff_t *ppos)
+{
+	int ret;
+
+	mutex_lock(&hinting_mutex);
+	ret = proc_dointvec(table, write, buffer, lenp, ppos);
+	if (guest_page_hinting_flag)
+		static_key_enable(&guest_page_hinting_key.key);
+	else
+		static_key_disable(&guest_page_hinting_key.key);
+	mutex_unlock(&hinting_mutex);
+	return ret;
+}
+
 void guest_free_page(struct page *page, int order)
 {
 }
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [RFC][Patch v8 3/7] KVM: Guest free page hinting functional skeleton
  2019-02-04 20:18 [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Nitesh Narayan Lal
  2019-02-04 20:18 ` [RFC][Patch v8 1/7] KVM: Support for guest free page hinting Nitesh Narayan Lal
  2019-02-04 20:18 ` [RFC][Patch v8 2/7] KVM: Enabling guest free page hinting via static key Nitesh Narayan Lal
@ 2019-02-04 20:18 ` Nitesh Narayan Lal
  2019-02-04 20:18 ` [RFC][Patch v8 4/7] KVM: Disabling page poisoning to prevent corruption Nitesh Narayan Lal
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-04 20:18 UTC (permalink / raw)
  To: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk, dhildenb,
	aarcange

This patch adds the functional skeleton for the guest implementation.
It also enables the guest to maintain the list of pages which are
freed by the guest. Once the list is full guest_free_page invokes
scan_array() which wakes up the kernel thread responsible for further
processing.

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 include/linux/page_hinting.h |  3 ++
 virt/kvm/page_hinting.c      | 60 +++++++++++++++++++++++++++++++++++-
 2 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
index 9bdcf63e1306..2d7ff59f3f6a 100644
--- a/include/linux/page_hinting.h
+++ b/include/linux/page_hinting.h
@@ -1,3 +1,5 @@
+#include <linux/smpboot.h>
+
 /*
  * Size of the array which is used to store the freed pages is defined by
  * MAX_FGPT_ENTRIES. If possible, we have to find a better way using which
@@ -16,6 +18,7 @@ struct hypervisor_pages {
 
 extern int guest_page_hinting_flag;
 extern struct static_key_false guest_page_hinting_key;
+extern struct smp_hotplug_thread hinting_threads;
 
 int guest_page_hinting_sysctl(struct ctl_table *table, int write,
 			      void __user *buffer, size_t *lenp, loff_t *ppos);
diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
index 4a34ea8db0c8..636990e7fbb3 100644
--- a/virt/kvm/page_hinting.c
+++ b/virt/kvm/page_hinting.c
@@ -1,7 +1,7 @@
 #include <linux/gfp.h>
 #include <linux/mm.h>
-#include <linux/kernel.h>
 #include <linux/kvm_host.h>
+#include <linux/kernel.h>
 
 /*
  * struct kvm_free_pages - Tracks the pages which are freed by the guest.
@@ -37,6 +37,7 @@ EXPORT_SYMBOL(guest_page_hinting_key);
 static DEFINE_MUTEX(hinting_mutex);
 int guest_page_hinting_flag;
 EXPORT_SYMBOL(guest_page_hinting_flag);
+static DEFINE_PER_CPU(struct task_struct *, hinting_task);
 
 int guest_page_hinting_sysctl(struct ctl_table *table, int write,
 			      void __user *buffer, size_t *lenp,
@@ -54,6 +55,63 @@ int guest_page_hinting_sysctl(struct ctl_table *table, int write,
 	return ret;
 }
 
+static void hinting_fn(unsigned int cpu)
+{
+	struct page_hinting *page_hinting_obj = this_cpu_ptr(&hinting_obj);
+
+	page_hinting_obj->kvm_pt_idx = 0;
+	put_cpu_var(hinting_obj);
+}
+
+void scan_array(void)
+{
+	struct page_hinting *page_hinting_obj = this_cpu_ptr(&hinting_obj);
+
+	if (page_hinting_obj->kvm_pt_idx == MAX_FGPT_ENTRIES)
+		wake_up_process(__this_cpu_read(hinting_task));
+}
+
+static int hinting_should_run(unsigned int cpu)
+{
+	struct page_hinting *page_hinting_obj = this_cpu_ptr(&hinting_obj);
+	int free_page_idx = page_hinting_obj->kvm_pt_idx;
+
+	if (free_page_idx == MAX_FGPT_ENTRIES)
+		return 1;
+	else
+		return 0;
+}
+
+struct smp_hotplug_thread hinting_threads = {
+	.store			= &hinting_task,
+	.thread_should_run	= hinting_should_run,
+	.thread_fn		= hinting_fn,
+	.thread_comm		= "hinting/%u",
+	.selfparking		= false,
+};
+EXPORT_SYMBOL(hinting_threads);
+
 void guest_free_page(struct page *page, int order)
 {
+	unsigned long flags;
+	struct page_hinting *page_hinting_obj = this_cpu_ptr(&hinting_obj);
+	/*
+	 * use of global variables may trigger a race condition between irq and
+	 * process context causing unwanted overwrites. This will be replaced
+	 * with a better solution to prevent such race conditions.
+	 */
+
+	local_irq_save(flags);
+	if (page_hinting_obj->kvm_pt_idx != MAX_FGPT_ENTRIES) {
+		page_hinting_obj->kvm_pt[page_hinting_obj->kvm_pt_idx].pfn =
+							page_to_pfn(page);
+		page_hinting_obj->kvm_pt[page_hinting_obj->kvm_pt_idx].zonenum =
+							page_zonenum(page);
+		page_hinting_obj->kvm_pt[page_hinting_obj->kvm_pt_idx].order =
+							order;
+		page_hinting_obj->kvm_pt_idx += 1;
+		if (page_hinting_obj->kvm_pt_idx == MAX_FGPT_ENTRIES)
+			scan_array();
+	}
+	local_irq_restore(flags);
 }
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [RFC][Patch v8 4/7] KVM: Disabling page poisoning to prevent corruption
  2019-02-04 20:18 [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Nitesh Narayan Lal
                   ` (2 preceding siblings ...)
  2019-02-04 20:18 ` [RFC][Patch v8 3/7] KVM: Guest free page hinting functional skeleton Nitesh Narayan Lal
@ 2019-02-04 20:18 ` Nitesh Narayan Lal
  2019-02-07 17:23   ` Alexander Duyck
  2019-02-07 21:08   ` Michael S. Tsirkin
  2019-02-04 20:18 ` [RFC][Patch v8 5/7] virtio: Enables to add a single descriptor to the host Nitesh Narayan Lal
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-04 20:18 UTC (permalink / raw)
  To: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk, dhildenb,
	aarcange

This patch disables page poisoning if guest page hinting is enabled.
It is required to avoid possible guest memory corruption errors.
Page Poisoning is a feature in which the page is filled with a specific
pattern of (0x00 or 0xaa) after arch_free_page and the same is verified
before arch_alloc_page to prevent following issues:
    *information leak from the freed data
    *use after free bugs
    *memory corruption
Selection of the pattern depends on the CONFIG_PAGE_POISONING_ZERO
Once the guest pages which are supposed to be freed are sent to the
hypervisor it frees them. After freeing the pages in the global list
following things may happen:
    *Hypervisor reallocates the freed memory back to the guest
    *Hypervisor frees the memory and maps a different physical memory
In order to prevent any information leak hypervisor before allocating
memory to the guest fills it with zeroes.
The issue arises when the pattern used for Page Poisoning is 0xaa while
the newly allocated page received from the hypervisor by the guest is
filled with the pattern 0x00. This will result in memory corruption errors.

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 include/linux/page_hinting.h | 8 ++++++++
 mm/page_poison.c             | 2 +-
 virt/kvm/page_hinting.c      | 1 +
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
index 2d7ff59f3f6a..e800c6b07561 100644
--- a/include/linux/page_hinting.h
+++ b/include/linux/page_hinting.h
@@ -19,7 +19,15 @@ struct hypervisor_pages {
 extern int guest_page_hinting_flag;
 extern struct static_key_false guest_page_hinting_key;
 extern struct smp_hotplug_thread hinting_threads;
+extern bool want_page_poisoning;
 
 int guest_page_hinting_sysctl(struct ctl_table *table, int write,
 			      void __user *buffer, size_t *lenp, loff_t *ppos);
 void guest_free_page(struct page *page, int order);
+
+static inline void disable_page_poisoning(void)
+{
+#ifdef CONFIG_PAGE_POISONING
+	want_page_poisoning = 0;
+#endif
+}
diff --git a/mm/page_poison.c b/mm/page_poison.c
index f0c15e9017c0..9af96021133b 100644
--- a/mm/page_poison.c
+++ b/mm/page_poison.c
@@ -7,7 +7,7 @@
 #include <linux/poison.h>
 #include <linux/ratelimit.h>
 
-static bool want_page_poisoning __read_mostly;
+bool want_page_poisoning __read_mostly;
 
 static int __init early_page_poison_param(char *buf)
 {
diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
index 636990e7fbb3..be529f6f2bc0 100644
--- a/virt/kvm/page_hinting.c
+++ b/virt/kvm/page_hinting.c
@@ -103,6 +103,7 @@ void guest_free_page(struct page *page, int order)
 
 	local_irq_save(flags);
 	if (page_hinting_obj->kvm_pt_idx != MAX_FGPT_ENTRIES) {
+		disable_page_poisoning();
 		page_hinting_obj->kvm_pt[page_hinting_obj->kvm_pt_idx].pfn =
 							page_to_pfn(page);
 		page_hinting_obj->kvm_pt[page_hinting_obj->kvm_pt_idx].zonenum =
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [RFC][Patch v8 5/7] virtio: Enables to add a single descriptor to the host
  2019-02-04 20:18 [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Nitesh Narayan Lal
                   ` (3 preceding siblings ...)
  2019-02-04 20:18 ` [RFC][Patch v8 4/7] KVM: Disabling page poisoning to prevent corruption Nitesh Narayan Lal
@ 2019-02-04 20:18 ` Nitesh Narayan Lal
  2019-02-05 20:49   ` Michael S. Tsirkin
  2019-02-04 20:18 ` [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages Nitesh Narayan Lal
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-04 20:18 UTC (permalink / raw)
  To: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk, dhildenb,
	aarcange

This patch enables the caller to expose a single buffers to the
other end using vring descriptor. It also allows the caller to
perform this action in synchornous manner by using virtqueue_kick_sync.

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 drivers/virtio/virtio_ring.c | 72 ++++++++++++++++++++++++++++++++++++
 include/linux/virtio.h       |  4 ++
 2 files changed, 76 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index cd7e755484e3..93c161ac6a28 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1695,6 +1695,52 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 					out_sgs, in_sgs, data, ctx, gfp);
 }
 
+/**
+ * virtqueue_add_desc - add a buffer to a chain using a vring desc
+ * @vq: the struct virtqueue we're talking about.
+ * @addr: address of the buffer to add.
+ * @len: length of the buffer.
+ * @in: set if the buffer is for the device to write.
+ *
+ * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
+ */
+int virtqueue_add_desc(struct virtqueue *_vq, u64 addr, u32 len, int in)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	struct vring_desc *desc = vq->split.vring.desc;
+	u16 flags = in ? VRING_DESC_F_WRITE : 0;
+	unsigned int i;
+	void *data = (void *)addr;
+	int avail_idx;
+
+	/* Sanity check */
+	if (!_vq)
+		return -EINVAL;
+
+	START_USE(vq);
+	if (unlikely(vq->broken)) {
+		END_USE(vq);
+		return -EIO;
+	}
+
+	i = vq->free_head;
+	flags &= ~VRING_DESC_F_NEXT;
+	desc[i].flags = cpu_to_virtio16(_vq->vdev, flags);
+	desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
+	desc[i].len = cpu_to_virtio32(_vq->vdev, len);
+
+	vq->vq.num_free--;
+	vq->free_head = virtio16_to_cpu(_vq->vdev, desc[i].next);
+	vq->split.desc_state[i].data = data;
+	vq->split.avail_idx_shadow = 1;
+	avail_idx = vq->split.avail_idx_shadow;
+	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev, avail_idx);
+	vq->num_added = 1;
+	END_USE(vq);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtqueue_add_desc);
+
 /**
  * virtqueue_add_sgs - expose buffers to other end
  * @vq: the struct virtqueue we're talking about.
@@ -1842,6 +1888,32 @@ bool virtqueue_notify(struct virtqueue *_vq)
 }
 EXPORT_SYMBOL_GPL(virtqueue_notify);
 
+/**
+ * virtqueue_kick_sync - update after add_buf and busy wait till update is done
+ * @vq: the struct virtqueue
+ *
+ * After one or more virtqueue_add_* calls, invoke this to kick
+ * the other side. Busy wait till the other side is done with the update.
+ *
+ * Caller must ensure we don't call this with other virtqueue
+ * operations at the same time (except where noted).
+ *
+ * Returns false if kick failed, otherwise true.
+ */
+bool virtqueue_kick_sync(struct virtqueue *vq)
+{
+	u32 len;
+
+	if (likely(virtqueue_kick(vq))) {
+		while (!virtqueue_get_buf(vq, &len) &&
+		       !virtqueue_is_broken(vq))
+			cpu_relax();
+		return true;
+	}
+	return false;
+}
+EXPORT_SYMBOL_GPL(virtqueue_kick_sync);
+
 /**
  * virtqueue_kick - update after add_buf
  * @vq: the struct virtqueue
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index fa1b5da2804e..58943a3a0e8d 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -57,6 +57,10 @@ int virtqueue_add_sgs(struct virtqueue *vq,
 		      unsigned int in_sgs,
 		      void *data,
 		      gfp_t gfp);
+/* A desc with this init id is treated as an invalid desc */
+int virtqueue_add_desc(struct virtqueue *_vq, u64 addr, u32 len, int in);
+
+bool virtqueue_kick_sync(struct virtqueue *vq);
 
 bool virtqueue_kick(struct virtqueue *vq);
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages
  2019-02-04 20:18 [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Nitesh Narayan Lal
                   ` (4 preceding siblings ...)
  2019-02-04 20:18 ` [RFC][Patch v8 5/7] virtio: Enables to add a single descriptor to the host Nitesh Narayan Lal
@ 2019-02-04 20:18 ` Nitesh Narayan Lal
  2019-02-05 20:45   ` Michael S. Tsirkin
  2019-02-04 20:18 ` [RFC][Patch v8 7/7] KVM: Adding tracepoints for guest page hinting Nitesh Narayan Lal
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-04 20:18 UTC (permalink / raw)
  To: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk, dhildenb,
	aarcange

This patch enables the kernel to scan the per cpu array and
compress it by removing the repetitive/re-allocated pages.
Once the per cpu array is completely filled with pages in the
buddy it wakes up the kernel per cpu thread which re-scans the
entire per cpu array by acquiring a zone lock corresponding to
the page which is being scanned. If the page is still free and
present in the buddy it tries to isolate the page and adds it
to another per cpu array.

Once this scanning process is complete and if there are any
isolated pages added to the new per cpu array kernel thread
invokes hyperlist_ready().

In hyperlist_ready() a hypercall is made to report these pages to
the host using the virtio-balloon framework. In order to do so
another virtqueue 'hinting_vq' is added to the balloon framework.
As the host frees all the reported pages, the kernel thread returns
them back to the buddy.

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 drivers/virtio/virtio_balloon.c     |  56 +++++++-
 include/linux/page_hinting.h        |  18 ++-
 include/uapi/linux/virtio_balloon.h |   1 +
 mm/page_alloc.c                     |   2 +-
 virt/kvm/page_hinting.c             | 202 +++++++++++++++++++++++++++-
 5 files changed, 269 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 728ecd1eea30..8af34e0b9a32 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -57,13 +57,15 @@ enum virtio_balloon_vq {
 	VIRTIO_BALLOON_VQ_INFLATE,
 	VIRTIO_BALLOON_VQ_DEFLATE,
 	VIRTIO_BALLOON_VQ_STATS,
+	VIRTIO_BALLOON_VQ_HINTING,
 	VIRTIO_BALLOON_VQ_FREE_PAGE,
 	VIRTIO_BALLOON_VQ_MAX
 };
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
+								*hinting_vq;
 
 	/* Balloon's own wq for cpu-intensive work items */
 	struct workqueue_struct *balloon_wq;
@@ -122,6 +124,40 @@ static struct virtio_device_id id_table[] = {
 	{ 0 },
 };
 
+#ifdef CONFIG_KVM_FREE_PAGE_HINTING
+void virtballoon_page_hinting(struct virtio_balloon *vb, u64 gvaddr,
+			      int hyper_entries)
+{
+	u64 gpaddr = virt_to_phys((void *)gvaddr);
+
+	virtqueue_add_desc(vb->hinting_vq, gpaddr, hyper_entries, 0);
+	virtqueue_kick_sync(vb->hinting_vq);
+}
+
+static void hinting_ack(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+
+	wake_up(&vb->acked);
+}
+
+static void enable_hinting(struct virtio_balloon *vb)
+{
+	guest_page_hinting_flag = 1;
+	static_branch_enable(&guest_page_hinting_key);
+	request_hypercall = (void *)&virtballoon_page_hinting;
+	balloon_ptr = vb;
+	WARN_ON(smpboot_register_percpu_thread(&hinting_threads));
+}
+
+static void disable_hinting(void)
+{
+	guest_page_hinting_flag = 0;
+	static_branch_enable(&guest_page_hinting_key);
+	balloon_ptr = NULL;
+}
+#endif
+
 static u32 page_to_balloon_pfn(struct page *page)
 {
 	unsigned long pfn = page_to_pfn(page);
@@ -481,6 +517,7 @@ static int init_vqs(struct virtio_balloon *vb)
 	names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
 	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
 	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
+	names[VIRTIO_BALLOON_VQ_HINTING] = NULL;
 
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
@@ -492,11 +529,18 @@ static int init_vqs(struct virtio_balloon *vb)
 		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
 	}
 
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
+		names[VIRTIO_BALLOON_VQ_HINTING] = "hinting_vq";
+		callbacks[VIRTIO_BALLOON_VQ_HINTING] = hinting_ack;
+	}
 	err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
 					 vqs, callbacks, names, NULL, NULL);
 	if (err)
 		return err;
 
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
+		vb->hinting_vq = vqs[VIRTIO_BALLOON_VQ_HINTING];
+
 	vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
 	vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
@@ -908,6 +952,11 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		if (err)
 			goto out_del_balloon_wq;
 	}
+
+#ifdef CONFIG_KVM_FREE_PAGE_HINTING
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
+		enable_hinting(vb);
+#endif
 	virtio_device_ready(vdev);
 
 	if (towards_target(vb))
@@ -950,6 +999,10 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
 
+#ifdef CONFIG_KVM_FREE_PAGE_HINTING
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
+		disable_hinting();
+#endif
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
 		cancel_work_sync(&vb->report_free_page_work);
 		destroy_workqueue(vb->balloon_wq);
@@ -1009,6 +1062,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_HINTING,
 	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
 	VIRTIO_BALLOON_F_PAGE_POISON,
 };
diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
index e800c6b07561..3ba8c1f3b4a4 100644
--- a/include/linux/page_hinting.h
+++ b/include/linux/page_hinting.h
@@ -1,15 +1,12 @@
 #include <linux/smpboot.h>
 
-/*
- * Size of the array which is used to store the freed pages is defined by
- * MAX_FGPT_ENTRIES. If possible, we have to find a better way using which
- * we can get rid of the hardcoded array size.
- */
 #define MAX_FGPT_ENTRIES	1000
 /*
  * hypervisor_pages - It is a dummy structure passed with the hypercall.
- * @pfn: page frame number for the page which needs to be sent to the host.
- * @order: order of the page needs to be reported to the host.
+ * @pfn - page frame number for the page which is to be freed.
+ * @pages - number of pages which are supposed to be freed.
+ * A global array object is used to to hold the list of pfn and pages and is
+ * passed as part of the hypercall.
  */
 struct hypervisor_pages {
 	unsigned long pfn;
@@ -19,11 +16,18 @@ struct hypervisor_pages {
 extern int guest_page_hinting_flag;
 extern struct static_key_false guest_page_hinting_key;
 extern struct smp_hotplug_thread hinting_threads;
+extern void (*request_hypercall)(void *, u64, int);
+extern void *balloon_ptr;
 extern bool want_page_poisoning;
 
 int guest_page_hinting_sysctl(struct ctl_table *table, int write,
 			      void __user *buffer, size_t *lenp, loff_t *ppos);
 void guest_free_page(struct page *page, int order);
+extern int __isolate_free_page(struct page *page, unsigned int order);
+extern void free_one_page(struct zone *zone,
+			  struct page *page, unsigned long pfn,
+			  unsigned int order,
+			  int migratetype);
 
 static inline void disable_page_poisoning(void)
 {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index a1966cd7b677..2b0f62814e22 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -36,6 +36,7 @@
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
 #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
+#define VIRTIO_BALLOON_F_HINTING	5 /* Page hinting virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d295c9bc01a8..93224cba9243 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1199,7 +1199,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	spin_unlock(&zone->lock);
 }
 
-static void free_one_page(struct zone *zone,
+void free_one_page(struct zone *zone,
 				struct page *page, unsigned long pfn,
 				unsigned int order,
 				int migratetype)
diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
index be529f6f2bc0..315099fcda43 100644
--- a/virt/kvm/page_hinting.c
+++ b/virt/kvm/page_hinting.c
@@ -1,6 +1,8 @@
 #include <linux/gfp.h>
 #include <linux/mm.h>
+#include <linux/page_ref.h>
 #include <linux/kvm_host.h>
+#include <linux/sort.h>
 #include <linux/kernel.h>
 
 /*
@@ -39,6 +41,11 @@ int guest_page_hinting_flag;
 EXPORT_SYMBOL(guest_page_hinting_flag);
 static DEFINE_PER_CPU(struct task_struct *, hinting_task);
 
+void (*request_hypercall)(void *, u64, int);
+EXPORT_SYMBOL(request_hypercall);
+void *balloon_ptr;
+EXPORT_SYMBOL(balloon_ptr);
+
 int guest_page_hinting_sysctl(struct ctl_table *table, int write,
 			      void __user *buffer, size_t *lenp,
 			      loff_t *ppos)
@@ -55,18 +62,201 @@ int guest_page_hinting_sysctl(struct ctl_table *table, int write,
 	return ret;
 }
 
+void hyperlist_ready(struct hypervisor_pages *guest_isolated_pages, int entries)
+{
+	int i = 0;
+	int mt = 0;
+
+	if (balloon_ptr)
+		request_hypercall(balloon_ptr, (u64)&guest_isolated_pages[0],
+				  entries);
+
+	while (i < entries) {
+		struct page *page = pfn_to_page(guest_isolated_pages[i].pfn);
+
+		mt = get_pageblock_migratetype(page);
+		free_one_page(page_zone(page), page, page_to_pfn(page),
+			      guest_isolated_pages[i].order, mt);
+		i++;
+	}
+}
+
+struct page *get_buddy_page(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	unsigned int order;
+
+	for (order = 0; order < MAX_ORDER; order++) {
+		struct page *page_head = page - (pfn & ((1 << order) - 1));
+
+		if (PageBuddy(page_head) && page_private(page_head) >= order)
+			return page_head;
+	}
+	return NULL;
+}
+
 static void hinting_fn(unsigned int cpu)
 {
 	struct page_hinting *page_hinting_obj = this_cpu_ptr(&hinting_obj);
+	int idx = 0, ret = 0;
+	struct zone *zone_cur;
+	unsigned long flags = 0;
+
+	while (idx < MAX_FGPT_ENTRIES) {
+		unsigned long pfn = page_hinting_obj->kvm_pt[idx].pfn;
+		unsigned long pfn_end = page_hinting_obj->kvm_pt[idx].pfn +
+			(1 << page_hinting_obj->kvm_pt[idx].order) - 1;
+
+		while (pfn <= pfn_end) {
+			struct page *page = pfn_to_page(pfn);
+			struct page *buddy_page = NULL;
+
+			zone_cur = page_zone(page);
+			spin_lock_irqsave(&zone_cur->lock, flags);
+
+			if (PageCompound(page)) {
+				struct page *head_page = compound_head(page);
+				unsigned long head_pfn = page_to_pfn(head_page);
+				unsigned int alloc_pages =
+					1 << compound_order(head_page);
+
+				pfn = head_pfn + alloc_pages;
+				spin_unlock_irqrestore(&zone_cur->lock, flags);
+				continue;
+			}
+
+			if (page_ref_count(page)) {
+				pfn++;
+				spin_unlock_irqrestore(&zone_cur->lock, flags);
+				continue;
+			}
+
+			if (PageBuddy(page)) {
+				int buddy_order = page_private(page);
 
+				ret = __isolate_free_page(page, buddy_order);
+				if (!ret) {
+				} else {
+					int l_idx = page_hinting_obj->hyp_idx;
+					struct hypervisor_pages *l_obj =
+					page_hinting_obj->hypervisor_pagelist;
+
+					l_obj[l_idx].pfn = pfn;
+					l_obj[l_idx].order = buddy_order;
+					page_hinting_obj->hyp_idx += 1;
+				}
+				pfn = pfn + (1 << buddy_order);
+				spin_unlock_irqrestore(&zone_cur->lock, flags);
+				continue;
+			}
+
+			buddy_page = get_buddy_page(page);
+			if (buddy_page) {
+				int buddy_order = page_private(buddy_page);
+
+				ret = __isolate_free_page(buddy_page,
+							  buddy_order);
+				if (!ret) {
+				} else {
+					int l_idx = page_hinting_obj->hyp_idx;
+					struct hypervisor_pages *l_obj =
+					page_hinting_obj->hypervisor_pagelist;
+					unsigned long buddy_pfn =
+						page_to_pfn(buddy_page);
+
+					l_obj[l_idx].pfn = buddy_pfn;
+					l_obj[l_idx].order = buddy_order;
+					page_hinting_obj->hyp_idx += 1;
+				}
+				pfn = page_to_pfn(buddy_page) +
+					(1 << buddy_order);
+				spin_unlock_irqrestore(&zone_cur->lock, flags);
+				continue;
+			}
+			spin_unlock_irqrestore(&zone_cur->lock, flags);
+			pfn++;
+		}
+		page_hinting_obj->kvm_pt[idx].pfn = 0;
+		page_hinting_obj->kvm_pt[idx].order = -1;
+		page_hinting_obj->kvm_pt[idx].zonenum = -1;
+		idx++;
+	}
+	if (page_hinting_obj->hyp_idx > 0) {
+		hyperlist_ready(page_hinting_obj->hypervisor_pagelist,
+				page_hinting_obj->hyp_idx);
+		page_hinting_obj->hyp_idx = 0;
+	}
 	page_hinting_obj->kvm_pt_idx = 0;
 	put_cpu_var(hinting_obj);
 }
 
+int if_exist(struct page *page)
+{
+	int i = 0;
+	struct page_hinting *page_hinting_obj = this_cpu_ptr(&hinting_obj);
+
+	while (i < MAX_FGPT_ENTRIES) {
+		if (page_to_pfn(page) == page_hinting_obj->kvm_pt[i].pfn)
+			return 1;
+		i++;
+	}
+	return 0;
+}
+
+void pack_array(void)
+{
+	int i = 0, j = 0;
+	struct page_hinting *page_hinting_obj = this_cpu_ptr(&hinting_obj);
+
+	while (i < MAX_FGPT_ENTRIES) {
+		if (page_hinting_obj->kvm_pt[i].pfn != 0) {
+			if (i != j) {
+				page_hinting_obj->kvm_pt[j].pfn =
+					page_hinting_obj->kvm_pt[i].pfn;
+				page_hinting_obj->kvm_pt[j].order =
+					page_hinting_obj->kvm_pt[i].order;
+				page_hinting_obj->kvm_pt[j].zonenum =
+					page_hinting_obj->kvm_pt[i].zonenum;
+			}
+			j++;
+		}
+		i++;
+	}
+	i = j;
+	page_hinting_obj->kvm_pt_idx = j;
+	while (j < MAX_FGPT_ENTRIES) {
+		page_hinting_obj->kvm_pt[j].pfn = 0;
+		page_hinting_obj->kvm_pt[j].order = -1;
+		page_hinting_obj->kvm_pt[j].zonenum = -1;
+		j++;
+	}
+}
+
 void scan_array(void)
 {
 	struct page_hinting *page_hinting_obj = this_cpu_ptr(&hinting_obj);
+	int i = 0;
 
+	while (i < MAX_FGPT_ENTRIES) {
+		struct page *page =
+			pfn_to_page(page_hinting_obj->kvm_pt[i].pfn);
+		struct page *buddy_page = get_buddy_page(page);
+
+		if (!PageBuddy(page) && buddy_page) {
+			if (if_exist(buddy_page)) {
+				page_hinting_obj->kvm_pt[i].pfn = 0;
+				page_hinting_obj->kvm_pt[i].order = -1;
+				page_hinting_obj->kvm_pt[i].zonenum = -1;
+			} else {
+				page_hinting_obj->kvm_pt[i].pfn =
+					page_to_pfn(buddy_page);
+				page_hinting_obj->kvm_pt[i].order =
+					page_private(buddy_page);
+			}
+		}
+		i++;
+	}
+	pack_array();
 	if (page_hinting_obj->kvm_pt_idx == MAX_FGPT_ENTRIES)
 		wake_up_process(__this_cpu_read(hinting_task));
 }
@@ -111,8 +301,18 @@ void guest_free_page(struct page *page, int order)
 		page_hinting_obj->kvm_pt[page_hinting_obj->kvm_pt_idx].order =
 							order;
 		page_hinting_obj->kvm_pt_idx += 1;
-		if (page_hinting_obj->kvm_pt_idx == MAX_FGPT_ENTRIES)
+		if (page_hinting_obj->kvm_pt_idx == MAX_FGPT_ENTRIES) {
+			/*
+			 * We are depending on the buddy free-list to identify
+			 * if a page is free or not. Hence, we are dumping all
+			 * the per-cpu pages back into the buddy allocator. This
+			 * will ensure less failures when we try to isolate free
+			 * captured pages and hence more memory reporting to the
+			 * host.
+			 */
+			drain_local_pages(NULL);
 			scan_array();
+		}
 	}
 	local_irq_restore(flags);
 }
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [RFC][Patch v8 7/7] KVM: Adding tracepoints for guest page hinting
  2019-02-04 20:18 [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Nitesh Narayan Lal
                   ` (5 preceding siblings ...)
  2019-02-04 20:18 ` [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages Nitesh Narayan Lal
@ 2019-02-04 20:18 ` Nitesh Narayan Lal
  2019-02-04 20:20 ` [RFC][QEMU PATCH] KVM: Support for guest free " Nitesh Narayan Lal
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-04 20:18 UTC (permalink / raw)
  To: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk, dhildenb,
	aarcange

This patch enables to track the pages freed by the guest and
the pages isolated by the page hinting code through kernel
tracepoints.

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 include/trace/events/kmem.h | 40 +++++++++++++++++++++++++++++++++++++
 virt/kvm/page_hinting.c     | 10 ++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index eb57e3037deb..69f6da9ff939 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -315,6 +315,46 @@ TRACE_EVENT(mm_page_alloc_extfrag,
 		__entry->change_ownership)
 );
 
+TRACE_EVENT(guest_free_page,
+	    TP_PROTO(struct page *page, unsigned int order),
+
+	TP_ARGS(page, order),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, pfn)
+		__field(unsigned int, order)
+	),
+
+	TP_fast_assign(
+		__entry->pfn            = page_to_pfn(page);
+		__entry->order          = order;
+	),
+
+	TP_printk("page=%p pfn=%lu number of pages=%d",
+		  pfn_to_page(__entry->pfn),
+		  __entry->pfn,
+		  (1 << __entry->order))
+);
+
+TRACE_EVENT(guest_isolated_pfn,
+	    TP_PROTO(unsigned long pfn, unsigned int pages),
+
+	TP_ARGS(pfn, pages),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, pfn)
+		__field(unsigned int, pages)
+	),
+
+	TP_fast_assign(
+		__entry->pfn            = pfn;
+		__entry->pages          = pages;
+	),
+
+	TP_printk("pfn=%lu number of pages=%u",
+		  __entry->pfn,
+		  __entry->pages)
+);
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
index 315099fcda43..395d94e52c74 100644
--- a/virt/kvm/page_hinting.c
+++ b/virt/kvm/page_hinting.c
@@ -4,6 +4,7 @@
 #include <linux/kvm_host.h>
 #include <linux/sort.h>
 #include <linux/kernel.h>
+#include <trace/events/kmem.h>
 
 /*
  * struct kvm_free_pages - Tracks the pages which are freed by the guest.
@@ -140,7 +141,11 @@ static void hinting_fn(unsigned int cpu)
 					int l_idx = page_hinting_obj->hyp_idx;
 					struct hypervisor_pages *l_obj =
 					page_hinting_obj->hypervisor_pagelist;
+					unsigned int buddy_pages =
+						1 << buddy_order;
 
+					trace_guest_isolated_pfn(pfn,
+								 buddy_pages);
 					l_obj[l_idx].pfn = pfn;
 					l_obj[l_idx].order = buddy_order;
 					page_hinting_obj->hyp_idx += 1;
@@ -163,7 +168,11 @@ static void hinting_fn(unsigned int cpu)
 					page_hinting_obj->hypervisor_pagelist;
 					unsigned long buddy_pfn =
 						page_to_pfn(buddy_page);
+					unsigned int buddy_pages =
+						1 << buddy_order;
 
+					trace_guest_isolated_pfn(pfn,
+								 buddy_pages);
 					l_obj[l_idx].pfn = buddy_pfn;
 					l_obj[l_idx].order = buddy_order;
 					page_hinting_obj->hyp_idx += 1;
@@ -294,6 +303,7 @@ void guest_free_page(struct page *page, int order)
 	local_irq_save(flags);
 	if (page_hinting_obj->kvm_pt_idx != MAX_FGPT_ENTRIES) {
 		disable_page_poisoning();
+		trace_guest_free_page(page, order);
 		page_hinting_obj->kvm_pt[page_hinting_obj->kvm_pt_idx].pfn =
 							page_to_pfn(page);
 		page_hinting_obj->kvm_pt[page_hinting_obj->kvm_pt_idx].zonenum =
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [RFC][QEMU PATCH] KVM: Support for guest free page hinting
  2019-02-04 20:18 [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Nitesh Narayan Lal
                   ` (6 preceding siblings ...)
  2019-02-04 20:18 ` [RFC][Patch v8 7/7] KVM: Adding tracepoints for guest page hinting Nitesh Narayan Lal
@ 2019-02-04 20:20 ` Nitesh Narayan Lal
  2019-02-12  9:03 ` [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Wang, Wei W
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-04 20:20 UTC (permalink / raw)
  To: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk, dhildenb,
	aarcange
  Cc: Nitesh Narayan Lal

This patch enables QEMU to recieve free page address from the guest
and to use madvise to make them available to the host.
---
 hw/virtio/trace-events                        |  1 +
 hw/virtio/virtio-balloon.c                    | 82 +++++++++++++++++++
 hw/virtio/virtio.c                            | 25 ++++++
 include/hw/virtio/virtio-access.h             |  1 +
 include/hw/virtio/virtio-balloon.h            |  2 +-
 .../standard-headers/linux/virtio_balloon.h   |  1 +
 6 files changed, 111 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 07bcbe9e85..e3ab66f126 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -46,3 +46,4 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
 virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
 virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
 virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: 0x%"PRIx64" num_pages: %d"
+virtio_balloon_hinting_request(unsigned long pfn, unsigned int num_pages) "Guest page hinting request: %lu num_pages: %d"
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index a12677d4d5..464d7d0d82 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -33,6 +33,13 @@
 
 #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
 
+struct guest_pages {
+	unsigned long pfn;
+	unsigned int order;
+};
+
+void page_hinting_request(uint64_t addr, uint32_t len);
+
 static void balloon_page(void *addr, int deflate)
 {
     if (!qemu_balloon_is_inhibited()) {
@@ -207,6 +214,77 @@ static void balloon_stats_set_poll_interval(Object *obj, Visitor *v,
     balloon_stats_change_timer(s, 0);
 }
 
+static void *gpa2hva(MemoryRegion **p_mr, hwaddr addr, Error **errp)
+{
+    MemoryRegionSection mrs = memory_region_find(get_system_memory(),
+                                                 addr, 1);
+
+    if (!mrs.mr) {
+        error_setg(errp, "No memory is mapped at address 0x%" HWADDR_PRIx, addr);
+        return NULL;
+    }
+
+    if (!memory_region_is_ram(mrs.mr) && !memory_region_is_romd(mrs.mr)) {
+        error_setg(errp, "Memory at address 0x%" HWADDR_PRIx "is not RAM", addr);
+        memory_region_unref(mrs.mr);
+        return NULL;
+    }
+
+    *p_mr = mrs.mr;
+    return qemu_map_ram_ptr(mrs.mr->ram_block, mrs.offset_within_region);
+}
+
+void page_hinting_request(uint64_t addr, uint32_t len)
+{
+    Error *local_err = NULL;
+    MemoryRegion *mr = NULL;
+    void *hvaddr;
+    int ret = 0;
+    struct guest_pages *guest_obj;
+    int i = 0;
+    void *hvaddr_to_free;
+    unsigned long pfn, pfn_end;
+    uint64_t gpaddr_to_free;
+
+    hvaddr = gpa2hva(&mr, addr, &local_err);
+    if (local_err) {
+        error_report_err(local_err);
+        return;
+    }
+    guest_obj = hvaddr;
+
+    while (i < len) {
+        pfn = guest_obj[i].pfn;
+	pfn_end = guest_obj[i].pfn + (1 << guest_obj[i].order) - 1;
+	trace_virtio_balloon_hinting_request(pfn,(1 << guest_obj[i].order));
+	while (pfn <= pfn_end) {
+	        gpaddr_to_free = pfn << VIRTIO_BALLOON_PFN_SHIFT;
+	        hvaddr_to_free = gpa2hva(&mr, gpaddr_to_free, &local_err);
+	        if (local_err) {
+			error_report_err(local_err);
+		        return;
+		}
+		ret = qemu_madvise((void *)hvaddr_to_free, 4096, QEMU_MADV_DONTNEED);
+		if (ret == -1)
+		    printf("\n%d:%s Error: Madvise failed with error:%d\n", __LINE__, __func__, ret);
+		pfn++;
+	}
+	i++;
+    }
+}
+
+static void virtio_balloon_page_hinting(VirtIODevice *vdev, VirtQueue *vq)
+{
+    uint64_t addr;
+    uint32_t len;
+    VirtQueueElement elem = {};
+
+    pop_hinting_addr(vq, &addr, &len);
+    page_hinting_request(addr, len);
+    virtqueue_push(vq, &elem, 0);
+    virtio_notify(vdev, vq);
+}
+
 static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
@@ -376,6 +454,7 @@ static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
     VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
     f |= dev->host_features;
     virtio_add_feature(&f, VIRTIO_BALLOON_F_STATS_VQ);
+    virtio_add_feature(&f, VIRTIO_BALLOON_F_HINTING);
     return f;
 }
 
@@ -445,6 +524,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
     s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
     s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
     s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
+    s->hvq = virtio_add_queue(vdev, 128, virtio_balloon_page_hinting);
 
     reset_stats(s);
 }
@@ -488,6 +568,8 @@ static void virtio_balloon_instance_init(Object *obj)
 
     object_property_add(obj, "guest-stats", "guest statistics",
                         balloon_stats_get_all, NULL, NULL, s, NULL);
+    object_property_add(obj, "guest-page-hinting", "guest page hinting",
+                        NULL, NULL, NULL, s, NULL);
 
     object_property_add(obj, "guest-stats-polling-interval", "int",
                         balloon_stats_get_poll_interval,
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 22bd1ac34e..a0cdb232f0 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -847,6 +847,31 @@ static void *virtqueue_alloc_element(size_t sz, unsigned out_num, unsigned in_nu
     return elem;
 }
 
+void pop_hinting_addr(VirtQueue *vq, uint64_t *addr, uint32_t *len)
+{
+   VRingMemoryRegionCaches *caches;
+   VRingDesc desc;
+   MemoryRegionCache *desc_cache;
+   VirtIODevice *vdev = vq->vdev;
+   unsigned int head, max;
+
+   max = vq->vring.num;
+   if (!virtqueue_get_head(vq, vq->last_avail_idx++, &head)) {
+   	virtio_error(vdev, "Unable to read head");
+  	return;
+   }
+
+   caches = vring_get_region_caches(vq);
+   if (caches->desc.len < max * sizeof(VRingDesc)) {
+       virtio_error(vdev, "Cannot map descriptor ring");
+       return;
+   }
+   desc_cache = &caches->desc;
+   vring_desc_read(vdev, &desc, desc_cache, head);
+   *addr = desc.addr;
+   *len = desc.len;
+}
+
 void *virtqueue_pop(VirtQueue *vq, size_t sz)
 {
     unsigned int i, head, max;
diff --git a/include/hw/virtio/virtio-access.h b/include/hw/virtio/virtio-access.h
index bdf58f3119..0a55f2626f 100644
--- a/include/hw/virtio/virtio-access.h
+++ b/include/hw/virtio/virtio-access.h
@@ -23,6 +23,7 @@
 #define LEGACY_VIRTIO_IS_BIENDIAN 1
 #endif
 
+void pop_hinting_addr(VirtQueue *vq, uint64_t *addr, uint32_t *len);
 static inline bool virtio_access_is_big_endian(VirtIODevice *vdev)
 {
 #if defined(LEGACY_VIRTIO_IS_BIENDIAN)
diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
index e0df3528c8..774498a6ca 100644
--- a/include/hw/virtio/virtio-balloon.h
+++ b/include/hw/virtio/virtio-balloon.h
@@ -32,7 +32,7 @@ typedef struct virtio_balloon_stat_modern {
 
 typedef struct VirtIOBalloon {
     VirtIODevice parent_obj;
-    VirtQueue *ivq, *dvq, *svq;
+    VirtQueue *ivq, *dvq, *svq, *hvq;
     uint32_t num_pages;
     uint32_t actual;
     uint64_t stats[VIRTIO_BALLOON_S_NR];
diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
index 4dbb7dc6c0..f50c0d95ea 100644
--- a/include/standard-headers/linux/virtio_balloon.h
+++ b/include/standard-headers/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_HINTING	5 /* Page hinting virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 1/7] KVM: Support for guest free page hinting
  2019-02-04 20:18 ` [RFC][Patch v8 1/7] KVM: Support for guest free page hinting Nitesh Narayan Lal
@ 2019-02-05  4:14   ` Michael S. Tsirkin
  2019-02-05 13:06     ` Nitesh Narayan Lal
  0 siblings, 1 reply; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-05  4:14 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, david, dodgen, konrad.wilk, dhildenb,
	aarcange

On Mon, Feb 04, 2019 at 03:18:48PM -0500, Nitesh Narayan Lal wrote:
> This patch includes the following:
> 1. Basic skeleton for the support
> 2. Enablement of x86 platform to use the same
> 
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
>  arch/x86/Kbuild              |  2 +-
>  arch/x86/kvm/Kconfig         |  8 ++++++++
>  arch/x86/kvm/Makefile        |  2 ++
>  include/linux/gfp.h          |  9 +++++++++
>  include/linux/page_hinting.h | 17 +++++++++++++++++
>  virt/kvm/page_hinting.c      | 36 ++++++++++++++++++++++++++++++++++++
>  6 files changed, 73 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/page_hinting.h
>  create mode 100644 virt/kvm/page_hinting.c
> 
> diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
> index c625f57472f7..3244df4ee311 100644
> --- a/arch/x86/Kbuild
> +++ b/arch/x86/Kbuild
> @@ -2,7 +2,7 @@ obj-y += entry/
>  
>  obj-$(CONFIG_PERF_EVENTS) += events/
>  
> -obj-$(CONFIG_KVM) += kvm/
> +obj-$(subst m,y,$(CONFIG_KVM)) += kvm/
>  
>  # Xen paravirtualization support
>  obj-$(CONFIG_XEN) += xen/
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 72fa955f4a15..2fae31459706 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -96,6 +96,14 @@ config KVM_MMU_AUDIT
>  	 This option adds a R/W kVM module parameter 'mmu_audit', which allows
>  	 auditing of KVM MMU events at runtime.
>  
> +# KVM_FREE_PAGE_HINTING will allow the guest to report the free pages to the
> +# host in regular interval of time.
> +config KVM_FREE_PAGE_HINTING
> +       def_bool y
> +       depends on KVM
> +       select VIRTIO
> +       select VIRTIO_BALLOON
> +
>  # OK, it's a little counter-intuitive to do this, but it puts it neatly under
>  # the virtualization menu.
>  source "drivers/vhost/Kconfig"
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index 69b3a7c30013..78640a80501e 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -16,6 +16,8 @@ kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
>  			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
>  			   hyperv.o page_track.o debugfs.o
>  
> +obj-$(CONFIG_KVM_FREE_PAGE_HINTING)    += $(KVM)/page_hinting.o
> +
>  kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o vmx/evmcs.o vmx/nested.o
>  kvm-amd-y		+= svm.o pmu_amd.o
>  
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 5f5e25fd6149..e596527284ba 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -7,6 +7,7 @@
>  #include <linux/stddef.h>
>  #include <linux/linkage.h>
>  #include <linux/topology.h>
> +#include <linux/page_hinting.h>
>  
>  struct vm_area_struct;
>  
> @@ -456,6 +457,14 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>  	return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
>  }
>  
> +#ifdef	CONFIG_KVM_FREE_PAGE_HINTING
> +#define HAVE_ARCH_FREE_PAGE
> +static inline void arch_free_page(struct page *page, int order)
> +{
> +	guest_free_page(page, order);
> +}
> +#endif
> +
>  #ifndef HAVE_ARCH_FREE_PAGE
>  static inline void arch_free_page(struct page *page, int order) { }
>  #endif

OK so arch_free_page hook is used to tie into mm code,
with follow-up patches the pages get queued in a list
and then sent to hypervisor so it can free them.
Fair enough but how do we know the page is
not reused by the time it's received by the hypervisor?
If it's reused then isn't it a problem that
hypervisor calls MADV_DONTNEED on them?


> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
> new file mode 100644
> index 000000000000..b54f7428f348
> --- /dev/null
> +++ b/include/linux/page_hinting.h
> @@ -0,0 +1,17 @@
> +/*
> + * Size of the array which is used to store the freed pages is defined by
> + * MAX_FGPT_ENTRIES. If possible, we have to find a better way using which
> + * we can get rid of the hardcoded array size.
> + */
> +#define MAX_FGPT_ENTRIES	1000
> +/*
> + * hypervisor_pages - It is a dummy structure passed with the hypercall.
> + * @pfn: page frame number for the page which needs to be sent to the host.
> + * @order: order of the page needs to be reported to the host.
> + */
> +struct hypervisor_pages {
> +	unsigned long pfn;
> +	unsigned int order;
> +};
> +
> +void guest_free_page(struct page *page, int order);
> diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
> new file mode 100644
> index 000000000000..818bd6b84e0c
> --- /dev/null
> +++ b/virt/kvm/page_hinting.c
> @@ -0,0 +1,36 @@
> +#include <linux/gfp.h>
> +#include <linux/mm.h>
> +#include <linux/kernel.h>
> +
> +/*
> + * struct kvm_free_pages - Tracks the pages which are freed by the guest.
> + * @pfn: page frame number for the page which is freed.
> + * @order: order corresponding to the page freed.
> + * @zonenum: zone number to which the freed page belongs.
> + */
> +struct kvm_free_pages {
> +	unsigned long pfn;
> +	unsigned int order;
> +	int zonenum;
> +};
> +
> +/*
> + * struct page_hinting - holds array objects for the structures used to track
> + * guest free pages, along with an index variable for each of them.
> + * @kvm_pt: array object for the structure kvm_free_pages.
> + * @kvm_pt_idx: index for kvm_free_pages object.
> + * @hypervisor_pagelist: array object for the structure hypervisor_pages.
> + * @hyp_idx: index for hypervisor_pages object.
> + */
> +struct page_hinting {
> +	struct kvm_free_pages kvm_pt[MAX_FGPT_ENTRIES];
> +	int kvm_pt_idx;
> +	struct hypervisor_pages hypervisor_pagelist[MAX_FGPT_ENTRIES];
> +	int hyp_idx;
> +};
> +
> +DEFINE_PER_CPU(struct page_hinting, hinting_obj);
> +
> +void guest_free_page(struct page *page, int order)
> +{
> +}
> -- 
> 2.17.2

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 1/7] KVM: Support for guest free page hinting
  2019-02-05  4:14   ` Michael S. Tsirkin
@ 2019-02-05 13:06     ` Nitesh Narayan Lal
  2019-02-05 16:27       ` Michael S. Tsirkin
  0 siblings, 1 reply; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-05 13:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, david, dodgen, konrad.wilk, dhildenb,
	aarcange


[-- Attachment #1.1: Type: text/plain, Size: 6320 bytes --]

On 2/4/19 11:14 PM, Michael S. Tsirkin wrote:
> On Mon, Feb 04, 2019 at 03:18:48PM -0500, Nitesh Narayan Lal wrote:
>> This patch includes the following:
>> 1. Basic skeleton for the support
>> 2. Enablement of x86 platform to use the same
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> ---
>>  arch/x86/Kbuild              |  2 +-
>>  arch/x86/kvm/Kconfig         |  8 ++++++++
>>  arch/x86/kvm/Makefile        |  2 ++
>>  include/linux/gfp.h          |  9 +++++++++
>>  include/linux/page_hinting.h | 17 +++++++++++++++++
>>  virt/kvm/page_hinting.c      | 36 ++++++++++++++++++++++++++++++++++++
>>  6 files changed, 73 insertions(+), 1 deletion(-)
>>  create mode 100644 include/linux/page_hinting.h
>>  create mode 100644 virt/kvm/page_hinting.c
>>
>> diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
>> index c625f57472f7..3244df4ee311 100644
>> --- a/arch/x86/Kbuild
>> +++ b/arch/x86/Kbuild
>> @@ -2,7 +2,7 @@ obj-y += entry/
>>  
>>  obj-$(CONFIG_PERF_EVENTS) += events/
>>  
>> -obj-$(CONFIG_KVM) += kvm/
>> +obj-$(subst m,y,$(CONFIG_KVM)) += kvm/
>>  
>>  # Xen paravirtualization support
>>  obj-$(CONFIG_XEN) += xen/
>> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
>> index 72fa955f4a15..2fae31459706 100644
>> --- a/arch/x86/kvm/Kconfig
>> +++ b/arch/x86/kvm/Kconfig
>> @@ -96,6 +96,14 @@ config KVM_MMU_AUDIT
>>  	 This option adds a R/W kVM module parameter 'mmu_audit', which allows
>>  	 auditing of KVM MMU events at runtime.
>>  
>> +# KVM_FREE_PAGE_HINTING will allow the guest to report the free pages to the
>> +# host in regular interval of time.
>> +config KVM_FREE_PAGE_HINTING
>> +       def_bool y
>> +       depends on KVM
>> +       select VIRTIO
>> +       select VIRTIO_BALLOON
>> +
>>  # OK, it's a little counter-intuitive to do this, but it puts it neatly under
>>  # the virtualization menu.
>>  source "drivers/vhost/Kconfig"
>> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
>> index 69b3a7c30013..78640a80501e 100644
>> --- a/arch/x86/kvm/Makefile
>> +++ b/arch/x86/kvm/Makefile
>> @@ -16,6 +16,8 @@ kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
>>  			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
>>  			   hyperv.o page_track.o debugfs.o
>>  
>> +obj-$(CONFIG_KVM_FREE_PAGE_HINTING)    += $(KVM)/page_hinting.o
>> +
>>  kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o vmx/evmcs.o vmx/nested.o
>>  kvm-amd-y		+= svm.o pmu_amd.o
>>  
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index 5f5e25fd6149..e596527284ba 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -7,6 +7,7 @@
>>  #include <linux/stddef.h>
>>  #include <linux/linkage.h>
>>  #include <linux/topology.h>
>> +#include <linux/page_hinting.h>
>>  
>>  struct vm_area_struct;
>>  
>> @@ -456,6 +457,14 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>>  	return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
>>  }
>>  
>> +#ifdef	CONFIG_KVM_FREE_PAGE_HINTING
>> +#define HAVE_ARCH_FREE_PAGE
>> +static inline void arch_free_page(struct page *page, int order)
>> +{
>> +	guest_free_page(page, order);
>> +}
>> +#endif
>> +
>>  #ifndef HAVE_ARCH_FREE_PAGE
>>  static inline void arch_free_page(struct page *page, int order) { }
>>  #endif
> OK so arch_free_page hook is used to tie into mm code,
> with follow-up patches the pages get queued in a list
> and then sent to hypervisor so it can free them.
> Fair enough but how do we know the page is
> not reused by the time it's received by the hypervisor?
> If it's reused then isn't it a problem that
> hypervisor calls MADV_DONTNEED on them?
Hi Michael,

In order to ensure that the page is not reused, we remove it from the
buddy free list by acquiring the zone lock. After the page is freed by
the hypervisor it is returned to the buddy free list again.
>
>
>> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
>> new file mode 100644
>> index 000000000000..b54f7428f348
>> --- /dev/null
>> +++ b/include/linux/page_hinting.h
>> @@ -0,0 +1,17 @@
>> +/*
>> + * Size of the array which is used to store the freed pages is defined by
>> + * MAX_FGPT_ENTRIES. If possible, we have to find a better way using which
>> + * we can get rid of the hardcoded array size.
>> + */
>> +#define MAX_FGPT_ENTRIES	1000
>> +/*
>> + * hypervisor_pages - It is a dummy structure passed with the hypercall.
>> + * @pfn: page frame number for the page which needs to be sent to the host.
>> + * @order: order of the page needs to be reported to the host.
>> + */
>> +struct hypervisor_pages {
>> +	unsigned long pfn;
>> +	unsigned int order;
>> +};
>> +
>> +void guest_free_page(struct page *page, int order);
>> diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
>> new file mode 100644
>> index 000000000000..818bd6b84e0c
>> --- /dev/null
>> +++ b/virt/kvm/page_hinting.c
>> @@ -0,0 +1,36 @@
>> +#include <linux/gfp.h>
>> +#include <linux/mm.h>
>> +#include <linux/kernel.h>
>> +
>> +/*
>> + * struct kvm_free_pages - Tracks the pages which are freed by the guest.
>> + * @pfn: page frame number for the page which is freed.
>> + * @order: order corresponding to the page freed.
>> + * @zonenum: zone number to which the freed page belongs.
>> + */
>> +struct kvm_free_pages {
>> +	unsigned long pfn;
>> +	unsigned int order;
>> +	int zonenum;
>> +};
>> +
>> +/*
>> + * struct page_hinting - holds array objects for the structures used to track
>> + * guest free pages, along with an index variable for each of them.
>> + * @kvm_pt: array object for the structure kvm_free_pages.
>> + * @kvm_pt_idx: index for kvm_free_pages object.
>> + * @hypervisor_pagelist: array object for the structure hypervisor_pages.
>> + * @hyp_idx: index for hypervisor_pages object.
>> + */
>> +struct page_hinting {
>> +	struct kvm_free_pages kvm_pt[MAX_FGPT_ENTRIES];
>> +	int kvm_pt_idx;
>> +	struct hypervisor_pages hypervisor_pagelist[MAX_FGPT_ENTRIES];
>> +	int hyp_idx;
>> +};
>> +
>> +DEFINE_PER_CPU(struct page_hinting, hinting_obj);
>> +
>> +void guest_free_page(struct page *page, int order)
>> +{
>> +}
>> -- 
>> 2.17.2
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 1/7] KVM: Support for guest free page hinting
  2019-02-05 13:06     ` Nitesh Narayan Lal
@ 2019-02-05 16:27       ` Michael S. Tsirkin
  2019-02-05 16:34         ` Nitesh Narayan Lal
  0 siblings, 1 reply; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-05 16:27 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, david, dodgen, konrad.wilk, dhildenb,
	aarcange

On Tue, Feb 05, 2019 at 08:06:33AM -0500, Nitesh Narayan Lal wrote:
> On 2/4/19 11:14 PM, Michael S. Tsirkin wrote:
> > On Mon, Feb 04, 2019 at 03:18:48PM -0500, Nitesh Narayan Lal wrote:
> >> This patch includes the following:
> >> 1. Basic skeleton for the support
> >> 2. Enablement of x86 platform to use the same
> >>
> >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> >> ---
> >>  arch/x86/Kbuild              |  2 +-
> >>  arch/x86/kvm/Kconfig         |  8 ++++++++
> >>  arch/x86/kvm/Makefile        |  2 ++
> >>  include/linux/gfp.h          |  9 +++++++++
> >>  include/linux/page_hinting.h | 17 +++++++++++++++++
> >>  virt/kvm/page_hinting.c      | 36 ++++++++++++++++++++++++++++++++++++
> >>  6 files changed, 73 insertions(+), 1 deletion(-)
> >>  create mode 100644 include/linux/page_hinting.h
> >>  create mode 100644 virt/kvm/page_hinting.c
> >>
> >> diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
> >> index c625f57472f7..3244df4ee311 100644
> >> --- a/arch/x86/Kbuild
> >> +++ b/arch/x86/Kbuild
> >> @@ -2,7 +2,7 @@ obj-y += entry/
> >>  
> >>  obj-$(CONFIG_PERF_EVENTS) += events/
> >>  
> >> -obj-$(CONFIG_KVM) += kvm/
> >> +obj-$(subst m,y,$(CONFIG_KVM)) += kvm/
> >>  
> >>  # Xen paravirtualization support
> >>  obj-$(CONFIG_XEN) += xen/
> >> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> >> index 72fa955f4a15..2fae31459706 100644
> >> --- a/arch/x86/kvm/Kconfig
> >> +++ b/arch/x86/kvm/Kconfig
> >> @@ -96,6 +96,14 @@ config KVM_MMU_AUDIT
> >>  	 This option adds a R/W kVM module parameter 'mmu_audit', which allows
> >>  	 auditing of KVM MMU events at runtime.
> >>  
> >> +# KVM_FREE_PAGE_HINTING will allow the guest to report the free pages to the
> >> +# host in regular interval of time.
> >> +config KVM_FREE_PAGE_HINTING
> >> +       def_bool y
> >> +       depends on KVM
> >> +       select VIRTIO
> >> +       select VIRTIO_BALLOON
> >> +
> >>  # OK, it's a little counter-intuitive to do this, but it puts it neatly under
> >>  # the virtualization menu.
> >>  source "drivers/vhost/Kconfig"
> >> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> >> index 69b3a7c30013..78640a80501e 100644
> >> --- a/arch/x86/kvm/Makefile
> >> +++ b/arch/x86/kvm/Makefile
> >> @@ -16,6 +16,8 @@ kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
> >>  			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
> >>  			   hyperv.o page_track.o debugfs.o
> >>  
> >> +obj-$(CONFIG_KVM_FREE_PAGE_HINTING)    += $(KVM)/page_hinting.o
> >> +
> >>  kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o vmx/evmcs.o vmx/nested.o
> >>  kvm-amd-y		+= svm.o pmu_amd.o
> >>  
> >> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> >> index 5f5e25fd6149..e596527284ba 100644
> >> --- a/include/linux/gfp.h
> >> +++ b/include/linux/gfp.h
> >> @@ -7,6 +7,7 @@
> >>  #include <linux/stddef.h>
> >>  #include <linux/linkage.h>
> >>  #include <linux/topology.h>
> >> +#include <linux/page_hinting.h>
> >>  
> >>  struct vm_area_struct;
> >>  
> >> @@ -456,6 +457,14 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> >>  	return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> >>  }
> >>  
> >> +#ifdef	CONFIG_KVM_FREE_PAGE_HINTING
> >> +#define HAVE_ARCH_FREE_PAGE
> >> +static inline void arch_free_page(struct page *page, int order)
> >> +{
> >> +	guest_free_page(page, order);
> >> +}
> >> +#endif
> >> +
> >>  #ifndef HAVE_ARCH_FREE_PAGE
> >>  static inline void arch_free_page(struct page *page, int order) { }
> >>  #endif
> > OK so arch_free_page hook is used to tie into mm code,
> > with follow-up patches the pages get queued in a list
> > and then sent to hypervisor so it can free them.
> > Fair enough but how do we know the page is
> > not reused by the time it's received by the hypervisor?
> > If it's reused then isn't it a problem that
> > hypervisor calls MADV_DONTNEED on them?
> Hi Michael,
> 
> In order to ensure that the page is not reused, we remove it from the
> buddy free list by acquiring the zone lock. After the page is freed by
> the hypervisor it is returned to the buddy free list again.

Thanks that's good to know. Could you point me to code that does this?

> >
> >
> >> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
> >> new file mode 100644
> >> index 000000000000..b54f7428f348
> >> --- /dev/null
> >> +++ b/include/linux/page_hinting.h
> >> @@ -0,0 +1,17 @@
> >> +/*
> >> + * Size of the array which is used to store the freed pages is defined by
> >> + * MAX_FGPT_ENTRIES. If possible, we have to find a better way using which
> >> + * we can get rid of the hardcoded array size.
> >> + */
> >> +#define MAX_FGPT_ENTRIES	1000
> >> +/*
> >> + * hypervisor_pages - It is a dummy structure passed with the hypercall.
> >> + * @pfn: page frame number for the page which needs to be sent to the host.
> >> + * @order: order of the page needs to be reported to the host.
> >> + */
> >> +struct hypervisor_pages {
> >> +	unsigned long pfn;
> >> +	unsigned int order;
> >> +};
> >> +
> >> +void guest_free_page(struct page *page, int order);
> >> diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
> >> new file mode 100644
> >> index 000000000000..818bd6b84e0c
> >> --- /dev/null
> >> +++ b/virt/kvm/page_hinting.c
> >> @@ -0,0 +1,36 @@
> >> +#include <linux/gfp.h>
> >> +#include <linux/mm.h>
> >> +#include <linux/kernel.h>
> >> +
> >> +/*
> >> + * struct kvm_free_pages - Tracks the pages which are freed by the guest.
> >> + * @pfn: page frame number for the page which is freed.
> >> + * @order: order corresponding to the page freed.
> >> + * @zonenum: zone number to which the freed page belongs.
> >> + */
> >> +struct kvm_free_pages {
> >> +	unsigned long pfn;
> >> +	unsigned int order;
> >> +	int zonenum;
> >> +};
> >> +
> >> +/*
> >> + * struct page_hinting - holds array objects for the structures used to track
> >> + * guest free pages, along with an index variable for each of them.
> >> + * @kvm_pt: array object for the structure kvm_free_pages.
> >> + * @kvm_pt_idx: index for kvm_free_pages object.
> >> + * @hypervisor_pagelist: array object for the structure hypervisor_pages.
> >> + * @hyp_idx: index for hypervisor_pages object.
> >> + */
> >> +struct page_hinting {
> >> +	struct kvm_free_pages kvm_pt[MAX_FGPT_ENTRIES];
> >> +	int kvm_pt_idx;
> >> +	struct hypervisor_pages hypervisor_pagelist[MAX_FGPT_ENTRIES];
> >> +	int hyp_idx;
> >> +};
> >> +
> >> +DEFINE_PER_CPU(struct page_hinting, hinting_obj);
> >> +
> >> +void guest_free_page(struct page *page, int order)
> >> +{
> >> +}
> >> -- 
> >> 2.17.2
> -- 
> Regards
> Nitesh
> 




^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 1/7] KVM: Support for guest free page hinting
  2019-02-05 16:27       ` Michael S. Tsirkin
@ 2019-02-05 16:34         ` Nitesh Narayan Lal
  0 siblings, 0 replies; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-05 16:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, david, dodgen, konrad.wilk, dhildenb,
	aarcange


[-- Attachment #1.1: Type: text/plain, Size: 7266 bytes --]

On 2/5/19 11:27 AM, Michael S. Tsirkin wrote:
> On Tue, Feb 05, 2019 at 08:06:33AM -0500, Nitesh Narayan Lal wrote:
>> On 2/4/19 11:14 PM, Michael S. Tsirkin wrote:
>>> On Mon, Feb 04, 2019 at 03:18:48PM -0500, Nitesh Narayan Lal wrote:
>>>> This patch includes the following:
>>>> 1. Basic skeleton for the support
>>>> 2. Enablement of x86 platform to use the same
>>>>
>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>>>> ---
>>>>  arch/x86/Kbuild              |  2 +-
>>>>  arch/x86/kvm/Kconfig         |  8 ++++++++
>>>>  arch/x86/kvm/Makefile        |  2 ++
>>>>  include/linux/gfp.h          |  9 +++++++++
>>>>  include/linux/page_hinting.h | 17 +++++++++++++++++
>>>>  virt/kvm/page_hinting.c      | 36 ++++++++++++++++++++++++++++++++++++
>>>>  6 files changed, 73 insertions(+), 1 deletion(-)
>>>>  create mode 100644 include/linux/page_hinting.h
>>>>  create mode 100644 virt/kvm/page_hinting.c
>>>>
>>>> diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
>>>> index c625f57472f7..3244df4ee311 100644
>>>> --- a/arch/x86/Kbuild
>>>> +++ b/arch/x86/Kbuild
>>>> @@ -2,7 +2,7 @@ obj-y += entry/
>>>>  
>>>>  obj-$(CONFIG_PERF_EVENTS) += events/
>>>>  
>>>> -obj-$(CONFIG_KVM) += kvm/
>>>> +obj-$(subst m,y,$(CONFIG_KVM)) += kvm/
>>>>  
>>>>  # Xen paravirtualization support
>>>>  obj-$(CONFIG_XEN) += xen/
>>>> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
>>>> index 72fa955f4a15..2fae31459706 100644
>>>> --- a/arch/x86/kvm/Kconfig
>>>> +++ b/arch/x86/kvm/Kconfig
>>>> @@ -96,6 +96,14 @@ config KVM_MMU_AUDIT
>>>>  	 This option adds a R/W kVM module parameter 'mmu_audit', which allows
>>>>  	 auditing of KVM MMU events at runtime.
>>>>  
>>>> +# KVM_FREE_PAGE_HINTING will allow the guest to report the free pages to the
>>>> +# host in regular interval of time.
>>>> +config KVM_FREE_PAGE_HINTING
>>>> +       def_bool y
>>>> +       depends on KVM
>>>> +       select VIRTIO
>>>> +       select VIRTIO_BALLOON
>>>> +
>>>>  # OK, it's a little counter-intuitive to do this, but it puts it neatly under
>>>>  # the virtualization menu.
>>>>  source "drivers/vhost/Kconfig"
>>>> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
>>>> index 69b3a7c30013..78640a80501e 100644
>>>> --- a/arch/x86/kvm/Makefile
>>>> +++ b/arch/x86/kvm/Makefile
>>>> @@ -16,6 +16,8 @@ kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
>>>>  			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
>>>>  			   hyperv.o page_track.o debugfs.o
>>>>  
>>>> +obj-$(CONFIG_KVM_FREE_PAGE_HINTING)    += $(KVM)/page_hinting.o
>>>> +
>>>>  kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o vmx/evmcs.o vmx/nested.o
>>>>  kvm-amd-y		+= svm.o pmu_amd.o
>>>>  
>>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>>> index 5f5e25fd6149..e596527284ba 100644
>>>> --- a/include/linux/gfp.h
>>>> +++ b/include/linux/gfp.h
>>>> @@ -7,6 +7,7 @@
>>>>  #include <linux/stddef.h>
>>>>  #include <linux/linkage.h>
>>>>  #include <linux/topology.h>
>>>> +#include <linux/page_hinting.h>
>>>>  
>>>>  struct vm_area_struct;
>>>>  
>>>> @@ -456,6 +457,14 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>>>>  	return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
>>>>  }
>>>>  
>>>> +#ifdef	CONFIG_KVM_FREE_PAGE_HINTING
>>>> +#define HAVE_ARCH_FREE_PAGE
>>>> +static inline void arch_free_page(struct page *page, int order)
>>>> +{
>>>> +	guest_free_page(page, order);
>>>> +}
>>>> +#endif
>>>> +
>>>>  #ifndef HAVE_ARCH_FREE_PAGE
>>>>  static inline void arch_free_page(struct page *page, int order) { }
>>>>  #endif
>>> OK so arch_free_page hook is used to tie into mm code,
>>> with follow-up patches the pages get queued in a list
>>> and then sent to hypervisor so it can free them.
>>> Fair enough but how do we know the page is
>>> not reused by the time it's received by the hypervisor?
>>> If it's reused then isn't it a problem that
>>> hypervisor calls MADV_DONTNEED on them?
>> Hi Michael,
>>
>> In order to ensure that the page is not reused, we remove it from the
>> buddy free list by acquiring the zone lock. After the page is freed by
>> the hypervisor it is returned to the buddy free list again.
> Thanks that's good to know. Could you point me to code that does this?
In Patch 0006-KVM-Enables-the-kernel-to-isolate-and-report-free-page.
hinting_fn() is responsible for scanning the per-cpu-array, acquiring
the lock, isolating the page and invoking hyperlist_ready().
Under hyperlist_ready, the hypercall to report the free pages is made
and once it is done in this function only those pages are returned to
the buddy free list.
>
>>>
>>>> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
>>>> new file mode 100644
>>>> index 000000000000..b54f7428f348
>>>> --- /dev/null
>>>> +++ b/include/linux/page_hinting.h
>>>> @@ -0,0 +1,17 @@
>>>> +/*
>>>> + * Size of the array which is used to store the freed pages is defined by
>>>> + * MAX_FGPT_ENTRIES. If possible, we have to find a better way using which
>>>> + * we can get rid of the hardcoded array size.
>>>> + */
>>>> +#define MAX_FGPT_ENTRIES	1000
>>>> +/*
>>>> + * hypervisor_pages - It is a dummy structure passed with the hypercall.
>>>> + * @pfn: page frame number for the page which needs to be sent to the host.
>>>> + * @order: order of the page needs to be reported to the host.
>>>> + */
>>>> +struct hypervisor_pages {
>>>> +	unsigned long pfn;
>>>> +	unsigned int order;
>>>> +};
>>>> +
>>>> +void guest_free_page(struct page *page, int order);
>>>> diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
>>>> new file mode 100644
>>>> index 000000000000..818bd6b84e0c
>>>> --- /dev/null
>>>> +++ b/virt/kvm/page_hinting.c
>>>> @@ -0,0 +1,36 @@
>>>> +#include <linux/gfp.h>
>>>> +#include <linux/mm.h>
>>>> +#include <linux/kernel.h>
>>>> +
>>>> +/*
>>>> + * struct kvm_free_pages - Tracks the pages which are freed by the guest.
>>>> + * @pfn: page frame number for the page which is freed.
>>>> + * @order: order corresponding to the page freed.
>>>> + * @zonenum: zone number to which the freed page belongs.
>>>> + */
>>>> +struct kvm_free_pages {
>>>> +	unsigned long pfn;
>>>> +	unsigned int order;
>>>> +	int zonenum;
>>>> +};
>>>> +
>>>> +/*
>>>> + * struct page_hinting - holds array objects for the structures used to track
>>>> + * guest free pages, along with an index variable for each of them.
>>>> + * @kvm_pt: array object for the structure kvm_free_pages.
>>>> + * @kvm_pt_idx: index for kvm_free_pages object.
>>>> + * @hypervisor_pagelist: array object for the structure hypervisor_pages.
>>>> + * @hyp_idx: index for hypervisor_pages object.
>>>> + */
>>>> +struct page_hinting {
>>>> +	struct kvm_free_pages kvm_pt[MAX_FGPT_ENTRIES];
>>>> +	int kvm_pt_idx;
>>>> +	struct hypervisor_pages hypervisor_pagelist[MAX_FGPT_ENTRIES];
>>>> +	int hyp_idx;
>>>> +};
>>>> +
>>>> +DEFINE_PER_CPU(struct page_hinting, hinting_obj);
>>>> +
>>>> +void guest_free_page(struct page *page, int order)
>>>> +{
>>>> +}
>>>> -- 
>>>> 2.17.2
>> -- 
>> Regards
>> Nitesh
>>
>
>
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages
  2019-02-04 20:18 ` [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages Nitesh Narayan Lal
@ 2019-02-05 20:45   ` Michael S. Tsirkin
  2019-02-05 21:54     ` Nitesh Narayan Lal
  0 siblings, 1 reply; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-05 20:45 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, david, dodgen, konrad.wilk, dhildenb,
	aarcange

On Mon, Feb 04, 2019 at 03:18:53PM -0500, Nitesh Narayan Lal wrote:
> This patch enables the kernel to scan the per cpu array and
> compress it by removing the repetitive/re-allocated pages.
> Once the per cpu array is completely filled with pages in the
> buddy it wakes up the kernel per cpu thread which re-scans the
> entire per cpu array by acquiring a zone lock corresponding to
> the page which is being scanned. If the page is still free and
> present in the buddy it tries to isolate the page and adds it
> to another per cpu array.
> 
> Once this scanning process is complete and if there are any
> isolated pages added to the new per cpu array kernel thread
> invokes hyperlist_ready().
> 
> In hyperlist_ready() a hypercall is made to report these pages to
> the host using the virtio-balloon framework. In order to do so
> another virtqueue 'hinting_vq' is added to the balloon framework.
> As the host frees all the reported pages, the kernel thread returns
> them back to the buddy.
> 
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>


This looks kind of like what early iterations of Wei's patches did.

But this has lots of issues, for example you might end up with
a hypercall per a 4K page.

So in the end, he switched over to just reporting only
MAX_ORDER - 1 pages.

Would that be a good idea for you too?

An alternative would be a different much lighter weight
way to report these pages and to free them on the host.

> ---
>  drivers/virtio/virtio_balloon.c     |  56 +++++++-
>  include/linux/page_hinting.h        |  18 ++-
>  include/uapi/linux/virtio_balloon.h |   1 +
>  mm/page_alloc.c                     |   2 +-
>  virt/kvm/page_hinting.c             | 202 +++++++++++++++++++++++++++-
>  5 files changed, 269 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 728ecd1eea30..8af34e0b9a32 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -57,13 +57,15 @@ enum virtio_balloon_vq {
>  	VIRTIO_BALLOON_VQ_INFLATE,
>  	VIRTIO_BALLOON_VQ_DEFLATE,
>  	VIRTIO_BALLOON_VQ_STATS,
> +	VIRTIO_BALLOON_VQ_HINTING,
>  	VIRTIO_BALLOON_VQ_FREE_PAGE,
>  	VIRTIO_BALLOON_VQ_MAX
>  };
>  
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
> +								*hinting_vq;
>  
>  	/* Balloon's own wq for cpu-intensive work items */
>  	struct workqueue_struct *balloon_wq;
> @@ -122,6 +124,40 @@ static struct virtio_device_id id_table[] = {
>  	{ 0 },
>  };
>  
> +#ifdef CONFIG_KVM_FREE_PAGE_HINTING
> +void virtballoon_page_hinting(struct virtio_balloon *vb, u64 gvaddr,
> +			      int hyper_entries)
> +{
> +	u64 gpaddr = virt_to_phys((void *)gvaddr);
> +
> +	virtqueue_add_desc(vb->hinting_vq, gpaddr, hyper_entries, 0);
> +	virtqueue_kick_sync(vb->hinting_vq);
> +}
> +
> +static void hinting_ack(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +
> +	wake_up(&vb->acked);
> +}
> +
> +static void enable_hinting(struct virtio_balloon *vb)
> +{
> +	guest_page_hinting_flag = 1;
> +	static_branch_enable(&guest_page_hinting_key);
> +	request_hypercall = (void *)&virtballoon_page_hinting;
> +	balloon_ptr = vb;
> +	WARN_ON(smpboot_register_percpu_thread(&hinting_threads));
> +}
> +
> +static void disable_hinting(void)
> +{
> +	guest_page_hinting_flag = 0;
> +	static_branch_enable(&guest_page_hinting_key);
> +	balloon_ptr = NULL;
> +}
> +#endif
> +
>  static u32 page_to_balloon_pfn(struct page *page)
>  {
>  	unsigned long pfn = page_to_pfn(page);
> @@ -481,6 +517,7 @@ static int init_vqs(struct virtio_balloon *vb)
>  	names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
>  	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
>  	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> +	names[VIRTIO_BALLOON_VQ_HINTING] = NULL;
>  
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>  		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> @@ -492,11 +529,18 @@ static int init_vqs(struct virtio_balloon *vb)
>  		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>  	}
>  
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
> +		names[VIRTIO_BALLOON_VQ_HINTING] = "hinting_vq";
> +		callbacks[VIRTIO_BALLOON_VQ_HINTING] = hinting_ack;
> +	}
>  	err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
>  					 vqs, callbacks, names, NULL, NULL);
>  	if (err)
>  		return err;
>  
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
> +		vb->hinting_vq = vqs[VIRTIO_BALLOON_VQ_HINTING];
> +
>  	vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
>  	vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> @@ -908,6 +952,11 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  		if (err)
>  			goto out_del_balloon_wq;
>  	}
> +
> +#ifdef CONFIG_KVM_FREE_PAGE_HINTING
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
> +		enable_hinting(vb);
> +#endif
>  	virtio_device_ready(vdev);
>  
>  	if (towards_target(vb))
> @@ -950,6 +999,10 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
>  
> +#ifdef CONFIG_KVM_FREE_PAGE_HINTING
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
> +		disable_hinting();
> +#endif
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
>  		cancel_work_sync(&vb->report_free_page_work);
>  		destroy_workqueue(vb->balloon_wq);
> @@ -1009,6 +1062,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_HINTING,
>  	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
>  	VIRTIO_BALLOON_F_PAGE_POISON,
>  };
> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
> index e800c6b07561..3ba8c1f3b4a4 100644
> --- a/include/linux/page_hinting.h
> +++ b/include/linux/page_hinting.h
> @@ -1,15 +1,12 @@
>  #include <linux/smpboot.h>
>  
> -/*
> - * Size of the array which is used to store the freed pages is defined by
> - * MAX_FGPT_ENTRIES. If possible, we have to find a better way using which
> - * we can get rid of the hardcoded array size.
> - */
>  #define MAX_FGPT_ENTRIES	1000
>  /*
>   * hypervisor_pages - It is a dummy structure passed with the hypercall.
> - * @pfn: page frame number for the page which needs to be sent to the host.
> - * @order: order of the page needs to be reported to the host.
> + * @pfn - page frame number for the page which is to be freed.
> + * @pages - number of pages which are supposed to be freed.
> + * A global array object is used to to hold the list of pfn and pages and is
> + * passed as part of the hypercall.
>   */
>  struct hypervisor_pages {
>  	unsigned long pfn;
> @@ -19,11 +16,18 @@ struct hypervisor_pages {
>  extern int guest_page_hinting_flag;
>  extern struct static_key_false guest_page_hinting_key;
>  extern struct smp_hotplug_thread hinting_threads;
> +extern void (*request_hypercall)(void *, u64, int);
> +extern void *balloon_ptr;
>  extern bool want_page_poisoning;
>  
>  int guest_page_hinting_sysctl(struct ctl_table *table, int write,
>  			      void __user *buffer, size_t *lenp, loff_t *ppos);
>  void guest_free_page(struct page *page, int order);
> +extern int __isolate_free_page(struct page *page, unsigned int order);
> +extern void free_one_page(struct zone *zone,
> +			  struct page *page, unsigned long pfn,
> +			  unsigned int order,
> +			  int migratetype);
>  
>  static inline void disable_page_poisoning(void)
>  {

I guess you will want to put this in some other header.  Function
declarations belong close to where they are implemented, not used.

> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index a1966cd7b677..2b0f62814e22 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -36,6 +36,7 @@
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
>  #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
> +#define VIRTIO_BALLOON_F_HINTING	5 /* Page hinting virtqueue */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d295c9bc01a8..93224cba9243 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1199,7 +1199,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  	spin_unlock(&zone->lock);
>  }
>  
> -static void free_one_page(struct zone *zone,
> +void free_one_page(struct zone *zone,
>  				struct page *page, unsigned long pfn,
>  				unsigned int order,
>  				int migratetype)
> diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
> index be529f6f2bc0..315099fcda43 100644
> --- a/virt/kvm/page_hinting.c
> +++ b/virt/kvm/page_hinting.c
> @@ -1,6 +1,8 @@
>  #include <linux/gfp.h>
>  #include <linux/mm.h>
> +#include <linux/page_ref.h>
>  #include <linux/kvm_host.h>
> +#include <linux/sort.h>
>  #include <linux/kernel.h>
>  
>  /*
> @@ -39,6 +41,11 @@ int guest_page_hinting_flag;
>  EXPORT_SYMBOL(guest_page_hinting_flag);
>  static DEFINE_PER_CPU(struct task_struct *, hinting_task);
>  
> +void (*request_hypercall)(void *, u64, int);
> +EXPORT_SYMBOL(request_hypercall);
> +void *balloon_ptr;
> +EXPORT_SYMBOL(balloon_ptr);
> +
>  int guest_page_hinting_sysctl(struct ctl_table *table, int write,
>  			      void __user *buffer, size_t *lenp,
>  			      loff_t *ppos)
> @@ -55,18 +62,201 @@ int guest_page_hinting_sysctl(struct ctl_table *table, int write,
>  	return ret;
>  }
>  
> +void hyperlist_ready(struct hypervisor_pages *guest_isolated_pages, int entries)
> +{
> +	int i = 0;
> +	int mt = 0;
> +
> +	if (balloon_ptr)
> +		request_hypercall(balloon_ptr, (u64)&guest_isolated_pages[0],
> +				  entries);
> +
> +	while (i < entries) {
> +		struct page *page = pfn_to_page(guest_isolated_pages[i].pfn);
> +
> +		mt = get_pageblock_migratetype(page);
> +		free_one_page(page_zone(page), page, page_to_pfn(page),
> +			      guest_isolated_pages[i].order, mt);
> +		i++;
> +	}
> +}
> +
> +struct page *get_buddy_page(struct page *page)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +	unsigned int order;
> +
> +	for (order = 0; order < MAX_ORDER; order++) {
> +		struct page *page_head = page - (pfn & ((1 << order) - 1));
> +
> +		if (PageBuddy(page_head) && page_private(page_head) >= order)
> +			return page_head;
> +	}
> +	return NULL;
> +}
> +
>  static void hinting_fn(unsigned int cpu)
>  {
>  	struct page_hinting *page_hinting_obj = this_cpu_ptr(&hinting_obj);
> +	int idx = 0, ret = 0;
> +	struct zone *zone_cur;
> +	unsigned long flags = 0;
> +
> +	while (idx < MAX_FGPT_ENTRIES) {
> +		unsigned long pfn = page_hinting_obj->kvm_pt[idx].pfn;
> +		unsigned long pfn_end = page_hinting_obj->kvm_pt[idx].pfn +
> +			(1 << page_hinting_obj->kvm_pt[idx].order) - 1;
> +
> +		while (pfn <= pfn_end) {
> +			struct page *page = pfn_to_page(pfn);
> +			struct page *buddy_page = NULL;
> +
> +			zone_cur = page_zone(page);
> +			spin_lock_irqsave(&zone_cur->lock, flags);
> +
> +			if (PageCompound(page)) {
> +				struct page *head_page = compound_head(page);
> +				unsigned long head_pfn = page_to_pfn(head_page);
> +				unsigned int alloc_pages =
> +					1 << compound_order(head_page);
> +
> +				pfn = head_pfn + alloc_pages;
> +				spin_unlock_irqrestore(&zone_cur->lock, flags);
> +				continue;
> +			}
> +
> +			if (page_ref_count(page)) {
> +				pfn++;
> +				spin_unlock_irqrestore(&zone_cur->lock, flags);
> +				continue;
> +			}
> +
> +			if (PageBuddy(page)) {
> +				int buddy_order = page_private(page);
>  
> +				ret = __isolate_free_page(page, buddy_order);
> +				if (!ret) {
> +				} else {
> +					int l_idx = page_hinting_obj->hyp_idx;
> +					struct hypervisor_pages *l_obj =
> +					page_hinting_obj->hypervisor_pagelist;
> +
> +					l_obj[l_idx].pfn = pfn;
> +					l_obj[l_idx].order = buddy_order;
> +					page_hinting_obj->hyp_idx += 1;
> +				}
> +				pfn = pfn + (1 << buddy_order);
> +				spin_unlock_irqrestore(&zone_cur->lock, flags);
> +				continue;
> +			}
> +
> +			buddy_page = get_buddy_page(page);
> +			if (buddy_page) {
> +				int buddy_order = page_private(buddy_page);
> +
> +				ret = __isolate_free_page(buddy_page,
> +							  buddy_order);
> +				if (!ret) {
> +				} else {
> +					int l_idx = page_hinting_obj->hyp_idx;
> +					struct hypervisor_pages *l_obj =
> +					page_hinting_obj->hypervisor_pagelist;
> +					unsigned long buddy_pfn =
> +						page_to_pfn(buddy_page);
> +
> +					l_obj[l_idx].pfn = buddy_pfn;
> +					l_obj[l_idx].order = buddy_order;
> +					page_hinting_obj->hyp_idx += 1;
> +				}
> +				pfn = page_to_pfn(buddy_page) +
> +					(1 << buddy_order);
> +				spin_unlock_irqrestore(&zone_cur->lock, flags);
> +				continue;
> +			}
> +			spin_unlock_irqrestore(&zone_cur->lock, flags);
> +			pfn++;
> +		}
> +		page_hinting_obj->kvm_pt[idx].pfn = 0;
> +		page_hinting_obj->kvm_pt[idx].order = -1;
> +		page_hinting_obj->kvm_pt[idx].zonenum = -1;
> +		idx++;
> +	}
> +	if (page_hinting_obj->hyp_idx > 0) {
> +		hyperlist_ready(page_hinting_obj->hypervisor_pagelist,
> +				page_hinting_obj->hyp_idx);
> +		page_hinting_obj->hyp_idx = 0;
> +	}
>  	page_hinting_obj->kvm_pt_idx = 0;
>  	put_cpu_var(hinting_obj);
>  }
>  
> +int if_exist(struct page *page)
> +{
> +	int i = 0;
> +	struct page_hinting *page_hinting_obj = this_cpu_ptr(&hinting_obj);
> +
> +	while (i < MAX_FGPT_ENTRIES) {
> +		if (page_to_pfn(page) == page_hinting_obj->kvm_pt[i].pfn)
> +			return 1;
> +		i++;
> +	}
> +	return 0;
> +}
> +
> +void pack_array(void)
> +{
> +	int i = 0, j = 0;
> +	struct page_hinting *page_hinting_obj = this_cpu_ptr(&hinting_obj);
> +
> +	while (i < MAX_FGPT_ENTRIES) {
> +		if (page_hinting_obj->kvm_pt[i].pfn != 0) {
> +			if (i != j) {
> +				page_hinting_obj->kvm_pt[j].pfn =
> +					page_hinting_obj->kvm_pt[i].pfn;
> +				page_hinting_obj->kvm_pt[j].order =
> +					page_hinting_obj->kvm_pt[i].order;
> +				page_hinting_obj->kvm_pt[j].zonenum =
> +					page_hinting_obj->kvm_pt[i].zonenum;
> +			}
> +			j++;
> +		}
> +		i++;
> +	}
> +	i = j;
> +	page_hinting_obj->kvm_pt_idx = j;
> +	while (j < MAX_FGPT_ENTRIES) {
> +		page_hinting_obj->kvm_pt[j].pfn = 0;
> +		page_hinting_obj->kvm_pt[j].order = -1;
> +		page_hinting_obj->kvm_pt[j].zonenum = -1;
> +		j++;
> +	}
> +}
> +
>  void scan_array(void)
>  {
>  	struct page_hinting *page_hinting_obj = this_cpu_ptr(&hinting_obj);
> +	int i = 0;
>  
> +	while (i < MAX_FGPT_ENTRIES) {
> +		struct page *page =
> +			pfn_to_page(page_hinting_obj->kvm_pt[i].pfn);
> +		struct page *buddy_page = get_buddy_page(page);
> +
> +		if (!PageBuddy(page) && buddy_page) {
> +			if (if_exist(buddy_page)) {
> +				page_hinting_obj->kvm_pt[i].pfn = 0;
> +				page_hinting_obj->kvm_pt[i].order = -1;
> +				page_hinting_obj->kvm_pt[i].zonenum = -1;
> +			} else {
> +				page_hinting_obj->kvm_pt[i].pfn =
> +					page_to_pfn(buddy_page);
> +				page_hinting_obj->kvm_pt[i].order =
> +					page_private(buddy_page);
> +			}
> +		}
> +		i++;
> +	}
> +	pack_array();
>  	if (page_hinting_obj->kvm_pt_idx == MAX_FGPT_ENTRIES)
>  		wake_up_process(__this_cpu_read(hinting_task));
>  }
> @@ -111,8 +301,18 @@ void guest_free_page(struct page *page, int order)
>  		page_hinting_obj->kvm_pt[page_hinting_obj->kvm_pt_idx].order =
>  							order;
>  		page_hinting_obj->kvm_pt_idx += 1;
> -		if (page_hinting_obj->kvm_pt_idx == MAX_FGPT_ENTRIES)
> +		if (page_hinting_obj->kvm_pt_idx == MAX_FGPT_ENTRIES) {
> +			/*
> +			 * We are depending on the buddy free-list to identify
> +			 * if a page is free or not. Hence, we are dumping all
> +			 * the per-cpu pages back into the buddy allocator. This
> +			 * will ensure less failures when we try to isolate free
> +			 * captured pages and hence more memory reporting to the
> +			 * host.
> +			 */
> +			drain_local_pages(NULL);
>  			scan_array();
> +		}
>  	}
>  	local_irq_restore(flags);
>  }
> -- 
> 2.17.2

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 5/7] virtio: Enables to add a single descriptor to the host
  2019-02-04 20:18 ` [RFC][Patch v8 5/7] virtio: Enables to add a single descriptor to the host Nitesh Narayan Lal
@ 2019-02-05 20:49   ` Michael S. Tsirkin
  2019-02-06 12:56     ` Nitesh Narayan Lal
  0 siblings, 1 reply; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-05 20:49 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, david, dodgen, konrad.wilk, dhildenb,
	aarcange

On Mon, Feb 04, 2019 at 03:18:52PM -0500, Nitesh Narayan Lal wrote:
> This patch enables the caller to expose a single buffers to the
> other end using vring descriptor. It also allows the caller to
> perform this action in synchornous manner by using virtqueue_kick_sync.
> 
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>

I am not sure why do we need this API. Polling in guest
until host runs isn't great either since these
might be running on the same host CPU.



> ---
>  drivers/virtio/virtio_ring.c | 72 ++++++++++++++++++++++++++++++++++++
>  include/linux/virtio.h       |  4 ++
>  2 files changed, 76 insertions(+)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index cd7e755484e3..93c161ac6a28 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -1695,6 +1695,52 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  					out_sgs, in_sgs, data, ctx, gfp);
>  }
>  
> +/**
> + * virtqueue_add_desc - add a buffer to a chain using a vring desc
> + * @vq: the struct virtqueue we're talking about.
> + * @addr: address of the buffer to add.
> + * @len: length of the buffer.
> + * @in: set if the buffer is for the device to write.
> + *
> + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
> + */
> +int virtqueue_add_desc(struct virtqueue *_vq, u64 addr, u32 len, int in)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +	struct vring_desc *desc = vq->split.vring.desc;
> +	u16 flags = in ? VRING_DESC_F_WRITE : 0;
> +	unsigned int i;
> +	void *data = (void *)addr;
> +	int avail_idx;
> +
> +	/* Sanity check */
> +	if (!_vq)
> +		return -EINVAL;
> +
> +	START_USE(vq);
> +	if (unlikely(vq->broken)) {
> +		END_USE(vq);
> +		return -EIO;
> +	}
> +
> +	i = vq->free_head;
> +	flags &= ~VRING_DESC_F_NEXT;
> +	desc[i].flags = cpu_to_virtio16(_vq->vdev, flags);
> +	desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
> +	desc[i].len = cpu_to_virtio32(_vq->vdev, len);
> +
> +	vq->vq.num_free--;
> +	vq->free_head = virtio16_to_cpu(_vq->vdev, desc[i].next);
> +	vq->split.desc_state[i].data = data;
> +	vq->split.avail_idx_shadow = 1;
> +	avail_idx = vq->split.avail_idx_shadow;
> +	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev, avail_idx);
> +	vq->num_added = 1;
> +	END_USE(vq);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_add_desc);
> +
>  /**
>   * virtqueue_add_sgs - expose buffers to other end
>   * @vq: the struct virtqueue we're talking about.
> @@ -1842,6 +1888,32 @@ bool virtqueue_notify(struct virtqueue *_vq)
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_notify);
>  
> +/**
> + * virtqueue_kick_sync - update after add_buf and busy wait till update is done
> + * @vq: the struct virtqueue
> + *
> + * After one or more virtqueue_add_* calls, invoke this to kick
> + * the other side. Busy wait till the other side is done with the update.
> + *
> + * Caller must ensure we don't call this with other virtqueue
> + * operations at the same time (except where noted).
> + *
> + * Returns false if kick failed, otherwise true.
> + */
> +bool virtqueue_kick_sync(struct virtqueue *vq)
> +{
> +	u32 len;
> +
> +	if (likely(virtqueue_kick(vq))) {
> +		while (!virtqueue_get_buf(vq, &len) &&
> +		       !virtqueue_is_broken(vq))
> +			cpu_relax();
> +		return true;
> +	}
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_kick_sync);
> +
>  /**
>   * virtqueue_kick - update after add_buf
>   * @vq: the struct virtqueue
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index fa1b5da2804e..58943a3a0e8d 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -57,6 +57,10 @@ int virtqueue_add_sgs(struct virtqueue *vq,
>  		      unsigned int in_sgs,
>  		      void *data,
>  		      gfp_t gfp);
> +/* A desc with this init id is treated as an invalid desc */
> +int virtqueue_add_desc(struct virtqueue *_vq, u64 addr, u32 len, int in);
> +
> +bool virtqueue_kick_sync(struct virtqueue *vq);
>  
>  bool virtqueue_kick(struct virtqueue *vq);
>  
> -- 
> 2.17.2

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages
  2019-02-05 20:45   ` Michael S. Tsirkin
@ 2019-02-05 21:54     ` Nitesh Narayan Lal
  2019-02-05 21:55       ` Michael S. Tsirkin
  0 siblings, 1 reply; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-05 21:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, david, dodgen, konrad.wilk, dhildenb,
	aarcange


[-- Attachment #1.1: Type: text/plain, Size: 17546 bytes --]


On 2/5/19 3:45 PM, Michael S. Tsirkin wrote:
> On Mon, Feb 04, 2019 at 03:18:53PM -0500, Nitesh Narayan Lal wrote:
>> This patch enables the kernel to scan the per cpu array and
>> compress it by removing the repetitive/re-allocated pages.
>> Once the per cpu array is completely filled with pages in the
>> buddy it wakes up the kernel per cpu thread which re-scans the
>> entire per cpu array by acquiring a zone lock corresponding to
>> the page which is being scanned. If the page is still free and
>> present in the buddy it tries to isolate the page and adds it
>> to another per cpu array.
>>
>> Once this scanning process is complete and if there are any
>> isolated pages added to the new per cpu array kernel thread
>> invokes hyperlist_ready().
>>
>> In hyperlist_ready() a hypercall is made to report these pages to
>> the host using the virtio-balloon framework. In order to do so
>> another virtqueue 'hinting_vq' is added to the balloon framework.
>> As the host frees all the reported pages, the kernel thread returns
>> them back to the buddy.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>
> This looks kind of like what early iterations of Wei's patches did.
>
> But this has lots of issues, for example you might end up with
> a hypercall per a 4K page.
> So in the end, he switched over to just reporting only
> MAX_ORDER - 1 pages.
You mean that I should only capture/attempt to isolate pages with order
MAX_ORDER - 1?
>
> Would that be a good idea for you too?
Will it help if we have a threshold value based on the amount of memory
captured instead of the number of entries/pages in the array?
>
> An alternative would be a different much lighter weight
> way to report these pages and to free them on the host.
>
>> ---
>>  drivers/virtio/virtio_balloon.c     |  56 +++++++-
>>  include/linux/page_hinting.h        |  18 ++-
>>  include/uapi/linux/virtio_balloon.h |   1 +
>>  mm/page_alloc.c                     |   2 +-
>>  virt/kvm/page_hinting.c             | 202 +++++++++++++++++++++++++++-
>>  5 files changed, 269 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>> index 728ecd1eea30..8af34e0b9a32 100644
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
>> @@ -57,13 +57,15 @@ enum virtio_balloon_vq {
>>  	VIRTIO_BALLOON_VQ_INFLATE,
>>  	VIRTIO_BALLOON_VQ_DEFLATE,
>>  	VIRTIO_BALLOON_VQ_STATS,
>> +	VIRTIO_BALLOON_VQ_HINTING,
>>  	VIRTIO_BALLOON_VQ_FREE_PAGE,
>>  	VIRTIO_BALLOON_VQ_MAX
>>  };
>>  
>>  struct virtio_balloon {
>>  	struct virtio_device *vdev;
>> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
>> +								*hinting_vq;
>>  
>>  	/* Balloon's own wq for cpu-intensive work items */
>>  	struct workqueue_struct *balloon_wq;
>> @@ -122,6 +124,40 @@ static struct virtio_device_id id_table[] = {
>>  	{ 0 },
>>  };
>>  
>> +#ifdef CONFIG_KVM_FREE_PAGE_HINTING
>> +void virtballoon_page_hinting(struct virtio_balloon *vb, u64 gvaddr,
>> +			      int hyper_entries)
>> +{
>> +	u64 gpaddr = virt_to_phys((void *)gvaddr);
>> +
>> +	virtqueue_add_desc(vb->hinting_vq, gpaddr, hyper_entries, 0);
>> +	virtqueue_kick_sync(vb->hinting_vq);
>> +}
>> +
>> +static void hinting_ack(struct virtqueue *vq)
>> +{
>> +	struct virtio_balloon *vb = vq->vdev->priv;
>> +
>> +	wake_up(&vb->acked);
>> +}
>> +
>> +static void enable_hinting(struct virtio_balloon *vb)
>> +{
>> +	guest_page_hinting_flag = 1;
>> +	static_branch_enable(&guest_page_hinting_key);
>> +	request_hypercall = (void *)&virtballoon_page_hinting;
>> +	balloon_ptr = vb;
>> +	WARN_ON(smpboot_register_percpu_thread(&hinting_threads));
>> +}
>> +
>> +static void disable_hinting(void)
>> +{
>> +	guest_page_hinting_flag = 0;
>> +	static_branch_enable(&guest_page_hinting_key);
>> +	balloon_ptr = NULL;
>> +}
>> +#endif
>> +
>>  static u32 page_to_balloon_pfn(struct page *page)
>>  {
>>  	unsigned long pfn = page_to_pfn(page);
>> @@ -481,6 +517,7 @@ static int init_vqs(struct virtio_balloon *vb)
>>  	names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
>>  	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
>>  	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>> +	names[VIRTIO_BALLOON_VQ_HINTING] = NULL;
>>  
>>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>>  		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
>> @@ -492,11 +529,18 @@ static int init_vqs(struct virtio_balloon *vb)
>>  		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>>  	}
>>  
>> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
>> +		names[VIRTIO_BALLOON_VQ_HINTING] = "hinting_vq";
>> +		callbacks[VIRTIO_BALLOON_VQ_HINTING] = hinting_ack;
>> +	}
>>  	err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
>>  					 vqs, callbacks, names, NULL, NULL);
>>  	if (err)
>>  		return err;
>>  
>> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
>> +		vb->hinting_vq = vqs[VIRTIO_BALLOON_VQ_HINTING];
>> +
>>  	vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
>>  	vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
>>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>> @@ -908,6 +952,11 @@ static int virtballoon_probe(struct virtio_device *vdev)
>>  		if (err)
>>  			goto out_del_balloon_wq;
>>  	}
>> +
>> +#ifdef CONFIG_KVM_FREE_PAGE_HINTING
>> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
>> +		enable_hinting(vb);
>> +#endif
>>  	virtio_device_ready(vdev);
>>  
>>  	if (towards_target(vb))
>> @@ -950,6 +999,10 @@ static void virtballoon_remove(struct virtio_device *vdev)
>>  	cancel_work_sync(&vb->update_balloon_size_work);
>>  	cancel_work_sync(&vb->update_balloon_stats_work);
>>  
>> +#ifdef CONFIG_KVM_FREE_PAGE_HINTING
>> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
>> +		disable_hinting();
>> +#endif
>>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
>>  		cancel_work_sync(&vb->report_free_page_work);
>>  		destroy_workqueue(vb->balloon_wq);
>> @@ -1009,6 +1062,7 @@ static unsigned int features[] = {
>>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>>  	VIRTIO_BALLOON_F_STATS_VQ,
>>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>> +	VIRTIO_BALLOON_F_HINTING,
>>  	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
>>  	VIRTIO_BALLOON_F_PAGE_POISON,
>>  };
>> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
>> index e800c6b07561..3ba8c1f3b4a4 100644
>> --- a/include/linux/page_hinting.h
>> +++ b/include/linux/page_hinting.h
>> @@ -1,15 +1,12 @@
>>  #include <linux/smpboot.h>
>>  
>> -/*
>> - * Size of the array which is used to store the freed pages is defined by
>> - * MAX_FGPT_ENTRIES. If possible, we have to find a better way using which
>> - * we can get rid of the hardcoded array size.
>> - */
>>  #define MAX_FGPT_ENTRIES	1000
>>  /*
>>   * hypervisor_pages - It is a dummy structure passed with the hypercall.
>> - * @pfn: page frame number for the page which needs to be sent to the host.
>> - * @order: order of the page needs to be reported to the host.
>> + * @pfn - page frame number for the page which is to be freed.
>> + * @pages - number of pages which are supposed to be freed.
>> + * A global array object is used to to hold the list of pfn and pages and is
>> + * passed as part of the hypercall.
>>   */
>>  struct hypervisor_pages {
>>  	unsigned long pfn;
>> @@ -19,11 +16,18 @@ struct hypervisor_pages {
>>  extern int guest_page_hinting_flag;
>>  extern struct static_key_false guest_page_hinting_key;
>>  extern struct smp_hotplug_thread hinting_threads;
>> +extern void (*request_hypercall)(void *, u64, int);
>> +extern void *balloon_ptr;
>>  extern bool want_page_poisoning;
>>  
>>  int guest_page_hinting_sysctl(struct ctl_table *table, int write,
>>  			      void __user *buffer, size_t *lenp, loff_t *ppos);
>>  void guest_free_page(struct page *page, int order);
>> +extern int __isolate_free_page(struct page *page, unsigned int order);
>> +extern void free_one_page(struct zone *zone,
>> +			  struct page *page, unsigned long pfn,
>> +			  unsigned int order,
>> +			  int migratetype);
>>  
>>  static inline void disable_page_poisoning(void)
>>  {
> I guess you will want to put this in some other header.  Function
> declarations belong close to where they are implemented, not used.
I will find a better place.
>
>> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
>> index a1966cd7b677..2b0f62814e22 100644
>> --- a/include/uapi/linux/virtio_balloon.h
>> +++ b/include/uapi/linux/virtio_balloon.h
>> @@ -36,6 +36,7 @@
>>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>>  #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
>>  #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
>> +#define VIRTIO_BALLOON_F_HINTING	5 /* Page hinting virtqueue */
>>  
>>  /* Size of a PFN in the balloon interface. */
>>  #define VIRTIO_BALLOON_PFN_SHIFT 12
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index d295c9bc01a8..93224cba9243 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1199,7 +1199,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>  	spin_unlock(&zone->lock);
>>  }
>>  
>> -static void free_one_page(struct zone *zone,
>> +void free_one_page(struct zone *zone,
>>  				struct page *page, unsigned long pfn,
>>  				unsigned int order,
>>  				int migratetype)
>> diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
>> index be529f6f2bc0..315099fcda43 100644
>> --- a/virt/kvm/page_hinting.c
>> +++ b/virt/kvm/page_hinting.c
>> @@ -1,6 +1,8 @@
>>  #include <linux/gfp.h>
>>  #include <linux/mm.h>
>> +#include <linux/page_ref.h>
>>  #include <linux/kvm_host.h>
>> +#include <linux/sort.h>
>>  #include <linux/kernel.h>
>>  
>>  /*
>> @@ -39,6 +41,11 @@ int guest_page_hinting_flag;
>>  EXPORT_SYMBOL(guest_page_hinting_flag);
>>  static DEFINE_PER_CPU(struct task_struct *, hinting_task);
>>  
>> +void (*request_hypercall)(void *, u64, int);
>> +EXPORT_SYMBOL(request_hypercall);
>> +void *balloon_ptr;
>> +EXPORT_SYMBOL(balloon_ptr);
>> +
>>  int guest_page_hinting_sysctl(struct ctl_table *table, int write,
>>  			      void __user *buffer, size_t *lenp,
>>  			      loff_t *ppos)
>> @@ -55,18 +62,201 @@ int guest_page_hinting_sysctl(struct ctl_table *table, int write,
>>  	return ret;
>>  }
>>  
>> +void hyperlist_ready(struct hypervisor_pages *guest_isolated_pages, int entries)
>> +{
>> +	int i = 0;
>> +	int mt = 0;
>> +
>> +	if (balloon_ptr)
>> +		request_hypercall(balloon_ptr, (u64)&guest_isolated_pages[0],
>> +				  entries);
>> +
>> +	while (i < entries) {
>> +		struct page *page = pfn_to_page(guest_isolated_pages[i].pfn);
>> +
>> +		mt = get_pageblock_migratetype(page);
>> +		free_one_page(page_zone(page), page, page_to_pfn(page),
>> +			      guest_isolated_pages[i].order, mt);
>> +		i++;
>> +	}
>> +}
>> +
>> +struct page *get_buddy_page(struct page *page)
>> +{
>> +	unsigned long pfn = page_to_pfn(page);
>> +	unsigned int order;
>> +
>> +	for (order = 0; order < MAX_ORDER; order++) {
>> +		struct page *page_head = page - (pfn & ((1 << order) - 1));
>> +
>> +		if (PageBuddy(page_head) && page_private(page_head) >= order)
>> +			return page_head;
>> +	}
>> +	return NULL;
>> +}
>> +
>>  static void hinting_fn(unsigned int cpu)
>>  {
>>  	struct page_hinting *page_hinting_obj = this_cpu_ptr(&hinting_obj);
>> +	int idx = 0, ret = 0;
>> +	struct zone *zone_cur;
>> +	unsigned long flags = 0;
>> +
>> +	while (idx < MAX_FGPT_ENTRIES) {
>> +		unsigned long pfn = page_hinting_obj->kvm_pt[idx].pfn;
>> +		unsigned long pfn_end = page_hinting_obj->kvm_pt[idx].pfn +
>> +			(1 << page_hinting_obj->kvm_pt[idx].order) - 1;
>> +
>> +		while (pfn <= pfn_end) {
>> +			struct page *page = pfn_to_page(pfn);
>> +			struct page *buddy_page = NULL;
>> +
>> +			zone_cur = page_zone(page);
>> +			spin_lock_irqsave(&zone_cur->lock, flags);
>> +
>> +			if (PageCompound(page)) {
>> +				struct page *head_page = compound_head(page);
>> +				unsigned long head_pfn = page_to_pfn(head_page);
>> +				unsigned int alloc_pages =
>> +					1 << compound_order(head_page);
>> +
>> +				pfn = head_pfn + alloc_pages;
>> +				spin_unlock_irqrestore(&zone_cur->lock, flags);
>> +				continue;
>> +			}
>> +
>> +			if (page_ref_count(page)) {
>> +				pfn++;
>> +				spin_unlock_irqrestore(&zone_cur->lock, flags);
>> +				continue;
>> +			}
>> +
>> +			if (PageBuddy(page)) {
>> +				int buddy_order = page_private(page);
>>  
>> +				ret = __isolate_free_page(page, buddy_order);
>> +				if (!ret) {
>> +				} else {
>> +					int l_idx = page_hinting_obj->hyp_idx;
>> +					struct hypervisor_pages *l_obj =
>> +					page_hinting_obj->hypervisor_pagelist;
>> +
>> +					l_obj[l_idx].pfn = pfn;
>> +					l_obj[l_idx].order = buddy_order;
>> +					page_hinting_obj->hyp_idx += 1;
>> +				}
>> +				pfn = pfn + (1 << buddy_order);
>> +				spin_unlock_irqrestore(&zone_cur->lock, flags);
>> +				continue;
>> +			}
>> +
>> +			buddy_page = get_buddy_page(page);
>> +			if (buddy_page) {
>> +				int buddy_order = page_private(buddy_page);
>> +
>> +				ret = __isolate_free_page(buddy_page,
>> +							  buddy_order);
>> +				if (!ret) {
>> +				} else {
>> +					int l_idx = page_hinting_obj->hyp_idx;
>> +					struct hypervisor_pages *l_obj =
>> +					page_hinting_obj->hypervisor_pagelist;
>> +					unsigned long buddy_pfn =
>> +						page_to_pfn(buddy_page);
>> +
>> +					l_obj[l_idx].pfn = buddy_pfn;
>> +					l_obj[l_idx].order = buddy_order;
>> +					page_hinting_obj->hyp_idx += 1;
>> +				}
>> +				pfn = page_to_pfn(buddy_page) +
>> +					(1 << buddy_order);
>> +				spin_unlock_irqrestore(&zone_cur->lock, flags);
>> +				continue;
>> +			}
>> +			spin_unlock_irqrestore(&zone_cur->lock, flags);
>> +			pfn++;
>> +		}
>> +		page_hinting_obj->kvm_pt[idx].pfn = 0;
>> +		page_hinting_obj->kvm_pt[idx].order = -1;
>> +		page_hinting_obj->kvm_pt[idx].zonenum = -1;
>> +		idx++;
>> +	}
>> +	if (page_hinting_obj->hyp_idx > 0) {
>> +		hyperlist_ready(page_hinting_obj->hypervisor_pagelist,
>> +				page_hinting_obj->hyp_idx);
>> +		page_hinting_obj->hyp_idx = 0;
>> +	}
>>  	page_hinting_obj->kvm_pt_idx = 0;
>>  	put_cpu_var(hinting_obj);
>>  }
>>  
>> +int if_exist(struct page *page)
>> +{
>> +	int i = 0;
>> +	struct page_hinting *page_hinting_obj = this_cpu_ptr(&hinting_obj);
>> +
>> +	while (i < MAX_FGPT_ENTRIES) {
>> +		if (page_to_pfn(page) == page_hinting_obj->kvm_pt[i].pfn)
>> +			return 1;
>> +		i++;
>> +	}
>> +	return 0;
>> +}
>> +
>> +void pack_array(void)
>> +{
>> +	int i = 0, j = 0;
>> +	struct page_hinting *page_hinting_obj = this_cpu_ptr(&hinting_obj);
>> +
>> +	while (i < MAX_FGPT_ENTRIES) {
>> +		if (page_hinting_obj->kvm_pt[i].pfn != 0) {
>> +			if (i != j) {
>> +				page_hinting_obj->kvm_pt[j].pfn =
>> +					page_hinting_obj->kvm_pt[i].pfn;
>> +				page_hinting_obj->kvm_pt[j].order =
>> +					page_hinting_obj->kvm_pt[i].order;
>> +				page_hinting_obj->kvm_pt[j].zonenum =
>> +					page_hinting_obj->kvm_pt[i].zonenum;
>> +			}
>> +			j++;
>> +		}
>> +		i++;
>> +	}
>> +	i = j;
>> +	page_hinting_obj->kvm_pt_idx = j;
>> +	while (j < MAX_FGPT_ENTRIES) {
>> +		page_hinting_obj->kvm_pt[j].pfn = 0;
>> +		page_hinting_obj->kvm_pt[j].order = -1;
>> +		page_hinting_obj->kvm_pt[j].zonenum = -1;
>> +		j++;
>> +	}
>> +}
>> +
>>  void scan_array(void)
>>  {
>>  	struct page_hinting *page_hinting_obj = this_cpu_ptr(&hinting_obj);
>> +	int i = 0;
>>  
>> +	while (i < MAX_FGPT_ENTRIES) {
>> +		struct page *page =
>> +			pfn_to_page(page_hinting_obj->kvm_pt[i].pfn);
>> +		struct page *buddy_page = get_buddy_page(page);
>> +
>> +		if (!PageBuddy(page) && buddy_page) {
>> +			if (if_exist(buddy_page)) {
>> +				page_hinting_obj->kvm_pt[i].pfn = 0;
>> +				page_hinting_obj->kvm_pt[i].order = -1;
>> +				page_hinting_obj->kvm_pt[i].zonenum = -1;
>> +			} else {
>> +				page_hinting_obj->kvm_pt[i].pfn =
>> +					page_to_pfn(buddy_page);
>> +				page_hinting_obj->kvm_pt[i].order =
>> +					page_private(buddy_page);
>> +			}
>> +		}
>> +		i++;
>> +	}
>> +	pack_array();
>>  	if (page_hinting_obj->kvm_pt_idx == MAX_FGPT_ENTRIES)
>>  		wake_up_process(__this_cpu_read(hinting_task));
>>  }
>> @@ -111,8 +301,18 @@ void guest_free_page(struct page *page, int order)
>>  		page_hinting_obj->kvm_pt[page_hinting_obj->kvm_pt_idx].order =
>>  							order;
>>  		page_hinting_obj->kvm_pt_idx += 1;
>> -		if (page_hinting_obj->kvm_pt_idx == MAX_FGPT_ENTRIES)
>> +		if (page_hinting_obj->kvm_pt_idx == MAX_FGPT_ENTRIES) {
>> +			/*
>> +			 * We are depending on the buddy free-list to identify
>> +			 * if a page is free or not. Hence, we are dumping all
>> +			 * the per-cpu pages back into the buddy allocator. This
>> +			 * will ensure less failures when we try to isolate free
>> +			 * captured pages and hence more memory reporting to the
>> +			 * host.
>> +			 */
>> +			drain_local_pages(NULL);
>>  			scan_array();
>> +		}
>>  	}
>>  	local_irq_restore(flags);
>>  }
>> -- 
>> 2.17.2
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages
  2019-02-05 21:54     ` Nitesh Narayan Lal
@ 2019-02-05 21:55       ` Michael S. Tsirkin
  2019-02-07 17:43         ` Alexander Duyck
  0 siblings, 1 reply; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-05 21:55 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, david, dodgen, konrad.wilk, dhildenb,
	aarcange

On Tue, Feb 05, 2019 at 04:54:03PM -0500, Nitesh Narayan Lal wrote:
> 
> On 2/5/19 3:45 PM, Michael S. Tsirkin wrote:
> > On Mon, Feb 04, 2019 at 03:18:53PM -0500, Nitesh Narayan Lal wrote:
> >> This patch enables the kernel to scan the per cpu array and
> >> compress it by removing the repetitive/re-allocated pages.
> >> Once the per cpu array is completely filled with pages in the
> >> buddy it wakes up the kernel per cpu thread which re-scans the
> >> entire per cpu array by acquiring a zone lock corresponding to
> >> the page which is being scanned. If the page is still free and
> >> present in the buddy it tries to isolate the page and adds it
> >> to another per cpu array.
> >>
> >> Once this scanning process is complete and if there are any
> >> isolated pages added to the new per cpu array kernel thread
> >> invokes hyperlist_ready().
> >>
> >> In hyperlist_ready() a hypercall is made to report these pages to
> >> the host using the virtio-balloon framework. In order to do so
> >> another virtqueue 'hinting_vq' is added to the balloon framework.
> >> As the host frees all the reported pages, the kernel thread returns
> >> them back to the buddy.
> >>
> >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> >
> > This looks kind of like what early iterations of Wei's patches did.
> >
> > But this has lots of issues, for example you might end up with
> > a hypercall per a 4K page.
> > So in the end, he switched over to just reporting only
> > MAX_ORDER - 1 pages.
> You mean that I should only capture/attempt to isolate pages with order
> MAX_ORDER - 1?
> >
> > Would that be a good idea for you too?
> Will it help if we have a threshold value based on the amount of memory
> captured instead of the number of entries/pages in the array?

This is what Wei's patches do at least.

> >
> > An alternative would be a different much lighter weight
> > way to report these pages and to free them on the host.
> >
> >> ---
> >>  drivers/virtio/virtio_balloon.c     |  56 +++++++-
> >>  include/linux/page_hinting.h        |  18 ++-
> >>  include/uapi/linux/virtio_balloon.h |   1 +
> >>  mm/page_alloc.c                     |   2 +-
> >>  virt/kvm/page_hinting.c             | 202 +++++++++++++++++++++++++++-
> >>  5 files changed, 269 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> >> index 728ecd1eea30..8af34e0b9a32 100644
> >> --- a/drivers/virtio/virtio_balloon.c
> >> +++ b/drivers/virtio/virtio_balloon.c
> >> @@ -57,13 +57,15 @@ enum virtio_balloon_vq {
> >>  	VIRTIO_BALLOON_VQ_INFLATE,
> >>  	VIRTIO_BALLOON_VQ_DEFLATE,
> >>  	VIRTIO_BALLOON_VQ_STATS,
> >> +	VIRTIO_BALLOON_VQ_HINTING,
> >>  	VIRTIO_BALLOON_VQ_FREE_PAGE,
> >>  	VIRTIO_BALLOON_VQ_MAX
> >>  };
> >>  
> >>  struct virtio_balloon {
> >>  	struct virtio_device *vdev;
> >> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> >> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
> >> +								*hinting_vq;
> >>  
> >>  	/* Balloon's own wq for cpu-intensive work items */
> >>  	struct workqueue_struct *balloon_wq;
> >> @@ -122,6 +124,40 @@ static struct virtio_device_id id_table[] = {
> >>  	{ 0 },
> >>  };
> >>  
> >> +#ifdef CONFIG_KVM_FREE_PAGE_HINTING
> >> +void virtballoon_page_hinting(struct virtio_balloon *vb, u64 gvaddr,
> >> +			      int hyper_entries)
> >> +{
> >> +	u64 gpaddr = virt_to_phys((void *)gvaddr);
> >> +
> >> +	virtqueue_add_desc(vb->hinting_vq, gpaddr, hyper_entries, 0);
> >> +	virtqueue_kick_sync(vb->hinting_vq);
> >> +}
> >> +
> >> +static void hinting_ack(struct virtqueue *vq)
> >> +{
> >> +	struct virtio_balloon *vb = vq->vdev->priv;
> >> +
> >> +	wake_up(&vb->acked);
> >> +}
> >> +
> >> +static void enable_hinting(struct virtio_balloon *vb)
> >> +{
> >> +	guest_page_hinting_flag = 1;
> >> +	static_branch_enable(&guest_page_hinting_key);
> >> +	request_hypercall = (void *)&virtballoon_page_hinting;
> >> +	balloon_ptr = vb;
> >> +	WARN_ON(smpboot_register_percpu_thread(&hinting_threads));
> >> +}
> >> +
> >> +static void disable_hinting(void)
> >> +{
> >> +	guest_page_hinting_flag = 0;
> >> +	static_branch_enable(&guest_page_hinting_key);
> >> +	balloon_ptr = NULL;
> >> +}
> >> +#endif
> >> +
> >>  static u32 page_to_balloon_pfn(struct page *page)
> >>  {
> >>  	unsigned long pfn = page_to_pfn(page);
> >> @@ -481,6 +517,7 @@ static int init_vqs(struct virtio_balloon *vb)
> >>  	names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
> >>  	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
> >>  	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> >> +	names[VIRTIO_BALLOON_VQ_HINTING] = NULL;
> >>  
> >>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> >>  		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> >> @@ -492,11 +529,18 @@ static int init_vqs(struct virtio_balloon *vb)
> >>  		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> >>  	}
> >>  
> >> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
> >> +		names[VIRTIO_BALLOON_VQ_HINTING] = "hinting_vq";
> >> +		callbacks[VIRTIO_BALLOON_VQ_HINTING] = hinting_ack;
> >> +	}
> >>  	err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
> >>  					 vqs, callbacks, names, NULL, NULL);
> >>  	if (err)
> >>  		return err;
> >>  
> >> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
> >> +		vb->hinting_vq = vqs[VIRTIO_BALLOON_VQ_HINTING];
> >> +
> >>  	vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
> >>  	vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
> >>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> >> @@ -908,6 +952,11 @@ static int virtballoon_probe(struct virtio_device *vdev)
> >>  		if (err)
> >>  			goto out_del_balloon_wq;
> >>  	}
> >> +
> >> +#ifdef CONFIG_KVM_FREE_PAGE_HINTING
> >> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
> >> +		enable_hinting(vb);
> >> +#endif
> >>  	virtio_device_ready(vdev);
> >>  
> >>  	if (towards_target(vb))
> >> @@ -950,6 +999,10 @@ static void virtballoon_remove(struct virtio_device *vdev)
> >>  	cancel_work_sync(&vb->update_balloon_size_work);
> >>  	cancel_work_sync(&vb->update_balloon_stats_work);
> >>  
> >> +#ifdef CONFIG_KVM_FREE_PAGE_HINTING
> >> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
> >> +		disable_hinting();
> >> +#endif
> >>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
> >>  		cancel_work_sync(&vb->report_free_page_work);
> >>  		destroy_workqueue(vb->balloon_wq);
> >> @@ -1009,6 +1062,7 @@ static unsigned int features[] = {
> >>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
> >>  	VIRTIO_BALLOON_F_STATS_VQ,
> >>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> >> +	VIRTIO_BALLOON_F_HINTING,
> >>  	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
> >>  	VIRTIO_BALLOON_F_PAGE_POISON,
> >>  };
> >> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
> >> index e800c6b07561..3ba8c1f3b4a4 100644
> >> --- a/include/linux/page_hinting.h
> >> +++ b/include/linux/page_hinting.h
> >> @@ -1,15 +1,12 @@
> >>  #include <linux/smpboot.h>
> >>  
> >> -/*
> >> - * Size of the array which is used to store the freed pages is defined by
> >> - * MAX_FGPT_ENTRIES. If possible, we have to find a better way using which
> >> - * we can get rid of the hardcoded array size.
> >> - */
> >>  #define MAX_FGPT_ENTRIES	1000
> >>  /*
> >>   * hypervisor_pages - It is a dummy structure passed with the hypercall.
> >> - * @pfn: page frame number for the page which needs to be sent to the host.
> >> - * @order: order of the page needs to be reported to the host.
> >> + * @pfn - page frame number for the page which is to be freed.
> >> + * @pages - number of pages which are supposed to be freed.
> >> + * A global array object is used to to hold the list of pfn and pages and is
> >> + * passed as part of the hypercall.
> >>   */
> >>  struct hypervisor_pages {
> >>  	unsigned long pfn;
> >> @@ -19,11 +16,18 @@ struct hypervisor_pages {
> >>  extern int guest_page_hinting_flag;
> >>  extern struct static_key_false guest_page_hinting_key;
> >>  extern struct smp_hotplug_thread hinting_threads;
> >> +extern void (*request_hypercall)(void *, u64, int);
> >> +extern void *balloon_ptr;
> >>  extern bool want_page_poisoning;
> >>  
> >>  int guest_page_hinting_sysctl(struct ctl_table *table, int write,
> >>  			      void __user *buffer, size_t *lenp, loff_t *ppos);
> >>  void guest_free_page(struct page *page, int order);
> >> +extern int __isolate_free_page(struct page *page, unsigned int order);
> >> +extern void free_one_page(struct zone *zone,
> >> +			  struct page *page, unsigned long pfn,
> >> +			  unsigned int order,
> >> +			  int migratetype);
> >>  
> >>  static inline void disable_page_poisoning(void)
> >>  {
> > I guess you will want to put this in some other header.  Function
> > declarations belong close to where they are implemented, not used.
> I will find a better place.
> >
> >> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> >> index a1966cd7b677..2b0f62814e22 100644
> >> --- a/include/uapi/linux/virtio_balloon.h
> >> +++ b/include/uapi/linux/virtio_balloon.h
> >> @@ -36,6 +36,7 @@
> >>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> >>  #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
> >>  #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
> >> +#define VIRTIO_BALLOON_F_HINTING	5 /* Page hinting virtqueue */
> >>  
> >>  /* Size of a PFN in the balloon interface. */
> >>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index d295c9bc01a8..93224cba9243 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -1199,7 +1199,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >>  	spin_unlock(&zone->lock);
> >>  }
> >>  
> >> -static void free_one_page(struct zone *zone,
> >> +void free_one_page(struct zone *zone,
> >>  				struct page *page, unsigned long pfn,
> >>  				unsigned int order,
> >>  				int migratetype)
> >> diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
> >> index be529f6f2bc0..315099fcda43 100644
> >> --- a/virt/kvm/page_hinting.c
> >> +++ b/virt/kvm/page_hinting.c
> >> @@ -1,6 +1,8 @@
> >>  #include <linux/gfp.h>
> >>  #include <linux/mm.h>
> >> +#include <linux/page_ref.h>
> >>  #include <linux/kvm_host.h>
> >> +#include <linux/sort.h>
> >>  #include <linux/kernel.h>
> >>  
> >>  /*
> >> @@ -39,6 +41,11 @@ int guest_page_hinting_flag;
> >>  EXPORT_SYMBOL(guest_page_hinting_flag);
> >>  static DEFINE_PER_CPU(struct task_struct *, hinting_task);
> >>  
> >> +void (*request_hypercall)(void *, u64, int);
> >> +EXPORT_SYMBOL(request_hypercall);
> >> +void *balloon_ptr;
> >> +EXPORT_SYMBOL(balloon_ptr);
> >> +
> >>  int guest_page_hinting_sysctl(struct ctl_table *table, int write,
> >>  			      void __user *buffer, size_t *lenp,
> >>  			      loff_t *ppos)
> >> @@ -55,18 +62,201 @@ int guest_page_hinting_sysctl(struct ctl_table *table, int write,
> >>  	return ret;
> >>  }
> >>  
> >> +void hyperlist_ready(struct hypervisor_pages *guest_isolated_pages, int entries)
> >> +{
> >> +	int i = 0;
> >> +	int mt = 0;
> >> +
> >> +	if (balloon_ptr)
> >> +		request_hypercall(balloon_ptr, (u64)&guest_isolated_pages[0],
> >> +				  entries);
> >> +
> >> +	while (i < entries) {
> >> +		struct page *page = pfn_to_page(guest_isolated_pages[i].pfn);
> >> +
> >> +		mt = get_pageblock_migratetype(page);
> >> +		free_one_page(page_zone(page), page, page_to_pfn(page),
> >> +			      guest_isolated_pages[i].order, mt);
> >> +		i++;
> >> +	}
> >> +}
> >> +
> >> +struct page *get_buddy_page(struct page *page)
> >> +{
> >> +	unsigned long pfn = page_to_pfn(page);
> >> +	unsigned int order;
> >> +
> >> +	for (order = 0; order < MAX_ORDER; order++) {
> >> +		struct page *page_head = page - (pfn & ((1 << order) - 1));
> >> +
> >> +		if (PageBuddy(page_head) && page_private(page_head) >= order)
> >> +			return page_head;
> >> +	}
> >> +	return NULL;
> >> +}
> >> +
> >>  static void hinting_fn(unsigned int cpu)
> >>  {
> >>  	struct page_hinting *page_hinting_obj = this_cpu_ptr(&hinting_obj);
> >> +	int idx = 0, ret = 0;
> >> +	struct zone *zone_cur;
> >> +	unsigned long flags = 0;
> >> +
> >> +	while (idx < MAX_FGPT_ENTRIES) {
> >> +		unsigned long pfn = page_hinting_obj->kvm_pt[idx].pfn;
> >> +		unsigned long pfn_end = page_hinting_obj->kvm_pt[idx].pfn +
> >> +			(1 << page_hinting_obj->kvm_pt[idx].order) - 1;
> >> +
> >> +		while (pfn <= pfn_end) {
> >> +			struct page *page = pfn_to_page(pfn);
> >> +			struct page *buddy_page = NULL;
> >> +
> >> +			zone_cur = page_zone(page);
> >> +			spin_lock_irqsave(&zone_cur->lock, flags);
> >> +
> >> +			if (PageCompound(page)) {
> >> +				struct page *head_page = compound_head(page);
> >> +				unsigned long head_pfn = page_to_pfn(head_page);
> >> +				unsigned int alloc_pages =
> >> +					1 << compound_order(head_page);
> >> +
> >> +				pfn = head_pfn + alloc_pages;
> >> +				spin_unlock_irqrestore(&zone_cur->lock, flags);
> >> +				continue;
> >> +			}
> >> +
> >> +			if (page_ref_count(page)) {
> >> +				pfn++;
> >> +				spin_unlock_irqrestore(&zone_cur->lock, flags);
> >> +				continue;
> >> +			}
> >> +
> >> +			if (PageBuddy(page)) {
> >> +				int buddy_order = page_private(page);
> >>  
> >> +				ret = __isolate_free_page(page, buddy_order);
> >> +				if (!ret) {
> >> +				} else {
> >> +					int l_idx = page_hinting_obj->hyp_idx;
> >> +					struct hypervisor_pages *l_obj =
> >> +					page_hinting_obj->hypervisor_pagelist;
> >> +
> >> +					l_obj[l_idx].pfn = pfn;
> >> +					l_obj[l_idx].order = buddy_order;
> >> +					page_hinting_obj->hyp_idx += 1;
> >> +				}
> >> +				pfn = pfn + (1 << buddy_order);
> >> +				spin_unlock_irqrestore(&zone_cur->lock, flags);
> >> +				continue;
> >> +			}
> >> +
> >> +			buddy_page = get_buddy_page(page);
> >> +			if (buddy_page) {
> >> +				int buddy_order = page_private(buddy_page);
> >> +
> >> +				ret = __isolate_free_page(buddy_page,
> >> +							  buddy_order);
> >> +				if (!ret) {
> >> +				} else {
> >> +					int l_idx = page_hinting_obj->hyp_idx;
> >> +					struct hypervisor_pages *l_obj =
> >> +					page_hinting_obj->hypervisor_pagelist;
> >> +					unsigned long buddy_pfn =
> >> +						page_to_pfn(buddy_page);
> >> +
> >> +					l_obj[l_idx].pfn = buddy_pfn;
> >> +					l_obj[l_idx].order = buddy_order;
> >> +					page_hinting_obj->hyp_idx += 1;
> >> +				}
> >> +				pfn = page_to_pfn(buddy_page) +
> >> +					(1 << buddy_order);
> >> +				spin_unlock_irqrestore(&zone_cur->lock, flags);
> >> +				continue;
> >> +			}
> >> +			spin_unlock_irqrestore(&zone_cur->lock, flags);
> >> +			pfn++;
> >> +		}
> >> +		page_hinting_obj->kvm_pt[idx].pfn = 0;
> >> +		page_hinting_obj->kvm_pt[idx].order = -1;
> >> +		page_hinting_obj->kvm_pt[idx].zonenum = -1;
> >> +		idx++;
> >> +	}
> >> +	if (page_hinting_obj->hyp_idx > 0) {
> >> +		hyperlist_ready(page_hinting_obj->hypervisor_pagelist,
> >> +				page_hinting_obj->hyp_idx);
> >> +		page_hinting_obj->hyp_idx = 0;
> >> +	}
> >>  	page_hinting_obj->kvm_pt_idx = 0;
> >>  	put_cpu_var(hinting_obj);
> >>  }
> >>  
> >> +int if_exist(struct page *page)
> >> +{
> >> +	int i = 0;
> >> +	struct page_hinting *page_hinting_obj = this_cpu_ptr(&hinting_obj);
> >> +
> >> +	while (i < MAX_FGPT_ENTRIES) {
> >> +		if (page_to_pfn(page) == page_hinting_obj->kvm_pt[i].pfn)
> >> +			return 1;
> >> +		i++;
> >> +	}
> >> +	return 0;
> >> +}
> >> +
> >> +void pack_array(void)
> >> +{
> >> +	int i = 0, j = 0;
> >> +	struct page_hinting *page_hinting_obj = this_cpu_ptr(&hinting_obj);
> >> +
> >> +	while (i < MAX_FGPT_ENTRIES) {
> >> +		if (page_hinting_obj->kvm_pt[i].pfn != 0) {
> >> +			if (i != j) {
> >> +				page_hinting_obj->kvm_pt[j].pfn =
> >> +					page_hinting_obj->kvm_pt[i].pfn;
> >> +				page_hinting_obj->kvm_pt[j].order =
> >> +					page_hinting_obj->kvm_pt[i].order;
> >> +				page_hinting_obj->kvm_pt[j].zonenum =
> >> +					page_hinting_obj->kvm_pt[i].zonenum;
> >> +			}
> >> +			j++;
> >> +		}
> >> +		i++;
> >> +	}
> >> +	i = j;
> >> +	page_hinting_obj->kvm_pt_idx = j;
> >> +	while (j < MAX_FGPT_ENTRIES) {
> >> +		page_hinting_obj->kvm_pt[j].pfn = 0;
> >> +		page_hinting_obj->kvm_pt[j].order = -1;
> >> +		page_hinting_obj->kvm_pt[j].zonenum = -1;
> >> +		j++;
> >> +	}
> >> +}
> >> +
> >>  void scan_array(void)
> >>  {
> >>  	struct page_hinting *page_hinting_obj = this_cpu_ptr(&hinting_obj);
> >> +	int i = 0;
> >>  
> >> +	while (i < MAX_FGPT_ENTRIES) {
> >> +		struct page *page =
> >> +			pfn_to_page(page_hinting_obj->kvm_pt[i].pfn);
> >> +		struct page *buddy_page = get_buddy_page(page);
> >> +
> >> +		if (!PageBuddy(page) && buddy_page) {
> >> +			if (if_exist(buddy_page)) {
> >> +				page_hinting_obj->kvm_pt[i].pfn = 0;
> >> +				page_hinting_obj->kvm_pt[i].order = -1;
> >> +				page_hinting_obj->kvm_pt[i].zonenum = -1;
> >> +			} else {
> >> +				page_hinting_obj->kvm_pt[i].pfn =
> >> +					page_to_pfn(buddy_page);
> >> +				page_hinting_obj->kvm_pt[i].order =
> >> +					page_private(buddy_page);
> >> +			}
> >> +		}
> >> +		i++;
> >> +	}
> >> +	pack_array();
> >>  	if (page_hinting_obj->kvm_pt_idx == MAX_FGPT_ENTRIES)
> >>  		wake_up_process(__this_cpu_read(hinting_task));
> >>  }
> >> @@ -111,8 +301,18 @@ void guest_free_page(struct page *page, int order)
> >>  		page_hinting_obj->kvm_pt[page_hinting_obj->kvm_pt_idx].order =
> >>  							order;
> >>  		page_hinting_obj->kvm_pt_idx += 1;
> >> -		if (page_hinting_obj->kvm_pt_idx == MAX_FGPT_ENTRIES)
> >> +		if (page_hinting_obj->kvm_pt_idx == MAX_FGPT_ENTRIES) {
> >> +			/*
> >> +			 * We are depending on the buddy free-list to identify
> >> +			 * if a page is free or not. Hence, we are dumping all
> >> +			 * the per-cpu pages back into the buddy allocator. This
> >> +			 * will ensure less failures when we try to isolate free
> >> +			 * captured pages and hence more memory reporting to the
> >> +			 * host.
> >> +			 */
> >> +			drain_local_pages(NULL);
> >>  			scan_array();
> >> +		}
> >>  	}
> >>  	local_irq_restore(flags);
> >>  }
> >> -- 
> >> 2.17.2
> -- 
> Regards
> Nitesh
> 




^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 5/7] virtio: Enables to add a single descriptor to the host
  2019-02-05 20:49   ` Michael S. Tsirkin
@ 2019-02-06 12:56     ` Nitesh Narayan Lal
  2019-02-06 13:15       ` Luiz Capitulino
  2019-02-06 18:03       ` Michael S. Tsirkin
  0 siblings, 2 replies; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-06 12:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, david, dodgen, konrad.wilk, dhildenb,
	aarcange


[-- Attachment #1.1: Type: text/plain, Size: 4870 bytes --]


On 2/5/19 3:49 PM, Michael S. Tsirkin wrote:
> On Mon, Feb 04, 2019 at 03:18:52PM -0500, Nitesh Narayan Lal wrote:
>> This patch enables the caller to expose a single buffers to the
>> other end using vring descriptor. It also allows the caller to
>> perform this action in synchornous manner by using virtqueue_kick_sync.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> I am not sure why do we need this API. Polling in guest
> until host runs isn't great either since these
> might be running on the same host CPU.
True.

However, my understanding is that the existing API such as
virtqueue_add_outbuf() requires an allocation which will be problematic
for my implementation.
Although I am not blocking the allocation path during normal Linux
kernel usage as even if one of the zone is locked the other zone could
be used to get free pages.
But during the initial boot time (device initialization), in certain
situations the allocation can only come from a single zone, acquiring a
lock on it may result in a deadlock situation.

>
>
>
>> ---
>>  drivers/virtio/virtio_ring.c | 72 ++++++++++++++++++++++++++++++++++++
>>  include/linux/virtio.h       |  4 ++
>>  2 files changed, 76 insertions(+)
>>
>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>> index cd7e755484e3..93c161ac6a28 100644
>> --- a/drivers/virtio/virtio_ring.c
>> +++ b/drivers/virtio/virtio_ring.c
>> @@ -1695,6 +1695,52 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>>  					out_sgs, in_sgs, data, ctx, gfp);
>>  }
>>  
>> +/**
>> + * virtqueue_add_desc - add a buffer to a chain using a vring desc
>> + * @vq: the struct virtqueue we're talking about.
>> + * @addr: address of the buffer to add.
>> + * @len: length of the buffer.
>> + * @in: set if the buffer is for the device to write.
>> + *
>> + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
>> + */
>> +int virtqueue_add_desc(struct virtqueue *_vq, u64 addr, u32 len, int in)
>> +{
>> +	struct vring_virtqueue *vq = to_vvq(_vq);
>> +	struct vring_desc *desc = vq->split.vring.desc;
>> +	u16 flags = in ? VRING_DESC_F_WRITE : 0;
>> +	unsigned int i;
>> +	void *data = (void *)addr;
>> +	int avail_idx;
>> +
>> +	/* Sanity check */
>> +	if (!_vq)
>> +		return -EINVAL;
>> +
>> +	START_USE(vq);
>> +	if (unlikely(vq->broken)) {
>> +		END_USE(vq);
>> +		return -EIO;
>> +	}
>> +
>> +	i = vq->free_head;
>> +	flags &= ~VRING_DESC_F_NEXT;
>> +	desc[i].flags = cpu_to_virtio16(_vq->vdev, flags);
>> +	desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
>> +	desc[i].len = cpu_to_virtio32(_vq->vdev, len);
>> +
>> +	vq->vq.num_free--;
>> +	vq->free_head = virtio16_to_cpu(_vq->vdev, desc[i].next);
>> +	vq->split.desc_state[i].data = data;
>> +	vq->split.avail_idx_shadow = 1;
>> +	avail_idx = vq->split.avail_idx_shadow;
>> +	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev, avail_idx);
>> +	vq->num_added = 1;
>> +	END_USE(vq);
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(virtqueue_add_desc);
>> +
>>  /**
>>   * virtqueue_add_sgs - expose buffers to other end
>>   * @vq: the struct virtqueue we're talking about.
>> @@ -1842,6 +1888,32 @@ bool virtqueue_notify(struct virtqueue *_vq)
>>  }
>>  EXPORT_SYMBOL_GPL(virtqueue_notify);
>>  
>> +/**
>> + * virtqueue_kick_sync - update after add_buf and busy wait till update is done
>> + * @vq: the struct virtqueue
>> + *
>> + * After one or more virtqueue_add_* calls, invoke this to kick
>> + * the other side. Busy wait till the other side is done with the update.
>> + *
>> + * Caller must ensure we don't call this with other virtqueue
>> + * operations at the same time (except where noted).
>> + *
>> + * Returns false if kick failed, otherwise true.
>> + */
>> +bool virtqueue_kick_sync(struct virtqueue *vq)
>> +{
>> +	u32 len;
>> +
>> +	if (likely(virtqueue_kick(vq))) {
>> +		while (!virtqueue_get_buf(vq, &len) &&
>> +		       !virtqueue_is_broken(vq))
>> +			cpu_relax();
>> +		return true;
>> +	}
>> +	return false;
>> +}
>> +EXPORT_SYMBOL_GPL(virtqueue_kick_sync);
>> +
>>  /**
>>   * virtqueue_kick - update after add_buf
>>   * @vq: the struct virtqueue
>> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
>> index fa1b5da2804e..58943a3a0e8d 100644
>> --- a/include/linux/virtio.h
>> +++ b/include/linux/virtio.h
>> @@ -57,6 +57,10 @@ int virtqueue_add_sgs(struct virtqueue *vq,
>>  		      unsigned int in_sgs,
>>  		      void *data,
>>  		      gfp_t gfp);
>> +/* A desc with this init id is treated as an invalid desc */
>> +int virtqueue_add_desc(struct virtqueue *_vq, u64 addr, u32 len, int in);
>> +
>> +bool virtqueue_kick_sync(struct virtqueue *vq);
>>  
>>  bool virtqueue_kick(struct virtqueue *vq);
>>  
>> -- 
>> 2.17.2
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 5/7] virtio: Enables to add a single descriptor to the host
  2019-02-06 12:56     ` Nitesh Narayan Lal
@ 2019-02-06 13:15       ` Luiz Capitulino
  2019-02-06 13:24         ` Nitesh Narayan Lal
  2019-02-06 18:03       ` Michael S. Tsirkin
  1 sibling, 1 reply; 116+ messages in thread
From: Luiz Capitulino @ 2019-02-06 13:15 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: Michael S. Tsirkin, kvm, linux-kernel, pbonzini, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, dodgen, konrad.wilk,
	dhildenb, aarcange

On Wed, 6 Feb 2019 07:56:37 -0500
Nitesh Narayan Lal <nitesh@redhat.com> wrote:

> On 2/5/19 3:49 PM, Michael S. Tsirkin wrote:
> > On Mon, Feb 04, 2019 at 03:18:52PM -0500, Nitesh Narayan Lal wrote:  
> >> This patch enables the caller to expose a single buffers to the
> >> other end using vring descriptor. It also allows the caller to
> >> perform this action in synchornous manner by using virtqueue_kick_sync.
> >>
> >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>  
> > I am not sure why do we need this API. Polling in guest
> > until host runs isn't great either since these
> > might be running on the same host CPU.  
> True.
> 
> However, my understanding is that the existing API such as
> virtqueue_add_outbuf() requires an allocation which will be problematic
> for my implementation.
> Although I am not blocking the allocation path during normal Linux
> kernel usage as even if one of the zone is locked the other zone could
> be used to get free pages.
> But during the initial boot time (device initialization), in certain
> situations the allocation can only come from a single zone, acquiring a
> lock on it may result in a deadlock situation.

I might be wrong, but if I remember correctly, this was true for
your previous implementation where you'd report page hinting down
from arch_free_page() so you couldn't allocate memory. But this
is not the case anymore.

> 
> >
> >
> >  
> >> ---
> >>  drivers/virtio/virtio_ring.c | 72 ++++++++++++++++++++++++++++++++++++
> >>  include/linux/virtio.h       |  4 ++
> >>  2 files changed, 76 insertions(+)
> >>
> >> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> >> index cd7e755484e3..93c161ac6a28 100644
> >> --- a/drivers/virtio/virtio_ring.c
> >> +++ b/drivers/virtio/virtio_ring.c
> >> @@ -1695,6 +1695,52 @@ static inline int virtqueue_add(struct virtqueue *_vq,
> >>  					out_sgs, in_sgs, data, ctx, gfp);
> >>  }
> >>  
> >> +/**
> >> + * virtqueue_add_desc - add a buffer to a chain using a vring desc
> >> + * @vq: the struct virtqueue we're talking about.
> >> + * @addr: address of the buffer to add.
> >> + * @len: length of the buffer.
> >> + * @in: set if the buffer is for the device to write.
> >> + *
> >> + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
> >> + */
> >> +int virtqueue_add_desc(struct virtqueue *_vq, u64 addr, u32 len, int in)
> >> +{
> >> +	struct vring_virtqueue *vq = to_vvq(_vq);
> >> +	struct vring_desc *desc = vq->split.vring.desc;
> >> +	u16 flags = in ? VRING_DESC_F_WRITE : 0;
> >> +	unsigned int i;
> >> +	void *data = (void *)addr;
> >> +	int avail_idx;
> >> +
> >> +	/* Sanity check */
> >> +	if (!_vq)
> >> +		return -EINVAL;
> >> +
> >> +	START_USE(vq);
> >> +	if (unlikely(vq->broken)) {
> >> +		END_USE(vq);
> >> +		return -EIO;
> >> +	}
> >> +
> >> +	i = vq->free_head;
> >> +	flags &= ~VRING_DESC_F_NEXT;
> >> +	desc[i].flags = cpu_to_virtio16(_vq->vdev, flags);
> >> +	desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
> >> +	desc[i].len = cpu_to_virtio32(_vq->vdev, len);
> >> +
> >> +	vq->vq.num_free--;
> >> +	vq->free_head = virtio16_to_cpu(_vq->vdev, desc[i].next);
> >> +	vq->split.desc_state[i].data = data;
> >> +	vq->split.avail_idx_shadow = 1;
> >> +	avail_idx = vq->split.avail_idx_shadow;
> >> +	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev, avail_idx);
> >> +	vq->num_added = 1;
> >> +	END_USE(vq);
> >> +	return 0;
> >> +}
> >> +EXPORT_SYMBOL_GPL(virtqueue_add_desc);
> >> +
> >>  /**
> >>   * virtqueue_add_sgs - expose buffers to other end
> >>   * @vq: the struct virtqueue we're talking about.
> >> @@ -1842,6 +1888,32 @@ bool virtqueue_notify(struct virtqueue *_vq)
> >>  }
> >>  EXPORT_SYMBOL_GPL(virtqueue_notify);
> >>  
> >> +/**
> >> + * virtqueue_kick_sync - update after add_buf and busy wait till update is done
> >> + * @vq: the struct virtqueue
> >> + *
> >> + * After one or more virtqueue_add_* calls, invoke this to kick
> >> + * the other side. Busy wait till the other side is done with the update.
> >> + *
> >> + * Caller must ensure we don't call this with other virtqueue
> >> + * operations at the same time (except where noted).
> >> + *
> >> + * Returns false if kick failed, otherwise true.
> >> + */
> >> +bool virtqueue_kick_sync(struct virtqueue *vq)
> >> +{
> >> +	u32 len;
> >> +
> >> +	if (likely(virtqueue_kick(vq))) {
> >> +		while (!virtqueue_get_buf(vq, &len) &&
> >> +		       !virtqueue_is_broken(vq))
> >> +			cpu_relax();
> >> +		return true;
> >> +	}
> >> +	return false;
> >> +}
> >> +EXPORT_SYMBOL_GPL(virtqueue_kick_sync);
> >> +
> >>  /**
> >>   * virtqueue_kick - update after add_buf
> >>   * @vq: the struct virtqueue
> >> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> >> index fa1b5da2804e..58943a3a0e8d 100644
> >> --- a/include/linux/virtio.h
> >> +++ b/include/linux/virtio.h
> >> @@ -57,6 +57,10 @@ int virtqueue_add_sgs(struct virtqueue *vq,
> >>  		      unsigned int in_sgs,
> >>  		      void *data,
> >>  		      gfp_t gfp);
> >> +/* A desc with this init id is treated as an invalid desc */
> >> +int virtqueue_add_desc(struct virtqueue *_vq, u64 addr, u32 len, int in);
> >> +
> >> +bool virtqueue_kick_sync(struct virtqueue *vq);
> >>  
> >>  bool virtqueue_kick(struct virtqueue *vq);
> >>  
> >> -- 
> >> 2.17.2  


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 5/7] virtio: Enables to add a single descriptor to the host
  2019-02-06 13:15       ` Luiz Capitulino
@ 2019-02-06 13:24         ` Nitesh Narayan Lal
  2019-02-06 13:29           ` Luiz Capitulino
  0 siblings, 1 reply; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-06 13:24 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Michael S. Tsirkin, kvm, linux-kernel, pbonzini, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, dodgen, konrad.wilk,
	dhildenb, aarcange


[-- Attachment #1.1: Type: text/plain, Size: 5987 bytes --]


On 2/6/19 8:15 AM, Luiz Capitulino wrote:
> On Wed, 6 Feb 2019 07:56:37 -0500
> Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>> On 2/5/19 3:49 PM, Michael S. Tsirkin wrote:
>>> On Mon, Feb 04, 2019 at 03:18:52PM -0500, Nitesh Narayan Lal wrote:  
>>>> This patch enables the caller to expose a single buffers to the
>>>> other end using vring descriptor. It also allows the caller to
>>>> perform this action in synchornous manner by using virtqueue_kick_sync.
>>>>
>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>  
>>> I am not sure why do we need this API. Polling in guest
>>> until host runs isn't great either since these
>>> might be running on the same host CPU.  
>> True.
>>
>> However, my understanding is that the existing API such as
>> virtqueue_add_outbuf() requires an allocation which will be problematic
>> for my implementation.
>> Although I am not blocking the allocation path during normal Linux
>> kernel usage as even if one of the zone is locked the other zone could
>> be used to get free pages.
>> But during the initial boot time (device initialization), in certain
>> situations the allocation can only come from a single zone, acquiring a
>> lock on it may result in a deadlock situation.
> I might be wrong, but if I remember correctly, this was true for
> your previous implementation where you'd report page hinting down
> from arch_free_page() so you couldn't allocate memory. But this
> is not the case anymore.

With the earlier implementation, the allocation was blocked all the time
when freeing was going on.
With this implementation, the allocation is not blocked during normal
Linux kernel usage (after Linux boots up). For example, on a 64 bit
machine, if the Normal zone is locked and there is an allocation request
then it can be served by DMA32 zone as well. (This is not the case
during device initialization time)
Feel free to correct me if I am wrong.

>
>>>
>>>  
>>>> ---
>>>>  drivers/virtio/virtio_ring.c | 72 ++++++++++++++++++++++++++++++++++++
>>>>  include/linux/virtio.h       |  4 ++
>>>>  2 files changed, 76 insertions(+)
>>>>
>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>> index cd7e755484e3..93c161ac6a28 100644
>>>> --- a/drivers/virtio/virtio_ring.c
>>>> +++ b/drivers/virtio/virtio_ring.c
>>>> @@ -1695,6 +1695,52 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>>>>  					out_sgs, in_sgs, data, ctx, gfp);
>>>>  }
>>>>  
>>>> +/**
>>>> + * virtqueue_add_desc - add a buffer to a chain using a vring desc
>>>> + * @vq: the struct virtqueue we're talking about.
>>>> + * @addr: address of the buffer to add.
>>>> + * @len: length of the buffer.
>>>> + * @in: set if the buffer is for the device to write.
>>>> + *
>>>> + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
>>>> + */
>>>> +int virtqueue_add_desc(struct virtqueue *_vq, u64 addr, u32 len, int in)
>>>> +{
>>>> +	struct vring_virtqueue *vq = to_vvq(_vq);
>>>> +	struct vring_desc *desc = vq->split.vring.desc;
>>>> +	u16 flags = in ? VRING_DESC_F_WRITE : 0;
>>>> +	unsigned int i;
>>>> +	void *data = (void *)addr;
>>>> +	int avail_idx;
>>>> +
>>>> +	/* Sanity check */
>>>> +	if (!_vq)
>>>> +		return -EINVAL;
>>>> +
>>>> +	START_USE(vq);
>>>> +	if (unlikely(vq->broken)) {
>>>> +		END_USE(vq);
>>>> +		return -EIO;
>>>> +	}
>>>> +
>>>> +	i = vq->free_head;
>>>> +	flags &= ~VRING_DESC_F_NEXT;
>>>> +	desc[i].flags = cpu_to_virtio16(_vq->vdev, flags);
>>>> +	desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
>>>> +	desc[i].len = cpu_to_virtio32(_vq->vdev, len);
>>>> +
>>>> +	vq->vq.num_free--;
>>>> +	vq->free_head = virtio16_to_cpu(_vq->vdev, desc[i].next);
>>>> +	vq->split.desc_state[i].data = data;
>>>> +	vq->split.avail_idx_shadow = 1;
>>>> +	avail_idx = vq->split.avail_idx_shadow;
>>>> +	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev, avail_idx);
>>>> +	vq->num_added = 1;
>>>> +	END_USE(vq);
>>>> +	return 0;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(virtqueue_add_desc);
>>>> +
>>>>  /**
>>>>   * virtqueue_add_sgs - expose buffers to other end
>>>>   * @vq: the struct virtqueue we're talking about.
>>>> @@ -1842,6 +1888,32 @@ bool virtqueue_notify(struct virtqueue *_vq)
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(virtqueue_notify);
>>>>  
>>>> +/**
>>>> + * virtqueue_kick_sync - update after add_buf and busy wait till update is done
>>>> + * @vq: the struct virtqueue
>>>> + *
>>>> + * After one or more virtqueue_add_* calls, invoke this to kick
>>>> + * the other side. Busy wait till the other side is done with the update.
>>>> + *
>>>> + * Caller must ensure we don't call this with other virtqueue
>>>> + * operations at the same time (except where noted).
>>>> + *
>>>> + * Returns false if kick failed, otherwise true.
>>>> + */
>>>> +bool virtqueue_kick_sync(struct virtqueue *vq)
>>>> +{
>>>> +	u32 len;
>>>> +
>>>> +	if (likely(virtqueue_kick(vq))) {
>>>> +		while (!virtqueue_get_buf(vq, &len) &&
>>>> +		       !virtqueue_is_broken(vq))
>>>> +			cpu_relax();
>>>> +		return true;
>>>> +	}
>>>> +	return false;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(virtqueue_kick_sync);
>>>> +
>>>>  /**
>>>>   * virtqueue_kick - update after add_buf
>>>>   * @vq: the struct virtqueue
>>>> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
>>>> index fa1b5da2804e..58943a3a0e8d 100644
>>>> --- a/include/linux/virtio.h
>>>> +++ b/include/linux/virtio.h
>>>> @@ -57,6 +57,10 @@ int virtqueue_add_sgs(struct virtqueue *vq,
>>>>  		      unsigned int in_sgs,
>>>>  		      void *data,
>>>>  		      gfp_t gfp);
>>>> +/* A desc with this init id is treated as an invalid desc */
>>>> +int virtqueue_add_desc(struct virtqueue *_vq, u64 addr, u32 len, int in);
>>>> +
>>>> +bool virtqueue_kick_sync(struct virtqueue *vq);
>>>>  
>>>>  bool virtqueue_kick(struct virtqueue *vq);
>>>>  
>>>> -- 
>>>> 2.17.2  
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 5/7] virtio: Enables to add a single descriptor to the host
  2019-02-06 13:24         ` Nitesh Narayan Lal
@ 2019-02-06 13:29           ` Luiz Capitulino
  2019-02-06 14:05             ` Nitesh Narayan Lal
  0 siblings, 1 reply; 116+ messages in thread
From: Luiz Capitulino @ 2019-02-06 13:29 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: Michael S. Tsirkin, kvm, linux-kernel, pbonzini, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, dodgen, konrad.wilk,
	dhildenb, aarcange

On Wed, 6 Feb 2019 08:24:14 -0500
Nitesh Narayan Lal <nitesh@redhat.com> wrote:

> On 2/6/19 8:15 AM, Luiz Capitulino wrote:
> > On Wed, 6 Feb 2019 07:56:37 -0500
> > Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >  
> >> On 2/5/19 3:49 PM, Michael S. Tsirkin wrote:  
> >>> On Mon, Feb 04, 2019 at 03:18:52PM -0500, Nitesh Narayan Lal wrote:    
> >>>> This patch enables the caller to expose a single buffers to the
> >>>> other end using vring descriptor. It also allows the caller to
> >>>> perform this action in synchornous manner by using virtqueue_kick_sync.
> >>>>
> >>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>    
> >>> I am not sure why do we need this API. Polling in guest
> >>> until host runs isn't great either since these
> >>> might be running on the same host CPU.    
> >> True.
> >>
> >> However, my understanding is that the existing API such as
> >> virtqueue_add_outbuf() requires an allocation which will be problematic
> >> for my implementation.
> >> Although I am not blocking the allocation path during normal Linux
> >> kernel usage as even if one of the zone is locked the other zone could
> >> be used to get free pages.
> >> But during the initial boot time (device initialization), in certain
> >> situations the allocation can only come from a single zone, acquiring a
> >> lock on it may result in a deadlock situation.  
> > I might be wrong, but if I remember correctly, this was true for
> > your previous implementation where you'd report page hinting down
> > from arch_free_page() so you couldn't allocate memory. But this
> > is not the case anymore.  
> 
> With the earlier implementation, the allocation was blocked all the time
> when freeing was going on.
> With this implementation, the allocation is not blocked during normal
> Linux kernel usage (after Linux boots up). For example, on a 64 bit
> machine, if the Normal zone is locked and there is an allocation request
> then it can be served by DMA32 zone as well. (This is not the case
> during device initialization time)
> Feel free to correct me if I am wrong.

That's what I meant :) I have an impression that your virtio API
was necessary because of your earlier design. I guess it's not needed
anymore as Michael says.

> 
> >  
> >>>
> >>>    
> >>>> ---
> >>>>  drivers/virtio/virtio_ring.c | 72 ++++++++++++++++++++++++++++++++++++
> >>>>  include/linux/virtio.h       |  4 ++
> >>>>  2 files changed, 76 insertions(+)
> >>>>
> >>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> >>>> index cd7e755484e3..93c161ac6a28 100644
> >>>> --- a/drivers/virtio/virtio_ring.c
> >>>> +++ b/drivers/virtio/virtio_ring.c
> >>>> @@ -1695,6 +1695,52 @@ static inline int virtqueue_add(struct virtqueue *_vq,
> >>>>  					out_sgs, in_sgs, data, ctx, gfp);
> >>>>  }
> >>>>  
> >>>> +/**
> >>>> + * virtqueue_add_desc - add a buffer to a chain using a vring desc
> >>>> + * @vq: the struct virtqueue we're talking about.
> >>>> + * @addr: address of the buffer to add.
> >>>> + * @len: length of the buffer.
> >>>> + * @in: set if the buffer is for the device to write.
> >>>> + *
> >>>> + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
> >>>> + */
> >>>> +int virtqueue_add_desc(struct virtqueue *_vq, u64 addr, u32 len, int in)
> >>>> +{
> >>>> +	struct vring_virtqueue *vq = to_vvq(_vq);
> >>>> +	struct vring_desc *desc = vq->split.vring.desc;
> >>>> +	u16 flags = in ? VRING_DESC_F_WRITE : 0;
> >>>> +	unsigned int i;
> >>>> +	void *data = (void *)addr;
> >>>> +	int avail_idx;
> >>>> +
> >>>> +	/* Sanity check */
> >>>> +	if (!_vq)
> >>>> +		return -EINVAL;
> >>>> +
> >>>> +	START_USE(vq);
> >>>> +	if (unlikely(vq->broken)) {
> >>>> +		END_USE(vq);
> >>>> +		return -EIO;
> >>>> +	}
> >>>> +
> >>>> +	i = vq->free_head;
> >>>> +	flags &= ~VRING_DESC_F_NEXT;
> >>>> +	desc[i].flags = cpu_to_virtio16(_vq->vdev, flags);
> >>>> +	desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
> >>>> +	desc[i].len = cpu_to_virtio32(_vq->vdev, len);
> >>>> +
> >>>> +	vq->vq.num_free--;
> >>>> +	vq->free_head = virtio16_to_cpu(_vq->vdev, desc[i].next);
> >>>> +	vq->split.desc_state[i].data = data;
> >>>> +	vq->split.avail_idx_shadow = 1;
> >>>> +	avail_idx = vq->split.avail_idx_shadow;
> >>>> +	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev, avail_idx);
> >>>> +	vq->num_added = 1;
> >>>> +	END_USE(vq);
> >>>> +	return 0;
> >>>> +}
> >>>> +EXPORT_SYMBOL_GPL(virtqueue_add_desc);
> >>>> +
> >>>>  /**
> >>>>   * virtqueue_add_sgs - expose buffers to other end
> >>>>   * @vq: the struct virtqueue we're talking about.
> >>>> @@ -1842,6 +1888,32 @@ bool virtqueue_notify(struct virtqueue *_vq)
> >>>>  }
> >>>>  EXPORT_SYMBOL_GPL(virtqueue_notify);
> >>>>  
> >>>> +/**
> >>>> + * virtqueue_kick_sync - update after add_buf and busy wait till update is done
> >>>> + * @vq: the struct virtqueue
> >>>> + *
> >>>> + * After one or more virtqueue_add_* calls, invoke this to kick
> >>>> + * the other side. Busy wait till the other side is done with the update.
> >>>> + *
> >>>> + * Caller must ensure we don't call this with other virtqueue
> >>>> + * operations at the same time (except where noted).
> >>>> + *
> >>>> + * Returns false if kick failed, otherwise true.
> >>>> + */
> >>>> +bool virtqueue_kick_sync(struct virtqueue *vq)
> >>>> +{
> >>>> +	u32 len;
> >>>> +
> >>>> +	if (likely(virtqueue_kick(vq))) {
> >>>> +		while (!virtqueue_get_buf(vq, &len) &&
> >>>> +		       !virtqueue_is_broken(vq))
> >>>> +			cpu_relax();
> >>>> +		return true;
> >>>> +	}
> >>>> +	return false;
> >>>> +}
> >>>> +EXPORT_SYMBOL_GPL(virtqueue_kick_sync);
> >>>> +
> >>>>  /**
> >>>>   * virtqueue_kick - update after add_buf
> >>>>   * @vq: the struct virtqueue
> >>>> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> >>>> index fa1b5da2804e..58943a3a0e8d 100644
> >>>> --- a/include/linux/virtio.h
> >>>> +++ b/include/linux/virtio.h
> >>>> @@ -57,6 +57,10 @@ int virtqueue_add_sgs(struct virtqueue *vq,
> >>>>  		      unsigned int in_sgs,
> >>>>  		      void *data,
> >>>>  		      gfp_t gfp);
> >>>> +/* A desc with this init id is treated as an invalid desc */
> >>>> +int virtqueue_add_desc(struct virtqueue *_vq, u64 addr, u32 len, int in);
> >>>> +
> >>>> +bool virtqueue_kick_sync(struct virtqueue *vq);
> >>>>  
> >>>>  bool virtqueue_kick(struct virtqueue *vq);
> >>>>  
> >>>> -- 
> >>>> 2.17.2    


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 5/7] virtio: Enables to add a single descriptor to the host
  2019-02-06 13:29           ` Luiz Capitulino
@ 2019-02-06 14:05             ` Nitesh Narayan Lal
  0 siblings, 0 replies; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-06 14:05 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Michael S. Tsirkin, kvm, linux-kernel, pbonzini, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, dodgen, konrad.wilk,
	dhildenb, aarcange


[-- Attachment #1.1: Type: text/plain, Size: 6661 bytes --]

On 2/6/19 8:29 AM, Luiz Capitulino wrote:
> On Wed, 6 Feb 2019 08:24:14 -0500
> Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>> On 2/6/19 8:15 AM, Luiz Capitulino wrote:
>>> On Wed, 6 Feb 2019 07:56:37 -0500
>>> Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>  
>>>> On 2/5/19 3:49 PM, Michael S. Tsirkin wrote:  
>>>>> On Mon, Feb 04, 2019 at 03:18:52PM -0500, Nitesh Narayan Lal wrote:    
>>>>>> This patch enables the caller to expose a single buffers to the
>>>>>> other end using vring descriptor. It also allows the caller to
>>>>>> perform this action in synchornous manner by using virtqueue_kick_sync.
>>>>>>
>>>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>    
>>>>> I am not sure why do we need this API. Polling in guest
>>>>> until host runs isn't great either since these
>>>>> might be running on the same host CPU.    
>>>> True.
>>>>
>>>> However, my understanding is that the existing API such as
>>>> virtqueue_add_outbuf() requires an allocation which will be problematic
>>>> for my implementation.
>>>> Although I am not blocking the allocation path during normal Linux
>>>> kernel usage as even if one of the zone is locked the other zone could
>>>> be used to get free pages.
>>>> But during the initial boot time (device initialization), in certain
>>>> situations the allocation can only come from a single zone, acquiring a
>>>> lock on it may result in a deadlock situation.  
>>> I might be wrong, but if I remember correctly, this was true for
>>> your previous implementation where you'd report page hinting down
>>> from arch_free_page() so you couldn't allocate memory. But this
>>> is not the case anymore.  
>> With the earlier implementation, the allocation was blocked all the time
>> when freeing was going on.
>> With this implementation, the allocation is not blocked during normal
>> Linux kernel usage (after Linux boots up). For example, on a 64 bit
>> machine, if the Normal zone is locked and there is an allocation request
>> then it can be served by DMA32 zone as well. (This is not the case
>> during device initialization time)
>> Feel free to correct me if I am wrong.
> That's what I meant :) I have an impression that your virtio API
> was necessary because of your earlier design. I guess it's not needed
> anymore as Michael says.
I will re-visit this change before my next posting.
>>>  
>>>>>    
>>>>>> ---
>>>>>>  drivers/virtio/virtio_ring.c | 72 ++++++++++++++++++++++++++++++++++++
>>>>>>  include/linux/virtio.h       |  4 ++
>>>>>>  2 files changed, 76 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>>>> index cd7e755484e3..93c161ac6a28 100644
>>>>>> --- a/drivers/virtio/virtio_ring.c
>>>>>> +++ b/drivers/virtio/virtio_ring.c
>>>>>> @@ -1695,6 +1695,52 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>>>>>>  					out_sgs, in_sgs, data, ctx, gfp);
>>>>>>  }
>>>>>>  
>>>>>> +/**
>>>>>> + * virtqueue_add_desc - add a buffer to a chain using a vring desc
>>>>>> + * @vq: the struct virtqueue we're talking about.
>>>>>> + * @addr: address of the buffer to add.
>>>>>> + * @len: length of the buffer.
>>>>>> + * @in: set if the buffer is for the device to write.
>>>>>> + *
>>>>>> + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
>>>>>> + */
>>>>>> +int virtqueue_add_desc(struct virtqueue *_vq, u64 addr, u32 len, int in)
>>>>>> +{
>>>>>> +	struct vring_virtqueue *vq = to_vvq(_vq);
>>>>>> +	struct vring_desc *desc = vq->split.vring.desc;
>>>>>> +	u16 flags = in ? VRING_DESC_F_WRITE : 0;
>>>>>> +	unsigned int i;
>>>>>> +	void *data = (void *)addr;
>>>>>> +	int avail_idx;
>>>>>> +
>>>>>> +	/* Sanity check */
>>>>>> +	if (!_vq)
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	START_USE(vq);
>>>>>> +	if (unlikely(vq->broken)) {
>>>>>> +		END_USE(vq);
>>>>>> +		return -EIO;
>>>>>> +	}
>>>>>> +
>>>>>> +	i = vq->free_head;
>>>>>> +	flags &= ~VRING_DESC_F_NEXT;
>>>>>> +	desc[i].flags = cpu_to_virtio16(_vq->vdev, flags);
>>>>>> +	desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
>>>>>> +	desc[i].len = cpu_to_virtio32(_vq->vdev, len);
>>>>>> +
>>>>>> +	vq->vq.num_free--;
>>>>>> +	vq->free_head = virtio16_to_cpu(_vq->vdev, desc[i].next);
>>>>>> +	vq->split.desc_state[i].data = data;
>>>>>> +	vq->split.avail_idx_shadow = 1;
>>>>>> +	avail_idx = vq->split.avail_idx_shadow;
>>>>>> +	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev, avail_idx);
>>>>>> +	vq->num_added = 1;
>>>>>> +	END_USE(vq);
>>>>>> +	return 0;
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(virtqueue_add_desc);
>>>>>> +
>>>>>>  /**
>>>>>>   * virtqueue_add_sgs - expose buffers to other end
>>>>>>   * @vq: the struct virtqueue we're talking about.
>>>>>> @@ -1842,6 +1888,32 @@ bool virtqueue_notify(struct virtqueue *_vq)
>>>>>>  }
>>>>>>  EXPORT_SYMBOL_GPL(virtqueue_notify);
>>>>>>  
>>>>>> +/**
>>>>>> + * virtqueue_kick_sync - update after add_buf and busy wait till update is done
>>>>>> + * @vq: the struct virtqueue
>>>>>> + *
>>>>>> + * After one or more virtqueue_add_* calls, invoke this to kick
>>>>>> + * the other side. Busy wait till the other side is done with the update.
>>>>>> + *
>>>>>> + * Caller must ensure we don't call this with other virtqueue
>>>>>> + * operations at the same time (except where noted).
>>>>>> + *
>>>>>> + * Returns false if kick failed, otherwise true.
>>>>>> + */
>>>>>> +bool virtqueue_kick_sync(struct virtqueue *vq)
>>>>>> +{
>>>>>> +	u32 len;
>>>>>> +
>>>>>> +	if (likely(virtqueue_kick(vq))) {
>>>>>> +		while (!virtqueue_get_buf(vq, &len) &&
>>>>>> +		       !virtqueue_is_broken(vq))
>>>>>> +			cpu_relax();
>>>>>> +		return true;
>>>>>> +	}
>>>>>> +	return false;
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(virtqueue_kick_sync);
>>>>>> +
>>>>>>  /**
>>>>>>   * virtqueue_kick - update after add_buf
>>>>>>   * @vq: the struct virtqueue
>>>>>> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
>>>>>> index fa1b5da2804e..58943a3a0e8d 100644
>>>>>> --- a/include/linux/virtio.h
>>>>>> +++ b/include/linux/virtio.h
>>>>>> @@ -57,6 +57,10 @@ int virtqueue_add_sgs(struct virtqueue *vq,
>>>>>>  		      unsigned int in_sgs,
>>>>>>  		      void *data,
>>>>>>  		      gfp_t gfp);
>>>>>> +/* A desc with this init id is treated as an invalid desc */
>>>>>> +int virtqueue_add_desc(struct virtqueue *_vq, u64 addr, u32 len, int in);
>>>>>> +
>>>>>> +bool virtqueue_kick_sync(struct virtqueue *vq);
>>>>>>  
>>>>>>  bool virtqueue_kick(struct virtqueue *vq);
>>>>>>  
>>>>>> -- 
>>>>>> 2.17.2    
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 5/7] virtio: Enables to add a single descriptor to the host
  2019-02-06 12:56     ` Nitesh Narayan Lal
  2019-02-06 13:15       ` Luiz Capitulino
@ 2019-02-06 18:03       ` Michael S. Tsirkin
  2019-02-06 18:19         ` Nitesh Narayan Lal
  1 sibling, 1 reply; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-06 18:03 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, david, dodgen, konrad.wilk, dhildenb,
	aarcange

On Wed, Feb 06, 2019 at 07:56:37AM -0500, Nitesh Narayan Lal wrote:
> 
> On 2/5/19 3:49 PM, Michael S. Tsirkin wrote:
> > On Mon, Feb 04, 2019 at 03:18:52PM -0500, Nitesh Narayan Lal wrote:
> >> This patch enables the caller to expose a single buffers to the
> >> other end using vring descriptor. It also allows the caller to
> >> perform this action in synchornous manner by using virtqueue_kick_sync.
> >>
> >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> > I am not sure why do we need this API. Polling in guest
> > until host runs isn't great either since these
> > might be running on the same host CPU.
> True.
> 
> However, my understanding is that the existing API such as
> virtqueue_add_outbuf() requires an allocation which will be problematic
> for my implementation.

Not with a single s/g entry, no.

> Although I am not blocking the allocation path during normal Linux
> kernel usage as even if one of the zone is locked the other zone could
> be used to get free pages.


I am a bit confused about locking, I was under the impression
that you are not calling virtio under a zone lock.
FYI doing that was nacked by Linus.

> But during the initial boot time (device initialization), in certain
> situations the allocation can only come from a single zone, acquiring a
> lock on it may result in a deadlock situation.
> 
> >
> >
> >
> >> ---
> >>  drivers/virtio/virtio_ring.c | 72 ++++++++++++++++++++++++++++++++++++
> >>  include/linux/virtio.h       |  4 ++
> >>  2 files changed, 76 insertions(+)
> >>
> >> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> >> index cd7e755484e3..93c161ac6a28 100644
> >> --- a/drivers/virtio/virtio_ring.c
> >> +++ b/drivers/virtio/virtio_ring.c
> >> @@ -1695,6 +1695,52 @@ static inline int virtqueue_add(struct virtqueue *_vq,
> >>  					out_sgs, in_sgs, data, ctx, gfp);
> >>  }
> >>  
> >> +/**
> >> + * virtqueue_add_desc - add a buffer to a chain using a vring desc
> >> + * @vq: the struct virtqueue we're talking about.
> >> + * @addr: address of the buffer to add.
> >> + * @len: length of the buffer.
> >> + * @in: set if the buffer is for the device to write.
> >> + *
> >> + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
> >> + */
> >> +int virtqueue_add_desc(struct virtqueue *_vq, u64 addr, u32 len, int in)
> >> +{
> >> +	struct vring_virtqueue *vq = to_vvq(_vq);
> >> +	struct vring_desc *desc = vq->split.vring.desc;
> >> +	u16 flags = in ? VRING_DESC_F_WRITE : 0;
> >> +	unsigned int i;
> >> +	void *data = (void *)addr;
> >> +	int avail_idx;
> >> +
> >> +	/* Sanity check */
> >> +	if (!_vq)
> >> +		return -EINVAL;
> >> +
> >> +	START_USE(vq);
> >> +	if (unlikely(vq->broken)) {
> >> +		END_USE(vq);
> >> +		return -EIO;
> >> +	}
> >> +
> >> +	i = vq->free_head;
> >> +	flags &= ~VRING_DESC_F_NEXT;
> >> +	desc[i].flags = cpu_to_virtio16(_vq->vdev, flags);
> >> +	desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
> >> +	desc[i].len = cpu_to_virtio32(_vq->vdev, len);
> >> +
> >> +	vq->vq.num_free--;
> >> +	vq->free_head = virtio16_to_cpu(_vq->vdev, desc[i].next);
> >> +	vq->split.desc_state[i].data = data;
> >> +	vq->split.avail_idx_shadow = 1;
> >> +	avail_idx = vq->split.avail_idx_shadow;
> >> +	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev, avail_idx);
> >> +	vq->num_added = 1;
> >> +	END_USE(vq);
> >> +	return 0;
> >> +}
> >> +EXPORT_SYMBOL_GPL(virtqueue_add_desc);
> >> +
> >>  /**
> >>   * virtqueue_add_sgs - expose buffers to other end
> >>   * @vq: the struct virtqueue we're talking about.
> >> @@ -1842,6 +1888,32 @@ bool virtqueue_notify(struct virtqueue *_vq)
> >>  }
> >>  EXPORT_SYMBOL_GPL(virtqueue_notify);
> >>  
> >> +/**
> >> + * virtqueue_kick_sync - update after add_buf and busy wait till update is done
> >> + * @vq: the struct virtqueue
> >> + *
> >> + * After one or more virtqueue_add_* calls, invoke this to kick
> >> + * the other side. Busy wait till the other side is done with the update.
> >> + *
> >> + * Caller must ensure we don't call this with other virtqueue
> >> + * operations at the same time (except where noted).
> >> + *
> >> + * Returns false if kick failed, otherwise true.
> >> + */
> >> +bool virtqueue_kick_sync(struct virtqueue *vq)
> >> +{
> >> +	u32 len;
> >> +
> >> +	if (likely(virtqueue_kick(vq))) {
> >> +		while (!virtqueue_get_buf(vq, &len) &&
> >> +		       !virtqueue_is_broken(vq))
> >> +			cpu_relax();
> >> +		return true;
> >> +	}
> >> +	return false;
> >> +}
> >> +EXPORT_SYMBOL_GPL(virtqueue_kick_sync);
> >> +
> >>  /**
> >>   * virtqueue_kick - update after add_buf
> >>   * @vq: the struct virtqueue
> >> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> >> index fa1b5da2804e..58943a3a0e8d 100644
> >> --- a/include/linux/virtio.h
> >> +++ b/include/linux/virtio.h
> >> @@ -57,6 +57,10 @@ int virtqueue_add_sgs(struct virtqueue *vq,
> >>  		      unsigned int in_sgs,
> >>  		      void *data,
> >>  		      gfp_t gfp);
> >> +/* A desc with this init id is treated as an invalid desc */
> >> +int virtqueue_add_desc(struct virtqueue *_vq, u64 addr, u32 len, int in);
> >> +
> >> +bool virtqueue_kick_sync(struct virtqueue *vq);
> >>  
> >>  bool virtqueue_kick(struct virtqueue *vq);
> >>  
> >> -- 
> >> 2.17.2
> -- 
> Regards
> Nitesh
> 




^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 5/7] virtio: Enables to add a single descriptor to the host
  2019-02-06 18:03       ` Michael S. Tsirkin
@ 2019-02-06 18:19         ` Nitesh Narayan Lal
  0 siblings, 0 replies; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-06 18:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, david, dodgen, konrad.wilk, dhildenb,
	aarcange


[-- Attachment #1.1: Type: text/plain, Size: 5692 bytes --]

On 2/6/19 1:03 PM, Michael S. Tsirkin wrote:
> On Wed, Feb 06, 2019 at 07:56:37AM -0500, Nitesh Narayan Lal wrote:
>> On 2/5/19 3:49 PM, Michael S. Tsirkin wrote:
>>> On Mon, Feb 04, 2019 at 03:18:52PM -0500, Nitesh Narayan Lal wrote:
>>>> This patch enables the caller to expose a single buffers to the
>>>> other end using vring descriptor. It also allows the caller to
>>>> perform this action in synchornous manner by using virtqueue_kick_sync.
>>>>
>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>>> I am not sure why do we need this API. Polling in guest
>>> until host runs isn't great either since these
>>> might be running on the same host CPU.
>> True.
>>
>> However, my understanding is that the existing API such as
>> virtqueue_add_outbuf() requires an allocation which will be problematic
>> for my implementation.
> Not with a single s/g entry, no.
Didn't know this. I will re-check.
>
>> Although I am not blocking the allocation path during normal Linux
>> kernel usage as even if one of the zone is locked the other zone could
>> be used to get free pages.
>
> I am a bit confused about locking, 
My bad, I think I created the confusion.
> I was under the impression
> that you are not calling virtio under a zone lock.

Yeap. Your understanding is correct.
I will re-visit this and correct it in the next version.
> FYI doing that was nacked by Linus.
>
>
>> But during the initial boot time (device initialization), in certain
>> situations the allocation can only come from a single zone, acquiring a
>> lock on it may result in a deadlock situation.
>>
>>>
>>>
>>>> ---
>>>>  drivers/virtio/virtio_ring.c | 72 ++++++++++++++++++++++++++++++++++++
>>>>  include/linux/virtio.h       |  4 ++
>>>>  2 files changed, 76 insertions(+)
>>>>
>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>> index cd7e755484e3..93c161ac6a28 100644
>>>> --- a/drivers/virtio/virtio_ring.c
>>>> +++ b/drivers/virtio/virtio_ring.c
>>>> @@ -1695,6 +1695,52 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>>>>  					out_sgs, in_sgs, data, ctx, gfp);
>>>>  }
>>>>  
>>>> +/**
>>>> + * virtqueue_add_desc - add a buffer to a chain using a vring desc
>>>> + * @vq: the struct virtqueue we're talking about.
>>>> + * @addr: address of the buffer to add.
>>>> + * @len: length of the buffer.
>>>> + * @in: set if the buffer is for the device to write.
>>>> + *
>>>> + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
>>>> + */
>>>> +int virtqueue_add_desc(struct virtqueue *_vq, u64 addr, u32 len, int in)
>>>> +{
>>>> +	struct vring_virtqueue *vq = to_vvq(_vq);
>>>> +	struct vring_desc *desc = vq->split.vring.desc;
>>>> +	u16 flags = in ? VRING_DESC_F_WRITE : 0;
>>>> +	unsigned int i;
>>>> +	void *data = (void *)addr;
>>>> +	int avail_idx;
>>>> +
>>>> +	/* Sanity check */
>>>> +	if (!_vq)
>>>> +		return -EINVAL;
>>>> +
>>>> +	START_USE(vq);
>>>> +	if (unlikely(vq->broken)) {
>>>> +		END_USE(vq);
>>>> +		return -EIO;
>>>> +	}
>>>> +
>>>> +	i = vq->free_head;
>>>> +	flags &= ~VRING_DESC_F_NEXT;
>>>> +	desc[i].flags = cpu_to_virtio16(_vq->vdev, flags);
>>>> +	desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
>>>> +	desc[i].len = cpu_to_virtio32(_vq->vdev, len);
>>>> +
>>>> +	vq->vq.num_free--;
>>>> +	vq->free_head = virtio16_to_cpu(_vq->vdev, desc[i].next);
>>>> +	vq->split.desc_state[i].data = data;
>>>> +	vq->split.avail_idx_shadow = 1;
>>>> +	avail_idx = vq->split.avail_idx_shadow;
>>>> +	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev, avail_idx);
>>>> +	vq->num_added = 1;
>>>> +	END_USE(vq);
>>>> +	return 0;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(virtqueue_add_desc);
>>>> +
>>>>  /**
>>>>   * virtqueue_add_sgs - expose buffers to other end
>>>>   * @vq: the struct virtqueue we're talking about.
>>>> @@ -1842,6 +1888,32 @@ bool virtqueue_notify(struct virtqueue *_vq)
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(virtqueue_notify);
>>>>  
>>>> +/**
>>>> + * virtqueue_kick_sync - update after add_buf and busy wait till update is done
>>>> + * @vq: the struct virtqueue
>>>> + *
>>>> + * After one or more virtqueue_add_* calls, invoke this to kick
>>>> + * the other side. Busy wait till the other side is done with the update.
>>>> + *
>>>> + * Caller must ensure we don't call this with other virtqueue
>>>> + * operations at the same time (except where noted).
>>>> + *
>>>> + * Returns false if kick failed, otherwise true.
>>>> + */
>>>> +bool virtqueue_kick_sync(struct virtqueue *vq)
>>>> +{
>>>> +	u32 len;
>>>> +
>>>> +	if (likely(virtqueue_kick(vq))) {
>>>> +		while (!virtqueue_get_buf(vq, &len) &&
>>>> +		       !virtqueue_is_broken(vq))
>>>> +			cpu_relax();
>>>> +		return true;
>>>> +	}
>>>> +	return false;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(virtqueue_kick_sync);
>>>> +
>>>>  /**
>>>>   * virtqueue_kick - update after add_buf
>>>>   * @vq: the struct virtqueue
>>>> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
>>>> index fa1b5da2804e..58943a3a0e8d 100644
>>>> --- a/include/linux/virtio.h
>>>> +++ b/include/linux/virtio.h
>>>> @@ -57,6 +57,10 @@ int virtqueue_add_sgs(struct virtqueue *vq,
>>>>  		      unsigned int in_sgs,
>>>>  		      void *data,
>>>>  		      gfp_t gfp);
>>>> +/* A desc with this init id is treated as an invalid desc */
>>>> +int virtqueue_add_desc(struct virtqueue *_vq, u64 addr, u32 len, int in);
>>>> +
>>>> +bool virtqueue_kick_sync(struct virtqueue *vq);
>>>>  
>>>>  bool virtqueue_kick(struct virtqueue *vq);
>>>>  
>>>> -- 
>>>> 2.17.2
>> -- 
>> Regards
>> Nitesh
>>
>
>
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 4/7] KVM: Disabling page poisoning to prevent corruption
  2019-02-04 20:18 ` [RFC][Patch v8 4/7] KVM: Disabling page poisoning to prevent corruption Nitesh Narayan Lal
@ 2019-02-07 17:23   ` Alexander Duyck
  2019-02-07 17:56     ` Nitesh Narayan Lal
  2019-02-07 21:08   ` Michael S. Tsirkin
  1 sibling, 1 reply; 116+ messages in thread
From: Alexander Duyck @ 2019-02-07 17:23 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang,
	Yang Zhang, riel, david, Michael S. Tsirkin, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

On Mon, Feb 4, 2019 at 2:11 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> This patch disables page poisoning if guest page hinting is enabled.
> It is required to avoid possible guest memory corruption errors.
> Page Poisoning is a feature in which the page is filled with a specific
> pattern of (0x00 or 0xaa) after arch_free_page and the same is verified
> before arch_alloc_page to prevent following issues:
>     *information leak from the freed data
>     *use after free bugs
>     *memory corruption
> Selection of the pattern depends on the CONFIG_PAGE_POISONING_ZERO
> Once the guest pages which are supposed to be freed are sent to the
> hypervisor it frees them. After freeing the pages in the global list
> following things may happen:
>     *Hypervisor reallocates the freed memory back to the guest
>     *Hypervisor frees the memory and maps a different physical memory
> In order to prevent any information leak hypervisor before allocating
> memory to the guest fills it with zeroes.
> The issue arises when the pattern used for Page Poisoning is 0xaa while
> the newly allocated page received from the hypervisor by the guest is
> filled with the pattern 0x00. This will result in memory corruption errors.
>
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>

This seems kind of backwards to me. Why disable page poisoning instead
of just not hinting about the free pages? There shouldn't be that many
instances when page poisoning is enabled, and when it is it would make
more sense to leave it enabled rather than silently disable it.

> ---
>  include/linux/page_hinting.h | 8 ++++++++
>  mm/page_poison.c             | 2 +-
>  virt/kvm/page_hinting.c      | 1 +
>  3 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
> index 2d7ff59f3f6a..e800c6b07561 100644
> --- a/include/linux/page_hinting.h
> +++ b/include/linux/page_hinting.h
> @@ -19,7 +19,15 @@ struct hypervisor_pages {
>  extern int guest_page_hinting_flag;
>  extern struct static_key_false guest_page_hinting_key;
>  extern struct smp_hotplug_thread hinting_threads;
> +extern bool want_page_poisoning;
>
>  int guest_page_hinting_sysctl(struct ctl_table *table, int write,
>                               void __user *buffer, size_t *lenp, loff_t *ppos);
>  void guest_free_page(struct page *page, int order);
> +
> +static inline void disable_page_poisoning(void)
> +{
> +#ifdef CONFIG_PAGE_POISONING
> +       want_page_poisoning = 0;
> +#endif
> +}
> diff --git a/mm/page_poison.c b/mm/page_poison.c
> index f0c15e9017c0..9af96021133b 100644
> --- a/mm/page_poison.c
> +++ b/mm/page_poison.c
> @@ -7,7 +7,7 @@
>  #include <linux/poison.h>
>  #include <linux/ratelimit.h>
>
> -static bool want_page_poisoning __read_mostly;
> +bool want_page_poisoning __read_mostly;
>
>  static int __init early_page_poison_param(char *buf)
>  {
> diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
> index 636990e7fbb3..be529f6f2bc0 100644
> --- a/virt/kvm/page_hinting.c
> +++ b/virt/kvm/page_hinting.c
> @@ -103,6 +103,7 @@ void guest_free_page(struct page *page, int order)
>
>         local_irq_save(flags);
>         if (page_hinting_obj->kvm_pt_idx != MAX_FGPT_ENTRIES) {
> +               disable_page_poisoning();
>                 page_hinting_obj->kvm_pt[page_hinting_obj->kvm_pt_idx].pfn =
>                                                         page_to_pfn(page);
>                 page_hinting_obj->kvm_pt[page_hinting_obj->kvm_pt_idx].zonenum =

At a minimum it seems like you should have some sort of warning
message that you are disabling page poisoning rather than just
silently turning it off.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages
  2019-02-05 21:55       ` Michael S. Tsirkin
@ 2019-02-07 17:43         ` Alexander Duyck
  2019-02-07 19:01           ` Michael S. Tsirkin
  2019-02-07 20:50           ` Nitesh Narayan Lal
  0 siblings, 2 replies; 116+ messages in thread
From: Alexander Duyck @ 2019-02-07 17:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nitesh Narayan Lal, kvm list, LKML, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, riel, david, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

On Tue, Feb 5, 2019 at 3:21 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Feb 05, 2019 at 04:54:03PM -0500, Nitesh Narayan Lal wrote:
> >
> > On 2/5/19 3:45 PM, Michael S. Tsirkin wrote:
> > > On Mon, Feb 04, 2019 at 03:18:53PM -0500, Nitesh Narayan Lal wrote:
> > >> This patch enables the kernel to scan the per cpu array and
> > >> compress it by removing the repetitive/re-allocated pages.
> > >> Once the per cpu array is completely filled with pages in the
> > >> buddy it wakes up the kernel per cpu thread which re-scans the
> > >> entire per cpu array by acquiring a zone lock corresponding to
> > >> the page which is being scanned. If the page is still free and
> > >> present in the buddy it tries to isolate the page and adds it
> > >> to another per cpu array.
> > >>
> > >> Once this scanning process is complete and if there are any
> > >> isolated pages added to the new per cpu array kernel thread
> > >> invokes hyperlist_ready().
> > >>
> > >> In hyperlist_ready() a hypercall is made to report these pages to
> > >> the host using the virtio-balloon framework. In order to do so
> > >> another virtqueue 'hinting_vq' is added to the balloon framework.
> > >> As the host frees all the reported pages, the kernel thread returns
> > >> them back to the buddy.
> > >>
> > >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> > >
> > > This looks kind of like what early iterations of Wei's patches did.
> > >
> > > But this has lots of issues, for example you might end up with
> > > a hypercall per a 4K page.
> > > So in the end, he switched over to just reporting only
> > > MAX_ORDER - 1 pages.
> > You mean that I should only capture/attempt to isolate pages with order
> > MAX_ORDER - 1?
> > >
> > > Would that be a good idea for you too?
> > Will it help if we have a threshold value based on the amount of memory
> > captured instead of the number of entries/pages in the array?
>
> This is what Wei's patches do at least.

So in the solution I had posted I was looking more at
HUGETLB_PAGE_ORDER and above as the size of pages to provide the hints
on [1]. The advantage to doing that is that you can also avoid
fragmenting huge pages which in turn can cause what looks like a
memory leak as the memory subsystem attempts to reassemble huge
pages[2]. In my mind a 2MB page makes good sense in terms of the size
of things to be performing hints on as anything smaller than that is
going to just end up being a bunch of extra work and end up causing a
bunch of fragmentation.

The only issue with limiting things on an arbitrary boundary like that
is that you have to hook into the buddy allocator to catch the cases
where a page has been merged up into that range.

[1] https://lkml.org/lkml/2019/2/4/903
[2] https://blog.digitalocean.com/transparent-huge-pages-and-alternative-memory-allocators/

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 4/7] KVM: Disabling page poisoning to prevent corruption
  2019-02-07 17:23   ` Alexander Duyck
@ 2019-02-07 17:56     ` Nitesh Narayan Lal
  2019-02-07 18:24       ` Alexander Duyck
  0 siblings, 1 reply; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-07 17:56 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang,
	Yang Zhang, riel, david, Michael S. Tsirkin, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli


[-- Attachment #1.1: Type: text/plain, Size: 4104 bytes --]


On 2/7/19 12:23 PM, Alexander Duyck wrote:
> On Mon, Feb 4, 2019 at 2:11 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> This patch disables page poisoning if guest page hinting is enabled.
>> It is required to avoid possible guest memory corruption errors.
>> Page Poisoning is a feature in which the page is filled with a specific
>> pattern of (0x00 or 0xaa) after arch_free_page and the same is verified
>> before arch_alloc_page to prevent following issues:
>>     *information leak from the freed data
>>     *use after free bugs
>>     *memory corruption
>> Selection of the pattern depends on the CONFIG_PAGE_POISONING_ZERO
>> Once the guest pages which are supposed to be freed are sent to the
>> hypervisor it frees them. After freeing the pages in the global list
>> following things may happen:
>>     *Hypervisor reallocates the freed memory back to the guest
>>     *Hypervisor frees the memory and maps a different physical memory
>> In order to prevent any information leak hypervisor before allocating
>> memory to the guest fills it with zeroes.
>> The issue arises when the pattern used for Page Poisoning is 0xaa while
>> the newly allocated page received from the hypervisor by the guest is
>> filled with the pattern 0x00. This will result in memory corruption errors.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> This seems kind of backwards to me. Why disable page poisoning instead
> of just not hinting about the free pages? There shouldn't be that many
> instances when page poisoning is enabled, and when it is it would make
> more sense to leave it enabled rather than silently disable it.
As I have mentioned in the cover email, I intend to reuse Wei's already
merged work.

This will enable the guest to communicate the poison value which is in
use to the host.

>
>> ---
>>  include/linux/page_hinting.h | 8 ++++++++
>>  mm/page_poison.c             | 2 +-
>>  virt/kvm/page_hinting.c      | 1 +
>>  3 files changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
>> index 2d7ff59f3f6a..e800c6b07561 100644
>> --- a/include/linux/page_hinting.h
>> +++ b/include/linux/page_hinting.h
>> @@ -19,7 +19,15 @@ struct hypervisor_pages {
>>  extern int guest_page_hinting_flag;
>>  extern struct static_key_false guest_page_hinting_key;
>>  extern struct smp_hotplug_thread hinting_threads;
>> +extern bool want_page_poisoning;
>>
>>  int guest_page_hinting_sysctl(struct ctl_table *table, int write,
>>                               void __user *buffer, size_t *lenp, loff_t *ppos);
>>  void guest_free_page(struct page *page, int order);
>> +
>> +static inline void disable_page_poisoning(void)
>> +{
>> +#ifdef CONFIG_PAGE_POISONING
>> +       want_page_poisoning = 0;
>> +#endif
>> +}
>> diff --git a/mm/page_poison.c b/mm/page_poison.c
>> index f0c15e9017c0..9af96021133b 100644
>> --- a/mm/page_poison.c
>> +++ b/mm/page_poison.c
>> @@ -7,7 +7,7 @@
>>  #include <linux/poison.h>
>>  #include <linux/ratelimit.h>
>>
>> -static bool want_page_poisoning __read_mostly;
>> +bool want_page_poisoning __read_mostly;
>>
>>  static int __init early_page_poison_param(char *buf)
>>  {
>> diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
>> index 636990e7fbb3..be529f6f2bc0 100644
>> --- a/virt/kvm/page_hinting.c
>> +++ b/virt/kvm/page_hinting.c
>> @@ -103,6 +103,7 @@ void guest_free_page(struct page *page, int order)
>>
>>         local_irq_save(flags);
>>         if (page_hinting_obj->kvm_pt_idx != MAX_FGPT_ENTRIES) {
>> +               disable_page_poisoning();
>>                 page_hinting_obj->kvm_pt[page_hinting_obj->kvm_pt_idx].pfn =
>>                                                         page_to_pfn(page);
>>                 page_hinting_obj->kvm_pt[page_hinting_obj->kvm_pt_idx].zonenum =
> At a minimum it seems like you should have some sort of warning
> message that you are disabling page poisoning rather than just
> silently turning it off.
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 4/7] KVM: Disabling page poisoning to prevent corruption
  2019-02-07 17:56     ` Nitesh Narayan Lal
@ 2019-02-07 18:24       ` Alexander Duyck
  2019-02-07 19:14         ` Michael S. Tsirkin
  0 siblings, 1 reply; 116+ messages in thread
From: Alexander Duyck @ 2019-02-07 18:24 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang,
	Yang Zhang, riel, david, Michael S. Tsirkin, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

On Thu, Feb 7, 2019 at 9:56 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 2/7/19 12:23 PM, Alexander Duyck wrote:
> > On Mon, Feb 4, 2019 at 2:11 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >> This patch disables page poisoning if guest page hinting is enabled.
> >> It is required to avoid possible guest memory corruption errors.
> >> Page Poisoning is a feature in which the page is filled with a specific
> >> pattern of (0x00 or 0xaa) after arch_free_page and the same is verified
> >> before arch_alloc_page to prevent following issues:
> >>     *information leak from the freed data
> >>     *use after free bugs
> >>     *memory corruption
> >> Selection of the pattern depends on the CONFIG_PAGE_POISONING_ZERO
> >> Once the guest pages which are supposed to be freed are sent to the
> >> hypervisor it frees them. After freeing the pages in the global list
> >> following things may happen:
> >>     *Hypervisor reallocates the freed memory back to the guest
> >>     *Hypervisor frees the memory and maps a different physical memory
> >> In order to prevent any information leak hypervisor before allocating
> >> memory to the guest fills it with zeroes.
> >> The issue arises when the pattern used for Page Poisoning is 0xaa while
> >> the newly allocated page received from the hypervisor by the guest is
> >> filled with the pattern 0x00. This will result in memory corruption errors.
> >>
> >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> > This seems kind of backwards to me. Why disable page poisoning instead
> > of just not hinting about the free pages? There shouldn't be that many
> > instances when page poisoning is enabled, and when it is it would make
> > more sense to leave it enabled rather than silently disable it.
> As I have mentioned in the cover email, I intend to reuse Wei's already
> merged work.
>
> This will enable the guest to communicate the poison value which is in
> use to the host.

That is far from being reliable given that you are having to buffer
the pages for some period of time. I really think it would be better
to just allow page poisoning to function and when you can support
applying poison to a newly allocated page then you could look at
re-enabling it.

What I am getting at is that those that care about poisoning won't
likely care about performance and I would lump the memory hinting in
with other performance features.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages
  2019-02-07 17:43         ` Alexander Duyck
@ 2019-02-07 19:01           ` Michael S. Tsirkin
  2019-02-07 20:50           ` Nitesh Narayan Lal
  1 sibling, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-07 19:01 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Nitesh Narayan Lal, kvm list, LKML, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, riel, david, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

On Thu, Feb 07, 2019 at 09:43:44AM -0800, Alexander Duyck wrote:
> On Tue, Feb 5, 2019 at 3:21 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Tue, Feb 05, 2019 at 04:54:03PM -0500, Nitesh Narayan Lal wrote:
> > >
> > > On 2/5/19 3:45 PM, Michael S. Tsirkin wrote:
> > > > On Mon, Feb 04, 2019 at 03:18:53PM -0500, Nitesh Narayan Lal wrote:
> > > >> This patch enables the kernel to scan the per cpu array and
> > > >> compress it by removing the repetitive/re-allocated pages.
> > > >> Once the per cpu array is completely filled with pages in the
> > > >> buddy it wakes up the kernel per cpu thread which re-scans the
> > > >> entire per cpu array by acquiring a zone lock corresponding to
> > > >> the page which is being scanned. If the page is still free and
> > > >> present in the buddy it tries to isolate the page and adds it
> > > >> to another per cpu array.
> > > >>
> > > >> Once this scanning process is complete and if there are any
> > > >> isolated pages added to the new per cpu array kernel thread
> > > >> invokes hyperlist_ready().
> > > >>
> > > >> In hyperlist_ready() a hypercall is made to report these pages to
> > > >> the host using the virtio-balloon framework. In order to do so
> > > >> another virtqueue 'hinting_vq' is added to the balloon framework.
> > > >> As the host frees all the reported pages, the kernel thread returns
> > > >> them back to the buddy.
> > > >>
> > > >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> > > >
> > > > This looks kind of like what early iterations of Wei's patches did.
> > > >
> > > > But this has lots of issues, for example you might end up with
> > > > a hypercall per a 4K page.
> > > > So in the end, he switched over to just reporting only
> > > > MAX_ORDER - 1 pages.
> > > You mean that I should only capture/attempt to isolate pages with order
> > > MAX_ORDER - 1?
> > > >
> > > > Would that be a good idea for you too?
> > > Will it help if we have a threshold value based on the amount of memory
> > > captured instead of the number of entries/pages in the array?
> >
> > This is what Wei's patches do at least.
> 
> So in the solution I had posted I was looking more at
> HUGETLB_PAGE_ORDER and above as the size of pages to provide the hints
> on [1]. The advantage to doing that is that you can also avoid
> fragmenting huge pages which in turn can cause what looks like a
> memory leak as the memory subsystem attempts to reassemble huge
> pages[2]. In my mind a 2MB page makes good sense in terms of the size
> of things to be performing hints on as anything smaller than that is
> going to just end up being a bunch of extra work and end up causing a
> bunch of fragmentation.

Yes MAX_ORDER-1 is 4M. So not a lot of difference on x86.

The idea behind keying off MAX_ORDER is that CPU hugepages isn't
the only reason to avoid fragmentation, there's other
hardware that benefits from linear physical addresses.
And there are weird platforms where HUGETLB_PAGE_ORDER exceeds
MAX_ORDER - 1. So from that POV keying it off MAX_ORDER
makes more sense.


> The only issue with limiting things on an arbitrary boundary like that
> is that you have to hook into the buddy allocator to catch the cases
> where a page has been merged up into that range.
> 
> [1] https://lkml.org/lkml/2019/2/4/903
> [2] https://blog.digitalocean.com/transparent-huge-pages-and-alternative-memory-allocators/

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 4/7] KVM: Disabling page poisoning to prevent corruption
  2019-02-07 18:24       ` Alexander Duyck
@ 2019-02-07 19:14         ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-07 19:14 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Nitesh Narayan Lal, kvm list, LKML, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, riel, david, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

On Thu, Feb 07, 2019 at 10:24:20AM -0800, Alexander Duyck wrote:
> On Thu, Feb 7, 2019 at 9:56 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >
> >
> > On 2/7/19 12:23 PM, Alexander Duyck wrote:
> > > On Mon, Feb 4, 2019 at 2:11 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> > >> This patch disables page poisoning if guest page hinting is enabled.
> > >> It is required to avoid possible guest memory corruption errors.
> > >> Page Poisoning is a feature in which the page is filled with a specific
> > >> pattern of (0x00 or 0xaa) after arch_free_page and the same is verified
> > >> before arch_alloc_page to prevent following issues:
> > >>     *information leak from the freed data
> > >>     *use after free bugs
> > >>     *memory corruption
> > >> Selection of the pattern depends on the CONFIG_PAGE_POISONING_ZERO
> > >> Once the guest pages which are supposed to be freed are sent to the
> > >> hypervisor it frees them. After freeing the pages in the global list
> > >> following things may happen:
> > >>     *Hypervisor reallocates the freed memory back to the guest
> > >>     *Hypervisor frees the memory and maps a different physical memory
> > >> In order to prevent any information leak hypervisor before allocating
> > >> memory to the guest fills it with zeroes.
> > >> The issue arises when the pattern used for Page Poisoning is 0xaa while
> > >> the newly allocated page received from the hypervisor by the guest is
> > >> filled with the pattern 0x00. This will result in memory corruption errors.
> > >>
> > >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> > > This seems kind of backwards to me. Why disable page poisoning instead
> > > of just not hinting about the free pages? There shouldn't be that many
> > > instances when page poisoning is enabled, and when it is it would make
> > > more sense to leave it enabled rather than silently disable it.
> > As I have mentioned in the cover email, I intend to reuse Wei's already
> > merged work.
> >
> > This will enable the guest to communicate the poison value which is in
> > use to the host.
> 
> That is far from being reliable given that you are having to buffer
> the pages for some period of time. I really think it would be better
> to just allow page poisoning to function and when you can support
> applying poison to a newly allocated page then you could look at
> re-enabling it.
> 
> What I am getting at is that those that care about poisoning won't
> likely care about performance and I would lump the memory hinting in
> with other performance features.

It's not just a performance issue.

There is an issue is with the host/guest API. Once host discards pages it
currently always gives you back zero-filled pages.
So guest that looks for the poison using e.g. unpoison_page
will crash unless the poison value is 0.

Idea behind current code is just to let host know:
it can either be more careful with these pages,
or skip them completely.


-- 
MST

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages
  2019-02-07 17:43         ` Alexander Duyck
  2019-02-07 19:01           ` Michael S. Tsirkin
@ 2019-02-07 20:50           ` Nitesh Narayan Lal
  2019-02-08 17:58             ` Alexander Duyck
  1 sibling, 1 reply; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-07 20:50 UTC (permalink / raw)
  To: Alexander Duyck, Michael S. Tsirkin
  Cc: kvm list, LKML, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang,
	Yang Zhang, riel, david, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli


[-- Attachment #1.1: Type: text/plain, Size: 3391 bytes --]


On 2/7/19 12:43 PM, Alexander Duyck wrote:
> On Tue, Feb 5, 2019 at 3:21 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Tue, Feb 05, 2019 at 04:54:03PM -0500, Nitesh Narayan Lal wrote:
>>> On 2/5/19 3:45 PM, Michael S. Tsirkin wrote:
>>>> On Mon, Feb 04, 2019 at 03:18:53PM -0500, Nitesh Narayan Lal wrote:
>>>>> This patch enables the kernel to scan the per cpu array and
>>>>> compress it by removing the repetitive/re-allocated pages.
>>>>> Once the per cpu array is completely filled with pages in the
>>>>> buddy it wakes up the kernel per cpu thread which re-scans the
>>>>> entire per cpu array by acquiring a zone lock corresponding to
>>>>> the page which is being scanned. If the page is still free and
>>>>> present in the buddy it tries to isolate the page and adds it
>>>>> to another per cpu array.
>>>>>
>>>>> Once this scanning process is complete and if there are any
>>>>> isolated pages added to the new per cpu array kernel thread
>>>>> invokes hyperlist_ready().
>>>>>
>>>>> In hyperlist_ready() a hypercall is made to report these pages to
>>>>> the host using the virtio-balloon framework. In order to do so
>>>>> another virtqueue 'hinting_vq' is added to the balloon framework.
>>>>> As the host frees all the reported pages, the kernel thread returns
>>>>> them back to the buddy.
>>>>>
>>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>>>> This looks kind of like what early iterations of Wei's patches did.
>>>>
>>>> But this has lots of issues, for example you might end up with
>>>> a hypercall per a 4K page.
>>>> So in the end, he switched over to just reporting only
>>>> MAX_ORDER - 1 pages.
>>> You mean that I should only capture/attempt to isolate pages with order
>>> MAX_ORDER - 1?
>>>> Would that be a good idea for you too?
>>> Will it help if we have a threshold value based on the amount of memory
>>> captured instead of the number of entries/pages in the array?
>> This is what Wei's patches do at least.
> So in the solution I had posted I was looking more at
> HUGETLB_PAGE_ORDER and above as the size of pages to provide the hints
> on [1]. The advantage to doing that is that you can also avoid
> fragmenting huge pages which in turn can cause what looks like a
> memory leak as the memory subsystem attempts to reassemble huge
> pages[2]. In my mind a 2MB page makes good sense in terms of the size
> of things to be performing hints on as anything smaller than that is
> going to just end up being a bunch of extra work and end up causing a
> bunch of fragmentation.
As per my opinion, in any implementation which page size to store before
reporting depends on the allocation pattern of the workload running in
the guest.

I am also planning to try Michael's suggestion of using MAX_ORDER - 1.
However I am still thinking about a workload which I can use to test its
effectiveness.

>
> The only issue with limiting things on an arbitrary boundary like that
> is that you have to hook into the buddy allocator to catch the cases
> where a page has been merged up into that range.
I don't think, I understood your comment completely. In any case, we
have to rely on the buddy for merging the pages.
>
> [1] https://lkml.org/lkml/2019/2/4/903
> [2] https://blog.digitalocean.com/transparent-huge-pages-and-alternative-memory-allocators/
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 4/7] KVM: Disabling page poisoning to prevent corruption
  2019-02-04 20:18 ` [RFC][Patch v8 4/7] KVM: Disabling page poisoning to prevent corruption Nitesh Narayan Lal
  2019-02-07 17:23   ` Alexander Duyck
@ 2019-02-07 21:08   ` Michael S. Tsirkin
  1 sibling, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-07 21:08 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, david, dodgen, konrad.wilk, dhildenb,
	aarcange

On Mon, Feb 04, 2019 at 03:18:51PM -0500, Nitesh Narayan Lal wrote:
> This patch disables page poisoning if guest page hinting is enabled.
> It is required to avoid possible guest memory corruption errors.
> Page Poisoning is a feature in which the page is filled with a specific
> pattern of (0x00 or 0xaa) after arch_free_page and the same is verified
> before arch_alloc_page to prevent following issues:
>     *information leak from the freed data
>     *use after free bugs
>     *memory corruption
> Selection of the pattern depends on the CONFIG_PAGE_POISONING_ZERO
> Once the guest pages which are supposed to be freed are sent to the
> hypervisor it frees them. After freeing the pages in the global list
> following things may happen:
>     *Hypervisor reallocates the freed memory back to the guest
>     *Hypervisor frees the memory and maps a different physical memory
> In order to prevent any information leak hypervisor before allocating
> memory to the guest fills it with zeroes.
> The issue arises when the pattern used for Page Poisoning is 0xaa while
> the newly allocated page received from the hypervisor by the guest is
> filled with the pattern 0x00. This will result in memory corruption errors.
> 
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>

IMHO it's better to take the approach of the existing balloon code
and just send the poison value to host. Host can then avoid filling
memory with zeroes.


> ---
>  include/linux/page_hinting.h | 8 ++++++++
>  mm/page_poison.c             | 2 +-
>  virt/kvm/page_hinting.c      | 1 +
>  3 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
> index 2d7ff59f3f6a..e800c6b07561 100644
> --- a/include/linux/page_hinting.h
> +++ b/include/linux/page_hinting.h
> @@ -19,7 +19,15 @@ struct hypervisor_pages {
>  extern int guest_page_hinting_flag;
>  extern struct static_key_false guest_page_hinting_key;
>  extern struct smp_hotplug_thread hinting_threads;
> +extern bool want_page_poisoning;
>  
>  int guest_page_hinting_sysctl(struct ctl_table *table, int write,
>  			      void __user *buffer, size_t *lenp, loff_t *ppos);
>  void guest_free_page(struct page *page, int order);
> +
> +static inline void disable_page_poisoning(void)
> +{
> +#ifdef CONFIG_PAGE_POISONING
> +	want_page_poisoning = 0;
> +#endif
> +}
> diff --git a/mm/page_poison.c b/mm/page_poison.c
> index f0c15e9017c0..9af96021133b 100644
> --- a/mm/page_poison.c
> +++ b/mm/page_poison.c
> @@ -7,7 +7,7 @@
>  #include <linux/poison.h>
>  #include <linux/ratelimit.h>
>  
> -static bool want_page_poisoning __read_mostly;
> +bool want_page_poisoning __read_mostly;
>  
>  static int __init early_page_poison_param(char *buf)
>  {
> diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
> index 636990e7fbb3..be529f6f2bc0 100644
> --- a/virt/kvm/page_hinting.c
> +++ b/virt/kvm/page_hinting.c
> @@ -103,6 +103,7 @@ void guest_free_page(struct page *page, int order)
>  
>  	local_irq_save(flags);
>  	if (page_hinting_obj->kvm_pt_idx != MAX_FGPT_ENTRIES) {
> +		disable_page_poisoning();
>  		page_hinting_obj->kvm_pt[page_hinting_obj->kvm_pt_idx].pfn =
>  							page_to_pfn(page);
>  		page_hinting_obj->kvm_pt[page_hinting_obj->kvm_pt_idx].zonenum =
> -- 
> 2.17.2

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages
  2019-02-07 20:50           ` Nitesh Narayan Lal
@ 2019-02-08 17:58             ` Alexander Duyck
  2019-02-08 20:41               ` Nitesh Narayan Lal
  2019-02-08 21:35               ` Michael S. Tsirkin
  0 siblings, 2 replies; 116+ messages in thread
From: Alexander Duyck @ 2019-02-08 17:58 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: Michael S. Tsirkin, kvm list, LKML, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, Rik van Riel, david, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

On Thu, Feb 7, 2019 at 12:50 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 2/7/19 12:43 PM, Alexander Duyck wrote:
> > On Tue, Feb 5, 2019 at 3:21 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >> On Tue, Feb 05, 2019 at 04:54:03PM -0500, Nitesh Narayan Lal wrote:
> >>> On 2/5/19 3:45 PM, Michael S. Tsirkin wrote:
> >>>> On Mon, Feb 04, 2019 at 03:18:53PM -0500, Nitesh Narayan Lal wrote:
> >>>>> This patch enables the kernel to scan the per cpu array and
> >>>>> compress it by removing the repetitive/re-allocated pages.
> >>>>> Once the per cpu array is completely filled with pages in the
> >>>>> buddy it wakes up the kernel per cpu thread which re-scans the
> >>>>> entire per cpu array by acquiring a zone lock corresponding to
> >>>>> the page which is being scanned. If the page is still free and
> >>>>> present in the buddy it tries to isolate the page and adds it
> >>>>> to another per cpu array.
> >>>>>
> >>>>> Once this scanning process is complete and if there are any
> >>>>> isolated pages added to the new per cpu array kernel thread
> >>>>> invokes hyperlist_ready().
> >>>>>
> >>>>> In hyperlist_ready() a hypercall is made to report these pages to
> >>>>> the host using the virtio-balloon framework. In order to do so
> >>>>> another virtqueue 'hinting_vq' is added to the balloon framework.
> >>>>> As the host frees all the reported pages, the kernel thread returns
> >>>>> them back to the buddy.
> >>>>>
> >>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> >>>> This looks kind of like what early iterations of Wei's patches did.
> >>>>
> >>>> But this has lots of issues, for example you might end up with
> >>>> a hypercall per a 4K page.
> >>>> So in the end, he switched over to just reporting only
> >>>> MAX_ORDER - 1 pages.
> >>> You mean that I should only capture/attempt to isolate pages with order
> >>> MAX_ORDER - 1?
> >>>> Would that be a good idea for you too?
> >>> Will it help if we have a threshold value based on the amount of memory
> >>> captured instead of the number of entries/pages in the array?
> >> This is what Wei's patches do at least.
> > So in the solution I had posted I was looking more at
> > HUGETLB_PAGE_ORDER and above as the size of pages to provide the hints
> > on [1]. The advantage to doing that is that you can also avoid
> > fragmenting huge pages which in turn can cause what looks like a
> > memory leak as the memory subsystem attempts to reassemble huge
> > pages[2]. In my mind a 2MB page makes good sense in terms of the size
> > of things to be performing hints on as anything smaller than that is
> > going to just end up being a bunch of extra work and end up causing a
> > bunch of fragmentation.
> As per my opinion, in any implementation which page size to store before
> reporting depends on the allocation pattern of the workload running in
> the guest.

I suggest you take a look at item 2 that I had called out in the
previous email. There are known issues with providing hints smaller
than THP using MADV_DONTNEED or MADV_FREE. Specifically what will
happen is that you end up breaking up a higher order transparent huge
page, backfilling a few holes with other pages, but then the memory
allocation subsystem attempts to reassemble the larger THP page
resulting in an application exhibiting behavior similar to a memory
leak while not actually allocating memory since it is sitting on
fragments of THP pages.

Also while I am thinking of it I haven't noticed anywhere that you are
handling the case of a device assigned to the guest. That seems like a
spot where we are going to have to stop hinting as well aren't we?
Otherwise we would need to redo the memory mapping of the guest in the
IOMMU every time a page is evicted and replaced.

> I am also planning to try Michael's suggestion of using MAX_ORDER - 1.
> However I am still thinking about a workload which I can use to test its
> effectiveness.

You might want to look at doing something like min(MAX_ORDER - 1,
HUGETLB_PAGE_ORDER). I know for x86 a 2MB page is the upper limit for
THP which is the most likely to be used page size with the guest.

> >
> > The only issue with limiting things on an arbitrary boundary like that
> > is that you have to hook into the buddy allocator to catch the cases
> > where a page has been merged up into that range.
> I don't think, I understood your comment completely. In any case, we
> have to rely on the buddy for merging the pages.
> >
> > [1] https://lkml.org/lkml/2019/2/4/903
> > [2] https://blog.digitalocean.com/transparent-huge-pages-and-alternative-memory-allocators/
> --
> Regards
> Nitesh
>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 2/7] KVM: Enabling guest free page hinting via static key
  2019-02-04 20:18 ` [RFC][Patch v8 2/7] KVM: Enabling guest free page hinting via static key Nitesh Narayan Lal
@ 2019-02-08 18:07   ` Alexander Duyck
  2019-02-08 18:22     ` Nitesh Narayan Lal
  0 siblings, 1 reply; 116+ messages in thread
From: Alexander Duyck @ 2019-02-08 18:07 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang,
	Yang Zhang, Rik van Riel, david, Michael S. Tsirkin, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

On Mon, Feb 4, 2019 at 2:11 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> This patch enables the guest free page hinting support
> to enable or disable based on the STATIC key which
> could be set via sysctl.
>
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
>  include/linux/gfp.h          |  2 ++
>  include/linux/page_hinting.h |  5 +++++
>  kernel/sysctl.c              |  9 +++++++++
>  virt/kvm/page_hinting.c      | 23 +++++++++++++++++++++++
>  4 files changed, 39 insertions(+)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index e596527284ba..8389219a076a 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -461,6 +461,8 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>  #define HAVE_ARCH_FREE_PAGE
>  static inline void arch_free_page(struct page *page, int order)
>  {
> +       if (!static_branch_unlikely(&guest_page_hinting_key))
> +               return;
>         guest_free_page(page, order);
>  }
>  #endif
> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
> index b54f7428f348..9bdcf63e1306 100644
> --- a/include/linux/page_hinting.h
> +++ b/include/linux/page_hinting.h
> @@ -14,4 +14,9 @@ struct hypervisor_pages {
>         unsigned int order;
>  };
>
> +extern int guest_page_hinting_flag;
> +extern struct static_key_false guest_page_hinting_key;
> +
> +int guest_page_hinting_sysctl(struct ctl_table *table, int write,
> +                             void __user *buffer, size_t *lenp, loff_t *ppos);
>  void guest_free_page(struct page *page, int order);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index ba4d9e85feb8..5d53629c9bfb 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1690,6 +1690,15 @@ static struct ctl_table vm_table[] = {
>                 .extra1         = (void *)&mmap_rnd_compat_bits_min,
>                 .extra2         = (void *)&mmap_rnd_compat_bits_max,
>         },
> +#endif
> +#ifdef CONFIG_KVM_FREE_PAGE_HINTING
> +       {
> +               .procname       = "guest-page-hinting",
> +               .data           = &guest_page_hinting_flag,
> +               .maxlen         = sizeof(guest_page_hinting_flag),
> +               .mode           = 0644,
> +               .proc_handler   = guest_page_hinting_sysctl,
> +       },
>  #endif
>         { }
>  };

Since you are adding a new sysctl shouldn't you also be updating
Documentation/sysctl/vm.txt?

> diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
> index 818bd6b84e0c..4a34ea8db0c8 100644
> --- a/virt/kvm/page_hinting.c
> +++ b/virt/kvm/page_hinting.c
> @@ -1,6 +1,7 @@
>  #include <linux/gfp.h>
>  #include <linux/mm.h>
>  #include <linux/kernel.h>
> +#include <linux/kvm_host.h>
>
>  /*
>   * struct kvm_free_pages - Tracks the pages which are freed by the guest.
> @@ -31,6 +32,28 @@ struct page_hinting {
>
>  DEFINE_PER_CPU(struct page_hinting, hinting_obj);
>
> +struct static_key_false guest_page_hinting_key  = STATIC_KEY_FALSE_INIT;
> +EXPORT_SYMBOL(guest_page_hinting_key);
> +static DEFINE_MUTEX(hinting_mutex);
> +int guest_page_hinting_flag;
> +EXPORT_SYMBOL(guest_page_hinting_flag);

I'm not entirely sure this flag makes sense to me. What is to prevent
someone from turning this on when there is no means of actually using
the hints. I understand right now that guest_free_page doesn't
actually do anything, but when it does I would assume it has to
interact with a device. If that device is not present would it still
make sense for us to be generating hints?

> +
> +int guest_page_hinting_sysctl(struct ctl_table *table, int write,
> +                             void __user *buffer, size_t *lenp,
> +                             loff_t *ppos)
> +{
> +       int ret;
> +
> +       mutex_lock(&hinting_mutex);
> +       ret = proc_dointvec(table, write, buffer, lenp, ppos);
> +       if (guest_page_hinting_flag)
> +               static_key_enable(&guest_page_hinting_key.key);
> +       else
> +               static_key_disable(&guest_page_hinting_key.key);
> +       mutex_unlock(&hinting_mutex);
> +       return ret;
> +}
> +
>  void guest_free_page(struct page *page, int order)
>  {
>  }
> --
> 2.17.2
>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 2/7] KVM: Enabling guest free page hinting via static key
  2019-02-08 18:07   ` Alexander Duyck
@ 2019-02-08 18:22     ` Nitesh Narayan Lal
  0 siblings, 0 replies; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-08 18:22 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang,
	Yang Zhang, Rik van Riel, david, Michael S. Tsirkin, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli


[-- Attachment #1.1: Type: text/plain, Size: 4632 bytes --]

On 2/8/19 1:07 PM, Alexander Duyck wrote:
> On Mon, Feb 4, 2019 at 2:11 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> This patch enables the guest free page hinting support
>> to enable or disable based on the STATIC key which
>> could be set via sysctl.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> ---
>>  include/linux/gfp.h          |  2 ++
>>  include/linux/page_hinting.h |  5 +++++
>>  kernel/sysctl.c              |  9 +++++++++
>>  virt/kvm/page_hinting.c      | 23 +++++++++++++++++++++++
>>  4 files changed, 39 insertions(+)
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index e596527284ba..8389219a076a 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -461,6 +461,8 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>>  #define HAVE_ARCH_FREE_PAGE
>>  static inline void arch_free_page(struct page *page, int order)
>>  {
>> +       if (!static_branch_unlikely(&guest_page_hinting_key))
>> +               return;
>>         guest_free_page(page, order);
>>  }
>>  #endif
>> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
>> index b54f7428f348..9bdcf63e1306 100644
>> --- a/include/linux/page_hinting.h
>> +++ b/include/linux/page_hinting.h
>> @@ -14,4 +14,9 @@ struct hypervisor_pages {
>>         unsigned int order;
>>  };
>>
>> +extern int guest_page_hinting_flag;
>> +extern struct static_key_false guest_page_hinting_key;
>> +
>> +int guest_page_hinting_sysctl(struct ctl_table *table, int write,
>> +                             void __user *buffer, size_t *lenp, loff_t *ppos);
>>  void guest_free_page(struct page *page, int order);
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index ba4d9e85feb8..5d53629c9bfb 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -1690,6 +1690,15 @@ static struct ctl_table vm_table[] = {
>>                 .extra1         = (void *)&mmap_rnd_compat_bits_min,
>>                 .extra2         = (void *)&mmap_rnd_compat_bits_max,
>>         },
>> +#endif
>> +#ifdef CONFIG_KVM_FREE_PAGE_HINTING
>> +       {
>> +               .procname       = "guest-page-hinting",
>> +               .data           = &guest_page_hinting_flag,
>> +               .maxlen         = sizeof(guest_page_hinting_flag),
>> +               .mode           = 0644,
>> +               .proc_handler   = guest_page_hinting_sysctl,
>> +       },
>>  #endif
>>         { }
>>  };
> Since you are adding a new sysctl shouldn't you also be updating
> Documentation/sysctl/vm.txt?
Indeed I will be doing that.
However, I would first like to close other major issues with the design.
>
>> diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
>> index 818bd6b84e0c..4a34ea8db0c8 100644
>> --- a/virt/kvm/page_hinting.c
>> +++ b/virt/kvm/page_hinting.c
>> @@ -1,6 +1,7 @@
>>  #include <linux/gfp.h>
>>  #include <linux/mm.h>
>>  #include <linux/kernel.h>
>> +#include <linux/kvm_host.h>
>>
>>  /*
>>   * struct kvm_free_pages - Tracks the pages which are freed by the guest.
>> @@ -31,6 +32,28 @@ struct page_hinting {
>>
>>  DEFINE_PER_CPU(struct page_hinting, hinting_obj);
>>
>> +struct static_key_false guest_page_hinting_key  = STATIC_KEY_FALSE_INIT;
>> +EXPORT_SYMBOL(guest_page_hinting_key);
>> +static DEFINE_MUTEX(hinting_mutex);
>> +int guest_page_hinting_flag;
>> +EXPORT_SYMBOL(guest_page_hinting_flag);
> I'm not entirely sure this flag makes sense to me. What is to prevent
> someone from turning this on when there is no means of actually using
> the hints. I understand right now that guest_free_page doesn't
> actually do anything, but when it does I would assume it has to
> interact with a device. If that device is not present would it still
> make sense for us to be generating hints?
Fair point, I will address this issue.
>
>> +
>> +int guest_page_hinting_sysctl(struct ctl_table *table, int write,
>> +                             void __user *buffer, size_t *lenp,
>> +                             loff_t *ppos)
>> +{
>> +       int ret;
>> +
>> +       mutex_lock(&hinting_mutex);
>> +       ret = proc_dointvec(table, write, buffer, lenp, ppos);
>> +       if (guest_page_hinting_flag)
>> +               static_key_enable(&guest_page_hinting_key.key);
>> +       else
>> +               static_key_disable(&guest_page_hinting_key.key);
>> +       mutex_unlock(&hinting_mutex);
>> +       return ret;
>> +}
>> +
>>  void guest_free_page(struct page *page, int order)
>>  {
>>  }
>> --
>> 2.17.2
>>
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages
  2019-02-08 17:58             ` Alexander Duyck
@ 2019-02-08 20:41               ` Nitesh Narayan Lal
  2019-02-08 21:38                 ` Michael S. Tsirkin
  2019-02-08 21:35               ` Michael S. Tsirkin
  1 sibling, 1 reply; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-08 20:41 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, kvm list, LKML, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, Rik van Riel, david, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli


[-- Attachment #1.1: Type: text/plain, Size: 4959 bytes --]


On 2/8/19 12:58 PM, Alexander Duyck wrote:
> On Thu, Feb 7, 2019 at 12:50 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>
>> On 2/7/19 12:43 PM, Alexander Duyck wrote:
>>> On Tue, Feb 5, 2019 at 3:21 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>>>> On Tue, Feb 05, 2019 at 04:54:03PM -0500, Nitesh Narayan Lal wrote:
>>>>> On 2/5/19 3:45 PM, Michael S. Tsirkin wrote:
>>>>>> On Mon, Feb 04, 2019 at 03:18:53PM -0500, Nitesh Narayan Lal wrote:
>>>>>>> This patch enables the kernel to scan the per cpu array and
>>>>>>> compress it by removing the repetitive/re-allocated pages.
>>>>>>> Once the per cpu array is completely filled with pages in the
>>>>>>> buddy it wakes up the kernel per cpu thread which re-scans the
>>>>>>> entire per cpu array by acquiring a zone lock corresponding to
>>>>>>> the page which is being scanned. If the page is still free and
>>>>>>> present in the buddy it tries to isolate the page and adds it
>>>>>>> to another per cpu array.
>>>>>>>
>>>>>>> Once this scanning process is complete and if there are any
>>>>>>> isolated pages added to the new per cpu array kernel thread
>>>>>>> invokes hyperlist_ready().
>>>>>>>
>>>>>>> In hyperlist_ready() a hypercall is made to report these pages to
>>>>>>> the host using the virtio-balloon framework. In order to do so
>>>>>>> another virtqueue 'hinting_vq' is added to the balloon framework.
>>>>>>> As the host frees all the reported pages, the kernel thread returns
>>>>>>> them back to the buddy.
>>>>>>>
>>>>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>>>>>> This looks kind of like what early iterations of Wei's patches did.
>>>>>>
>>>>>> But this has lots of issues, for example you might end up with
>>>>>> a hypercall per a 4K page.
>>>>>> So in the end, he switched over to just reporting only
>>>>>> MAX_ORDER - 1 pages.
>>>>> You mean that I should only capture/attempt to isolate pages with order
>>>>> MAX_ORDER - 1?
>>>>>> Would that be a good idea for you too?
>>>>> Will it help if we have a threshold value based on the amount of memory
>>>>> captured instead of the number of entries/pages in the array?
>>>> This is what Wei's patches do at least.
>>> So in the solution I had posted I was looking more at
>>> HUGETLB_PAGE_ORDER and above as the size of pages to provide the hints
>>> on [1]. The advantage to doing that is that you can also avoid
>>> fragmenting huge pages which in turn can cause what looks like a
>>> memory leak as the memory subsystem attempts to reassemble huge
>>> pages[2]. In my mind a 2MB page makes good sense in terms of the size
>>> of things to be performing hints on as anything smaller than that is
>>> going to just end up being a bunch of extra work and end up causing a
>>> bunch of fragmentation.
>> As per my opinion, in any implementation which page size to store before
>> reporting depends on the allocation pattern of the workload running in
>> the guest.
> I suggest you take a look at item 2 that I had called out in the
> previous email. There are known issues with providing hints smaller
> than THP using MADV_DONTNEED or MADV_FREE. Specifically what will
> happen is that you end up breaking up a higher order transparent huge
> page, backfilling a few holes with other pages, but then the memory
> allocation subsystem attempts to reassemble the larger THP page
> resulting in an application exhibiting behavior similar to a memory
> leak while not actually allocating memory since it is sitting on
> fragments of THP pages.
I will look into this.
>
> Also while I am thinking of it I haven't noticed anywhere that you are
> handling the case of a device assigned to the guest. That seems like a
> spot where we are going to have to stop hinting as well aren't we?
> Otherwise we would need to redo the memory mapping of the guest in the
> IOMMU every time a page is evicted and replaced.
 I haven't explored such a use-case as of now but will definitely
explore it.
>
>> I am also planning to try Michael's suggestion of using MAX_ORDER - 1.
>> However I am still thinking about a workload which I can use to test its
>> effectiveness.
> You might want to look at doing something like min(MAX_ORDER - 1,
> HUGETLB_PAGE_ORDER). I know for x86 a 2MB page is the upper limit for
> THP which is the most likely to be used page size with the guest.
Sure, thanks for the suggestion.
>
>>> The only issue with limiting things on an arbitrary boundary like that
>>> is that you have to hook into the buddy allocator to catch the cases
>>> where a page has been merged up into that range.
>> I don't think, I understood your comment completely. In any case, we
>> have to rely on the buddy for merging the pages.
>>> [1] https://lkml.org/lkml/2019/2/4/903
>>> [2] https://blog.digitalocean.com/transparent-huge-pages-and-alternative-memory-allocators/
>> --
>> Regards
>> Nitesh
>>
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages
  2019-02-08 17:58             ` Alexander Duyck
  2019-02-08 20:41               ` Nitesh Narayan Lal
@ 2019-02-08 21:35               ` Michael S. Tsirkin
  1 sibling, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-08 21:35 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Nitesh Narayan Lal, kvm list, LKML, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, Rik van Riel, david, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

On Fri, Feb 08, 2019 at 09:58:47AM -0800, Alexander Duyck wrote:
> On Thu, Feb 7, 2019 at 12:50 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >
> >
> > On 2/7/19 12:43 PM, Alexander Duyck wrote:
> > > On Tue, Feb 5, 2019 at 3:21 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >> On Tue, Feb 05, 2019 at 04:54:03PM -0500, Nitesh Narayan Lal wrote:
> > >>> On 2/5/19 3:45 PM, Michael S. Tsirkin wrote:
> > >>>> On Mon, Feb 04, 2019 at 03:18:53PM -0500, Nitesh Narayan Lal wrote:
> > >>>>> This patch enables the kernel to scan the per cpu array and
> > >>>>> compress it by removing the repetitive/re-allocated pages.
> > >>>>> Once the per cpu array is completely filled with pages in the
> > >>>>> buddy it wakes up the kernel per cpu thread which re-scans the
> > >>>>> entire per cpu array by acquiring a zone lock corresponding to
> > >>>>> the page which is being scanned. If the page is still free and
> > >>>>> present in the buddy it tries to isolate the page and adds it
> > >>>>> to another per cpu array.
> > >>>>>
> > >>>>> Once this scanning process is complete and if there are any
> > >>>>> isolated pages added to the new per cpu array kernel thread
> > >>>>> invokes hyperlist_ready().
> > >>>>>
> > >>>>> In hyperlist_ready() a hypercall is made to report these pages to
> > >>>>> the host using the virtio-balloon framework. In order to do so
> > >>>>> another virtqueue 'hinting_vq' is added to the balloon framework.
> > >>>>> As the host frees all the reported pages, the kernel thread returns
> > >>>>> them back to the buddy.
> > >>>>>
> > >>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> > >>>> This looks kind of like what early iterations of Wei's patches did.
> > >>>>
> > >>>> But this has lots of issues, for example you might end up with
> > >>>> a hypercall per a 4K page.
> > >>>> So in the end, he switched over to just reporting only
> > >>>> MAX_ORDER - 1 pages.
> > >>> You mean that I should only capture/attempt to isolate pages with order
> > >>> MAX_ORDER - 1?
> > >>>> Would that be a good idea for you too?
> > >>> Will it help if we have a threshold value based on the amount of memory
> > >>> captured instead of the number of entries/pages in the array?
> > >> This is what Wei's patches do at least.
> > > So in the solution I had posted I was looking more at
> > > HUGETLB_PAGE_ORDER and above as the size of pages to provide the hints
> > > on [1]. The advantage to doing that is that you can also avoid
> > > fragmenting huge pages which in turn can cause what looks like a
> > > memory leak as the memory subsystem attempts to reassemble huge
> > > pages[2]. In my mind a 2MB page makes good sense in terms of the size
> > > of things to be performing hints on as anything smaller than that is
> > > going to just end up being a bunch of extra work and end up causing a
> > > bunch of fragmentation.
> > As per my opinion, in any implementation which page size to store before
> > reporting depends on the allocation pattern of the workload running in
> > the guest.
> 
> I suggest you take a look at item 2 that I had called out in the
> previous email. There are known issues with providing hints smaller
> than THP using MADV_DONTNEED or MADV_FREE. Specifically what will
> happen is that you end up breaking up a higher order transparent huge
> page, backfilling a few holes with other pages, but then the memory
> allocation subsystem attempts to reassemble the larger THP page
> resulting in an application exhibiting behavior similar to a memory
> leak while not actually allocating memory since it is sitting on
> fragments of THP pages.
> 
> Also while I am thinking of it I haven't noticed anywhere that you are
> handling the case of a device assigned to the guest. That seems like a
> spot where we are going to have to stop hinting as well aren't we?

That would be easy for the host to do, way easier than for the guest.

> Otherwise we would need to redo the memory mapping of the guest in the
> IOMMU every time a page is evicted and replaced.

I think that in fact we could in theory make it work.


The reason is that while Linux IOMMU APIs do not allow
this, in fact you can change a mapping just for a single
page within a huge mapping while others are used, as follows:

- create a new set of PTEs
- copy over all PTE mappings except the one
  we are changing
- change the required mapping in the new entry
- atomically update the PMD to point at new PTEs
- flush IOMMU translation cache

similarly for higher levels if there are no PTEs.

So we could come up with something like
        int (*remap)(struct iommu_domain *domain, unsigned long iova,
                   phys_addr_t paddr, size_t size, int prot);

that just tweaks a mapping for a specified range without
breaking others.



> > I am also planning to try Michael's suggestion of using MAX_ORDER - 1.
> > However I am still thinking about a workload which I can use to test its
> > effectiveness.
> 
> You might want to look at doing something like min(MAX_ORDER - 1,
> HUGETLB_PAGE_ORDER).
> I know for x86 a 2MB page is the upper limit for
> THP which is the most likely to be used page size with the guest.

Did you mean max?

I just feel that a good order has much more to do with how
the buddy allocators works than with hardware.

And maybe TRT is to completely disable hinting for when
HUGETLB_PAGE_ORDER > MAX_ORDER since clearly using
buddy allocator for hinting when that breaks huge pages
isn't a good idea.


> > >
> > > The only issue with limiting things on an arbitrary boundary like that
> > > is that you have to hook into the buddy allocator to catch the cases
> > > where a page has been merged up into that range.
> > I don't think, I understood your comment completely. In any case, we
> > have to rely on the buddy for merging the pages.
> > >
> > > [1] https://lkml.org/lkml/2019/2/4/903
> > > [2] https://blog.digitalocean.com/transparent-huge-pages-and-alternative-memory-allocators/
> > --
> > Regards
> > Nitesh
> >

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages
  2019-02-08 20:41               ` Nitesh Narayan Lal
@ 2019-02-08 21:38                 ` Michael S. Tsirkin
  2019-02-08 22:05                   ` Alexander Duyck
  0 siblings, 1 reply; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-08 21:38 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: Alexander Duyck, kvm list, LKML, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, Rik van Riel, david, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

On Fri, Feb 08, 2019 at 03:41:55PM -0500, Nitesh Narayan Lal wrote:
> >> I am also planning to try Michael's suggestion of using MAX_ORDER - 1.
> >> However I am still thinking about a workload which I can use to test its
> >> effectiveness.
> > You might want to look at doing something like min(MAX_ORDER - 1,
> > HUGETLB_PAGE_ORDER). I know for x86 a 2MB page is the upper limit for
> > THP which is the most likely to be used page size with the guest.
> Sure, thanks for the suggestion.

Given current hinting in balloon is MAX_ORDER I'd say
share code. If you feel a need to adjust down the road,
adjust both of them with actual testing showing gains.

-- 
MST

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages
  2019-02-08 21:38                 ` Michael S. Tsirkin
@ 2019-02-08 22:05                   ` Alexander Duyck
  2019-02-10  0:38                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 116+ messages in thread
From: Alexander Duyck @ 2019-02-08 22:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nitesh Narayan Lal, kvm list, LKML, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, Rik van Riel, david, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

On Fri, Feb 8, 2019 at 1:38 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, Feb 08, 2019 at 03:41:55PM -0500, Nitesh Narayan Lal wrote:
> > >> I am also planning to try Michael's suggestion of using MAX_ORDER - 1.
> > >> However I am still thinking about a workload which I can use to test its
> > >> effectiveness.
> > > You might want to look at doing something like min(MAX_ORDER - 1,
> > > HUGETLB_PAGE_ORDER). I know for x86 a 2MB page is the upper limit for
> > > THP which is the most likely to be used page size with the guest.
> > Sure, thanks for the suggestion.
>
> Given current hinting in balloon is MAX_ORDER I'd say
> share code. If you feel a need to adjust down the road,
> adjust both of them with actual testing showing gains.

Actually I'm left kind of wondering why we are even going through
virtio-balloon for this? It seems like this would make much more sense
as core functionality of KVM itself for the specific architectures
rather than some side thing. In addition this could end up being
redundant when you start getting into either the s390 or PowerPC
architectures as they already have means of providing unused page
hints.

I have a set of patches I proposed that add similar functionality via
a KVM hypercall for x86 instead of doing it as a part of a Virtio
device[1].  I'm suspecting the overhead of doing things this way is
much less then having to make multiple madvise system calls from QEMU
back into the kernel.

One other concern that has been pointed out with my patchset that
would likely need to be addressed here as well is what do we do about
other hypervisors that decide to implement page hinting. We probably
should look at making this KVM/QEMU specific code run through the
paravirtual infrastructure instead of trying into the x86 arch code
directly.

[1] https://lkml.org/lkml/2019/2/4/903

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages
  2019-02-08 22:05                   ` Alexander Duyck
@ 2019-02-10  0:38                     ` Michael S. Tsirkin
  2019-02-11  9:28                       ` David Hildenbrand
  2019-02-12 17:10                       ` Nitesh Narayan Lal
  0 siblings, 2 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-10  0:38 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Nitesh Narayan Lal, kvm list, LKML, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, Rik van Riel, david, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

On Fri, Feb 08, 2019 at 02:05:09PM -0800, Alexander Duyck wrote:
> On Fri, Feb 8, 2019 at 1:38 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Fri, Feb 08, 2019 at 03:41:55PM -0500, Nitesh Narayan Lal wrote:
> > > >> I am also planning to try Michael's suggestion of using MAX_ORDER - 1.
> > > >> However I am still thinking about a workload which I can use to test its
> > > >> effectiveness.
> > > > You might want to look at doing something like min(MAX_ORDER - 1,
> > > > HUGETLB_PAGE_ORDER). I know for x86 a 2MB page is the upper limit for
> > > > THP which is the most likely to be used page size with the guest.
> > > Sure, thanks for the suggestion.
> >
> > Given current hinting in balloon is MAX_ORDER I'd say
> > share code. If you feel a need to adjust down the road,
> > adjust both of them with actual testing showing gains.
> 
> Actually I'm left kind of wondering why we are even going through
> virtio-balloon for this?

Just look at what does it do.

It improves memory overcommit if guests are cooperative, and it does
this by giving the hypervisor addresses of pages which it can discard.

It's just *exactly* like the balloon with all the same limitations.

> It seems like this would make much more sense
> as core functionality of KVM itself for the specific architectures
> rather than some side thing.

Well same as balloon: whether it's useful to you at all
would very much depend on your workloads.

This kind of cooperative functionality is good for co-located
single-tenant VMs. That's pretty niche.  The core things in KVM
generally don't trust guests.


> In addition this could end up being
> redundant when you start getting into either the s390 or PowerPC
> architectures as they already have means of providing unused page
> hints.

Interesting. Is there host support in kvm?


> I have a set of patches I proposed that add similar functionality via
> a KVM hypercall for x86 instead of doing it as a part of a Virtio
> device[1].  I'm suspecting the overhead of doing things this way is
> much less then having to make multiple madvise system calls from QEMU
> back into the kernel.

Well whether it's a virtio device is orthogonal to whether it's an
madvise call, right? You can build vhost-pagehint and that can
handle requests in a VQ within balloon and do it
within host kernel directly.

virtio rings let you pass multiple pages so it's really hard to
say which will win outright - maybe it's more important
to coalesce exits.

Nitesh, how about trying same tests and reporting performance?


> One other concern that has been pointed out with my patchset that
> would likely need to be addressed here as well is what do we do about
> other hypervisors that decide to implement page hinting. We probably
> should look at making this KVM/QEMU specific code run through the
> paravirtual infrastructure instead of trying into the x86 arch code
> directly.
> 
> [1] https://lkml.org/lkml/2019/2/4/903


So virtio is a paravirtual interface, that's an argument for
using it then.

In any case pls copy the Cc'd crowd on future version of your patches.

-- 
MST

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages
  2019-02-10  0:38                     ` Michael S. Tsirkin
@ 2019-02-11  9:28                       ` David Hildenbrand
  2019-02-12  5:16                         ` Michael S. Tsirkin
  2019-02-12 17:10                       ` Nitesh Narayan Lal
  1 sibling, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2019-02-11  9:28 UTC (permalink / raw)
  To: Michael S. Tsirkin, Alexander Duyck
  Cc: Nitesh Narayan Lal, kvm list, LKML, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, Rik van Riel, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

On 10.02.19 01:38, Michael S. Tsirkin wrote:
> On Fri, Feb 08, 2019 at 02:05:09PM -0800, Alexander Duyck wrote:
>> On Fri, Feb 8, 2019 at 1:38 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>>>
>>> On Fri, Feb 08, 2019 at 03:41:55PM -0500, Nitesh Narayan Lal wrote:
>>>>>> I am also planning to try Michael's suggestion of using MAX_ORDER - 1.
>>>>>> However I am still thinking about a workload which I can use to test its
>>>>>> effectiveness.
>>>>> You might want to look at doing something like min(MAX_ORDER - 1,
>>>>> HUGETLB_PAGE_ORDER). I know for x86 a 2MB page is the upper limit for
>>>>> THP which is the most likely to be used page size with the guest.
>>>> Sure, thanks for the suggestion.
>>>
>>> Given current hinting in balloon is MAX_ORDER I'd say
>>> share code. If you feel a need to adjust down the road,
>>> adjust both of them with actual testing showing gains.
>>
>> Actually I'm left kind of wondering why we are even going through
>> virtio-balloon for this?
> 
> Just look at what does it do.
> 
> It improves memory overcommit if guests are cooperative, and it does
> this by giving the hypervisor addresses of pages which it can discard.
> 
> It's just *exactly* like the balloon with all the same limitations.

I agree, this belongs to virtio-balloon *unless* we run into real
problems implementing it via an asynchronous mechanism.

> 
>> It seems like this would make much more sense
>> as core functionality of KVM itself for the specific architectures
>> rather than some side thing.

Whatever can be handled in user space and does not have significant
performance impacts should be handled in user space. If we run into real
problems with that approach, fair enough. (e.g. vcpu yielding is a good
example where an implementation in KVM makes sense, not going via QEMU)

> 
> Well same as balloon: whether it's useful to you at all
> would very much depend on your workloads.
> 
> This kind of cooperative functionality is good for co-located
> single-tenant VMs. That's pretty niche.  The core things in KVM
> generally don't trust guests.
> 
> 
>> In addition this could end up being
>> redundant when you start getting into either the s390 or PowerPC
>> architectures as they already have means of providing unused page
>> hints.

I'd like to note that on s390x the functionality is not provided when
running nested guests. And there are real problems getting it ever
supported. (see description below how it works on s390x, the issue for
nested guests are the bits in the guest -> host page tables we cannot
support for nested guests).

Hinting only works for guests running one level under LPAR (with a
recent machine), but not nested guests.

(LPAR -> KVM1 works, LPAR - KVM1 -> KVM2 foes not work for the latter)

So an implementation for s390 would still make sense for this scenario.

> 
> Interesting. Is there host support in kvm?

On s390x there is. It works on page granularity and synchronization
between guest/host ("don't drop a page in the host while the guest is
reusing it") is done via special bits in the host->guest page table.
Instructions in the guest are able to modify these bits. A guest can
configure a "usage state" of it's backed PTEs. E.g. "unused" or "stable".

Whenever a page in the guest is freed/reused, the ESSA instruction is
triggered in the guest. It will modify the page table bits and add the
guest phyical pfn to a buffer in the host. Once that buffer is full,
ESSA will trigger an intercept to the hypervisor. Here, all these
"unused" pages can be zapped.

Also, when swapping a page out in the hypervisor, if it was masked by
the guest as unused or logically zero, instead of swapping out the page,
it can simply be dropped and a fresh zero page can be supplied when the
guest tries to access it.

"ESSA" is implemented in KVM in arch/s390/kvm/priv.c:handle_essa().

So on s390x, it works because the synchronization with the hypervisor is
directly built into hw vitualization support (guest->host page tables +
instruction) and ESSA will not intercept on every call (due to the buffer).
> 
> 
>> I have a set of patches I proposed that add similar functionality via
>> a KVM hypercall for x86 instead of doing it as a part of a Virtio
>> device[1].  I'm suspecting the overhead of doing things this way is
>> much less then having to make multiple madvise system calls from QEMU
>> back into the kernel.
> 
> Well whether it's a virtio device is orthogonal to whether it's an
> madvise call, right? You can build vhost-pagehint and that can
> handle requests in a VQ within balloon and do it
> within host kernel directly.
> 
> virtio rings let you pass multiple pages so it's really hard to
> say which will win outright - maybe it's more important
> to coalesce exits.

We don't know until we measure it.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages
  2019-02-11  9:28                       ` David Hildenbrand
@ 2019-02-12  5:16                         ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-12  5:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexander Duyck, Nitesh Narayan Lal, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Feb 11, 2019 at 10:28:31AM +0100, David Hildenbrand wrote:
> On 10.02.19 01:38, Michael S. Tsirkin wrote:
> > On Fri, Feb 08, 2019 at 02:05:09PM -0800, Alexander Duyck wrote:
> >> On Fri, Feb 8, 2019 at 1:38 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>
> >>> On Fri, Feb 08, 2019 at 03:41:55PM -0500, Nitesh Narayan Lal wrote:
> >>>>>> I am also planning to try Michael's suggestion of using MAX_ORDER - 1.
> >>>>>> However I am still thinking about a workload which I can use to test its
> >>>>>> effectiveness.
> >>>>> You might want to look at doing something like min(MAX_ORDER - 1,
> >>>>> HUGETLB_PAGE_ORDER). I know for x86 a 2MB page is the upper limit for
> >>>>> THP which is the most likely to be used page size with the guest.
> >>>> Sure, thanks for the suggestion.
> >>>
> >>> Given current hinting in balloon is MAX_ORDER I'd say
> >>> share code. If you feel a need to adjust down the road,
> >>> adjust both of them with actual testing showing gains.
> >>
> >> Actually I'm left kind of wondering why we are even going through
> >> virtio-balloon for this?
> > 
> > Just look at what does it do.
> > 
> > It improves memory overcommit if guests are cooperative, and it does
> > this by giving the hypervisor addresses of pages which it can discard.
> > 
> > It's just *exactly* like the balloon with all the same limitations.
> 
> I agree, this belongs to virtio-balloon *unless* we run into real
> problems implementing it via an asynchronous mechanism.
> 
> > 
> >> It seems like this would make much more sense
> >> as core functionality of KVM itself for the specific architectures
> >> rather than some side thing.
> 
> Whatever can be handled in user space and does not have significant
> performance impacts should be handled in user space. If we run into real
> problems with that approach, fair enough. (e.g. vcpu yielding is a good
> example where an implementation in KVM makes sense, not going via QEMU)

Just to note, if we wanted to we could add a special kind of VQ where
e.g. kick yields the VCPU. You don't necessarily need a hypercall for
this. A virtio-cpu, yay!


> > 
> > Well same as balloon: whether it's useful to you at all
> > would very much depend on your workloads.
> > 
> > This kind of cooperative functionality is good for co-located
> > single-tenant VMs. That's pretty niche.  The core things in KVM
> > generally don't trust guests.
> > 
> > 
> >> In addition this could end up being
> >> redundant when you start getting into either the s390 or PowerPC
> >> architectures as they already have means of providing unused page
> >> hints.
> 
> I'd like to note that on s390x the functionality is not provided when
> running nested guests. And there are real problems getting it ever
> supported. (see description below how it works on s390x, the issue for
> nested guests are the bits in the guest -> host page tables we cannot
> support for nested guests).
> 
> Hinting only works for guests running one level under LPAR (with a
> recent machine), but not nested guests.
> 
> (LPAR -> KVM1 works, LPAR - KVM1 -> KVM2 foes not work for the latter)
> 
> So an implementation for s390 would still make sense for this scenario.
> 
> > 
> > Interesting. Is there host support in kvm?
> 
> On s390x there is. It works on page granularity and synchronization
> between guest/host ("don't drop a page in the host while the guest is
> reusing it") is done via special bits in the host->guest page table.
> Instructions in the guest are able to modify these bits. A guest can
> configure a "usage state" of it's backed PTEs. E.g. "unused" or "stable".
> 
> Whenever a page in the guest is freed/reused, the ESSA instruction is
> triggered in the guest. It will modify the page table bits and add the
> guest phyical pfn to a buffer in the host. Once that buffer is full,
> ESSA will trigger an intercept to the hypervisor. Here, all these
> "unused" pages can be zapped.
> 
> Also, when swapping a page out in the hypervisor, if it was masked by
> the guest as unused or logically zero, instead of swapping out the page,
> it can simply be dropped and a fresh zero page can be supplied when the
> guest tries to access it.
> 
> "ESSA" is implemented in KVM in arch/s390/kvm/priv.c:handle_essa().
> 
> So on s390x, it works because the synchronization with the hypervisor is
> directly built into hw vitualization support (guest->host page tables +
> instruction) and ESSA will not intercept on every call (due to the buffer).
> > 
> > 
> >> I have a set of patches I proposed that add similar functionality via
> >> a KVM hypercall for x86 instead of doing it as a part of a Virtio
> >> device[1].  I'm suspecting the overhead of doing things this way is
> >> much less then having to make multiple madvise system calls from QEMU
> >> back into the kernel.
> > 
> > Well whether it's a virtio device is orthogonal to whether it's an
> > madvise call, right? You can build vhost-pagehint and that can
> > handle requests in a VQ within balloon and do it
> > within host kernel directly.
> > 
> > virtio rings let you pass multiple pages so it's really hard to
> > say which will win outright - maybe it's more important
> > to coalesce exits.
> 
> We don't know until we measure it.

So to measure, I think we can start with traces that show how often do
specific workloads allocate/free pages of specific size.  We don't
necessarily need hypercall/host support.  We might want "mm: Add merge
page notifier" so we can count merges.

> -- 
> 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* RE: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-04 20:18 [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Nitesh Narayan Lal
                   ` (7 preceding siblings ...)
  2019-02-04 20:20 ` [RFC][QEMU PATCH] KVM: Support for guest free " Nitesh Narayan Lal
@ 2019-02-12  9:03 ` Wang, Wei W
  2019-02-12  9:24   ` David Hildenbrand
  2019-02-13  9:00 ` Wang, Wei W
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 116+ messages in thread
From: Wang, Wei W @ 2019-02-12  9:03 UTC (permalink / raw)
  To: Nitesh Narayan Lal, kvm, linux-kernel, pbonzini, lcapitulino,
	pagupta, yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk,
	dhildenb, aarcange

On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote:
> The following patch-set proposes an efficient mechanism for handing freed
> memory between the guest and the host. It enables the guests with no page
> cache to rapidly free and reclaims memory to and from the host respectively.
> 
> Benefit:
> With this patch-series, in our test-case, executed on a single system and
> single NUMA node with 15GB memory, we were able to successfully launch
> atleast 5 guests when page hinting was enabled and 3 without it. (Detailed
> explanation of the test procedure is provided at the bottom).
> 
> Changelog in V8:
> In this patch-series, the earlier approach [1] which was used to capture and
> scan the pages freed by the guest has been changed. The new approach is
> briefly described below:
> 
> The patch-set still leverages the existing arch_free_page() to add this
> functionality. It maintains a per CPU array which is used to store the pages
> freed by the guest. The maximum number of entries which it can hold is
> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it
> is scanned and only the pages which are available in the buddy are stored.
> This process continues until the array is filled with pages which are part of
> the buddy free list. After which it wakes up a kernel per-cpu-thread.
> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation
> and if the page is not reallocated and present in the buddy, the kernel
> thread attempts to isolate it from the buddy. If it is successfully isolated, the
> page is added to another per-cpu array. Once the entire scanning process is
> complete, all the isolated pages are reported to the host through an existing
> virtio-balloon driver.

 Hi Nitesh,

Have you guys thought about something like below, which would be simpler:

- use bitmaps to record free pages, e.g. xbitmap: https://lkml.org/lkml/2018/1/9/304.
  The bitmap can be indexed by the guest pfn, and it's globally accessed by all the CPUs;
- arch_free_page(): set the bits of the freed pages from the bitmap
 (no per-CPU array with hardcoded fixed length and no per-cpu scanning thread)
- arch_alloc_page(): clear the related bits from the bitmap
- expose 2 APIs for the callers:
  -- unsigned long get_free_page_hints(unsigned long pfn_start, unsigned int nr); 
     This API searches for the next free page chunk (@nr of pages), starting from @pfn_start.
     Bits of those free pages will be cleared after this function returns.
  -- void put_free_page_hints(unsigned long pfn_start, unsigned int nr);
     This API sets the @nr continuous bits starting from pfn_start.

Usage example with balloon:
1) host requests to start ballooning;
2) balloon driver get_free_page_hints and report the hints to host via report_vq;
3) host calls madvise(pfn_start, DONTNEED) for each reported chunk of free pages and put back pfn_start to ack_vq;
4) balloon driver receives pfn_start and calls put_free_page_hints(pfn_start) to have the related bits from the bitmap to be set, indicating that those free pages are ready to be allocated.

In above 2), get_free_page_hints clears the bits which indicates that those pages are not ready to be used by the guest yet. Why?
This is because 3) will unmap the underlying physical pages from EPT. Normally, when guest re-visits those pages, EPT violations and QEMU page faults will get a new host page to set up the related EPT entry. If guest uses that page before the page gets unmapped (i.e. right before step 3), no EPT violation happens and the guest will use the same physical page that will be unmapped and given to other host threads. So we need to make sure that the guest free page is usable only after step 3 finishes.

Back to arch_alloc_page(), it needs to check if the allocated pages have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it means step 2) above has happened and step 4) hasn't been reached. In this case, we can either have arch_alloc_page() busywaiting a bit till 4) is done for that page
Or better to have a balloon callback which prioritize 3) and 4) to make this page usable by the guest.

Using bitmaps to record free page hints don't need to take the free pages off the buddy list and return them later, which needs to go through the long allocation/free code path.

Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-12  9:03 ` [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Wang, Wei W
@ 2019-02-12  9:24   ` David Hildenbrand
  2019-02-12 17:24     ` Nitesh Narayan Lal
  2019-02-13  8:55     ` Wang, Wei W
  0 siblings, 2 replies; 116+ messages in thread
From: David Hildenbrand @ 2019-02-12  9:24 UTC (permalink / raw)
  To: Wang, Wei W, Nitesh Narayan Lal, kvm, linux-kernel, pbonzini,
	lcapitulino, pagupta, yang.zhang.wz, riel, mst, dodgen,
	konrad.wilk, dhildenb, aarcange

On 12.02.19 10:03, Wang, Wei W wrote:
> On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote:
>> The following patch-set proposes an efficient mechanism for handing freed
>> memory between the guest and the host. It enables the guests with no page
>> cache to rapidly free and reclaims memory to and from the host respectively.
>>
>> Benefit:
>> With this patch-series, in our test-case, executed on a single system and
>> single NUMA node with 15GB memory, we were able to successfully launch
>> atleast 5 guests when page hinting was enabled and 3 without it. (Detailed
>> explanation of the test procedure is provided at the bottom).
>>
>> Changelog in V8:
>> In this patch-series, the earlier approach [1] which was used to capture and
>> scan the pages freed by the guest has been changed. The new approach is
>> briefly described below:
>>
>> The patch-set still leverages the existing arch_free_page() to add this
>> functionality. It maintains a per CPU array which is used to store the pages
>> freed by the guest. The maximum number of entries which it can hold is
>> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it
>> is scanned and only the pages which are available in the buddy are stored.
>> This process continues until the array is filled with pages which are part of
>> the buddy free list. After which it wakes up a kernel per-cpu-thread.
>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation
>> and if the page is not reallocated and present in the buddy, the kernel
>> thread attempts to isolate it from the buddy. If it is successfully isolated, the
>> page is added to another per-cpu array. Once the entire scanning process is
>> complete, all the isolated pages are reported to the host through an existing
>> virtio-balloon driver.
> 
>  Hi Nitesh,
> 
> Have you guys thought about something like below, which would be simpler:

Responding because I'm the first to stumble over this mail, hah! :)

> 
> - use bitmaps to record free pages, e.g. xbitmap: https://lkml.org/lkml/2018/1/9/304.
>   The bitmap can be indexed by the guest pfn, and it's globally accessed by all the CPUs;

Global means all VCPUs will be competing potentially for a single lock
when freeing/allocating a page, no? What if you have 64VCPUs
allocating/freeing memory like crazy?

(I assume some kind of locking is required even if the bitmap would be
atomic. Also, doesn't xbitmap mean that we eventually have to allocate
memory at places where we don't want to - e.g. from arch_free_page ?)

That's the big benefit of taking the pages of the buddy free list. Other
VCPUs won't stumble over them, waiting for them to get freed in the
hypervisor.

> - arch_free_page(): set the bits of the freed pages from the bitmap
>  (no per-CPU array with hardcoded fixed length and no per-cpu scanning thread)
> - arch_alloc_page(): clear the related bits from the bitmap
> - expose 2 APIs for the callers:
>   -- unsigned long get_free_page_hints(unsigned long pfn_start, unsigned int nr); 
>      This API searches for the next free page chunk (@nr of pages), starting from @pfn_start.
>      Bits of those free pages will be cleared after this function returns.
>   -- void put_free_page_hints(unsigned long pfn_start, unsigned int nr);
>      This API sets the @nr continuous bits starting from pfn_start.
> 
> Usage example with balloon:
> 1) host requests to start ballooning;
> 2) balloon driver get_free_page_hints and report the hints to host via report_vq;
> 3) host calls madvise(pfn_start, DONTNEED) for each reported chunk of free pages and put back pfn_start to ack_vq;
> 4) balloon driver receives pfn_start and calls put_free_page_hints(pfn_start) to have the related bits from the bitmap to be set, indicating that those free pages are ready to be allocated.

This sounds more like "the host requests to get free pages once in a
while" compared to "the host is always informed about free pages". At
the time where the host actually has to ask the guest (e.g. because the
host is low on memory), it might be to late to wait for guest action.
Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages
as candidates for removal and if the host is low on memory, only
scanning the guest page tables is sufficient to free up memory.

But both points might just be an implementation detail in the example
you describe.

> 
> In above 2), get_free_page_hints clears the bits which indicates that those pages are not ready to be used by the guest yet. Why?
> This is because 3) will unmap the underlying physical pages from EPT. Normally, when guest re-visits those pages, EPT violations and QEMU page faults will get a new host page to set up the related EPT entry. If guest uses that page before the page gets unmapped (i.e. right before step 3), no EPT violation happens and the guest will use the same physical page that will be unmapped and given to other host threads. So we need to make sure that the guest free page is usable only after step 3 finishes.
> 
> Back to arch_alloc_page(), it needs to check if the allocated pages have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it means step 2) above has happened and step 4) hasn't been reached. In this case, we can either have arch_alloc_page() busywaiting a bit till 4) is done for that page
> Or better to have a balloon callback which prioritize 3) and 4) to make this page usable by the guest.

Regarding the latter, the VCPU allocating a page cannot do anything if
the page (along with other pages) is just being freed by the hypervisor.
It has to busy-wait, no chance to prioritize.

> 
> Using bitmaps to record free page hints don't need to take the free pages off the buddy list and return them later, which needs to go through the long allocation/free code path.
> 

Yes, but it means that any process is able to get stuck on such a page
for as long as it takes to report the free pages to the hypervisor and
for it to call madvise(pfn_start, DONTNEED) on any such page.

Nice idea, but I think we definitely need something the can potentially
be implemented per-cpu without any global locks involved.

Thanks!

> Best,
> Wei
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages
  2019-02-10  0:38                     ` Michael S. Tsirkin
  2019-02-11  9:28                       ` David Hildenbrand
@ 2019-02-12 17:10                       ` Nitesh Narayan Lal
  1 sibling, 0 replies; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-12 17:10 UTC (permalink / raw)
  To: Michael S. Tsirkin, Alexander Duyck
  Cc: kvm list, LKML, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang,
	Yang Zhang, Rik van Riel, david, dodgen, Konrad Rzeszutek Wilk,
	dhildenb, Andrea Arcangeli


[-- Attachment #1.1: Type: text/plain, Size: 3348 bytes --]


On 2/9/19 7:38 PM, Michael S. Tsirkin wrote:
> On Fri, Feb 08, 2019 at 02:05:09PM -0800, Alexander Duyck wrote:
>> On Fri, Feb 8, 2019 at 1:38 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>>> On Fri, Feb 08, 2019 at 03:41:55PM -0500, Nitesh Narayan Lal wrote:
>>>>>> I am also planning to try Michael's suggestion of using MAX_ORDER - 1.
>>>>>> However I am still thinking about a workload which I can use to test its
>>>>>> effectiveness.
>>>>> You might want to look at doing something like min(MAX_ORDER - 1,
>>>>> HUGETLB_PAGE_ORDER). I know for x86 a 2MB page is the upper limit for
>>>>> THP which is the most likely to be used page size with the guest.
>>>> Sure, thanks for the suggestion.
>>> Given current hinting in balloon is MAX_ORDER I'd say
>>> share code. If you feel a need to adjust down the road,
>>> adjust both of them with actual testing showing gains.
>> Actually I'm left kind of wondering why we are even going through
>> virtio-balloon for this?
> Just look at what does it do.
>
> It improves memory overcommit if guests are cooperative, and it does
> this by giving the hypervisor addresses of pages which it can discard.
>
> It's just *exactly* like the balloon with all the same limitations.
>
>> It seems like this would make much more sense
>> as core functionality of KVM itself for the specific architectures
>> rather than some side thing.
> Well same as balloon: whether it's useful to you at all
> would very much depend on your workloads.
>
> This kind of cooperative functionality is good for co-located
> single-tenant VMs. That's pretty niche.  The core things in KVM
> generally don't trust guests.
>
>
>> In addition this could end up being
>> redundant when you start getting into either the s390 or PowerPC
>> architectures as they already have means of providing unused page
>> hints.
> Interesting. Is there host support in kvm?
>
>
>> I have a set of patches I proposed that add similar functionality via
>> a KVM hypercall for x86 instead of doing it as a part of a Virtio
>> device[1].  I'm suspecting the overhead of doing things this way is
>> much less then having to make multiple madvise system calls from QEMU
>> back into the kernel.
> Well whether it's a virtio device is orthogonal to whether it's an
> madvise call, right? You can build vhost-pagehint and that can
> handle requests in a VQ within balloon and do it
> within host kernel directly.
>
> virtio rings let you pass multiple pages so it's really hard to
> say which will win outright - maybe it's more important
> to coalesce exits.
>
> Nitesh, how about trying same tests and reporting performance?
Noted, I can give it a try before my next positing.
>
>
>> One other concern that has been pointed out with my patchset that
>> would likely need to be addressed here as well is what do we do about
>> other hypervisors that decide to implement page hinting. We probably
>> should look at making this KVM/QEMU specific code run through the
>> paravirtual infrastructure instead of trying into the x86 arch code
>> directly.
>>
>> [1] https://lkml.org/lkml/2019/2/4/903
>
> So virtio is a paravirtual interface, that's an argument for
> using it then.
>
> In any case pls copy the Cc'd crowd on future version of your patches.
>
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-12  9:24   ` David Hildenbrand
@ 2019-02-12 17:24     ` Nitesh Narayan Lal
  2019-02-12 19:34       ` David Hildenbrand
  2019-02-13  8:55     ` Wang, Wei W
  1 sibling, 1 reply; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-12 17:24 UTC (permalink / raw)
  To: David Hildenbrand, Wang, Wei W, kvm, linux-kernel, pbonzini,
	lcapitulino, pagupta, yang.zhang.wz, riel, mst, dodgen,
	konrad.wilk, dhildenb, aarcange


[-- Attachment #1.1: Type: text/plain, Size: 6833 bytes --]


On 2/12/19 4:24 AM, David Hildenbrand wrote:
> On 12.02.19 10:03, Wang, Wei W wrote:
>> On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote:
>>> The following patch-set proposes an efficient mechanism for handing freed
>>> memory between the guest and the host. It enables the guests with no page
>>> cache to rapidly free and reclaims memory to and from the host respectively.
>>>
>>> Benefit:
>>> With this patch-series, in our test-case, executed on a single system and
>>> single NUMA node with 15GB memory, we were able to successfully launch
>>> atleast 5 guests when page hinting was enabled and 3 without it. (Detailed
>>> explanation of the test procedure is provided at the bottom).
>>>
>>> Changelog in V8:
>>> In this patch-series, the earlier approach [1] which was used to capture and
>>> scan the pages freed by the guest has been changed. The new approach is
>>> briefly described below:
>>>
>>> The patch-set still leverages the existing arch_free_page() to add this
>>> functionality. It maintains a per CPU array which is used to store the pages
>>> freed by the guest. The maximum number of entries which it can hold is
>>> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it
>>> is scanned and only the pages which are available in the buddy are stored.
>>> This process continues until the array is filled with pages which are part of
>>> the buddy free list. After which it wakes up a kernel per-cpu-thread.
>>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation
>>> and if the page is not reallocated and present in the buddy, the kernel
>>> thread attempts to isolate it from the buddy. If it is successfully isolated, the
>>> page is added to another per-cpu array. Once the entire scanning process is
>>> complete, all the isolated pages are reported to the host through an existing
>>> virtio-balloon driver.
>>  Hi Nitesh,
>>
>> Have you guys thought about something like below, which would be simpler:
> Responding because I'm the first to stumble over this mail, hah! :)
>
>> - use bitmaps to record free pages, e.g. xbitmap: https://lkml.org/lkml/2018/1/9/304.
>>   The bitmap can be indexed by the guest pfn, and it's globally accessed by all the CPUs;
> Global means all VCPUs will be competing potentially for a single lock
> when freeing/allocating a page, no? What if you have 64VCPUs
> allocating/freeing memory like crazy?
>
> (I assume some kind of locking is required even if the bitmap would be
> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
> memory at places where we don't want to - e.g. from arch_free_page ?)
>
> That's the big benefit of taking the pages of the buddy free list. Other
> VCPUs won't stumble over them, waiting for them to get freed in the
> hypervisor.
>
>> - arch_free_page(): set the bits of the freed pages from the bitmap
>>  (no per-CPU array with hardcoded fixed length and no per-cpu scanning thread)
>> - arch_alloc_page(): clear the related bits from the bitmap
>> - expose 2 APIs for the callers:
>>   -- unsigned long get_free_page_hints(unsigned long pfn_start, unsigned int nr); 
>>      This API searches for the next free page chunk (@nr of pages), starting from @pfn_start.
>>      Bits of those free pages will be cleared after this function returns.
>>   -- void put_free_page_hints(unsigned long pfn_start, unsigned int nr);
>>      This API sets the @nr continuous bits starting from pfn_start.
>>
>> Usage example with balloon:
>> 1) host requests to start ballooning;
>> 2) balloon driver get_free_page_hints and report the hints to host via report_vq;
>> 3) host calls madvise(pfn_start, DONTNEED) for each reported chunk of free pages and put back pfn_start to ack_vq;
>> 4) balloon driver receives pfn_start and calls put_free_page_hints(pfn_start) to have the related bits from the bitmap to be set, indicating that those free pages are ready to be allocated.
> This sounds more like "the host requests to get free pages once in a
> while" compared to "the host is always informed about free pages". At
> the time where the host actually has to ask the guest (e.g. because the
> host is low on memory), it might be to late to wait for guest action.
> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages
> as candidates for removal and if the host is low on memory, only
> scanning the guest page tables is sufficient to free up memory.
>
> But both points might just be an implementation detail in the example
> you describe.
>
>> In above 2), get_free_page_hints clears the bits which indicates that those pages are not ready to be used by the guest yet. Why?
>> This is because 3) will unmap the underlying physical pages from EPT. Normally, when guest re-visits those pages, EPT violations and QEMU page faults will get a new host page to set up the related EPT entry. If guest uses that page before the page gets unmapped (i.e. right before step 3), no EPT violation happens and the guest will use the same physical page that will be unmapped and given to other host threads. So we need to make sure that the guest free page is usable only after step 3 finishes.
>>
>> Back to arch_alloc_page(), it needs to check if the allocated pages have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it means step 2) above has happened and step 4) hasn't been reached. In this case, we can either have arch_alloc_page() busywaiting a bit till 4) is done for that page
>> Or better to have a balloon callback which prioritize 3) and 4) to make this page usable by the guest.
> Regarding the latter, the VCPU allocating a page cannot do anything if
> the page (along with other pages) is just being freed by the hypervisor.
> It has to busy-wait, no chance to prioritize.
>
>> Using bitmaps to record free page hints don't need to take the free pages off the buddy list and return them later, which needs to go through the long allocation/free code path.
>>
> Yes, but it means that any process is able to get stuck on such a page
> for as long as it takes to report the free pages to the hypervisor and
> for it to call madvise(pfn_start, DONTNEED) on any such page.
>
> Nice idea, but I think we definitely need something the can potentially
> be implemented per-cpu without any global locks involved.
>
> Thanks!
>
>> Best,
>> Wei
>>
Hi Wei,

For your comment, I agree with David. If we have one global per-cpu, we
will have to acquire a lock.
Also as David mentioned the idea is to derive the hints from the guest,
rather than host asking for free pages.

However, I am wondering if having per-cpu bitmaps is possible?
Using this I can possibly get rid of the fixed array size issue.

-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-12 17:24     ` Nitesh Narayan Lal
@ 2019-02-12 19:34       ` David Hildenbrand
  0 siblings, 0 replies; 116+ messages in thread
From: David Hildenbrand @ 2019-02-12 19:34 UTC (permalink / raw)
  To: Nitesh Narayan Lal, Wang, Wei W, kvm, linux-kernel, pbonzini,
	lcapitulino, pagupta, yang.zhang.wz, riel, mst, dodgen,
	konrad.wilk, dhildenb, aarcange

On 12.02.19 18:24, Nitesh Narayan Lal wrote:
> 
> On 2/12/19 4:24 AM, David Hildenbrand wrote:
>> On 12.02.19 10:03, Wang, Wei W wrote:
>>> On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote:
>>>> The following patch-set proposes an efficient mechanism for handing freed
>>>> memory between the guest and the host. It enables the guests with no page
>>>> cache to rapidly free and reclaims memory to and from the host respectively.
>>>>
>>>> Benefit:
>>>> With this patch-series, in our test-case, executed on a single system and
>>>> single NUMA node with 15GB memory, we were able to successfully launch
>>>> atleast 5 guests when page hinting was enabled and 3 without it. (Detailed
>>>> explanation of the test procedure is provided at the bottom).
>>>>
>>>> Changelog in V8:
>>>> In this patch-series, the earlier approach [1] which was used to capture and
>>>> scan the pages freed by the guest has been changed. The new approach is
>>>> briefly described below:
>>>>
>>>> The patch-set still leverages the existing arch_free_page() to add this
>>>> functionality. It maintains a per CPU array which is used to store the pages
>>>> freed by the guest. The maximum number of entries which it can hold is
>>>> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it
>>>> is scanned and only the pages which are available in the buddy are stored.
>>>> This process continues until the array is filled with pages which are part of
>>>> the buddy free list. After which it wakes up a kernel per-cpu-thread.
>>>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation
>>>> and if the page is not reallocated and present in the buddy, the kernel
>>>> thread attempts to isolate it from the buddy. If it is successfully isolated, the
>>>> page is added to another per-cpu array. Once the entire scanning process is
>>>> complete, all the isolated pages are reported to the host through an existing
>>>> virtio-balloon driver.
>>>  Hi Nitesh,
>>>
>>> Have you guys thought about something like below, which would be simpler:
>> Responding because I'm the first to stumble over this mail, hah! :)
>>
>>> - use bitmaps to record free pages, e.g. xbitmap: https://lkml.org/lkml/2018/1/9/304.
>>>   The bitmap can be indexed by the guest pfn, and it's globally accessed by all the CPUs;
>> Global means all VCPUs will be competing potentially for a single lock
>> when freeing/allocating a page, no? What if you have 64VCPUs
>> allocating/freeing memory like crazy?
>>
>> (I assume some kind of locking is required even if the bitmap would be
>> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
>> memory at places where we don't want to - e.g. from arch_free_page ?)
>>
>> That's the big benefit of taking the pages of the buddy free list. Other
>> VCPUs won't stumble over them, waiting for them to get freed in the
>> hypervisor.
>>
>>> - arch_free_page(): set the bits of the freed pages from the bitmap
>>>  (no per-CPU array with hardcoded fixed length and no per-cpu scanning thread)
>>> - arch_alloc_page(): clear the related bits from the bitmap
>>> - expose 2 APIs for the callers:
>>>   -- unsigned long get_free_page_hints(unsigned long pfn_start, unsigned int nr); 
>>>      This API searches for the next free page chunk (@nr of pages), starting from @pfn_start.
>>>      Bits of those free pages will be cleared after this function returns.
>>>   -- void put_free_page_hints(unsigned long pfn_start, unsigned int nr);
>>>      This API sets the @nr continuous bits starting from pfn_start.
>>>
>>> Usage example with balloon:
>>> 1) host requests to start ballooning;
>>> 2) balloon driver get_free_page_hints and report the hints to host via report_vq;
>>> 3) host calls madvise(pfn_start, DONTNEED) for each reported chunk of free pages and put back pfn_start to ack_vq;
>>> 4) balloon driver receives pfn_start and calls put_free_page_hints(pfn_start) to have the related bits from the bitmap to be set, indicating that those free pages are ready to be allocated.
>> This sounds more like "the host requests to get free pages once in a
>> while" compared to "the host is always informed about free pages". At
>> the time where the host actually has to ask the guest (e.g. because the
>> host is low on memory), it might be to late to wait for guest action.
>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages
>> as candidates for removal and if the host is low on memory, only
>> scanning the guest page tables is sufficient to free up memory.
>>
>> But both points might just be an implementation detail in the example
>> you describe.
>>
>>> In above 2), get_free_page_hints clears the bits which indicates that those pages are not ready to be used by the guest yet. Why?
>>> This is because 3) will unmap the underlying physical pages from EPT. Normally, when guest re-visits those pages, EPT violations and QEMU page faults will get a new host page to set up the related EPT entry. If guest uses that page before the page gets unmapped (i.e. right before step 3), no EPT violation happens and the guest will use the same physical page that will be unmapped and given to other host threads. So we need to make sure that the guest free page is usable only after step 3 finishes.
>>>
>>> Back to arch_alloc_page(), it needs to check if the allocated pages have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it means step 2) above has happened and step 4) hasn't been reached. In this case, we can either have arch_alloc_page() busywaiting a bit till 4) is done for that page
>>> Or better to have a balloon callback which prioritize 3) and 4) to make this page usable by the guest.
>> Regarding the latter, the VCPU allocating a page cannot do anything if
>> the page (along with other pages) is just being freed by the hypervisor.
>> It has to busy-wait, no chance to prioritize.
>>
>>> Using bitmaps to record free page hints don't need to take the free pages off the buddy list and return them later, which needs to go through the long allocation/free code path.
>>>
>> Yes, but it means that any process is able to get stuck on such a page
>> for as long as it takes to report the free pages to the hypervisor and
>> for it to call madvise(pfn_start, DONTNEED) on any such page.
>>
>> Nice idea, but I think we definitely need something the can potentially
>> be implemented per-cpu without any global locks involved.
>>
>> Thanks!
>>
>>> Best,
>>> Wei
>>>
> Hi Wei,
> 
> For your comment, I agree with David. If we have one global per-cpu, we
> will have to acquire a lock.
> Also as David mentioned the idea is to derive the hints from the guest,
> rather than host asking for free pages.
> 
> However, I am wondering if having per-cpu bitmaps is possible?
> Using this I can possibly get rid of the fixed array size issue.
> 

I assume we will have problems with dynamically sized bitmaps - memory
has to be allocated. Similar to a dynamically sized list.

But it is definitely worth investigating.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* RE: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-12  9:24   ` David Hildenbrand
  2019-02-12 17:24     ` Nitesh Narayan Lal
@ 2019-02-13  8:55     ` Wang, Wei W
  2019-02-13  9:19       ` David Hildenbrand
  1 sibling, 1 reply; 116+ messages in thread
From: Wang, Wei W @ 2019-02-13  8:55 UTC (permalink / raw)
  To: 'David Hildenbrand',
	Nitesh Narayan Lal, kvm, linux-kernel, pbonzini, lcapitulino,
	pagupta, yang.zhang.wz, riel, mst, dodgen, konrad.wilk, dhildenb,
	aarcange

On Tuesday, February 12, 2019 5:24 PM, David Hildenbrand wrote:
> Global means all VCPUs will be competing potentially for a single lock when
> freeing/allocating a page, no? What if you have 64VCPUs allocating/freeing
> memory like crazy?

I think the key point is that the 64 vcpus won't allocate/free on the same page simultaneously, so no need to have a global big lock, isn’t it?
I think atomic operations on the bitmap would be enough.

> (I assume some kind of locking is required even if the bitmap would be
> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
> memory at places where we don't want to - e.g. from arch_free_page ?)

arch_free_pages is in free_pages_prepare, why can't we have memory allocation there?

It would also be doable to find a preferred place to preallocate some amount of memory for the bitmap.

> 
> That's the big benefit of taking the pages of the buddy free list. Other VCPUs
> won't stumble over them, waiting for them to get freed in the hypervisor.

As also mentioned above, I think other vcpus will not allocate/free on the same page that is in progress of being allocated/freed.

> This sounds more like "the host requests to get free pages once in a while"
> compared to "the host is always informed about free pages". At the time
> where the host actually has to ask the guest (e.g. because the host is low on
> memory), it might be to late to wait for guest action.

Option 1: Host asks for free pages:
Not necessary to ask only when the host has been in memory pressure.
This could be the orchestration layer's job to monitor the host memory usage.
For example, people could set the condition "when 50% of the host memory
has been used, start to ask a guest for some amount of free pages" 

Option 2: Guest actively offers free pages:
Add a balloon callback to arch_free_page so that whenever a page gets freed its gfn
will be filled into the balloon's report_vq and the host will take away the backing
host page.

Both options can be implemented. But I think option 1 would be more
efficient as the guest free pages are offered on demand.  

> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
> candidates for removal and if the host is low on memory, only scanning the
> guest page tables is sufficient to free up memory.
> 
> But both points might just be an implementation detail in the example you
> describe.

Yes, it is an implementation detail. I think DONTNEED would be easier
for the first step.

> 
> >
> > In above 2), get_free_page_hints clears the bits which indicates that those
> pages are not ready to be used by the guest yet. Why?
> > This is because 3) will unmap the underlying physical pages from EPT.
> Normally, when guest re-visits those pages, EPT violations and QEMU page
> faults will get a new host page to set up the related EPT entry. If guest uses
> that page before the page gets unmapped (i.e. right before step 3), no EPT
> violation happens and the guest will use the same physical page that will be
> unmapped and given to other host threads. So we need to make sure that
> the guest free page is usable only after step 3 finishes.
> >
> > Back to arch_alloc_page(), it needs to check if the allocated pages
> > have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it
> means step 2) above has happened and step 4) hasn't been reached. In this
> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done
> for that page Or better to have a balloon callback which prioritize 3) and 4)
> to make this page usable by the guest.
> 
> Regarding the latter, the VCPU allocating a page cannot do anything if the
> page (along with other pages) is just being freed by the hypervisor.
> It has to busy-wait, no chance to prioritize.

I meant this:
With this approach, essentially the free pages have 2 states:
ready free page: the page is on the free list and it has "1" in the bitmap
non-ready free page: the page is on the free list and it has "0" in the bitmap
Ready free pages are those who can be allocated to use.
Non-ready free pages are those who are in progress of being reported to
host and the related EPT mapping is about to be zapped. 

The non-ready pages are inserted into the report_vq and waiting for the
host to zap the mappings one by one. After the mapping gets zapped
(which means the backing host page has been taken away), host acks to
the guest to mark the free page as ready free page (set the bit to 1 in the bitmap).

So the non-ready free page may happen to be used when they are waiting in
the report_vq to be handled by the host to zap the mapping, balloon could
have a fast path to notify the host:
"page 0x1000 is about to be used, don’t zap the mapping when you get
0x1000 from the report_vq"  /*option [1] */

Or

"page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) and 4) above,
so that the free page will be marked as ready free page and the guest can use it".
This option will generate an extra EPT violation and QEMU page fault to get a new host
page to back the guest ready free page.

> 
> >
> > Using bitmaps to record free page hints don't need to take the free pages
> off the buddy list and return them later, which needs to go through the long
> allocation/free code path.
> >
> 
> Yes, but it means that any process is able to get stuck on such a page for as
> long as it takes to report the free pages to the hypervisor and for it to call
> madvise(pfn_start, DONTNEED) on any such page.

This only happens when the guest thread happens to get allocated on a page which is
being reported to the host. Using option [1] above will avoid this.

Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* RE: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-04 20:18 [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Nitesh Narayan Lal
                   ` (8 preceding siblings ...)
  2019-02-12  9:03 ` [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Wang, Wei W
@ 2019-02-13  9:00 ` Wang, Wei W
  2019-02-13 12:06   ` Nitesh Narayan Lal
  2019-02-16  9:40 ` David Hildenbrand
  2019-02-23  0:02 ` Alexander Duyck
  11 siblings, 1 reply; 116+ messages in thread
From: Wang, Wei W @ 2019-02-13  9:00 UTC (permalink / raw)
  To: Nitesh Narayan Lal, kvm, linux-kernel, pbonzini, lcapitulino,
	pagupta, yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk,
	dhildenb, aarcange

On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote:
> The following patch-set proposes an efficient mechanism for handing freed
> memory between the guest and the host. It enables the guests with no page
> cache to rapidly free and reclaims memory to and from the host respectively.
> 
> Benefit:
> With this patch-series, in our test-case, executed on a single system and
> single NUMA node with 15GB memory, we were able to successfully launch
> atleast 5 guests when page hinting was enabled and 3 without it. (Detailed
> explanation of the test procedure is provided at the bottom).
> 
> Changelog in V8:
> In this patch-series, the earlier approach [1] which was used to capture and
> scan the pages freed by the guest has been changed. The new approach is
> briefly described below:
> 
> The patch-set still leverages the existing arch_free_page() to add this
> functionality. It maintains a per CPU array which is used to store the pages
> freed by the guest. The maximum number of entries which it can hold is
> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it
> is scanned and only the pages which are available in the buddy are stored.
> This process continues until the array is filled with pages which are part of
> the buddy free list. After which it wakes up a kernel per-cpu-thread.
> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation
> and if the page is not reallocated and present in the buddy, the kernel
> thread attempts to isolate it from the buddy. If it is successfully isolated, the
> page is added to another per-cpu array. Once the entire scanning process is
> complete, all the isolated pages are reported to the host through an existing
> virtio-balloon driver.

The free page is removed from the buddy list here. When will they get returned to the buddy list so that the guest threads can use them normally?

Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-13  8:55     ` Wang, Wei W
@ 2019-02-13  9:19       ` David Hildenbrand
  2019-02-13 12:17         ` Nitesh Narayan Lal
                           ` (2 more replies)
  0 siblings, 3 replies; 116+ messages in thread
From: David Hildenbrand @ 2019-02-13  9:19 UTC (permalink / raw)
  To: Wang, Wei W, Nitesh Narayan Lal, kvm, linux-kernel, pbonzini,
	lcapitulino, pagupta, yang.zhang.wz, riel, mst, dodgen,
	konrad.wilk, dhildenb, aarcange

On 13.02.19 09:55, Wang, Wei W wrote:
> On Tuesday, February 12, 2019 5:24 PM, David Hildenbrand wrote:
>> Global means all VCPUs will be competing potentially for a single lock when
>> freeing/allocating a page, no? What if you have 64VCPUs allocating/freeing
>> memory like crazy?
> 
> I think the key point is that the 64 vcpus won't allocate/free on the same page simultaneously, so no need to have a global big lock, isn’t it?
> I think atomic operations on the bitmap would be enough.

If you have to resize/alloc/coordinate who will report, you will need
locking. Especially, I doubt that there is an atomic xbitmap  (prove me
wrong :) ).

> 
>> (I assume some kind of locking is required even if the bitmap would be
>> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
>> memory at places where we don't want to - e.g. from arch_free_page ?)
> 
> arch_free_pages is in free_pages_prepare, why can't we have memory allocation there?

I remember we were stumbling over some issues that were non-trivial. I
am not 100% sure yet anymore, but allocating memory while deep down in
the freeing part of MM core smells like "be careful".

> 
> It would also be doable to find a preferred place to preallocate some amount of memory for the bitmap.

That makes things very ugly. Especially, preallocation will most likely
require locking.

> 
>>
>> That's the big benefit of taking the pages of the buddy free list. Other VCPUs
>> won't stumble over them, waiting for them to get freed in the hypervisor.
> 
> As also mentioned above, I think other vcpus will not allocate/free on the same page that is in progress of being allocated/freed.

If a page is in the buddy but stuck in some other bitmap, there is
nothing stopping another VCPU from trying to allocate it. Nitesh has
been fighting with this problem already :)

> 
>> This sounds more like "the host requests to get free pages once in a while"
>> compared to "the host is always informed about free pages". At the time
>> where the host actually has to ask the guest (e.g. because the host is low on
>> memory), it might be to late to wait for guest action.
> 
> Option 1: Host asks for free pages:
> Not necessary to ask only when the host has been in memory pressure.
> This could be the orchestration layer's job to monitor the host memory usage.
> For example, people could set the condition "when 50% of the host memory
> has been used, start to ask a guest for some amount of free pages" 
> 
> Option 2: Guest actively offers free pages:
> Add a balloon callback to arch_free_page so that whenever a page gets freed its gfn
> will be filled into the balloon's report_vq and the host will take away the backing
> host page.
> 
> Both options can be implemented. But I think option 1 would be more
> efficient as the guest free pages are offered on demand.  

Yes, but as I mentioned this has other drawbacks. Relying on a a guest
to free up memory when you really need it is not going to work. It might
work for some scenarios but should not dictate the design. It is a good
start though if it makes things easier.

Enabling/disabling free page hintning by the hypervisor via some
mechanism is on the other hand a good idea. "I have plenty of free
space, don't worry".

> 
>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
>> candidates for removal and if the host is low on memory, only scanning the
>> guest page tables is sufficient to free up memory.
>>
>> But both points might just be an implementation detail in the example you
>> describe.
> 
> Yes, it is an implementation detail. I think DONTNEED would be easier
> for the first step.
> 
>>
>>>
>>> In above 2), get_free_page_hints clears the bits which indicates that those
>> pages are not ready to be used by the guest yet. Why?
>>> This is because 3) will unmap the underlying physical pages from EPT.
>> Normally, when guest re-visits those pages, EPT violations and QEMU page
>> faults will get a new host page to set up the related EPT entry. If guest uses
>> that page before the page gets unmapped (i.e. right before step 3), no EPT
>> violation happens and the guest will use the same physical page that will be
>> unmapped and given to other host threads. So we need to make sure that
>> the guest free page is usable only after step 3 finishes.
>>>
>>> Back to arch_alloc_page(), it needs to check if the allocated pages
>>> have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it
>> means step 2) above has happened and step 4) hasn't been reached. In this
>> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done
>> for that page Or better to have a balloon callback which prioritize 3) and 4)
>> to make this page usable by the guest.
>>
>> Regarding the latter, the VCPU allocating a page cannot do anything if the
>> page (along with other pages) is just being freed by the hypervisor.
>> It has to busy-wait, no chance to prioritize.
> 
> I meant this:
> With this approach, essentially the free pages have 2 states:
> ready free page: the page is on the free list and it has "1" in the bitmap
> non-ready free page: the page is on the free list and it has "0" in the bitmap
> Ready free pages are those who can be allocated to use.
> Non-ready free pages are those who are in progress of being reported to
> host and the related EPT mapping is about to be zapped. 
> 
> The non-ready pages are inserted into the report_vq and waiting for the
> host to zap the mappings one by one. After the mapping gets zapped
> (which means the backing host page has been taken away), host acks to
> the guest to mark the free page as ready free page (set the bit to 1 in the bitmap).

Yes, that's how I understood your approach. The interesting part is
where somebody finds a buddy page and wants to allocate it.

> 
> So the non-ready free page may happen to be used when they are waiting in
> the report_vq to be handled by the host to zap the mapping, balloon could
> have a fast path to notify the host:
> "page 0x1000 is about to be used, don’t zap the mapping when you get
> 0x1000 from the report_vq"  /*option [1] */

This requires coordination and in any case there will be a scenario
where you have to wait for the hypervisor to eventually finish a madv
call. You can just try to make that scenario less likely.

What you propose is synchronous in the worst case. Getting pages of the
buddy makes it possible to have it done completely asynchronous. Nobody
allocating a page has to wait.

> 
> Or
> 
> "page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) and 4) above,
> so that the free page will be marked as ready free page and the guest can use it".
> This option will generate an extra EPT violation and QEMU page fault to get a new host
> page to back the guest ready free page.

Again, coordination with the hypervisor while allocating a page. That is
to be avoided in any case.

> 
>>
>>>
>>> Using bitmaps to record free page hints don't need to take the free pages
>> off the buddy list and return them later, which needs to go through the long
>> allocation/free code path.
>>>
>>
>> Yes, but it means that any process is able to get stuck on such a page for as
>> long as it takes to report the free pages to the hypervisor and for it to call
>> madvise(pfn_start, DONTNEED) on any such page.
> 
> This only happens when the guest thread happens to get allocated on a page which is
> being reported to the host. Using option [1] above will avoid this.

I think getting pages out of the buddy system temporarily is the only
way we can avoid somebody else stumbling over a page currently getting
reported by the hypervisor. Otherwise, as I said, there are scenarios
where a allocating VCPU has to wait for the hypervisor to finish the
"freeing" task. While you can try to "speedup" that scenario -
"hypervisor please prioritize" you cannot avoid it. There will be busy
waiting.

I don't believe what you describe is going to work (especially the not
locking part when working with global resources).

What would be interesting is to see if something like a xbitmap could be
used instead of the per-vcpu list. Nitesh, do you remember what the
problem was with allocating memory from these hooks? Was it a locking issue?

Thanks!

> 
> Best,
> Wei
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-13  9:00 ` Wang, Wei W
@ 2019-02-13 12:06   ` Nitesh Narayan Lal
  2019-02-14  8:48     ` Wang, Wei W
  0 siblings, 1 reply; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-13 12:06 UTC (permalink / raw)
  To: Wang, Wei W, kvm, linux-kernel, pbonzini, lcapitulino, pagupta,
	yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk, dhildenb,
	aarcange


[-- Attachment #1.1: Type: text/plain, Size: 2174 bytes --]


On 2/13/19 4:00 AM, Wang, Wei W wrote:
> On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote:
>> The following patch-set proposes an efficient mechanism for handing freed
>> memory between the guest and the host. It enables the guests with no page
>> cache to rapidly free and reclaims memory to and from the host respectively.
>>
>> Benefit:
>> With this patch-series, in our test-case, executed on a single system and
>> single NUMA node with 15GB memory, we were able to successfully launch
>> atleast 5 guests when page hinting was enabled and 3 without it. (Detailed
>> explanation of the test procedure is provided at the bottom).
>>
>> Changelog in V8:
>> In this patch-series, the earlier approach [1] which was used to capture and
>> scan the pages freed by the guest has been changed. The new approach is
>> briefly described below:
>>
>> The patch-set still leverages the existing arch_free_page() to add this
>> functionality. It maintains a per CPU array which is used to store the pages
>> freed by the guest. The maximum number of entries which it can hold is
>> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it
>> is scanned and only the pages which are available in the buddy are stored.
>> This process continues until the array is filled with pages which are part of
>> the buddy free list. After which it wakes up a kernel per-cpu-thread.
>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation
>> and if the page is not reallocated and present in the buddy, the kernel
>> thread attempts to isolate it from the buddy. If it is successfully isolated, the
>> page is added to another per-cpu array. Once the entire scanning process is
>> complete, all the isolated pages are reported to the host through an existing
>> virtio-balloon driver.
> The free page is removed from the buddy list here. When will they get returned to the buddy list so that the guest threads can use them normally?
Once the host free the pages. All the isolated pages are returned back
to the buddy. (This is implemented in hyperlist_ready())
>
> Best,
> Wei
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-13  9:19       ` David Hildenbrand
@ 2019-02-13 12:17         ` Nitesh Narayan Lal
  2019-02-13 17:09           ` Michael S. Tsirkin
  2019-02-13 17:16         ` Michael S. Tsirkin
  2019-02-14  9:08         ` Wang, Wei W
  2 siblings, 1 reply; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-13 12:17 UTC (permalink / raw)
  To: David Hildenbrand, Wang, Wei W, kvm, linux-kernel, pbonzini,
	lcapitulino, pagupta, yang.zhang.wz, riel, mst, dodgen,
	konrad.wilk, dhildenb, aarcange


[-- Attachment #1.1: Type: text/plain, Size: 9049 bytes --]


On 2/13/19 4:19 AM, David Hildenbrand wrote:
> On 13.02.19 09:55, Wang, Wei W wrote:
>> On Tuesday, February 12, 2019 5:24 PM, David Hildenbrand wrote:
>>> Global means all VCPUs will be competing potentially for a single lock when
>>> freeing/allocating a page, no? What if you have 64VCPUs allocating/freeing
>>> memory like crazy?
>> I think the key point is that the 64 vcpus won't allocate/free on the same page simultaneously, so no need to have a global big lock, isn’t it?
>> I think atomic operations on the bitmap would be enough.
> If you have to resize/alloc/coordinate who will report, you will need
> locking. Especially, I doubt that there is an atomic xbitmap  (prove me
> wrong :) ).
>
>>> (I assume some kind of locking is required even if the bitmap would be
>>> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
>>> memory at places where we don't want to - e.g. from arch_free_page ?)
>> arch_free_pages is in free_pages_prepare, why can't we have memory allocation there?
> I remember we were stumbling over some issues that were non-trivial. I
> am not 100% sure yet anymore, but allocating memory while deep down in
> the freeing part of MM core smells like "be careful".
>
>> It would also be doable to find a preferred place to preallocate some amount of memory for the bitmap.
> That makes things very ugly. Especially, preallocation will most likely
> require locking.
>
>>> That's the big benefit of taking the pages of the buddy free list. Other VCPUs
>>> won't stumble over them, waiting for them to get freed in the hypervisor.
>> As also mentioned above, I think other vcpus will not allocate/free on the same page that is in progress of being allocated/freed.
> If a page is in the buddy but stuck in some other bitmap, there is
> nothing stopping another VCPU from trying to allocate it. Nitesh has
> been fighting with this problem already :)
>
>>> This sounds more like "the host requests to get free pages once in a while"
>>> compared to "the host is always informed about free pages". At the time
>>> where the host actually has to ask the guest (e.g. because the host is low on
>>> memory), it might be to late to wait for guest action.
>> Option 1: Host asks for free pages:
>> Not necessary to ask only when the host has been in memory pressure.
>> This could be the orchestration layer's job to monitor the host memory usage.
>> For example, people could set the condition "when 50% of the host memory
>> has been used, start to ask a guest for some amount of free pages" 
>>
>> Option 2: Guest actively offers free pages:
>> Add a balloon callback to arch_free_page so that whenever a page gets freed its gfn
>> will be filled into the balloon's report_vq and the host will take away the backing
>> host page.
>>
>> Both options can be implemented. But I think option 1 would be more
>> efficient as the guest free pages are offered on demand.  
> Yes, but as I mentioned this has other drawbacks. Relying on a a guest
> to free up memory when you really need it is not going to work. It might
> work for some scenarios but should not dictate the design. It is a good
> start though if it makes things easier.
>
> Enabling/disabling free page hintning by the hypervisor via some
> mechanism is on the other hand a good idea. "I have plenty of free
> space, don't worry".
>
>>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
>>> candidates for removal and if the host is low on memory, only scanning the
>>> guest page tables is sufficient to free up memory.
>>>
>>> But both points might just be an implementation detail in the example you
>>> describe.
>> Yes, it is an implementation detail. I think DONTNEED would be easier
>> for the first step.
>>
>>>> In above 2), get_free_page_hints clears the bits which indicates that those
>>> pages are not ready to be used by the guest yet. Why?
>>>> This is because 3) will unmap the underlying physical pages from EPT.
>>> Normally, when guest re-visits those pages, EPT violations and QEMU page
>>> faults will get a new host page to set up the related EPT entry. If guest uses
>>> that page before the page gets unmapped (i.e. right before step 3), no EPT
>>> violation happens and the guest will use the same physical page that will be
>>> unmapped and given to other host threads. So we need to make sure that
>>> the guest free page is usable only after step 3 finishes.
>>>> Back to arch_alloc_page(), it needs to check if the allocated pages
>>>> have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it
>>> means step 2) above has happened and step 4) hasn't been reached. In this
>>> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done
>>> for that page Or better to have a balloon callback which prioritize 3) and 4)
>>> to make this page usable by the guest.
>>>
>>> Regarding the latter, the VCPU allocating a page cannot do anything if the
>>> page (along with other pages) is just being freed by the hypervisor.
>>> It has to busy-wait, no chance to prioritize.
>> I meant this:
>> With this approach, essentially the free pages have 2 states:
>> ready free page: the page is on the free list and it has "1" in the bitmap
>> non-ready free page: the page is on the free list and it has "0" in the bitmap
>> Ready free pages are those who can be allocated to use.
>> Non-ready free pages are those who are in progress of being reported to
>> host and the related EPT mapping is about to be zapped. 
>>
>> The non-ready pages are inserted into the report_vq and waiting for the
>> host to zap the mappings one by one. After the mapping gets zapped
>> (which means the backing host page has been taken away), host acks to
>> the guest to mark the free page as ready free page (set the bit to 1 in the bitmap).
> Yes, that's how I understood your approach. The interesting part is
> where somebody finds a buddy page and wants to allocate it.
>
>> So the non-ready free page may happen to be used when they are waiting in
>> the report_vq to be handled by the host to zap the mapping, balloon could
>> have a fast path to notify the host:
>> "page 0x1000 is about to be used, don’t zap the mapping when you get
>> 0x1000 from the report_vq"  /*option [1] */
> This requires coordination and in any case there will be a scenario
> where you have to wait for the hypervisor to eventually finish a madv
> call. You can just try to make that scenario less likely.
>
> What you propose is synchronous in the worst case. Getting pages of the
> buddy makes it possible to have it done completely asynchronous. Nobody
> allocating a page has to wait.
>
>> Or
>>
>> "page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) and 4) above,
>> so that the free page will be marked as ready free page and the guest can use it".
>> This option will generate an extra EPT violation and QEMU page fault to get a new host
>> page to back the guest ready free page.
> Again, coordination with the hypervisor while allocating a page. That is
> to be avoided in any case.
>
>>>> Using bitmaps to record free page hints don't need to take the free pages
>>> off the buddy list and return them later, which needs to go through the long
>>> allocation/free code path.
>>> Yes, but it means that any process is able to get stuck on such a page for as
>>> long as it takes to report the free pages to the hypervisor and for it to call
>>> madvise(pfn_start, DONTNEED) on any such page.
>> This only happens when the guest thread happens to get allocated on a page which is
>> being reported to the host. Using option [1] above will avoid this.
> I think getting pages out of the buddy system temporarily is the only
> way we can avoid somebody else stumbling over a page currently getting
> reported by the hypervisor. Otherwise, as I said, there are scenarios
> where a allocating VCPU has to wait for the hypervisor to finish the
> "freeing" task. While you can try to "speedup" that scenario -
> "hypervisor please prioritize" you cannot avoid it. There will be busy
> waiting.
>
> I don't believe what you describe is going to work (especially the not
> locking part when working with global resources).
>
> What would be interesting is to see if something like a xbitmap could be
> used instead of the per-vcpu list. 
Yeap, exactly.
> Nitesh, do you remember what the
> problem was with allocating memory from these hooks? Was it a locking issue?
In the previous implementation, the issue was due to the locking. In the
current implementation having an allocation under these hooks will
result in lots of isolation failures under memory pressure.
By the above statement, if you are referring to having a dynamic array
to hold the freed pages.
Then, that is an idea Andrea also suggested to get around this fixed
array size issue.
>
> Thanks!
>
>> Best,
>> Wei
>>
>
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-13 12:17         ` Nitesh Narayan Lal
@ 2019-02-13 17:09           ` Michael S. Tsirkin
  2019-02-13 17:22             ` Nitesh Narayan Lal
  0 siblings, 1 reply; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-13 17:09 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: David Hildenbrand, Wang, Wei W, kvm, linux-kernel, pbonzini,
	lcapitulino, pagupta, yang.zhang.wz, riel, dodgen, konrad.wilk,
	dhildenb, aarcange

On Wed, Feb 13, 2019 at 07:17:13AM -0500, Nitesh Narayan Lal wrote:
> 
> On 2/13/19 4:19 AM, David Hildenbrand wrote:
> > On 13.02.19 09:55, Wang, Wei W wrote:
> >> On Tuesday, February 12, 2019 5:24 PM, David Hildenbrand wrote:
> >>> Global means all VCPUs will be competing potentially for a single lock when
> >>> freeing/allocating a page, no? What if you have 64VCPUs allocating/freeing
> >>> memory like crazy?
> >> I think the key point is that the 64 vcpus won't allocate/free on the same page simultaneously, so no need to have a global big lock, isn’t it?
> >> I think atomic operations on the bitmap would be enough.
> > If you have to resize/alloc/coordinate who will report, you will need
> > locking. Especially, I doubt that there is an atomic xbitmap  (prove me
> > wrong :) ).
> >
> >>> (I assume some kind of locking is required even if the bitmap would be
> >>> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
> >>> memory at places where we don't want to - e.g. from arch_free_page ?)
> >> arch_free_pages is in free_pages_prepare, why can't we have memory allocation there?
> > I remember we were stumbling over some issues that were non-trivial. I
> > am not 100% sure yet anymore, but allocating memory while deep down in
> > the freeing part of MM core smells like "be careful".
> >
> >> It would also be doable to find a preferred place to preallocate some amount of memory for the bitmap.
> > That makes things very ugly. Especially, preallocation will most likely
> > require locking.
> >
> >>> That's the big benefit of taking the pages of the buddy free list. Other VCPUs
> >>> won't stumble over them, waiting for them to get freed in the hypervisor.
> >> As also mentioned above, I think other vcpus will not allocate/free on the same page that is in progress of being allocated/freed.
> > If a page is in the buddy but stuck in some other bitmap, there is
> > nothing stopping another VCPU from trying to allocate it. Nitesh has
> > been fighting with this problem already :)
> >
> >>> This sounds more like "the host requests to get free pages once in a while"
> >>> compared to "the host is always informed about free pages". At the time
> >>> where the host actually has to ask the guest (e.g. because the host is low on
> >>> memory), it might be to late to wait for guest action.
> >> Option 1: Host asks for free pages:
> >> Not necessary to ask only when the host has been in memory pressure.
> >> This could be the orchestration layer's job to monitor the host memory usage.
> >> For example, people could set the condition "when 50% of the host memory
> >> has been used, start to ask a guest for some amount of free pages" 
> >>
> >> Option 2: Guest actively offers free pages:
> >> Add a balloon callback to arch_free_page so that whenever a page gets freed its gfn
> >> will be filled into the balloon's report_vq and the host will take away the backing
> >> host page.
> >>
> >> Both options can be implemented. But I think option 1 would be more
> >> efficient as the guest free pages are offered on demand.  
> > Yes, but as I mentioned this has other drawbacks. Relying on a a guest
> > to free up memory when you really need it is not going to work. It might
> > work for some scenarios but should not dictate the design. It is a good
> > start though if it makes things easier.
> >
> > Enabling/disabling free page hintning by the hypervisor via some
> > mechanism is on the other hand a good idea. "I have plenty of free
> > space, don't worry".
> >
> >>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
> >>> candidates for removal and if the host is low on memory, only scanning the
> >>> guest page tables is sufficient to free up memory.
> >>>
> >>> But both points might just be an implementation detail in the example you
> >>> describe.
> >> Yes, it is an implementation detail. I think DONTNEED would be easier
> >> for the first step.
> >>
> >>>> In above 2), get_free_page_hints clears the bits which indicates that those
> >>> pages are not ready to be used by the guest yet. Why?
> >>>> This is because 3) will unmap the underlying physical pages from EPT.
> >>> Normally, when guest re-visits those pages, EPT violations and QEMU page
> >>> faults will get a new host page to set up the related EPT entry. If guest uses
> >>> that page before the page gets unmapped (i.e. right before step 3), no EPT
> >>> violation happens and the guest will use the same physical page that will be
> >>> unmapped and given to other host threads. So we need to make sure that
> >>> the guest free page is usable only after step 3 finishes.
> >>>> Back to arch_alloc_page(), it needs to check if the allocated pages
> >>>> have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it
> >>> means step 2) above has happened and step 4) hasn't been reached. In this
> >>> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done
> >>> for that page Or better to have a balloon callback which prioritize 3) and 4)
> >>> to make this page usable by the guest.
> >>>
> >>> Regarding the latter, the VCPU allocating a page cannot do anything if the
> >>> page (along with other pages) is just being freed by the hypervisor.
> >>> It has to busy-wait, no chance to prioritize.
> >> I meant this:
> >> With this approach, essentially the free pages have 2 states:
> >> ready free page: the page is on the free list and it has "1" in the bitmap
> >> non-ready free page: the page is on the free list and it has "0" in the bitmap
> >> Ready free pages are those who can be allocated to use.
> >> Non-ready free pages are those who are in progress of being reported to
> >> host and the related EPT mapping is about to be zapped. 
> >>
> >> The non-ready pages are inserted into the report_vq and waiting for the
> >> host to zap the mappings one by one. After the mapping gets zapped
> >> (which means the backing host page has been taken away), host acks to
> >> the guest to mark the free page as ready free page (set the bit to 1 in the bitmap).
> > Yes, that's how I understood your approach. The interesting part is
> > where somebody finds a buddy page and wants to allocate it.
> >
> >> So the non-ready free page may happen to be used when they are waiting in
> >> the report_vq to be handled by the host to zap the mapping, balloon could
> >> have a fast path to notify the host:
> >> "page 0x1000 is about to be used, don’t zap the mapping when you get
> >> 0x1000 from the report_vq"  /*option [1] */
> > This requires coordination and in any case there will be a scenario
> > where you have to wait for the hypervisor to eventually finish a madv
> > call. You can just try to make that scenario less likely.
> >
> > What you propose is synchronous in the worst case. Getting pages of the
> > buddy makes it possible to have it done completely asynchronous. Nobody
> > allocating a page has to wait.
> >
> >> Or
> >>
> >> "page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) and 4) above,
> >> so that the free page will be marked as ready free page and the guest can use it".
> >> This option will generate an extra EPT violation and QEMU page fault to get a new host
> >> page to back the guest ready free page.
> > Again, coordination with the hypervisor while allocating a page. That is
> > to be avoided in any case.
> >
> >>>> Using bitmaps to record free page hints don't need to take the free pages
> >>> off the buddy list and return them later, which needs to go through the long
> >>> allocation/free code path.
> >>> Yes, but it means that any process is able to get stuck on such a page for as
> >>> long as it takes to report the free pages to the hypervisor and for it to call
> >>> madvise(pfn_start, DONTNEED) on any such page.
> >> This only happens when the guest thread happens to get allocated on a page which is
> >> being reported to the host. Using option [1] above will avoid this.
> > I think getting pages out of the buddy system temporarily is the only
> > way we can avoid somebody else stumbling over a page currently getting
> > reported by the hypervisor. Otherwise, as I said, there are scenarios
> > where a allocating VCPU has to wait for the hypervisor to finish the
> > "freeing" task. While you can try to "speedup" that scenario -
> > "hypervisor please prioritize" you cannot avoid it. There will be busy
> > waiting.
> >
> > I don't believe what you describe is going to work (especially the not
> > locking part when working with global resources).
> >
> > What would be interesting is to see if something like a xbitmap could be
> > used instead of the per-vcpu list. 
> Yeap, exactly.
> > Nitesh, do you remember what the
> > problem was with allocating memory from these hooks? Was it a locking issue?
> In the previous implementation, the issue was due to the locking. In the
> current implementation having an allocation under these hooks will
> result in lots of isolation failures under memory pressure.

But then we shouldn't be giving host memory when under pressure
at all, should we?

> By the above statement, if you are referring to having a dynamic array
> to hold the freed pages.
> Then, that is an idea Andrea also suggested to get around this fixed
> array size issue.
> >
> > Thanks!
> >
> >> Best,
> >> Wei
> >>
> >
> -- 
> Regards
> Nitesh
> 




^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-13  9:19       ` David Hildenbrand
  2019-02-13 12:17         ` Nitesh Narayan Lal
@ 2019-02-13 17:16         ` Michael S. Tsirkin
  2019-02-13 17:59           ` David Hildenbrand
  2019-02-14  9:08         ` Wang, Wei W
  2 siblings, 1 reply; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-13 17:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Wang, Wei W, Nitesh Narayan Lal, kvm, linux-kernel, pbonzini,
	lcapitulino, pagupta, yang.zhang.wz, riel, dodgen, konrad.wilk,
	dhildenb, aarcange

On Wed, Feb 13, 2019 at 10:19:05AM +0100, David Hildenbrand wrote:
> On 13.02.19 09:55, Wang, Wei W wrote:
> > On Tuesday, February 12, 2019 5:24 PM, David Hildenbrand wrote:
> >> Global means all VCPUs will be competing potentially for a single lock when
> >> freeing/allocating a page, no? What if you have 64VCPUs allocating/freeing
> >> memory like crazy?
> > 
> > I think the key point is that the 64 vcpus won't allocate/free on the same page simultaneously, so no need to have a global big lock, isn’t it?
> > I think atomic operations on the bitmap would be enough.
> 
> If you have to resize/alloc/coordinate who will report, you will need
> locking. Especially, I doubt that there is an atomic xbitmap  (prove me
> wrong :) ).
> 
> > 
> >> (I assume some kind of locking is required even if the bitmap would be
> >> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
> >> memory at places where we don't want to - e.g. from arch_free_page ?)
> > 
> > arch_free_pages is in free_pages_prepare, why can't we have memory allocation there?
> 
> I remember we were stumbling over some issues that were non-trivial. I
> am not 100% sure yet anymore, but allocating memory while deep down in
> the freeing part of MM core smells like "be careful".
> 
> > 
> > It would also be doable to find a preferred place to preallocate some amount of memory for the bitmap.
> 
> That makes things very ugly. Especially, preallocation will most likely
> require locking.
> 
> > 
> >>
> >> That's the big benefit of taking the pages of the buddy free list. Other VCPUs
> >> won't stumble over them, waiting for them to get freed in the hypervisor.
> > 
> > As also mentioned above, I think other vcpus will not allocate/free on the same page that is in progress of being allocated/freed.
> 
> If a page is in the buddy but stuck in some other bitmap, there is
> nothing stopping another VCPU from trying to allocate it. Nitesh has
> been fighting with this problem already :)
> 
> > 
> >> This sounds more like "the host requests to get free pages once in a while"
> >> compared to "the host is always informed about free pages". At the time
> >> where the host actually has to ask the guest (e.g. because the host is low on
> >> memory), it might be to late to wait for guest action.
> > 
> > Option 1: Host asks for free pages:
> > Not necessary to ask only when the host has been in memory pressure.
> > This could be the orchestration layer's job to monitor the host memory usage.
> > For example, people could set the condition "when 50% of the host memory
> > has been used, start to ask a guest for some amount of free pages" 
> > 
> > Option 2: Guest actively offers free pages:
> > Add a balloon callback to arch_free_page so that whenever a page gets freed its gfn
> > will be filled into the balloon's report_vq and the host will take away the backing
> > host page.
> > 
> > Both options can be implemented. But I think option 1 would be more
> > efficient as the guest free pages are offered on demand.  
> 
> Yes, but as I mentioned this has other drawbacks. Relying on a a guest
> to free up memory when you really need it is not going to work. It might
> work for some scenarios but should not dictate the design. It is a good
> start though if it makes things easier.

Besides, it has already been implemented in Linux.

> Enabling/disabling free page hintning by the hypervisor via some
> mechanism is on the other hand a good idea. "I have plenty of free
> space, don't worry".

Existing mechanism includes ability to cancel reporting.

> > 
> >> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
> >> candidates for removal and if the host is low on memory, only scanning the
> >> guest page tables is sufficient to free up memory.
> >>
> >> But both points might just be an implementation detail in the example you
> >> describe.
> > 
> > Yes, it is an implementation detail. I think DONTNEED would be easier
> > for the first step.
> > 
> >>
> >>>
> >>> In above 2), get_free_page_hints clears the bits which indicates that those
> >> pages are not ready to be used by the guest yet. Why?
> >>> This is because 3) will unmap the underlying physical pages from EPT.
> >> Normally, when guest re-visits those pages, EPT violations and QEMU page
> >> faults will get a new host page to set up the related EPT entry. If guest uses
> >> that page before the page gets unmapped (i.e. right before step 3), no EPT
> >> violation happens and the guest will use the same physical page that will be
> >> unmapped and given to other host threads. So we need to make sure that
> >> the guest free page is usable only after step 3 finishes.
> >>>
> >>> Back to arch_alloc_page(), it needs to check if the allocated pages
> >>> have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it
> >> means step 2) above has happened and step 4) hasn't been reached. In this
> >> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done
> >> for that page Or better to have a balloon callback which prioritize 3) and 4)
> >> to make this page usable by the guest.
> >>
> >> Regarding the latter, the VCPU allocating a page cannot do anything if the
> >> page (along with other pages) is just being freed by the hypervisor.
> >> It has to busy-wait, no chance to prioritize.
> > 
> > I meant this:
> > With this approach, essentially the free pages have 2 states:
> > ready free page: the page is on the free list and it has "1" in the bitmap
> > non-ready free page: the page is on the free list and it has "0" in the bitmap
> > Ready free pages are those who can be allocated to use.
> > Non-ready free pages are those who are in progress of being reported to
> > host and the related EPT mapping is about to be zapped. 
> > 
> > The non-ready pages are inserted into the report_vq and waiting for the
> > host to zap the mappings one by one. After the mapping gets zapped
> > (which means the backing host page has been taken away), host acks to
> > the guest to mark the free page as ready free page (set the bit to 1 in the bitmap).
> 
> Yes, that's how I understood your approach. The interesting part is
> where somebody finds a buddy page and wants to allocate it.
> 
> > 
> > So the non-ready free page may happen to be used when they are waiting in
> > the report_vq to be handled by the host to zap the mapping, balloon could
> > have a fast path to notify the host:
> > "page 0x1000 is about to be used, don’t zap the mapping when you get
> > 0x1000 from the report_vq"  /*option [1] */
> 
> This requires coordination and in any case there will be a scenario
> where you have to wait for the hypervisor to eventually finish a madv
> call. You can just try to make that scenario less likely.
> 
> What you propose is synchronous in the worst case. Getting pages of the
> buddy makes it possible to have it done completely asynchronous. Nobody
> allocating a page has to wait.
> 
> > 
> > Or
> > 
> > "page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) and 4) above,
> > so that the free page will be marked as ready free page and the guest can use it".
> > This option will generate an extra EPT violation and QEMU page fault to get a new host
> > page to back the guest ready free page.
> 
> Again, coordination with the hypervisor while allocating a page. That is
> to be avoided in any case.
> 
> > 
> >>
> >>>
> >>> Using bitmaps to record free page hints don't need to take the free pages
> >> off the buddy list and return them later, which needs to go through the long
> >> allocation/free code path.
> >>>
> >>
> >> Yes, but it means that any process is able to get stuck on such a page for as
> >> long as it takes to report the free pages to the hypervisor and for it to call
> >> madvise(pfn_start, DONTNEED) on any such page.
> > 
> > This only happens when the guest thread happens to get allocated on a page which is
> > being reported to the host. Using option [1] above will avoid this.
> 
> I think getting pages out of the buddy system temporarily is the only
> way we can avoid somebody else stumbling over a page currently getting
> reported by the hypervisor. Otherwise, as I said, there are scenarios
> where a allocating VCPU has to wait for the hypervisor to finish the
> "freeing" task. While you can try to "speedup" that scenario -
> "hypervisor please prioritize" you cannot avoid it. There will be busy
> waiting.

Right - there has to be waiting. But it does not have to be busy -
if you can defer page use until interrupt, that's one option.
Further if you are ready to exit to hypervisor it does not have to be
busy waiting.  In particular right now virtio does not have a capability
to stop queue processing by device.  We could add that if necessary.  In
that case, you would stop queue and detach buffers.  It is already
possible by reseting the balloon.  Naturally there is no magic - you
exit to hypervisor and block there. It's not all that great
in that VCPU does not run at all. But it is not busy waiting.


> I don't believe what you describe is going to work (especially the not
> locking part when working with global resources).
> 
> What would be interesting is to see if something like a xbitmap could be
> used instead of the per-vcpu list. Nitesh, do you remember what the
> problem was with allocating memory from these hooks? Was it a locking issue?
> 
> Thanks!
> 
> > 
> > Best,
> > Wei
> > 
> 
> 
> -- 
> 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-13 17:09           ` Michael S. Tsirkin
@ 2019-02-13 17:22             ` Nitesh Narayan Lal
       [not found]               ` <286AC319A985734F985F78AFA26841F73DF6F1C3@shsmsx102.ccr.corp.intel.com>
  0 siblings, 1 reply; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-13 17:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Hildenbrand, Wang, Wei W, kvm, linux-kernel, pbonzini,
	lcapitulino, pagupta, yang.zhang.wz, riel, dodgen, konrad.wilk,
	dhildenb, aarcange


[-- Attachment #1.1: Type: text/plain, Size: 10057 bytes --]


On 2/13/19 12:09 PM, Michael S. Tsirkin wrote:
> On Wed, Feb 13, 2019 at 07:17:13AM -0500, Nitesh Narayan Lal wrote:
>> On 2/13/19 4:19 AM, David Hildenbrand wrote:
>>> On 13.02.19 09:55, Wang, Wei W wrote:
>>>> On Tuesday, February 12, 2019 5:24 PM, David Hildenbrand wrote:
>>>>> Global means all VCPUs will be competing potentially for a single lock when
>>>>> freeing/allocating a page, no? What if you have 64VCPUs allocating/freeing
>>>>> memory like crazy?
>>>> I think the key point is that the 64 vcpus won't allocate/free on the same page simultaneously, so no need to have a global big lock, isn’t it?
>>>> I think atomic operations on the bitmap would be enough.
>>> If you have to resize/alloc/coordinate who will report, you will need
>>> locking. Especially, I doubt that there is an atomic xbitmap  (prove me
>>> wrong :) ).
>>>
>>>>> (I assume some kind of locking is required even if the bitmap would be
>>>>> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
>>>>> memory at places where we don't want to - e.g. from arch_free_page ?)
>>>> arch_free_pages is in free_pages_prepare, why can't we have memory allocation there?
>>> I remember we were stumbling over some issues that were non-trivial. I
>>> am not 100% sure yet anymore, but allocating memory while deep down in
>>> the freeing part of MM core smells like "be careful".
>>>
>>>> It would also be doable to find a preferred place to preallocate some amount of memory for the bitmap.
>>> That makes things very ugly. Especially, preallocation will most likely
>>> require locking.
>>>
>>>>> That's the big benefit of taking the pages of the buddy free list. Other VCPUs
>>>>> won't stumble over them, waiting for them to get freed in the hypervisor.
>>>> As also mentioned above, I think other vcpus will not allocate/free on the same page that is in progress of being allocated/freed.
>>> If a page is in the buddy but stuck in some other bitmap, there is
>>> nothing stopping another VCPU from trying to allocate it. Nitesh has
>>> been fighting with this problem already :)
>>>
>>>>> This sounds more like "the host requests to get free pages once in a while"
>>>>> compared to "the host is always informed about free pages". At the time
>>>>> where the host actually has to ask the guest (e.g. because the host is low on
>>>>> memory), it might be to late to wait for guest action.
>>>> Option 1: Host asks for free pages:
>>>> Not necessary to ask only when the host has been in memory pressure.
>>>> This could be the orchestration layer's job to monitor the host memory usage.
>>>> For example, people could set the condition "when 50% of the host memory
>>>> has been used, start to ask a guest for some amount of free pages" 
>>>>
>>>> Option 2: Guest actively offers free pages:
>>>> Add a balloon callback to arch_free_page so that whenever a page gets freed its gfn
>>>> will be filled into the balloon's report_vq and the host will take away the backing
>>>> host page.
>>>>
>>>> Both options can be implemented. But I think option 1 would be more
>>>> efficient as the guest free pages are offered on demand.  
>>> Yes, but as I mentioned this has other drawbacks. Relying on a a guest
>>> to free up memory when you really need it is not going to work. It might
>>> work for some scenarios but should not dictate the design. It is a good
>>> start though if it makes things easier.
>>>
>>> Enabling/disabling free page hintning by the hypervisor via some
>>> mechanism is on the other hand a good idea. "I have plenty of free
>>> space, don't worry".
>>>
>>>>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
>>>>> candidates for removal and if the host is low on memory, only scanning the
>>>>> guest page tables is sufficient to free up memory.
>>>>>
>>>>> But both points might just be an implementation detail in the example you
>>>>> describe.
>>>> Yes, it is an implementation detail. I think DONTNEED would be easier
>>>> for the first step.
>>>>
>>>>>> In above 2), get_free_page_hints clears the bits which indicates that those
>>>>> pages are not ready to be used by the guest yet. Why?
>>>>>> This is because 3) will unmap the underlying physical pages from EPT.
>>>>> Normally, when guest re-visits those pages, EPT violations and QEMU page
>>>>> faults will get a new host page to set up the related EPT entry. If guest uses
>>>>> that page before the page gets unmapped (i.e. right before step 3), no EPT
>>>>> violation happens and the guest will use the same physical page that will be
>>>>> unmapped and given to other host threads. So we need to make sure that
>>>>> the guest free page is usable only after step 3 finishes.
>>>>>> Back to arch_alloc_page(), it needs to check if the allocated pages
>>>>>> have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it
>>>>> means step 2) above has happened and step 4) hasn't been reached. In this
>>>>> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done
>>>>> for that page Or better to have a balloon callback which prioritize 3) and 4)
>>>>> to make this page usable by the guest.
>>>>>
>>>>> Regarding the latter, the VCPU allocating a page cannot do anything if the
>>>>> page (along with other pages) is just being freed by the hypervisor.
>>>>> It has to busy-wait, no chance to prioritize.
>>>> I meant this:
>>>> With this approach, essentially the free pages have 2 states:
>>>> ready free page: the page is on the free list and it has "1" in the bitmap
>>>> non-ready free page: the page is on the free list and it has "0" in the bitmap
>>>> Ready free pages are those who can be allocated to use.
>>>> Non-ready free pages are those who are in progress of being reported to
>>>> host and the related EPT mapping is about to be zapped. 
>>>>
>>>> The non-ready pages are inserted into the report_vq and waiting for the
>>>> host to zap the mappings one by one. After the mapping gets zapped
>>>> (which means the backing host page has been taken away), host acks to
>>>> the guest to mark the free page as ready free page (set the bit to 1 in the bitmap).
>>> Yes, that's how I understood your approach. The interesting part is
>>> where somebody finds a buddy page and wants to allocate it.
>>>
>>>> So the non-ready free page may happen to be used when they are waiting in
>>>> the report_vq to be handled by the host to zap the mapping, balloon could
>>>> have a fast path to notify the host:
>>>> "page 0x1000 is about to be used, don’t zap the mapping when you get
>>>> 0x1000 from the report_vq"  /*option [1] */
>>> This requires coordination and in any case there will be a scenario
>>> where you have to wait for the hypervisor to eventually finish a madv
>>> call. You can just try to make that scenario less likely.
>>>
>>> What you propose is synchronous in the worst case. Getting pages of the
>>> buddy makes it possible to have it done completely asynchronous. Nobody
>>> allocating a page has to wait.
>>>
>>>> Or
>>>>
>>>> "page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) and 4) above,
>>>> so that the free page will be marked as ready free page and the guest can use it".
>>>> This option will generate an extra EPT violation and QEMU page fault to get a new host
>>>> page to back the guest ready free page.
>>> Again, coordination with the hypervisor while allocating a page. That is
>>> to be avoided in any case.
>>>
>>>>>> Using bitmaps to record free page hints don't need to take the free pages
>>>>> off the buddy list and return them later, which needs to go through the long
>>>>> allocation/free code path.
>>>>> Yes, but it means that any process is able to get stuck on such a page for as
>>>>> long as it takes to report the free pages to the hypervisor and for it to call
>>>>> madvise(pfn_start, DONTNEED) on any such page.
>>>> This only happens when the guest thread happens to get allocated on a page which is
>>>> being reported to the host. Using option [1] above will avoid this.
>>> I think getting pages out of the buddy system temporarily is the only
>>> way we can avoid somebody else stumbling over a page currently getting
>>> reported by the hypervisor. Otherwise, as I said, there are scenarios
>>> where a allocating VCPU has to wait for the hypervisor to finish the
>>> "freeing" task. While you can try to "speedup" that scenario -
>>> "hypervisor please prioritize" you cannot avoid it. There will be busy
>>> waiting.
>>>
>>> I don't believe what you describe is going to work (especially the not
>>> locking part when working with global resources).
>>>
>>> What would be interesting is to see if something like a xbitmap could be
>>> used instead of the per-vcpu list. 
>> Yeap, exactly.
>>> Nitesh, do you remember what the
>>> problem was with allocating memory from these hooks? Was it a locking issue?
>> In the previous implementation, the issue was due to the locking. In the
>> current implementation having an allocation under these hooks will
>> result in lots of isolation failures under memory pressure.
> But then we shouldn't be giving host memory when under pressure
> at all, should we?
In normal condition yes we would not like to report any memory when the
guest is already under memory pressure.

I am not sure about the scenario where both guest and the host are under
memory pressure, who will be given priority? Is it something per-decided
or it depends on the use case?

In any case, the current implementation will not give away memory back
to the host when the guest is under continuous memory pressure.

>
>> By the above statement, if you are referring to having a dynamic array
>> to hold the freed pages.
>> Then, that is an idea Andrea also suggested to get around this fixed
>> array size issue.
>>> Thanks!
>>>
>>>> Best,
>>>> Wei
>>>>
>> -- 
>> Regards
>> Nitesh
>>
>
>
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-13 17:16         ` Michael S. Tsirkin
@ 2019-02-13 17:59           ` David Hildenbrand
  2019-02-13 19:08             ` Michael S. Tsirkin
  0 siblings, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2019-02-13 17:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wang, Wei W, Nitesh Narayan Lal, kvm, linux-kernel, pbonzini,
	lcapitulino, pagupta, yang.zhang.wz, riel, dodgen, konrad.wilk,
	dhildenb, aarcange

>>>
>>>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
>>>> candidates for removal and if the host is low on memory, only scanning the
>>>> guest page tables is sufficient to free up memory.
>>>>
>>>> But both points might just be an implementation detail in the example you
>>>> describe.
>>>
>>> Yes, it is an implementation detail. I think DONTNEED would be easier
>>> for the first step.
>>>
>>>>
>>>>>
>>>>> In above 2), get_free_page_hints clears the bits which indicates that those
>>>> pages are not ready to be used by the guest yet. Why?
>>>>> This is because 3) will unmap the underlying physical pages from EPT.
>>>> Normally, when guest re-visits those pages, EPT violations and QEMU page
>>>> faults will get a new host page to set up the related EPT entry. If guest uses
>>>> that page before the page gets unmapped (i.e. right before step 3), no EPT
>>>> violation happens and the guest will use the same physical page that will be
>>>> unmapped and given to other host threads. So we need to make sure that
>>>> the guest free page is usable only after step 3 finishes.
>>>>>
>>>>> Back to arch_alloc_page(), it needs to check if the allocated pages
>>>>> have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it
>>>> means step 2) above has happened and step 4) hasn't been reached. In this
>>>> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done
>>>> for that page Or better to have a balloon callback which prioritize 3) and 4)
>>>> to make this page usable by the guest.
>>>>
>>>> Regarding the latter, the VCPU allocating a page cannot do anything if the
>>>> page (along with other pages) is just being freed by the hypervisor.
>>>> It has to busy-wait, no chance to prioritize.
>>>
>>> I meant this:
>>> With this approach, essentially the free pages have 2 states:
>>> ready free page: the page is on the free list and it has "1" in the bitmap
>>> non-ready free page: the page is on the free list and it has "0" in the bitmap
>>> Ready free pages are those who can be allocated to use.
>>> Non-ready free pages are those who are in progress of being reported to
>>> host and the related EPT mapping is about to be zapped. 
>>>
>>> The non-ready pages are inserted into the report_vq and waiting for the
>>> host to zap the mappings one by one. After the mapping gets zapped
>>> (which means the backing host page has been taken away), host acks to
>>> the guest to mark the free page as ready free page (set the bit to 1 in the bitmap).
>>
>> Yes, that's how I understood your approach. The interesting part is
>> where somebody finds a buddy page and wants to allocate it.
>>
>>>
>>> So the non-ready free page may happen to be used when they are waiting in
>>> the report_vq to be handled by the host to zap the mapping, balloon could
>>> have a fast path to notify the host:
>>> "page 0x1000 is about to be used, don’t zap the mapping when you get
>>> 0x1000 from the report_vq"  /*option [1] */
>>
>> This requires coordination and in any case there will be a scenario
>> where you have to wait for the hypervisor to eventually finish a madv
>> call. You can just try to make that scenario less likely.
>>
>> What you propose is synchronous in the worst case. Getting pages of the
>> buddy makes it possible to have it done completely asynchronous. Nobody
>> allocating a page has to wait.
>>
>>>
>>> Or
>>>
>>> "page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) and 4) above,
>>> so that the free page will be marked as ready free page and the guest can use it".
>>> This option will generate an extra EPT violation and QEMU page fault to get a new host
>>> page to back the guest ready free page.
>>
>> Again, coordination with the hypervisor while allocating a page. That is
>> to be avoided in any case.
>>
>>>
>>>>
>>>>>
>>>>> Using bitmaps to record free page hints don't need to take the free pages
>>>> off the buddy list and return them later, which needs to go through the long
>>>> allocation/free code path.
>>>>>
>>>>
>>>> Yes, but it means that any process is able to get stuck on such a page for as
>>>> long as it takes to report the free pages to the hypervisor and for it to call
>>>> madvise(pfn_start, DONTNEED) on any such page.
>>>
>>> This only happens when the guest thread happens to get allocated on a page which is
>>> being reported to the host. Using option [1] above will avoid this.
>>
>> I think getting pages out of the buddy system temporarily is the only
>> way we can avoid somebody else stumbling over a page currently getting
>> reported by the hypervisor. Otherwise, as I said, there are scenarios
>> where a allocating VCPU has to wait for the hypervisor to finish the
>> "freeing" task. While you can try to "speedup" that scenario -
>> "hypervisor please prioritize" you cannot avoid it. There will be busy
>> waiting.
> 
> Right - there has to be waiting. But it does not have to be busy -
> if you can defer page use until interrupt, that's one option.
> Further if you are ready to exit to hypervisor it does not have to be
> busy waiting.  In particular right now virtio does not have a capability
> to stop queue processing by device.  We could add that if necessary.  In
> that case, you would stop queue and detach buffers.  It is already
> possible by reseting the balloon.  Naturally there is no magic - you
> exit to hypervisor and block there. It's not all that great
> in that VCPU does not run at all. But it is not busy waiting.

Of course, you can always yield to the hypervisor and not call it busy
waiting. From the guest point of view, it is busy waiting. The VCPU is
to making progress. If I am not wrong, one can easily construct examples
where all VCPUs in the guest are waiting for the hypervisor to
madv(dontneed) pages. I don't like that approach

Especially if temporarily getting pages out of the buddy resolves these
issues and seems to work.


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-13 17:59           ` David Hildenbrand
@ 2019-02-13 19:08             ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-13 19:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Wang, Wei W, Nitesh Narayan Lal, kvm, linux-kernel, pbonzini,
	lcapitulino, pagupta, yang.zhang.wz, riel, dodgen, konrad.wilk,
	dhildenb, aarcange

On Wed, Feb 13, 2019 at 06:59:24PM +0100, David Hildenbrand wrote:
> >>>
> >>>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
> >>>> candidates for removal and if the host is low on memory, only scanning the
> >>>> guest page tables is sufficient to free up memory.
> >>>>
> >>>> But both points might just be an implementation detail in the example you
> >>>> describe.
> >>>
> >>> Yes, it is an implementation detail. I think DONTNEED would be easier
> >>> for the first step.
> >>>
> >>>>
> >>>>>
> >>>>> In above 2), get_free_page_hints clears the bits which indicates that those
> >>>> pages are not ready to be used by the guest yet. Why?
> >>>>> This is because 3) will unmap the underlying physical pages from EPT.
> >>>> Normally, when guest re-visits those pages, EPT violations and QEMU page
> >>>> faults will get a new host page to set up the related EPT entry. If guest uses
> >>>> that page before the page gets unmapped (i.e. right before step 3), no EPT
> >>>> violation happens and the guest will use the same physical page that will be
> >>>> unmapped and given to other host threads. So we need to make sure that
> >>>> the guest free page is usable only after step 3 finishes.
> >>>>>
> >>>>> Back to arch_alloc_page(), it needs to check if the allocated pages
> >>>>> have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it
> >>>> means step 2) above has happened and step 4) hasn't been reached. In this
> >>>> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done
> >>>> for that page Or better to have a balloon callback which prioritize 3) and 4)
> >>>> to make this page usable by the guest.
> >>>>
> >>>> Regarding the latter, the VCPU allocating a page cannot do anything if the
> >>>> page (along with other pages) is just being freed by the hypervisor.
> >>>> It has to busy-wait, no chance to prioritize.
> >>>
> >>> I meant this:
> >>> With this approach, essentially the free pages have 2 states:
> >>> ready free page: the page is on the free list and it has "1" in the bitmap
> >>> non-ready free page: the page is on the free list and it has "0" in the bitmap
> >>> Ready free pages are those who can be allocated to use.
> >>> Non-ready free pages are those who are in progress of being reported to
> >>> host and the related EPT mapping is about to be zapped. 
> >>>
> >>> The non-ready pages are inserted into the report_vq and waiting for the
> >>> host to zap the mappings one by one. After the mapping gets zapped
> >>> (which means the backing host page has been taken away), host acks to
> >>> the guest to mark the free page as ready free page (set the bit to 1 in the bitmap).
> >>
> >> Yes, that's how I understood your approach. The interesting part is
> >> where somebody finds a buddy page and wants to allocate it.
> >>
> >>>
> >>> So the non-ready free page may happen to be used when they are waiting in
> >>> the report_vq to be handled by the host to zap the mapping, balloon could
> >>> have a fast path to notify the host:
> >>> "page 0x1000 is about to be used, don’t zap the mapping when you get
> >>> 0x1000 from the report_vq"  /*option [1] */
> >>
> >> This requires coordination and in any case there will be a scenario
> >> where you have to wait for the hypervisor to eventually finish a madv
> >> call. You can just try to make that scenario less likely.
> >>
> >> What you propose is synchronous in the worst case. Getting pages of the
> >> buddy makes it possible to have it done completely asynchronous. Nobody
> >> allocating a page has to wait.
> >>
> >>>
> >>> Or
> >>>
> >>> "page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) and 4) above,
> >>> so that the free page will be marked as ready free page and the guest can use it".
> >>> This option will generate an extra EPT violation and QEMU page fault to get a new host
> >>> page to back the guest ready free page.
> >>
> >> Again, coordination with the hypervisor while allocating a page. That is
> >> to be avoided in any case.
> >>
> >>>
> >>>>
> >>>>>
> >>>>> Using bitmaps to record free page hints don't need to take the free pages
> >>>> off the buddy list and return them later, which needs to go through the long
> >>>> allocation/free code path.
> >>>>>
> >>>>
> >>>> Yes, but it means that any process is able to get stuck on such a page for as
> >>>> long as it takes to report the free pages to the hypervisor and for it to call
> >>>> madvise(pfn_start, DONTNEED) on any such page.
> >>>
> >>> This only happens when the guest thread happens to get allocated on a page which is
> >>> being reported to the host. Using option [1] above will avoid this.
> >>
> >> I think getting pages out of the buddy system temporarily is the only
> >> way we can avoid somebody else stumbling over a page currently getting
> >> reported by the hypervisor. Otherwise, as I said, there are scenarios
> >> where a allocating VCPU has to wait for the hypervisor to finish the
> >> "freeing" task. While you can try to "speedup" that scenario -
> >> "hypervisor please prioritize" you cannot avoid it. There will be busy
> >> waiting.
> > 
> > Right - there has to be waiting. But it does not have to be busy -
> > if you can defer page use until interrupt, that's one option.
> > Further if you are ready to exit to hypervisor it does not have to be
> > busy waiting.  In particular right now virtio does not have a capability
> > to stop queue processing by device.  We could add that if necessary.  In
> > that case, you would stop queue and detach buffers.  It is already
> > possible by reseting the balloon.  Naturally there is no magic - you
> > exit to hypervisor and block there. It's not all that great
> > in that VCPU does not run at all. But it is not busy waiting.
> 
> Of course, you can always yield to the hypervisor and not call it busy
> waiting. From the guest point of view, it is busy waiting. The VCPU is
> to making progress. If I am not wrong, one can easily construct examples
> where all VCPUs in the guest are waiting for the hypervisor to
> madv(dontneed) pages. I don't like that approach
> 
> Especially if temporarily getting pages out of the buddy resolves these
> issues and seems to work.

Well hypervisor can send a singla and interrupt the dontneed work.
But yes I prefer not blocking the VCPU too.

I also prefer MADV_FREE generally.

> 
> -- 
> 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* RE: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-13 12:06   ` Nitesh Narayan Lal
@ 2019-02-14  8:48     ` Wang, Wei W
  2019-02-14  9:42       ` David Hildenbrand
  2019-02-14 13:00       ` Nitesh Narayan Lal
  0 siblings, 2 replies; 116+ messages in thread
From: Wang, Wei W @ 2019-02-14  8:48 UTC (permalink / raw)
  To: 'Nitesh Narayan Lal',
	kvm, linux-kernel, pbonzini, lcapitulino, pagupta, yang.zhang.wz,
	riel, david, mst, dodgen, konrad.wilk, dhildenb, aarcange

On Wednesday, February 13, 2019 8:07 PM, Nitesh Narayan Lal wrote:
> Once the host free the pages. All the isolated pages are returned back 
> to the buddy. (This is implemented in hyperlist_ready())

This actually has the same issue: the isolated pages have to wait to return to the buddy
after the host has done with madvise(DONTNEED). Otherwise, a page could be used by
a guest thread and the next moment the host takes it to other host threads.

Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* RE: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-13  9:19       ` David Hildenbrand
  2019-02-13 12:17         ` Nitesh Narayan Lal
  2019-02-13 17:16         ` Michael S. Tsirkin
@ 2019-02-14  9:08         ` Wang, Wei W
  2019-02-14 10:00           ` David Hildenbrand
  2 siblings, 1 reply; 116+ messages in thread
From: Wang, Wei W @ 2019-02-14  9:08 UTC (permalink / raw)
  To: 'David Hildenbrand',
	Nitesh Narayan Lal, kvm, linux-kernel, pbonzini, lcapitulino,
	pagupta, yang.zhang.wz, riel, mst, dodgen, konrad.wilk, dhildenb,
	aarcange

On Wednesday, February 13, 2019 5:19 PM, David Hildenbrand wrote:
> If you have to resize/alloc/coordinate who will report, you will need locking.
> Especially, I doubt that there is an atomic xbitmap  (prove me wrong :) ).

Yes, we need change xbitmap to support it.

Just thought of another option, which would be better:
- xb_preload in prepare_alloc_pages to pre-allocate the bitmap memory;
- xb_set/clear the bit under the zone->lock, i.e. in rmqueue and free_one_page

So we use the existing zone->lock to guarantee that the xb ops
will not be concurrently called to race on the same bitmap.
And we don't add any new locks to generate new doubts.
Also, we can probably remove the arch_alloc/free_page part.

For the first step, we could optimize VIRTIO_BALLOON_F_FREE_PAGE_HINT for the live migration optimization:
- just replace alloc_pages(VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG,
                           VIRTIO_BALLOON_FREE_PAGE_ORDER)
with get_free_page_hints()

get_free_page_hints() was designed to clear the bit, and need put_free_page_hints() to set it later after host finishes madvise. For the live migration usage, as host doesn't free the backing host pages, so we can give get_free_page_hints a parameter option to not clear the bit for this usage. It will be simpler and faster.

I think get_free_page_hints() to read hints via bitmaps should be much faster than that allocation function, which takes around 15us to get a 4MB block. Another big bonus is that we don't need free_pages() to return all the pages back to buddy (it's a quite expensive operation too) when migration is done.

For the second step, we can improve ballooning, e.g. a new feature VIRTIO_BALLOON_F_ADVANCED_BALLOON to use the same get_free_page_hints() and another put_free_page_hints(), along with the virtio-balloon's report_vq and ack_vq to wait for the host's ack before making the free page ready.
(I think waiting for the host ack is the overhead that the guest has to suffer for enabling memory overcommitment, and even with this v8 patch series it also needs to do that. The optimization method was described yesterday)

> Yes, but as I mentioned this has other drawbacks. Relying on a a guest to free
> up memory when you really need it is not going to work. 

why not working? Host can ask at any time (including when not urgently need it) depending on the admin's configuration.

>It might work for
> some scenarios but should not dictate the design. It is a good start though if
> it makes things easier.
 > Enabling/disabling free page hintning by the hypervisor via some
> mechanism is on the other hand a good idea. "I have plenty of free space,
> don't worry".

Also guests are not treated identically, host can decide whom to offer the free pages first (offering free pages will cause the guest some performance drop).

Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
       [not found]               ` <286AC319A985734F985F78AFA26841F73DF6F1C3@shsmsx102.ccr.corp.intel.com>
@ 2019-02-14  9:34                 ` David Hildenbrand
  0 siblings, 0 replies; 116+ messages in thread
From: David Hildenbrand @ 2019-02-14  9:34 UTC (permalink / raw)
  To: Wang, Wei W, 'Nitesh Narayan Lal', Michael S. Tsirkin
  Cc: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, yang.zhang.wz,
	riel, dodgen, konrad.wilk, dhildenb, aarcange

On 14.02.19 10:12, Wang, Wei W wrote:
> On Thursday, February 14, 2019 1:22 AM, Nitesh Narayan Lal wrote:
>> In normal condition yes we would not like to report any memory when the
>> guest is already under memory pressure.
>>
>> I am not sure about the scenario where both guest and the host are under
>> memory pressure, who will be given priority? Is it something per-decided or
>> it depends on the use case?
>>
> 
> That's one of the reasons that I would vote for "host to ask for free pages".

As I already said, there are scenarios where this does not work reliably.

When we

1. Let the guest report free pages
2. Allow the hypervisor to enable/disable it

We have a mechanism that is superior to what you describe.

"host to ask for free pages" can be emulated using that.

> 
> Host can have a global view of all the guest's memory states, so better to have the
> memory overcommitment policy defined on the host.
> 
> For example, the host can know guest1 is under memory pressure (thus not asking him)
> and guest2 has a huge amount of free memory. When host lacks memory to run the 3rd
> guest, it can asks guest2 to offer some free memory. 
> 
> Best,
> Wei
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-14  8:48     ` Wang, Wei W
@ 2019-02-14  9:42       ` David Hildenbrand
  2019-02-15  9:05         ` Wang, Wei W
  2019-02-14 13:00       ` Nitesh Narayan Lal
  1 sibling, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2019-02-14  9:42 UTC (permalink / raw)
  To: Wang, Wei W, 'Nitesh Narayan Lal',
	kvm, linux-kernel, pbonzini, lcapitulino, pagupta, yang.zhang.wz,
	riel, mst, dodgen, konrad.wilk, dhildenb, aarcange

On 14.02.19 09:48, Wang, Wei W wrote:
> On Wednesday, February 13, 2019 8:07 PM, Nitesh Narayan Lal wrote:
>> Once the host free the pages. All the isolated pages are returned back 
>> to the buddy. (This is implemented in hyperlist_ready())
> 
> This actually has the same issue: the isolated pages have to wait to return to the buddy
> after the host has done with madvise(DONTNEED). Otherwise, a page could be used by
> a guest thread and the next moment the host takes it to other host threads.

Yes indeed, that is the important bit. They must not be put pack to the
buddy before they have been processed by the hypervisor. But as the
pages are not in the buddy, no one allocating a page will stumble over
such a page and try to allocate it. Threads trying to allocate memory
will simply pick another buddy page instead of "busy waiting" for that
page to be finished reporting.

> 
> Best,
> Wei
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-14  9:08         ` Wang, Wei W
@ 2019-02-14 10:00           ` David Hildenbrand
  2019-02-14 10:44             ` David Hildenbrand
  2019-02-15  9:15             ` Wang, Wei W
  0 siblings, 2 replies; 116+ messages in thread
From: David Hildenbrand @ 2019-02-14 10:00 UTC (permalink / raw)
  To: Wang, Wei W, Nitesh Narayan Lal, kvm, linux-kernel, pbonzini,
	lcapitulino, pagupta, yang.zhang.wz, riel, mst, dodgen,
	konrad.wilk, dhildenb, aarcange

On 14.02.19 10:08, Wang, Wei W wrote:
> On Wednesday, February 13, 2019 5:19 PM, David Hildenbrand wrote:
>> If you have to resize/alloc/coordinate who will report, you will need locking.
>> Especially, I doubt that there is an atomic xbitmap  (prove me wrong :) ).
> 
> Yes, we need change xbitmap to support it.
> 
> Just thought of another option, which would be better:
> - xb_preload in prepare_alloc_pages to pre-allocate the bitmap memory;
> - xb_set/clear the bit under the zone->lock, i.e. in rmqueue and free_one_page

And how to preload without locking?

> 
> will not be concurrently called to race on the same bitmap.
> And we don't add any new locks to generate new doubts.
> Also, we can probably remove the arch_alloc/free_page part.
> 
> For the first step, we could optimize VIRTIO_BALLOON_F_FREE_PAGE_HINT for the live migration optimization:
> - just replace alloc_pages(VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG,
>                            VIRTIO_BALLOON_FREE_PAGE_ORDER)
> with get_free_page_hints()
> 
> get_free_page_hints() was designed to clear the bit, and need put_free_page_hints() to set it later after host finishes madvise. For the live migration usage, as host doesn't free the backing host pages, so we can give get_free_page_hints a parameter option to not clear the bit for this usage. It will be simpler and faster.
> 
> I think get_free_page_hints() to read hints via bitmaps should be much faster than that allocation function, which takes around 15us to get a 4MB block. Another big bonus is that we don't need free_pages() to return all the pages back to buddy (it's a quite expensive operation too) when migration is done.
> 
> For the second step, we can improve ballooning, e.g. a new feature VIRTIO_BALLOON_F_ADVANCED_BALLOON to use the same get_free_page_hints() and another put_free_page_hints(), along with the virtio-balloon's report_vq and ack_vq to wait for the host's ack before making the free page ready.
> (I think waiting for the host ack is the overhead that the guest has to suffer for enabling memory overcommitment, and even with this v8 patch series it also needs to do that. The optimization method was described yesterday)
> 

As I already said, I don't like that approach, because it has the
fundamental issue of page allocs getting blocked. That does not mean
that it is bad, but that I think what Nitesh has is superior in that
sense. Of course, things like "how to enable/disable", and much more
needs to be clarified.

If you believe in your approach, feel free to come up with a prototype.
 Especially the "no global locking" could be tricky in my opinion :)

>> Yes, but as I mentioned this has other drawbacks. Relying on a a guest to free
>> up memory when you really need it is not going to work. 
> 
> why not working? Host can ask at any time (including when not urgently need it) depending on the admin's configuration.

Because any heuristic like "I am running out of memory, quickly ask
someone who will need time to respond" is prone to fail in some
scenarios. It might work for many, but it is not a "I am running out of
memory, oh look, this page has been flagged via madv(FREE), let's just
take that."

> 
>> It might work for
>> some scenarios but should not dictate the design. It is a good start though if
>> it makes things easier.
>  > Enabling/disabling free page hintning by the hypervisor via some
>> mechanism is on the other hand a good idea. "I have plenty of free space,
>> don't worry".
> 
> Also guests are not treated identically, host can decide whom to offer the free pages first (offering free pages will cause the guest some performance drop).

Yes, it should definetly be configurable somehow. You don't want free
page hinting always and in any setup.


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-14 10:00           ` David Hildenbrand
@ 2019-02-14 10:44             ` David Hildenbrand
  2019-02-15  9:15             ` Wang, Wei W
  1 sibling, 0 replies; 116+ messages in thread
From: David Hildenbrand @ 2019-02-14 10:44 UTC (permalink / raw)
  To: Wang, Wei W, Nitesh Narayan Lal, kvm, linux-kernel, pbonzini,
	lcapitulino, pagupta, yang.zhang.wz, riel, mst, dodgen,
	konrad.wilk, dhildenb, aarcange

On 14.02.19 11:00, David Hildenbrand wrote:
> On 14.02.19 10:08, Wang, Wei W wrote:
>> On Wednesday, February 13, 2019 5:19 PM, David Hildenbrand wrote:
>>> If you have to resize/alloc/coordinate who will report, you will need locking.
>>> Especially, I doubt that there is an atomic xbitmap  (prove me wrong :) ).
>>
>> Yes, we need change xbitmap to support it.
>>
>> Just thought of another option, which would be better:
>> - xb_preload in prepare_alloc_pages to pre-allocate the bitmap memory;
>> - xb_set/clear the bit under the zone->lock, i.e. in rmqueue and free_one_page
> 
> And how to preload without locking?
> 
>>
>> will not be concurrently called to race on the same bitmap.
>> And we don't add any new locks to generate new doubts.
>> Also, we can probably remove the arch_alloc/free_page part.
>>
>> For the first step, we could optimize VIRTIO_BALLOON_F_FREE_PAGE_HINT for the live migration optimization:
>> - just replace alloc_pages(VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG,
>>                            VIRTIO_BALLOON_FREE_PAGE_ORDER)
>> with get_free_page_hints()
>>
>> get_free_page_hints() was designed to clear the bit, and need put_free_page_hints() to set it later after host finishes madvise. For the live migration usage, as host doesn't free the backing host pages, so we can give get_free_page_hints a parameter option to not clear the bit for this usage. It will be simpler and faster.
>>
>> I think get_free_page_hints() to read hints via bitmaps should be much faster than that allocation function, which takes around 15us to get a 4MB block. Another big bonus is that we don't need free_pages() to return all the pages back to buddy (it's a quite expensive operation too) when migration is done.
>>
>> For the second step, we can improve ballooning, e.g. a new feature VIRTIO_BALLOON_F_ADVANCED_BALLOON to use the same get_free_page_hints() and another put_free_page_hints(), along with the virtio-balloon's report_vq and ack_vq to wait for the host's ack before making the free page ready.
>> (I think waiting for the host ack is the overhead that the guest has to suffer for enabling memory overcommitment, and even with this v8 patch series it also needs to do that. The optimization method was described yesterday)
>>
> 
> As I already said, I don't like that approach, because it has the
> fundamental issue of page allocs getting blocked. That does not mean
> that it is bad, but that I think what Nitesh has is superior in that
> sense. Of course, things like "how to enable/disable", and much more
> needs to be clarified.
> 
> If you believe in your approach, feel free to come up with a prototype.
>  Especially the "no global locking" could be tricky in my opinion :)


I want to add that your approach makes sense if we expect that the
hypervisor will ask for free memory very rarely. Then, blocking during
page alloc is most probably acceptable. Depending on the setup, this
might or might not be the case. If you have some guests that are
allocating/freeing memory continuously, you might want to get back free
pages fairly often to move them to other guests.

In case the hypervisor asks for free pages, as we are not reporting
continuously, you would have to somehow report all pages currently free
to the hypervisor, making sure via the bitmap that they cannot be allocated.

You certainly don't want to track free pages in a bitmap if the
hypervisor is not asking for free pages, otherwise you will waste
eventually a big amount of memory tracking page states nobody cares
about in a xbtimap. So you would have to use another way to initially
fill the bitmap with free pages (when the hypervisor requests it), while
making sure to avoid races with pages getting allocated just while you
are creating the bitmap.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-14  8:48     ` Wang, Wei W
  2019-02-14  9:42       ` David Hildenbrand
@ 2019-02-14 13:00       ` Nitesh Narayan Lal
  1 sibling, 0 replies; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-14 13:00 UTC (permalink / raw)
  To: Wang, Wei W, kvm, linux-kernel, pbonzini, lcapitulino, pagupta,
	yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk, dhildenb,
	aarcange


[-- Attachment #1.1: Type: text/plain, Size: 852 bytes --]


On 2/14/19 3:48 AM, Wang, Wei W wrote:
> On Wednesday, February 13, 2019 8:07 PM, Nitesh Narayan Lal wrote:
>> Once the host free the pages. All the isolated pages are returned back 
>> to the buddy. (This is implemented in hyperlist_ready())
> This actually has the same issue: the isolated pages have to wait to return to the buddy
> after the host has done with madvise(DONTNEED). Otherwise, a page could be used by
> a guest thread and the next moment the host takes it to other host threads.
I don't think that this will be a blocking case. Let's say there are
pages from the normal zone which are isolated and are currently freed by
the host at this point even if the normal zone runs out of free pages,
any allocation request could be served from another zone which will have
free pages.
>
> Best,
> Wei
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* RE: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-14  9:42       ` David Hildenbrand
@ 2019-02-15  9:05         ` Wang, Wei W
  2019-02-15  9:41           ` David Hildenbrand
  2019-02-15 12:40           ` Nitesh Narayan Lal
  0 siblings, 2 replies; 116+ messages in thread
From: Wang, Wei W @ 2019-02-15  9:05 UTC (permalink / raw)
  To: 'David Hildenbrand', 'Nitesh Narayan Lal',
	kvm, linux-kernel, pbonzini, lcapitulino, pagupta, yang.zhang.wz,
	riel, mst, dodgen, konrad.wilk, dhildenb, aarcange

On Thursday, February 14, 2019 5:43 PM, David Hildenbrand wrote:
> Yes indeed, that is the important bit. They must not be put pack to the
> buddy before they have been processed by the hypervisor. But as the pages
> are not in the buddy, no one allocating a page will stumble over such a page
> and try to allocate it. Threads trying to allocate memory will simply pick
> another buddy page instead of "busy waiting" for that page to be finished
> reporting.

What if a guest thread try to allocate some pages but the buddy cannot satisfy
because all the pages are isolated? Would it be the same case that the guest thread
gets blocked by waiting all the isolated pages to get madvised by the host and
returned to the guest buddy, or even worse, some guest threads get killed due to oom?

Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* RE: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-14 10:00           ` David Hildenbrand
  2019-02-14 10:44             ` David Hildenbrand
@ 2019-02-15  9:15             ` Wang, Wei W
  2019-02-15  9:33               ` David Hildenbrand
  1 sibling, 1 reply; 116+ messages in thread
From: Wang, Wei W @ 2019-02-15  9:15 UTC (permalink / raw)
  To: 'David Hildenbrand',
	Nitesh Narayan Lal, kvm, linux-kernel, pbonzini, lcapitulino,
	pagupta, yang.zhang.wz, riel, mst, dodgen, konrad.wilk, dhildenb,
	aarcange

On Thursday, February 14, 2019 6:01 PM, David Hildenbrand wrote:
> And how to preload without locking?

The memory is preload per-CPU. It's usually called outside the lock.

Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-15  9:15             ` Wang, Wei W
@ 2019-02-15  9:33               ` David Hildenbrand
  0 siblings, 0 replies; 116+ messages in thread
From: David Hildenbrand @ 2019-02-15  9:33 UTC (permalink / raw)
  To: Wang, Wei W, Nitesh Narayan Lal, kvm, linux-kernel, pbonzini,
	lcapitulino, pagupta, yang.zhang.wz, riel, mst, dodgen,
	konrad.wilk, dhildenb, aarcange

On 15.02.19 10:15, Wang, Wei W wrote:
> On Thursday, February 14, 2019 6:01 PM, David Hildenbrand wrote:
>> And how to preload without locking?
> 
> The memory is preload per-CPU. It's usually called outside the lock.

Right, that works as long as only a fixed amount of pages is needed. I
remember that working for some radix tree operations.

> 
> Best,
> Wei
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-15  9:05         ` Wang, Wei W
@ 2019-02-15  9:41           ` David Hildenbrand
  2019-02-18  2:36             ` Wei Wang
  2019-02-15 12:40           ` Nitesh Narayan Lal
  1 sibling, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2019-02-15  9:41 UTC (permalink / raw)
  To: Wang, Wei W, 'Nitesh Narayan Lal',
	kvm, linux-kernel, pbonzini, lcapitulino, pagupta, yang.zhang.wz,
	riel, mst, dodgen, konrad.wilk, dhildenb, aarcange

On 15.02.19 10:05, Wang, Wei W wrote:
> On Thursday, February 14, 2019 5:43 PM, David Hildenbrand wrote:
>> Yes indeed, that is the important bit. They must not be put pack to the
>> buddy before they have been processed by the hypervisor. But as the pages
>> are not in the buddy, no one allocating a page will stumble over such a page
>> and try to allocate it. Threads trying to allocate memory will simply pick
>> another buddy page instead of "busy waiting" for that page to be finished
>> reporting.
> 
> What if a guest thread try to allocate some pages but the buddy cannot satisfy
> because all the pages are isolated? Would it be the same case that the guest thread
> gets blocked by waiting all the isolated pages to get madvised by the host and
> returned to the guest buddy, or even worse, some guest threads get killed due to oom?

Your question targets low memory situations in the guest. I think Nitesh
already answered parts of that question somewhere and I'll let him
answer it in detail, only a short comment from my side :)

I can imagine techniques where the OOM killer can be avoided, but the
OOM handler will eventually kick in and handle it.

In general your question is valid and we will have to think about a way
to avoid that from happening. However, in contrast to your approach
blocking on potentially every page that is being hinted, in Nitesh's
approach this would only happen when the guest is really low on memory.
And the question is in general, if a guest wants to hint if low on
memory ("safety buffer").

> 
> Best,
> Wei
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-15  9:05         ` Wang, Wei W
  2019-02-15  9:41           ` David Hildenbrand
@ 2019-02-15 12:40           ` Nitesh Narayan Lal
  1 sibling, 0 replies; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-15 12:40 UTC (permalink / raw)
  To: Wang, Wei W, 'David Hildenbrand',
	kvm, linux-kernel, pbonzini, lcapitulino, pagupta, yang.zhang.wz,
	riel, mst, dodgen, konrad.wilk, dhildenb, aarcange


[-- Attachment #1.1: Type: text/plain, Size: 1156 bytes --]


On 2/15/19 4:05 AM, Wang, Wei W wrote:
> On Thursday, February 14, 2019 5:43 PM, David Hildenbrand wrote:
>> Yes indeed, that is the important bit. They must not be put pack to the
>> buddy before they have been processed by the hypervisor. But as the pages
>> are not in the buddy, no one allocating a page will stumble over such a page
>> and try to allocate it. Threads trying to allocate memory will simply pick
>> another buddy page instead of "busy waiting" for that page to be finished
>> reporting.
> What if a guest thread try to allocate some pages but the buddy cannot satisfy
> because all the pages are isolated? Would it be the same case that the guest thread
> gets blocked by waiting all the isolated pages to get madvised by the host and
> returned to the guest buddy, or even worse, some guest threads get killed due to oom?

If you are referring to a situation when guest is under memory pressure
then isolation request will fail and memory will not be
reported/isolated. However, there can always be a corner case and we can
definitely take that into consideration later.

>
> Best,
> Wei
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-04 20:18 [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Nitesh Narayan Lal
                   ` (9 preceding siblings ...)
  2019-02-13  9:00 ` Wang, Wei W
@ 2019-02-16  9:40 ` David Hildenbrand
  2019-02-18 15:50   ` Nitesh Narayan Lal
  2019-02-18 16:49   ` Michael S. Tsirkin
  2019-02-23  0:02 ` Alexander Duyck
  11 siblings, 2 replies; 116+ messages in thread
From: David Hildenbrand @ 2019-02-16  9:40 UTC (permalink / raw)
  To: Nitesh Narayan Lal, kvm, linux-kernel, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, mst, dodgen,
	konrad.wilk, dhildenb, aarcange, Michael S. Tsirkin,
	Alexander Duyck

On 04.02.19 21:18, Nitesh Narayan Lal wrote:

Hi Nitesh,

I thought again about how s390x handles free page hinting. As that seems
to work just fine, I guess sticking to a similar model makes sense.


I already explained in this thread how it works on s390x, a short summary:

1. Each VCPU has a buffer of pfns to be reported to the hypervisor. If I
am not wrong, it contains 512 entries, so is exactly 1 page big. This
buffer is stored in the hypervisor and is on page granularity.

2. This page buffer is managed via the ESSA instruction. In addition, to
synchronize with the guest ("page reused when freeing in the
hypervisor"), special bits in the host->guest page table can be
set/locked via the ESSA instruction by the guest and similarly accessed
by the hypervisor.

3. Once the buffer is full, the guest does a synchronous hypercall,
going over all 512 entries and zapping them (== similar to MADV_DONTNEED)


To mimic that, we

1. Have a static buffer per VCPU in the guest with 512 entries. You
basically have that already.

2. On every free, add the page _or_ the page after merging by the buddy
(e.g. MAX_ORDER - 1) to the buffer (this is where we could be better
than s390x). You basically have that already.

3. If the buffer is full, try to isolate all pages and do a synchronous
report to the hypervisor. You have the first part already. The second
part would require a change (don't use a separate/global thread to do
the hinting, just do it synchronously).

4. One hinting is done, putback all isolated pages to the budy. You
basically have that already.


For 3. we can try what you have right now, using virtio. If we detect
that's a problem, we can do it similar to what Alexander proposes and
just do a bare hypercall. It's just a different way of carrying out the
same task.


This approach
1. Mimics what s390x does, besides supporting different granularities.
To synchronize guest->host we simply take the pages off the buddy.

2. Is basically what Alexander does, however his design limitation is
that doing any hinting on smaller granularities will not work because
there will be too many synchronous hints. Bad on fragmented guests.

3. Does not require any dynamic data structures in the guest.

4. Does not block allocation paths.

5. Blocks on e.g. every 512'ed free. It seems to work on s390x, why
shouldn't it for us. We have to measure.

6. We are free to decide which granularity we report.

7. Potentially works even if the guest memory is fragmented (little
MAX_ORDER - 1) pages.

It would be worth a try. My feeling is that a synchronous report after
e.g. 512 frees should be acceptable, as it seems to be acceptable on
s390x. (basically always enabled, nobody complains).

We would have to play with how to enable/disable reporting and when to
not report because it's not worth it in the guest (e.g. low on memory).


Do you think something like this would be easy to change/implement and
measure?

Thanks!

> The following patch-set proposes an efficient mechanism for handing freed memory between the guest and the host. It enables the guests with no page cache to rapidly free and reclaims memory to and from the host respectively.
> 
> Benefit:
> With this patch-series, in our test-case, executed on a single system and single NUMA node with 15GB memory, we were able to successfully launch atleast 5 guests 
> when page hinting was enabled and 3 without it. (Detailed explanation of the test procedure is provided at the bottom).
> 
> Changelog in V8:
> In this patch-series, the earlier approach [1] which was used to capture and scan the pages freed by the guest has been changed. The new approach is briefly described below:
> 
> The patch-set still leverages the existing arch_free_page() to add this functionality. It maintains a per CPU array which is used to store the pages freed by the guest. The maximum number of entries which it can hold is defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it is scanned and only the pages which are available in the buddy are stored. This process continues until the array is filled with pages which are part of the buddy free list. After which it wakes up a kernel per-cpu-thread.
> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation and if the page is not reallocated and present in the buddy, the kernel thread attempts to isolate it from the buddy. If it is successfully isolated, the page is added to another per-cpu array. Once the entire scanning process is complete, all the isolated pages are reported to the host through an existing virtio-balloon driver.
> 
> Known Issues:
> 	* Fixed array size: The problem with having a fixed/hardcoded array size arises when the size of the guest varies. For example when the guest size increases and it starts making large allocations fixed size limits this solution's ability to capture all the freed pages. This will result in less guest free memory getting reported to the host.
> 
> Known code re-work:
> 	* Plan to re-use Wei's work, which communicates the poison value to the host.
> 	* The nomenclatures used in virtio-balloon needs to be changed so that the code can easily be distinguished from Wei's Free Page Hint code.
> 	* Sorting based on zonenum, to avoid repetitive zone locks for the same zone.
> 
> Other required work:
> 	* Run other benchmarks to evaluate the performance/impact of this approach.
> 
> Test case:
> Setup:
> Memory-15837 MB
> Guest Memory Size-5 GB
> Swap-Disabled
> Test Program-Simple program which allocates 4GB memory via malloc, touches it via memset and exits.
> Use case-Number of guests that can be launched completely including the successful execution of the test program.
> Procedure: 
> The first guest is launched and once its console is up, the test allocation program is executed with 4 GB memory request (Due to this the guest occupies almost 4-5 GB of memory in the host in a system without page hinting). Once this program exits at that time another guest is launched in the host and the same process is followed. We continue launching the guests until a guest gets killed due to low memory condition in the host.
> 
> Result:
> Without Hinting-3 Guests
> With Hinting-5 to 7 Guests(Based on the amount of memory freed/captured).
> 
> [1] https://www.spinics.net/lists/kvm/msg170113.html 
> 
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-15  9:41           ` David Hildenbrand
@ 2019-02-18  2:36             ` Wei Wang
  2019-02-18  2:39               ` Wei Wang
  0 siblings, 1 reply; 116+ messages in thread
From: Wei Wang @ 2019-02-18  2:36 UTC (permalink / raw)
  To: David Hildenbrand, 'Nitesh Narayan Lal',
	kvm, linux-kernel, pbonzini, lcapitulino, pagupta, yang.zhang.wz,
	riel, mst, dodgen, konrad.wilk, dhildenb, aarcange

On 02/15/2019 05:41 PM, David Hildenbrand wrote:
> On 15.02.19 10:05, Wang, Wei W wrote:
>> On Thursday, February 14, 2019 5:43 PM, David Hildenbrand wrote:
>>> Yes indeed, that is the important bit. They must not be put pack to the
>>> buddy before they have been processed by the hypervisor. But as the pages
>>> are not in the buddy, no one allocating a page will stumble over such a page
>>> and try to allocate it. Threads trying to allocate memory will simply pick
>>> another buddy page instead of "busy waiting" for that page to be finished
>>> reporting.
>> What if a guest thread try to allocate some pages but the buddy cannot satisfy
>> because all the pages are isolated? Would it be the same case that the guest thread
>> gets blocked by waiting all the isolated pages to get madvised by the host and
>> returned to the guest buddy, or even worse, some guest threads get killed due to oom?
> Your question targets low memory situations in the guest. I think Nitesh
> already answered parts of that question somewhere and I'll let him
> answer it in detail, only a short comment from my side :)
>
> I can imagine techniques where the OOM killer can be avoided, but the
> OOM handler will eventually kick in and handle it.
>
> In general your question is valid and we will have to think about a way
> to avoid that from happening. However, in contrast to your approach
> blocking on potentially every page that is being hinted, in Nitesh's
> approach this would only happen when the guest is really low on memory.
> And the question is in general, if a guest wants to hint if low on
> memory ("safety buffer").

I think we should forget that the guest is low on memory because
this approach takes all the pages off the list, not because the guest really
uses up the free memory.

Guest allocating one page could also potentially be blocked until all 
the pages
(as opposed to one page) being madvised and returned to the guest buddy.

Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18  2:36             ` Wei Wang
@ 2019-02-18  2:39               ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2019-02-18  2:39 UTC (permalink / raw)
  To: David Hildenbrand, 'Nitesh Narayan Lal',
	kvm, linux-kernel, pbonzini, lcapitulino, pagupta, yang.zhang.wz,
	riel, mst, dodgen, konrad.wilk, dhildenb, aarcange

On 02/18/2019 10:36 AM, Wei Wang wrote:
> On 02/15/2019 05:41 PM, David Hildenbrand wrote:
>> On 15.02.19 10:05, Wang, Wei W wrote:
>>> On Thursday, February 14, 2019 5:43 PM, David Hildenbrand wrote:
>>>> Yes indeed, that is the important bit. They must not be put pack to 
>>>> the
>>>> buddy before they have been processed by the hypervisor. But as the 
>>>> pages
>>>> are not in the buddy, no one allocating a page will stumble over 
>>>> such a page
>>>> and try to allocate it. Threads trying to allocate memory will 
>>>> simply pick
>>>> another buddy page instead of "busy waiting" for that page to be 
>>>> finished
>>>> reporting.
>>> What if a guest thread try to allocate some pages but the buddy 
>>> cannot satisfy
>>> because all the pages are isolated? Would it be the same case that 
>>> the guest thread
>>> gets blocked by waiting all the isolated pages to get madvised by 
>>> the host and
>>> returned to the guest buddy, or even worse, some guest threads get 
>>> killed due to oom?
>> Your question targets low memory situations in the guest. I think Nitesh
>> already answered parts of that question somewhere and I'll let him
>> answer it in detail, only a short comment from my side :)
>>
>> I can imagine techniques where the OOM killer can be avoided, but the
>> OOM handler will eventually kick in and handle it.
>>
>> In general your question is valid and we will have to think about a way
>> to avoid that from happening. However, in contrast to your approach
>> blocking on potentially every page that is being hinted, in Nitesh's
>> approach this would only happen when the guest is really low on memory.
>> And the question is in general, if a guest wants to hint if low on
>> memory ("safety buffer").
>
> I think we should forget that the guest is low on memory because
%s/should/shouldn't
> this approach takes all the pages off the list, not because the guest 
> really
> uses up the free memory.
>
> Guest allocating one page could also potentially be blocked until all 
> the pages
> (as opposed to one page) being madvised and returned to the guest buddy.
>
> Best,
> Wei


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-16  9:40 ` David Hildenbrand
@ 2019-02-18 15:50   ` Nitesh Narayan Lal
  2019-02-18 16:02     ` David Hildenbrand
  2019-02-18 16:49   ` Michael S. Tsirkin
  1 sibling, 1 reply; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-18 15:50 UTC (permalink / raw)
  To: David Hildenbrand, kvm, linux-kernel, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, mst, dodgen,
	konrad.wilk, dhildenb, aarcange, Alexander Duyck


[-- Attachment #1.1: Type: text/plain, Size: 6965 bytes --]


On 2/16/19 4:40 AM, David Hildenbrand wrote:
> On 04.02.19 21:18, Nitesh Narayan Lal wrote:
>
> Hi Nitesh,
>
> I thought again about how s390x handles free page hinting. As that seems
> to work just fine, I guess sticking to a similar model makes sense.
>
>
> I already explained in this thread how it works on s390x, a short summary:
>
> 1. Each VCPU has a buffer of pfns to be reported to the hypervisor. If I
> am not wrong, it contains 512 entries, so is exactly 1 page big. This
> buffer is stored in the hypervisor and is on page granularity.
>
> 2. This page buffer is managed via the ESSA instruction. In addition, to
> synchronize with the guest ("page reused when freeing in the
> hypervisor"), special bits in the host->guest page table can be
> set/locked via the ESSA instruction by the guest and similarly accessed
> by the hypervisor.
>
> 3. Once the buffer is full, the guest does a synchronous hypercall,
> going over all 512 entries and zapping them (== similar to MADV_DONTNEED)
>
>
> To mimic that, we
>
> 1. Have a static buffer per VCPU in the guest with 512 entries. You
> basically have that already.
>
> 2. On every free, add the page _or_ the page after merging by the buddy
> (e.g. MAX_ORDER - 1) to the buffer (this is where we could be better
> than s390x). You basically have that already.
>
> 3. If the buffer is full, try to isolate all pages and do a synchronous
> report to the hypervisor. You have the first part already. The second
> part would require a change (don't use a separate/global thread to do
> the hinting, just do it synchronously).
>
> 4. One hinting is done, putback all isolated pages to the budy. You
> basically have that already.
>
>
> For 3. we can try what you have right now, using virtio. If we detect
> that's a problem, we can do it similar to what Alexander proposes and
> just do a bare hypercall. It's just a different way of carrying out the
> same task.
>
>
> This approach
> 1. Mimics what s390x does, besides supporting different granularities.
> To synchronize guest->host we simply take the pages off the buddy.
>
> 2. Is basically what Alexander does, however his design limitation is
> that doing any hinting on smaller granularities will not work because
> there will be too many synchronous hints. Bad on fragmented guests.
>
> 3. Does not require any dynamic data structures in the guest.
>
> 4. Does not block allocation paths.
>
> 5. Blocks on e.g. every 512'ed free. It seems to work on s390x, why
> shouldn't it for us. We have to measure.
>
> 6. We are free to decide which granularity we report.
>
> 7. Potentially works even if the guest memory is fragmented (little
> MAX_ORDER - 1) pages.
>
> It would be worth a try. My feeling is that a synchronous report after
> e.g. 512 frees should be acceptable, as it seems to be acceptable on
> s390x. (basically always enabled, nobody complains).

The reason I like the current approach of reporting via separate kernel
thread is that it doesn't block any regular allocation/freeing code path
in anyways.
>
> We would have to play with how to enable/disable reporting and when to
> not report because it's not worth it in the guest (e.g. low on memory).
>
>
> Do you think something like this would be easy to change/implement and
> measure?

I can do that as I figure out a real world guest workload using which
the two approaches can be compared.

> Thanks!
>
>> The following patch-set proposes an efficient mechanism for handing freed memory between the guest and the host. It enables the guests with no page cache to rapidly free and reclaims memory to and from the host respectively.
>>
>> Benefit:
>> With this patch-series, in our test-case, executed on a single system and single NUMA node with 15GB memory, we were able to successfully launch atleast 5 guests 
>> when page hinting was enabled and 3 without it. (Detailed explanation of the test procedure is provided at the bottom).
>>
>> Changelog in V8:
>> In this patch-series, the earlier approach [1] which was used to capture and scan the pages freed by the guest has been changed. The new approach is briefly described below:
>>
>> The patch-set still leverages the existing arch_free_page() to add this functionality. It maintains a per CPU array which is used to store the pages freed by the guest. The maximum number of entries which it can hold is defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it is scanned and only the pages which are available in the buddy are stored. This process continues until the array is filled with pages which are part of the buddy free list. After which it wakes up a kernel per-cpu-thread.
>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation and if the page is not reallocated and present in the buddy, the kernel thread attempts to isolate it from the buddy. If it is successfully isolated, the page is added to another per-cpu array. Once the entire scanning process is complete, all the isolated pages are reported to the host through an existing virtio-balloon driver.
>>
>> Known Issues:
>> 	* Fixed array size: The problem with having a fixed/hardcoded array size arises when the size of the guest varies. For example when the guest size increases and it starts making large allocations fixed size limits this solution's ability to capture all the freed pages. This will result in less guest free memory getting reported to the host.
>>
>> Known code re-work:
>> 	* Plan to re-use Wei's work, which communicates the poison value to the host.
>> 	* The nomenclatures used in virtio-balloon needs to be changed so that the code can easily be distinguished from Wei's Free Page Hint code.
>> 	* Sorting based on zonenum, to avoid repetitive zone locks for the same zone.
>>
>> Other required work:
>> 	* Run other benchmarks to evaluate the performance/impact of this approach.
>>
>> Test case:
>> Setup:
>> Memory-15837 MB
>> Guest Memory Size-5 GB
>> Swap-Disabled
>> Test Program-Simple program which allocates 4GB memory via malloc, touches it via memset and exits.
>> Use case-Number of guests that can be launched completely including the successful execution of the test program.
>> Procedure: 
>> The first guest is launched and once its console is up, the test allocation program is executed with 4 GB memory request (Due to this the guest occupies almost 4-5 GB of memory in the host in a system without page hinting). Once this program exits at that time another guest is launched in the host and the same process is followed. We continue launching the guests until a guest gets killed due to low memory condition in the host.
>>
>> Result:
>> Without Hinting-3 Guests
>> With Hinting-5 to 7 Guests(Based on the amount of memory freed/captured).
>>
>> [1] https://www.spinics.net/lists/kvm/msg170113.html 
>>
>>
>
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18 15:50   ` Nitesh Narayan Lal
@ 2019-02-18 16:02     ` David Hildenbrand
  0 siblings, 0 replies; 116+ messages in thread
From: David Hildenbrand @ 2019-02-18 16:02 UTC (permalink / raw)
  To: Nitesh Narayan Lal, kvm, linux-kernel, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, mst, dodgen,
	konrad.wilk, dhildenb, aarcange, Alexander Duyck

On 18.02.19 16:50, Nitesh Narayan Lal wrote:
> 
> On 2/16/19 4:40 AM, David Hildenbrand wrote:
>> On 04.02.19 21:18, Nitesh Narayan Lal wrote:
>>
>> Hi Nitesh,
>>
>> I thought again about how s390x handles free page hinting. As that seems
>> to work just fine, I guess sticking to a similar model makes sense.
>>
>>
>> I already explained in this thread how it works on s390x, a short summary:
>>
>> 1. Each VCPU has a buffer of pfns to be reported to the hypervisor. If I
>> am not wrong, it contains 512 entries, so is exactly 1 page big. This
>> buffer is stored in the hypervisor and is on page granularity.
>>
>> 2. This page buffer is managed via the ESSA instruction. In addition, to
>> synchronize with the guest ("page reused when freeing in the
>> hypervisor"), special bits in the host->guest page table can be
>> set/locked via the ESSA instruction by the guest and similarly accessed
>> by the hypervisor.
>>
>> 3. Once the buffer is full, the guest does a synchronous hypercall,
>> going over all 512 entries and zapping them (== similar to MADV_DONTNEED)
>>
>>
>> To mimic that, we
>>
>> 1. Have a static buffer per VCPU in the guest with 512 entries. You
>> basically have that already.
>>
>> 2. On every free, add the page _or_ the page after merging by the buddy
>> (e.g. MAX_ORDER - 1) to the buffer (this is where we could be better
>> than s390x). You basically have that already.
>>
>> 3. If the buffer is full, try to isolate all pages and do a synchronous
>> report to the hypervisor. You have the first part already. The second
>> part would require a change (don't use a separate/global thread to do
>> the hinting, just do it synchronously).
>>
>> 4. One hinting is done, putback all isolated pages to the budy. You
>> basically have that already.
>>
>>
>> For 3. we can try what you have right now, using virtio. If we detect
>> that's a problem, we can do it similar to what Alexander proposes and
>> just do a bare hypercall. It's just a different way of carrying out the
>> same task.
>>
>>
>> This approach
>> 1. Mimics what s390x does, besides supporting different granularities.
>> To synchronize guest->host we simply take the pages off the buddy.
>>
>> 2. Is basically what Alexander does, however his design limitation is
>> that doing any hinting on smaller granularities will not work because
>> there will be too many synchronous hints. Bad on fragmented guests.
>>
>> 3. Does not require any dynamic data structures in the guest.
>>
>> 4. Does not block allocation paths.
>>
>> 5. Blocks on e.g. every 512'ed free. It seems to work on s390x, why
>> shouldn't it for us. We have to measure.
>>
>> 6. We are free to decide which granularity we report.
>>
>> 7. Potentially works even if the guest memory is fragmented (little
>> MAX_ORDER - 1) pages.
>>
>> It would be worth a try. My feeling is that a synchronous report after
>> e.g. 512 frees should be acceptable, as it seems to be acceptable on
>> s390x. (basically always enabled, nobody complains).
> 
> The reason I like the current approach of reporting via separate kernel
> thread is that it doesn't block any regular allocation/freeing code path
> in anyways.

Well, that is partially true. The work has to be done "somewhere", so
once you kick a separate kernel thread, it can easily be scheduled on
the very same VCPU in the very near future. So depending on the user,
the "hickup" is similarly visible.

Having separate kernel threads seems to result in other questions not
easy to answer (do we need dynamic data structures, how to size these
data structures, how many threads do we want (e.g. big number of vcpus)
) that seem to be avoidable by keeping it simple and not having separate
threads. Initially I also thought that separate threads were the natural
thing to do, but now I have the feeling that it tends to over complicate
the problem. (and I don't want to repeat myself, but on s390x it seems
to work this way just fine, if we want to mimic that). Especially
without us knowing if "don't do a hypercall every X free calls" is
really a problem.

>>
>> We would have to play with how to enable/disable reporting and when to
>> not report because it's not worth it in the guest (e.g. low on memory).
>>
>>
>> Do you think something like this would be easy to change/implement and
>> measure?
> 
> I can do that as I figure out a real world guest workload using which
> the two approaches can be compared.




-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-16  9:40 ` David Hildenbrand
  2019-02-18 15:50   ` Nitesh Narayan Lal
@ 2019-02-18 16:49   ` Michael S. Tsirkin
  2019-02-18 16:59     ` David Hildenbrand
  1 sibling, 1 reply; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-18 16:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk,
	dhildenb, aarcange, Alexander Duyck

On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
> It would be worth a try. My feeling is that a synchronous report after
> e.g. 512 frees should be acceptable, as it seems to be acceptable on
> s390x. (basically always enabled, nobody complains).

What slips under the radar on an arch like s390 might
raise issues for a popular arch like x86. My fear would be
if it's only a problem e.g. for realtime. Then you get
a condition that's very hard to trigger and affects
worst case latencies.

But really what business has something that is supposedly
an optimization blocking a VCPU? We are just freeing up
lots of memory why is it a good idea to slow that
process down?

-- 
MST

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18 16:49   ` Michael S. Tsirkin
@ 2019-02-18 16:59     ` David Hildenbrand
  2019-02-18 17:31       ` Alexander Duyck
  2019-02-18 17:54       ` Michael S. Tsirkin
  0 siblings, 2 replies; 116+ messages in thread
From: David Hildenbrand @ 2019-02-18 16:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk,
	dhildenb, aarcange, Alexander Duyck

On 18.02.19 17:49, Michael S. Tsirkin wrote:
> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
>> It would be worth a try. My feeling is that a synchronous report after
>> e.g. 512 frees should be acceptable, as it seems to be acceptable on
>> s390x. (basically always enabled, nobody complains).
> 
> What slips under the radar on an arch like s390 might
> raise issues for a popular arch like x86. My fear would be
> if it's only a problem e.g. for realtime. Then you get
> a condition that's very hard to trigger and affects
> worst case latencies.

Realtime should never use free page hinting. Just like it should never
use ballooning. Just like it should pin all pages in the hypervisor.

> 
> But really what business has something that is supposedly
> an optimization blocking a VCPU? We are just freeing up
> lots of memory why is it a good idea to slow that
> process down?

I first want to know that it is a problem before we declare it a
problem. I provided an example (s390x) where it does not seem to be a
problem. One hypercall ~every 512 frees. As simple as it can get.

No trying to deny that it could be a problem on x86, but then I assume
it is only a problem in specific setups.

I would much rather prefer a simple solution that can eventually be
disabled in selected setup than a complicated solution that tries to fit
all possible setups. Realtime is one of the examples where such stuff is
to be disabled either way.

Optimization of space comes with a price (here: execution time).

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18 16:59     ` David Hildenbrand
@ 2019-02-18 17:31       ` Alexander Duyck
  2019-02-18 17:41         ` David Hildenbrand
  2019-02-18 18:01         ` Michael S. Tsirkin
  2019-02-18 17:54       ` Michael S. Tsirkin
  1 sibling, 2 replies; 116+ messages in thread
From: Alexander Duyck @ 2019-02-18 17:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Feb 18, 2019 at 8:59 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 18.02.19 17:49, Michael S. Tsirkin wrote:
> > On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
> >> It would be worth a try. My feeling is that a synchronous report after
> >> e.g. 512 frees should be acceptable, as it seems to be acceptable on
> >> s390x. (basically always enabled, nobody complains).
> >
> > What slips under the radar on an arch like s390 might
> > raise issues for a popular arch like x86. My fear would be
> > if it's only a problem e.g. for realtime. Then you get
> > a condition that's very hard to trigger and affects
> > worst case latencies.
>
> Realtime should never use free page hinting. Just like it should never
> use ballooning. Just like it should pin all pages in the hypervisor.
>
> >
> > But really what business has something that is supposedly
> > an optimization blocking a VCPU? We are just freeing up
> > lots of memory why is it a good idea to slow that
> > process down?
>
> I first want to know that it is a problem before we declare it a
> problem. I provided an example (s390x) where it does not seem to be a
> problem. One hypercall ~every 512 frees. As simple as it can get.
>
> No trying to deny that it could be a problem on x86, but then I assume
> it is only a problem in specific setups.
>
> I would much rather prefer a simple solution that can eventually be
> disabled in selected setup than a complicated solution that tries to fit
> all possible setups. Realtime is one of the examples where such stuff is
> to be disabled either way.
>
> Optimization of space comes with a price (here: execution time).

One thing to keep in mind though is that if you are already having to
pull pages in and out of swap on the host in order be able to provide
enough memory for the guests the free page hinting should be a
significant win in terms of performance.

So far with my patch set that hints at the PMD level w/ THP enabled I
am not really seeing that much overhead for the hypercalls. The bigger
piece that is eating up CPU time is all the page faults and page
zeroing that is going on as we are cycling the memory in and out of
the guest. Some of that could probably be resolved by using MADV_FREE,
but if we are under actual memory pressure I suspect it would behave
similar to MADV_DONTNEED.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18 17:31       ` Alexander Duyck
@ 2019-02-18 17:41         ` David Hildenbrand
  2019-02-18 23:47           ` Alexander Duyck
  2019-02-18 18:01         ` Michael S. Tsirkin
  1 sibling, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2019-02-18 17:41 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On 18.02.19 18:31, Alexander Duyck wrote:
> On Mon, Feb 18, 2019 at 8:59 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 18.02.19 17:49, Michael S. Tsirkin wrote:
>>> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
>>>> It would be worth a try. My feeling is that a synchronous report after
>>>> e.g. 512 frees should be acceptable, as it seems to be acceptable on
>>>> s390x. (basically always enabled, nobody complains).
>>>
>>> What slips under the radar on an arch like s390 might
>>> raise issues for a popular arch like x86. My fear would be
>>> if it's only a problem e.g. for realtime. Then you get
>>> a condition that's very hard to trigger and affects
>>> worst case latencies.
>>
>> Realtime should never use free page hinting. Just like it should never
>> use ballooning. Just like it should pin all pages in the hypervisor.
>>
>>>
>>> But really what business has something that is supposedly
>>> an optimization blocking a VCPU? We are just freeing up
>>> lots of memory why is it a good idea to slow that
>>> process down?
>>
>> I first want to know that it is a problem before we declare it a
>> problem. I provided an example (s390x) where it does not seem to be a
>> problem. One hypercall ~every 512 frees. As simple as it can get.
>>
>> No trying to deny that it could be a problem on x86, but then I assume
>> it is only a problem in specific setups.
>>
>> I would much rather prefer a simple solution that can eventually be
>> disabled in selected setup than a complicated solution that tries to fit
>> all possible setups. Realtime is one of the examples where such stuff is
>> to be disabled either way.
>>
>> Optimization of space comes with a price (here: execution time).
> 
> One thing to keep in mind though is that if you are already having to
> pull pages in and out of swap on the host in order be able to provide
> enough memory for the guests the free page hinting should be a
> significant win in terms of performance.

Indeed. And also we are in a virtualized environment already, we can
have any kind of sudden hickups. (again, realtime has special
requirements on the setup)

Side note: I like your approach because it is simple. I don't like your
approach because it cannot deal with fragmented memory. And that can
happen easily.

The idea I described here can be similarly be an extension of your
approach, merging in a "batched reporting" Nitesh proposed, so we can
report on something < MAX_ORDER, similar to s390x. In the end it boils
down to reporting via hypercall vs. reporting via virtio. The main point
is that it is synchronous and batched. (and that we properly take care
of the race between host freeing and guest allocation)

> 
> So far with my patch set that hints at the PMD level w/ THP enabled I
> am not really seeing that much overhead for the hypercalls. The bigger
> piece that is eating up CPU time is all the page faults and page
> zeroing that is going on as we are cycling the memory in and out of
> the guest. Some of that could probably be resolved by using MADV_FREE,
> but if we are under actual memory pressure I suspect it would behave
> similar to MADV_DONTNEED.
> 

MADV_FREE is certainly the better thing to do for hinting in my opinion.
It should result in even less overhead. Thanks for the comment about the
hypercall overhead.


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18 16:59     ` David Hildenbrand
  2019-02-18 17:31       ` Alexander Duyck
@ 2019-02-18 17:54       ` Michael S. Tsirkin
  2019-02-18 18:29         ` David Hildenbrand
  1 sibling, 1 reply; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-18 17:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk,
	dhildenb, aarcange, Alexander Duyck

On Mon, Feb 18, 2019 at 05:59:06PM +0100, David Hildenbrand wrote:
> On 18.02.19 17:49, Michael S. Tsirkin wrote:
> > On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
> >> It would be worth a try. My feeling is that a synchronous report after
> >> e.g. 512 frees should be acceptable, as it seems to be acceptable on
> >> s390x. (basically always enabled, nobody complains).
> > 
> > What slips under the radar on an arch like s390 might
> > raise issues for a popular arch like x86. My fear would be
> > if it's only a problem e.g. for realtime. Then you get
> > a condition that's very hard to trigger and affects
> > worst case latencies.
> 
> Realtime should never use free page hinting.

OK maybe document this in commit log. RT project has
enough work as it is without need to untangle
complex dependencies with other features.

> Just like it should never
> use ballooning.

Well its an aside but why not ballooning? As long as hypervisor does not touch the balloon,
and you don't touch the (weird, not really documented properly)
deflate on oom, you are fine.
Real time is violated when you reconfigure balloon,
but  after you are done guest is real time again.
And management certainly knows it that it did something
with balloon at the exact same time there was a latency spike.


I think this might not work well right now, but generally
I think it should be fine. No?


> Just like it should pin all pages in the hypervisor.

BTW all this is absolutely interesting to fix.
But I agree wrt hinting being kind of like pinning.


> > 
> > But really what business has something that is supposedly
> > an optimization blocking a VCPU? We are just freeing up
> > lots of memory why is it a good idea to slow that
> > process down?
> 
> I first want to know that it is a problem before we declare it a
> problem. I provided an example (s390x) where it does not seem to be a
> problem. One hypercall ~every 512 frees. As simple as it can get.
> 
> No trying to deny that it could be a problem on x86, but then I assume
> it is only a problem in specific setups.

But which setups? How are we going to identify them?

> I would much rather prefer a simple solution that can eventually be
> disabled in selected setup than a complicated solution that tries to fit
> all possible setups.

Well I am not sure just disabling it is reasonable.  E.g. Alex shows
drastic boot time speedups.  You won't be able to come to people later
and say oh you need to disable this feature yes you will stop getting
packet loss once in a while but you also won't be able to boot your VMs
quickly enough.

So I'm fine with a simple implementation but the interface needs to
allow the hypervisor to process hints in parallel while guest is
running.  We can then fix any issues on hypervisor without breaking
guests.


> Realtime is one of the examples where such stuff is
> to be disabled either way.

OK so we have identified realtime. Nice even though it wasn't documented
anywhere. Are there other workloads? What are they?


> Optimization of space comes with a price (here: execution time).

I am not sure I agree. If hinting patches just slowed everyone down they
would be useless. Note how Alex show-cased this by demonstrating
faster boot times.

Unlike regular ballooning, this doesn't do much to optimize space. There
are no promises so host must still have enough swap to fit guest memory
anyway.

All free page hinting does is reduce IO on the hypervisor.

So it's a tradeoff.

> -- 
> 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18 17:31       ` Alexander Duyck
  2019-02-18 17:41         ` David Hildenbrand
@ 2019-02-18 18:01         ` Michael S. Tsirkin
  1 sibling, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-18 18:01 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Hildenbrand, Nitesh Narayan Lal, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Feb 18, 2019 at 09:31:13AM -0800, Alexander Duyck wrote:
> > Optimization of space comes with a price (here: execution time).
> 
> One thing to keep in mind though is that if you are already having to
> pull pages in and out of swap on the host in order be able to provide
> enough memory for the guests the free page hinting should be a
> significant win in terms of performance.

Absolutely and I think that's the point of the hinting right?
To cut out swap/IO on the host.

-- 
MST

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18 17:54       ` Michael S. Tsirkin
@ 2019-02-18 18:29         ` David Hildenbrand
  2019-02-18 19:16           ` Michael S. Tsirkin
  0 siblings, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2019-02-18 18:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk,
	dhildenb, aarcange, Alexander Duyck

On 18.02.19 18:54, Michael S. Tsirkin wrote:
> On Mon, Feb 18, 2019 at 05:59:06PM +0100, David Hildenbrand wrote:
>> On 18.02.19 17:49, Michael S. Tsirkin wrote:
>>> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
>>>> It would be worth a try. My feeling is that a synchronous report after
>>>> e.g. 512 frees should be acceptable, as it seems to be acceptable on
>>>> s390x. (basically always enabled, nobody complains).
>>>
>>> What slips under the radar on an arch like s390 might
>>> raise issues for a popular arch like x86. My fear would be
>>> if it's only a problem e.g. for realtime. Then you get
>>> a condition that's very hard to trigger and affects
>>> worst case latencies.
>>
>> Realtime should never use free page hinting.
> 
> OK maybe document this in commit log. RT project has
> enough work as it is without need to untangle
> complex dependencies with other features.

We most certainly should!!

> 
>> Just like it should never
>> use ballooning.
> 
> Well its an aside but why not ballooning? As long as hypervisor does not touch the balloon,
> and you don't touch the (weird, not really documented properly)
> deflate on oom, you are fine.
> Real time is violated when you reconfigure balloon,
> but  after you are done guest is real time again.
> And management certainly knows it that it did something
> with balloon at the exact same time there was a latency spike.

Fair enough, this is a potential use case. But it goes hand in hand with
pinning/unpinning pages. So yes, while this would be possible - modify
balloon in  "no real time period", I doubt this is a real life scenario.
As always, I like to be taught differently :)

Similar to "start reporting on !RT activity" and "stop reporting on RT
activity"

> 
> 
> I think this might not work well right now, but generally
> I think it should be fine. No?
> 
> 
>> Just like it should pin all pages in the hypervisor.
> 
> BTW all this is absolutely interesting to fix.
> But I agree wrt hinting being kind of like pinning.

Yes, this is all interesting stuff :)

> 
> 
>>>
>>> But really what business has something that is supposedly
>>> an optimization blocking a VCPU? We are just freeing up
>>> lots of memory why is it a good idea to slow that
>>> process down?
>>
>> I first want to know that it is a problem before we declare it a
>> problem. I provided an example (s390x) where it does not seem to be a
>> problem. One hypercall ~every 512 frees. As simple as it can get.
>>
>> No trying to deny that it could be a problem on x86, but then I assume
>> it is only a problem in specific setups.
> 
> But which setups? How are we going to identify them?

I guess is simple (I should be carefuly with this word ;) ): As long as
you don't isolate + pin your CPUs in the hypervisor, you can expect any
kind of sudden hickups. We're in a virtualized world. Real time is one
example.

Using kernel threads like Nitesh does right now? It can be scheduled
anytime by the hypervisor on the exact same cpu. Unless you isolate +
pin in the hypervor. So the same problem applies.

> 
>> I would much rather prefer a simple solution that can eventually be
>> disabled in selected setup than a complicated solution that tries to fit
>> all possible setups.
> 
> Well I am not sure just disabling it is reasonable.  E.g. Alex shows
> drastic boot time speedups.  You won't be able to come to people later
> and say oh you need to disable this feature yes you will stop getting
> packet loss once in a while but you also won't be able to boot your VMs
> quickly enough.

The guest is always free to disable once up. Yes, these are nice
details, but I consider these improvements we can work on later.

> 
> So I'm fine with a simple implementation but the interface needs to
> allow the hypervisor to process hints in parallel while guest is
> running.  We can then fix any issues on hypervisor without breaking
> guests.

Yes, I am fine with defining an interface that theoretically let's us
change the implementation in the guest later. I consider this even a
prerequisite. IMHO the interface shouldn't be different, it will be
exactly the same.

It is just "who" calls the batch freeing and waits for it. And as I
outlined here, doing it without additional threads at least avoids us
for now having to think about dynamic data structures and that we can
sometimes not report "because the thread is still busy reporting or
wasn't scheduled yet".

> 
> 
>> Realtime is one of the examples where such stuff is
>> to be disabled either way.
> 
> OK so we have identified realtime. Nice even though it wasn't documented
> anywhere. Are there other workloads? What are they?

As stated above, I think these environments are easy to spot. As long as
you don't isolate and pin, surprises can happen anytime. Can you think
of others?

(this stuff really has to be documented)

> 
> 
>> Optimization of space comes with a price (here: execution time).
> 
> I am not sure I agree. If hinting patches just slowed everyone down they
> would be useless. Note how Alex show-cased this by demonstrating
> faster boot times.

Of course, like compressing the whole guest memory, things you might not
want to do ;) In the end, there has to be a net benefit.

> 
> Unlike regular ballooning, this doesn't do much to optimize space. There
> are no promises so host must still have enough swap to fit guest memory
> anyway.
> 
> All free page hinting does is reduce IO on the hypervisor.
> 
> So it's a tradeoff.

+1 to that.


The nice thing about this approach is that we can easily tweak "how many
to report in one shot" and "which sizes to report". We can play with it
fairly easily.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18 18:29         ` David Hildenbrand
@ 2019-02-18 19:16           ` Michael S. Tsirkin
  2019-02-18 19:35             ` David Hildenbrand
  0 siblings, 1 reply; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-18 19:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk,
	dhildenb, aarcange, Alexander Duyck

On Mon, Feb 18, 2019 at 07:29:44PM +0100, David Hildenbrand wrote:
> > 
> >>>
> >>> But really what business has something that is supposedly
> >>> an optimization blocking a VCPU? We are just freeing up
> >>> lots of memory why is it a good idea to slow that
> >>> process down?
> >>
> >> I first want to know that it is a problem before we declare it a
> >> problem. I provided an example (s390x) where it does not seem to be a
> >> problem. One hypercall ~every 512 frees. As simple as it can get.
> >>
> >> No trying to deny that it could be a problem on x86, but then I assume
> >> it is only a problem in specific setups.
> > 
> > But which setups? How are we going to identify them?
> 
> I guess is simple (I should be carefuly with this word ;) ): As long as
> you don't isolate + pin your CPUs in the hypervisor, you can expect any
> kind of sudden hickups. We're in a virtualized world. Real time is one
> example.
> 
> Using kernel threads like Nitesh does right now? It can be scheduled
> anytime by the hypervisor on the exact same cpu. Unless you isolate +
> pin in the hypervor. So the same problem applies.

Right but we know how to handle this. Many deployments already use tools
to detect host threads kicking VCPUs out.
Getting VCPU blocked by a kfree call would be something new.


...

> > So I'm fine with a simple implementation but the interface needs to
> > allow the hypervisor to process hints in parallel while guest is
> > running.  We can then fix any issues on hypervisor without breaking
> > guests.
> 
> Yes, I am fine with defining an interface that theoretically let's us
> change the implementation in the guest later.
> I consider this even a
> prerequisite. IMHO the interface shouldn't be different, it will be
> exactly the same.
> 
> It is just "who" calls the batch freeing and waits for it. And as I
> outlined here, doing it without additional threads at least avoids us
> for now having to think about dynamic data structures and that we can
> sometimes not report "because the thread is still busy reporting or
> wasn't scheduled yet".

Sorry I wasn't clear. I think we need ability to change the
implementation in the *host* later. IOW don't rely on
host being synchronous.


-- 
MST

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18 19:16           ` Michael S. Tsirkin
@ 2019-02-18 19:35             ` David Hildenbrand
  2019-02-18 19:47               ` Michael S. Tsirkin
  0 siblings, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2019-02-18 19:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk,
	dhildenb, aarcange, Alexander Duyck

On 18.02.19 20:16, Michael S. Tsirkin wrote:
> On Mon, Feb 18, 2019 at 07:29:44PM +0100, David Hildenbrand wrote:
>>>
>>>>>
>>>>> But really what business has something that is supposedly
>>>>> an optimization blocking a VCPU? We are just freeing up
>>>>> lots of memory why is it a good idea to slow that
>>>>> process down?
>>>>
>>>> I first want to know that it is a problem before we declare it a
>>>> problem. I provided an example (s390x) where it does not seem to be a
>>>> problem. One hypercall ~every 512 frees. As simple as it can get.
>>>>
>>>> No trying to deny that it could be a problem on x86, but then I assume
>>>> it is only a problem in specific setups.
>>>
>>> But which setups? How are we going to identify them?
>>
>> I guess is simple (I should be carefuly with this word ;) ): As long as
>> you don't isolate + pin your CPUs in the hypervisor, you can expect any
>> kind of sudden hickups. We're in a virtualized world. Real time is one
>> example.
>>
>> Using kernel threads like Nitesh does right now? It can be scheduled
>> anytime by the hypervisor on the exact same cpu. Unless you isolate +
>> pin in the hypervor. So the same problem applies.
> 
> Right but we know how to handle this. Many deployments already use tools
> to detect host threads kicking VCPUs out.
> Getting VCPU blocked by a kfree call would be something new.
> 

Yes, and for s390x we already have some kfree's taking longer than
others. We have to identify when it is not okay.

> 
>>> So I'm fine with a simple implementation but the interface needs to
>>> allow the hypervisor to process hints in parallel while guest is
>>> running.  We can then fix any issues on hypervisor without breaking
>>> guests.
>>
>> Yes, I am fine with defining an interface that theoretically let's us
>> change the implementation in the guest later.
>> I consider this even a
>> prerequisite. IMHO the interface shouldn't be different, it will be
>> exactly the same.
>>
>> It is just "who" calls the batch freeing and waits for it. And as I
>> outlined here, doing it without additional threads at least avoids us
>> for now having to think about dynamic data structures and that we can
>> sometimes not report "because the thread is still busy reporting or
>> wasn't scheduled yet".
> 
> Sorry I wasn't clear. I think we need ability to change the
> implementation in the *host* later. IOW don't rely on
> host being synchronous.
> 
> 
I actually misread it :) . In any way, there has to be a mechanism to
synchronize.

If we are going via a bare hypercall (like s390x, like what Alexander
proposes), it is going to be a synchronous interface either way. Just a
bare hypercall, there will not really be any blocking on the guest side.

Via virtio, I guess it is waiting for a response to a requests, right?

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18 19:35             ` David Hildenbrand
@ 2019-02-18 19:47               ` Michael S. Tsirkin
  2019-02-18 20:04                 ` David Hildenbrand
  0 siblings, 1 reply; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-18 19:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk,
	dhildenb, aarcange, Alexander Duyck

On Mon, Feb 18, 2019 at 08:35:36PM +0100, David Hildenbrand wrote:
> On 18.02.19 20:16, Michael S. Tsirkin wrote:
> > On Mon, Feb 18, 2019 at 07:29:44PM +0100, David Hildenbrand wrote:
> >>>
> >>>>>
> >>>>> But really what business has something that is supposedly
> >>>>> an optimization blocking a VCPU? We are just freeing up
> >>>>> lots of memory why is it a good idea to slow that
> >>>>> process down?
> >>>>
> >>>> I first want to know that it is a problem before we declare it a
> >>>> problem. I provided an example (s390x) where it does not seem to be a
> >>>> problem. One hypercall ~every 512 frees. As simple as it can get.
> >>>>
> >>>> No trying to deny that it could be a problem on x86, but then I assume
> >>>> it is only a problem in specific setups.
> >>>
> >>> But which setups? How are we going to identify them?
> >>
> >> I guess is simple (I should be carefuly with this word ;) ): As long as
> >> you don't isolate + pin your CPUs in the hypervisor, you can expect any
> >> kind of sudden hickups. We're in a virtualized world. Real time is one
> >> example.
> >>
> >> Using kernel threads like Nitesh does right now? It can be scheduled
> >> anytime by the hypervisor on the exact same cpu. Unless you isolate +
> >> pin in the hypervor. So the same problem applies.
> > 
> > Right but we know how to handle this. Many deployments already use tools
> > to detect host threads kicking VCPUs out.
> > Getting VCPU blocked by a kfree call would be something new.
> > 
> 
> Yes, and for s390x we already have some kfree's taking longer than
> others. We have to identify when it is not okay.

Right even if the problem exists elsewhere this does not make it go away
or ensure that someone will work to address it :)


> > 
> >>> So I'm fine with a simple implementation but the interface needs to
> >>> allow the hypervisor to process hints in parallel while guest is
> >>> running.  We can then fix any issues on hypervisor without breaking
> >>> guests.
> >>
> >> Yes, I am fine with defining an interface that theoretically let's us
> >> change the implementation in the guest later.
> >> I consider this even a
> >> prerequisite. IMHO the interface shouldn't be different, it will be
> >> exactly the same.
> >>
> >> It is just "who" calls the batch freeing and waits for it. And as I
> >> outlined here, doing it without additional threads at least avoids us
> >> for now having to think about dynamic data structures and that we can
> >> sometimes not report "because the thread is still busy reporting or
> >> wasn't scheduled yet".
> > 
> > Sorry I wasn't clear. I think we need ability to change the
> > implementation in the *host* later. IOW don't rely on
> > host being synchronous.
> > 
> > 
> I actually misread it :) . In any way, there has to be a mechanism to
> synchronize.
> 
> If we are going via a bare hypercall (like s390x, like what Alexander
> proposes), it is going to be a synchronous interface either way. Just a
> bare hypercall, there will not really be any blocking on the guest side.

It bothers me that we are now tied to interface being synchronous. We
won't be able to fix it if there's an issue as that would break guests.

> Via virtio, I guess it is waiting for a response to a requests, right?

For the buffer to be used, yes. And it could mean putting some pages
aside until hypervisor is done with them. Then you don't need timers or
tricks like this, you can get an interrupt and start using the memory.


> -- 
> 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18 19:47               ` Michael S. Tsirkin
@ 2019-02-18 20:04                 ` David Hildenbrand
  2019-02-18 20:31                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2019-02-18 20:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk,
	dhildenb, aarcange, Alexander Duyck

>>>>> So I'm fine with a simple implementation but the interface needs to
>>>>> allow the hypervisor to process hints in parallel while guest is
>>>>> running.  We can then fix any issues on hypervisor without breaking
>>>>> guests.
>>>>
>>>> Yes, I am fine with defining an interface that theoretically let's us
>>>> change the implementation in the guest later.
>>>> I consider this even a
>>>> prerequisite. IMHO the interface shouldn't be different, it will be
>>>> exactly the same.
>>>>
>>>> It is just "who" calls the batch freeing and waits for it. And as I
>>>> outlined here, doing it without additional threads at least avoids us
>>>> for now having to think about dynamic data structures and that we can
>>>> sometimes not report "because the thread is still busy reporting or
>>>> wasn't scheduled yet".
>>>
>>> Sorry I wasn't clear. I think we need ability to change the
>>> implementation in the *host* later. IOW don't rely on
>>> host being synchronous.
>>>
>>>
>> I actually misread it :) . In any way, there has to be a mechanism to
>> synchronize.
>>
>> If we are going via a bare hypercall (like s390x, like what Alexander
>> proposes), it is going to be a synchronous interface either way. Just a
>> bare hypercall, there will not really be any blocking on the guest side.
> 
> It bothers me that we are now tied to interface being synchronous. We
> won't be able to fix it if there's an issue as that would break guests.

I assume with "fix it" you mean "fix kfree taking longer on every X call"?

Yes, as I initially wrote, this mimics s390x. That might be good (we
know it has been working for years) and bad (we are inheriting the same
problem class, if it exists). And being synchronous is part of the
approach for now.

I tend to focus on the first part (we don't know anything besides it is
working) while you focus on the second part (there could be a potential
problem). Having a real problem at hand would be great, then we would
know what exactly we actually have to fix. But read below.

> 
>> Via virtio, I guess it is waiting for a response to a requests, right?
> 
> For the buffer to be used, yes. And it could mean putting some pages
> aside until hypervisor is done with them. Then you don't need timers or
> tricks like this, you can get an interrupt and start using the memory.

I am very open to such an approach as long as we can make it work and it
is not too complicated. (-> simple)

This would mean for example

1. Collect entries to be reported per VCPU in a buffer. Say magic number
256/512.

2. Once the buffer is full, do crazy "take pages out of the balloon
action" and report them to the hypervisor via virtio. Let the VCPU
continue. This will require some memory to store the request. Small
hickup for the VCPU to kick of the reporting to the hypervisor.

3. On interrupt/response, go over the response and put the pages back to
the buddy.

(assuming that reporting a bulk of frees is better than reporting every
single free obviously)

This could allow nice things like "when OOM gets trigger, see if pages
are currently being reported and wait until they have been put back to
the buddy, return "new pages available", so in a real "low on memory"
scenario, no OOM killer would get involved. This could address the issue
Wei had with reporting when low on memory.

Is that something you have in mind? I assume we would have to allocate
memory when crafting the new requests. This is the only reason I tend to
prefer a synchronous interface for now. But if allocation is not a
problem, great.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18 20:04                 ` David Hildenbrand
@ 2019-02-18 20:31                   ` Michael S. Tsirkin
  2019-02-18 20:40                     ` Nitesh Narayan Lal
  2019-02-18 20:53                     ` David Hildenbrand
  0 siblings, 2 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-18 20:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk,
	dhildenb, aarcange, Alexander Duyck

On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
> >>>>> So I'm fine with a simple implementation but the interface needs to
> >>>>> allow the hypervisor to process hints in parallel while guest is
> >>>>> running.  We can then fix any issues on hypervisor without breaking
> >>>>> guests.
> >>>>
> >>>> Yes, I am fine with defining an interface that theoretically let's us
> >>>> change the implementation in the guest later.
> >>>> I consider this even a
> >>>> prerequisite. IMHO the interface shouldn't be different, it will be
> >>>> exactly the same.
> >>>>
> >>>> It is just "who" calls the batch freeing and waits for it. And as I
> >>>> outlined here, doing it without additional threads at least avoids us
> >>>> for now having to think about dynamic data structures and that we can
> >>>> sometimes not report "because the thread is still busy reporting or
> >>>> wasn't scheduled yet".
> >>>
> >>> Sorry I wasn't clear. I think we need ability to change the
> >>> implementation in the *host* later. IOW don't rely on
> >>> host being synchronous.
> >>>
> >>>
> >> I actually misread it :) . In any way, there has to be a mechanism to
> >> synchronize.
> >>
> >> If we are going via a bare hypercall (like s390x, like what Alexander
> >> proposes), it is going to be a synchronous interface either way. Just a
> >> bare hypercall, there will not really be any blocking on the guest side.
> > 
> > It bothers me that we are now tied to interface being synchronous. We
> > won't be able to fix it if there's an issue as that would break guests.
> 
> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
> 
> Yes, as I initially wrote, this mimics s390x. That might be good (we
> know it has been working for years) and bad (we are inheriting the same
> problem class, if it exists). And being synchronous is part of the
> approach for now.

BTW on s390 are these hypercalls handled by Linux?

> I tend to focus on the first part (we don't know anything besides it is
> working) while you focus on the second part (there could be a potential
> problem). Having a real problem at hand would be great, then we would
> know what exactly we actually have to fix. But read below.

If we end up doing a hypercall per THP, maybe we could at least
not block with interrupts disabled? Poll in guest until
hypervisor reports its done?  That would already be an
improvement IMHO. E.g. perf within guest will point you
in the right direction and towards disabling hinting.


> > 
> >> Via virtio, I guess it is waiting for a response to a requests, right?
> > 
> > For the buffer to be used, yes. And it could mean putting some pages
> > aside until hypervisor is done with them. Then you don't need timers or
> > tricks like this, you can get an interrupt and start using the memory.
> 
> I am very open to such an approach as long as we can make it work and it
> is not too complicated. (-> simple)
> 
> This would mean for example
> 
> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
> 256/512.
> 
> 2. Once the buffer is full, do crazy "take pages out of the balloon
> action" and report them to the hypervisor via virtio. Let the VCPU
> continue. This will require some memory to store the request. Small
> hickup for the VCPU to kick of the reporting to the hypervisor.
> 
> 3. On interrupt/response, go over the response and put the pages back to
> the buddy.
> 
> (assuming that reporting a bulk of frees is better than reporting every
> single free obviously)
> 
> This could allow nice things like "when OOM gets trigger, see if pages
> are currently being reported and wait until they have been put back to
> the buddy, return "new pages available", so in a real "low on memory"
> scenario, no OOM killer would get involved. This could address the issue
> Wei had with reporting when low on memory.
> 
> Is that something you have in mind?

Yes that seems more future proof I think.

> I assume we would have to allocate
> memory when crafting the new requests. This is the only reason I tend to
> prefer a synchronous interface for now. But if allocation is not a
> problem, great.

There are two main ways to avoid allocation:
1. do not add extra data on top of each chunk passed
2. add extra data but pre-allocate buffers for it

> -- 
> 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18 20:31                   ` Michael S. Tsirkin
@ 2019-02-18 20:40                     ` Nitesh Narayan Lal
  2019-02-18 21:04                       ` David Hildenbrand
  2019-02-18 20:53                     ` David Hildenbrand
  1 sibling, 1 reply; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-18 20:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb, aarcange,
	Alexander Duyck, David Hildenbrand


[-- Attachment #1.1: Type: text/plain, Size: 4974 bytes --]

On 2/18/19 3:31 PM, Michael S. Tsirkin wrote:
> On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
>>>>>>> So I'm fine with a simple implementation but the interface needs to
>>>>>>> allow the hypervisor to process hints in parallel while guest is
>>>>>>> running.  We can then fix any issues on hypervisor without breaking
>>>>>>> guests.
>>>>>> Yes, I am fine with defining an interface that theoretically let's us
>>>>>> change the implementation in the guest later.
>>>>>> I consider this even a
>>>>>> prerequisite. IMHO the interface shouldn't be different, it will be
>>>>>> exactly the same.
>>>>>>
>>>>>> It is just "who" calls the batch freeing and waits for it. And as I
>>>>>> outlined here, doing it without additional threads at least avoids us
>>>>>> for now having to think about dynamic data structures and that we can
>>>>>> sometimes not report "because the thread is still busy reporting or
>>>>>> wasn't scheduled yet".
>>>>> Sorry I wasn't clear. I think we need ability to change the
>>>>> implementation in the *host* later. IOW don't rely on
>>>>> host being synchronous.
>>>>>
>>>>>
>>>> I actually misread it :) . In any way, there has to be a mechanism to
>>>> synchronize.
>>>>
>>>> If we are going via a bare hypercall (like s390x, like what Alexander
>>>> proposes), it is going to be a synchronous interface either way. Just a
>>>> bare hypercall, there will not really be any blocking on the guest side.
>>> It bothers me that we are now tied to interface being synchronous. We
>>> won't be able to fix it if there's an issue as that would break guests.
>> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
>>
>> Yes, as I initially wrote, this mimics s390x. That might be good (we
>> know it has been working for years) and bad (we are inheriting the same
>> problem class, if it exists). And being synchronous is part of the
>> approach for now.
> BTW on s390 are these hypercalls handled by Linux?
>
>> I tend to focus on the first part (we don't know anything besides it is
>> working) while you focus on the second part (there could be a potential
>> problem). Having a real problem at hand would be great, then we would
>> know what exactly we actually have to fix. But read below.
> If we end up doing a hypercall per THP, maybe we could at least
> not block with interrupts disabled? Poll in guest until
> hypervisor reports its done?  That would already be an
> improvement IMHO. E.g. perf within guest will point you
> in the right direction and towards disabling hinting.
>
>
>>>> Via virtio, I guess it is waiting for a response to a requests, right?
>>> For the buffer to be used, yes. And it could mean putting some pages
>>> aside until hypervisor is done with them. Then you don't need timers or
>>> tricks like this, you can get an interrupt and start using the memory.
>> I am very open to such an approach as long as we can make it work and it
>> is not too complicated. (-> simple)
>>
>> This would mean for example
>>
>> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
>> 256/512.
>>
>> 2. Once the buffer is full, do crazy "take pages out of the balloon
>> action" and report them to the hypervisor via virtio. Let the VCPU
>> continue. This will require some memory to store the request. Small
>> hickup for the VCPU to kick of the reporting to the hypervisor.
>>
>> 3. On interrupt/response, go over the response and put the pages back to
>> the buddy.
>>
>> (assuming that reporting a bulk of frees is better than reporting every
>> single free obviously)
>>
>> This could allow nice things like "when OOM gets trigger, see if pages
>> are currently being reported and wait until they have been put back to
>> the buddy, return "new pages available", so in a real "low on memory"
>> scenario, no OOM killer would get involved. This could address the issue
>> Wei had with reporting when low on memory.
>>
>> Is that something you have in mind?
> Yes that seems more future proof I think.
>
>> I assume we would have to allocate
>> memory when crafting the new requests. This is the only reason I tend to
>> prefer a synchronous interface for now. But if allocation is not a
>> problem, great.
> There are two main ways to avoid allocation:
> 1. do not add extra data on top of each chunk passed
If I am not wrong then this is close to what we have right now.
One issue I see right now is that I am polling while host is freeing the
memory.
In the next version I could tie the logic which returns pages to the
buddy and resets the per cpu array index value to 0 with the callback.
(i.e.., it happens once we receive an response from the host)
Other change which I am testing right now is to only capture 'MAX_ORDER
- 1' pages.
> 2. add extra data but pre-allocate buffers for it
>
>> -- 
>>
>> Thanks,
>>
>> David / dhildenb
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18 20:31                   ` Michael S. Tsirkin
  2019-02-18 20:40                     ` Nitesh Narayan Lal
@ 2019-02-18 20:53                     ` David Hildenbrand
  1 sibling, 0 replies; 116+ messages in thread
From: David Hildenbrand @ 2019-02-18 20:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk,
	dhildenb, aarcange, Alexander Duyck

On 18.02.19 21:31, Michael S. Tsirkin wrote:
> On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
>>>>>>> So I'm fine with a simple implementation but the interface needs to
>>>>>>> allow the hypervisor to process hints in parallel while guest is
>>>>>>> running.  We can then fix any issues on hypervisor without breaking
>>>>>>> guests.
>>>>>>
>>>>>> Yes, I am fine with defining an interface that theoretically let's us
>>>>>> change the implementation in the guest later.
>>>>>> I consider this even a
>>>>>> prerequisite. IMHO the interface shouldn't be different, it will be
>>>>>> exactly the same.
>>>>>>
>>>>>> It is just "who" calls the batch freeing and waits for it. And as I
>>>>>> outlined here, doing it without additional threads at least avoids us
>>>>>> for now having to think about dynamic data structures and that we can
>>>>>> sometimes not report "because the thread is still busy reporting or
>>>>>> wasn't scheduled yet".
>>>>>
>>>>> Sorry I wasn't clear. I think we need ability to change the
>>>>> implementation in the *host* later. IOW don't rely on
>>>>> host being synchronous.
>>>>>
>>>>>
>>>> I actually misread it :) . In any way, there has to be a mechanism to
>>>> synchronize.
>>>>
>>>> If we are going via a bare hypercall (like s390x, like what Alexander
>>>> proposes), it is going to be a synchronous interface either way. Just a
>>>> bare hypercall, there will not really be any blocking on the guest side.
>>>
>>> It bothers me that we are now tied to interface being synchronous. We
>>> won't be able to fix it if there's an issue as that would break guests.
>>
>> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
>>
>> Yes, as I initially wrote, this mimics s390x. That might be good (we
>> know it has been working for years) and bad (we are inheriting the same
>> problem class, if it exists). And being synchronous is part of the
>> approach for now.
> 
> BTW on s390 are these hypercalls handled by Linux?

I assume you mean in KVM - Yes! There is a hardware assist to handle the
"queuing of 512 pfns" but once the buffer is full, the actual hypercall
intercept will be triggered.

arch/s390/kvm/priv.c:handle_essa()

The interesting part is

down_read(&gmap->mm->mmap_sem);
for (i = 0; i < entries; ++i);
	__gmap_zap(gmap, cbrlo[i]);
up_read(&gmap->mm->mmap_sem);

cbrlo is the pfn array stored in the hypervisor.

> 
>> I tend to focus on the first part (we don't know anything besides it is
>> working) while you focus on the second part (there could be a potential
>> problem). Having a real problem at hand would be great, then we would
>> know what exactly we actually have to fix. But read below.
> 
> If we end up doing a hypercall per THP, maybe we could at least
> not block with interrupts disabled? Poll in guest until
> hypervisor reports its done?  That would already be an
> improvement IMHO. E.g. perf within guest will point you
> in the right direction and towards disabling hinting.

I think we always have the option to busy loop where we consider it more
helpful. On synchronous hypercalls, no waiting is necessary. Only on
asynchronous ones (which would the most probably be virtio based).

I don't think only reporting THP will be future proof. So with whatever
we come up, it has to be able to deal with smaller granularities. Not
saying eventually page granularity, but at least some other orders. The
only solution to avoid overhead of many hypercalls is then to report
multiple ones in one shot.

>>>
>>>> Via virtio, I guess it is waiting for a response to a requests, right?
>>>
>>> For the buffer to be used, yes. And it could mean putting some pages
>>> aside until hypervisor is done with them. Then you don't need timers or
>>> tricks like this, you can get an interrupt and start using the memory.
>>
>> I am very open to such an approach as long as we can make it work and it
>> is not too complicated. (-> simple)
>>
>> This would mean for example
>>
>> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
>> 256/512.
>>
>> 2. Once the buffer is full, do crazy "take pages out of the balloon
>> action" and report them to the hypervisor via virtio. Let the VCPU
>> continue. This will require some memory to store the request. Small
>> hickup for the VCPU to kick of the reporting to the hypervisor.
>>
>> 3. On interrupt/response, go over the response and put the pages back to
>> the buddy.
>>
>> (assuming that reporting a bulk of frees is better than reporting every
>> single free obviously)
>>
>> This could allow nice things like "when OOM gets trigger, see if pages
>> are currently being reported and wait until they have been put back to
>> the buddy, return "new pages available", so in a real "low on memory"
>> scenario, no OOM killer would get involved. This could address the issue
>> Wei had with reporting when low on memory.
>>
>> Is that something you have in mind?
> 
> Yes that seems more future proof I think.

And it would satisfy your request for an asynchronous interface. + we
would get rid of the kthread(s).

> 
>> I assume we would have to allocate
>> memory when crafting the new requests. This is the only reason I tend to
>> prefer a synchronous interface for now. But if allocation is not a
>> problem, great.
> 
> There are two main ways to avoid allocation:
> 1. do not add extra data on top of each chunk passed
> 2. add extra data but pre-allocate buffers for it
> 

It could theoretically happen that while the old VCPU buffer is still
getting reported, that we want to free a page, so we need a new buffer I
assume. Busy waiting is an option (hmm), or have to skip that page, but
that is something I want to avoid. So allocating memory for the request
seems to be the cleanest approach.

But after all, as we are literally allocating buddy pages to report
temporarily, we can also most probably also allocate memory. We will
have to look into the details.

So the options I see so far are

1. Do a synchronous hypercall, reporting a bunch of pages as described
initially in this thread. Release page to the buddy when returning from
the hypercall.

2. Do an asynchronous hypercall (allocating memory for the request),
reporting a bunch of pages. Release page to the buddy when on response
via interrupt.

Thanks for the helpful discussion Michael!

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18 20:40                     ` Nitesh Narayan Lal
@ 2019-02-18 21:04                       ` David Hildenbrand
  2019-02-19  0:01                         ` Alexander Duyck
  2019-02-19 12:47                         ` Nitesh Narayan Lal
  0 siblings, 2 replies; 116+ messages in thread
From: David Hildenbrand @ 2019-02-18 21:04 UTC (permalink / raw)
  To: Nitesh Narayan Lal, Michael S. Tsirkin
  Cc: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb, aarcange,
	Alexander Duyck

On 18.02.19 21:40, Nitesh Narayan Lal wrote:
> On 2/18/19 3:31 PM, Michael S. Tsirkin wrote:
>> On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
>>>>>>>> So I'm fine with a simple implementation but the interface needs to
>>>>>>>> allow the hypervisor to process hints in parallel while guest is
>>>>>>>> running.  We can then fix any issues on hypervisor without breaking
>>>>>>>> guests.
>>>>>>> Yes, I am fine with defining an interface that theoretically let's us
>>>>>>> change the implementation in the guest later.
>>>>>>> I consider this even a
>>>>>>> prerequisite. IMHO the interface shouldn't be different, it will be
>>>>>>> exactly the same.
>>>>>>>
>>>>>>> It is just "who" calls the batch freeing and waits for it. And as I
>>>>>>> outlined here, doing it without additional threads at least avoids us
>>>>>>> for now having to think about dynamic data structures and that we can
>>>>>>> sometimes not report "because the thread is still busy reporting or
>>>>>>> wasn't scheduled yet".
>>>>>> Sorry I wasn't clear. I think we need ability to change the
>>>>>> implementation in the *host* later. IOW don't rely on
>>>>>> host being synchronous.
>>>>>>
>>>>>>
>>>>> I actually misread it :) . In any way, there has to be a mechanism to
>>>>> synchronize.
>>>>>
>>>>> If we are going via a bare hypercall (like s390x, like what Alexander
>>>>> proposes), it is going to be a synchronous interface either way. Just a
>>>>> bare hypercall, there will not really be any blocking on the guest side.
>>>> It bothers me that we are now tied to interface being synchronous. We
>>>> won't be able to fix it if there's an issue as that would break guests.
>>> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
>>>
>>> Yes, as I initially wrote, this mimics s390x. That might be good (we
>>> know it has been working for years) and bad (we are inheriting the same
>>> problem class, if it exists). And being synchronous is part of the
>>> approach for now.
>> BTW on s390 are these hypercalls handled by Linux?
>>
>>> I tend to focus on the first part (we don't know anything besides it is
>>> working) while you focus on the second part (there could be a potential
>>> problem). Having a real problem at hand would be great, then we would
>>> know what exactly we actually have to fix. But read below.
>> If we end up doing a hypercall per THP, maybe we could at least
>> not block with interrupts disabled? Poll in guest until
>> hypervisor reports its done?  That would already be an
>> improvement IMHO. E.g. perf within guest will point you
>> in the right direction and towards disabling hinting.
>>
>>
>>>>> Via virtio, I guess it is waiting for a response to a requests, right?
>>>> For the buffer to be used, yes. And it could mean putting some pages
>>>> aside until hypervisor is done with them. Then you don't need timers or
>>>> tricks like this, you can get an interrupt and start using the memory.
>>> I am very open to such an approach as long as we can make it work and it
>>> is not too complicated. (-> simple)
>>>
>>> This would mean for example
>>>
>>> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
>>> 256/512.
>>>
>>> 2. Once the buffer is full, do crazy "take pages out of the balloon
>>> action" and report them to the hypervisor via virtio. Let the VCPU
>>> continue. This will require some memory to store the request. Small
>>> hickup for the VCPU to kick of the reporting to the hypervisor.
>>>
>>> 3. On interrupt/response, go over the response and put the pages back to
>>> the buddy.
>>>
>>> (assuming that reporting a bulk of frees is better than reporting every
>>> single free obviously)
>>>
>>> This could allow nice things like "when OOM gets trigger, see if pages
>>> are currently being reported and wait until they have been put back to
>>> the buddy, return "new pages available", so in a real "low on memory"
>>> scenario, no OOM killer would get involved. This could address the issue
>>> Wei had with reporting when low on memory.
>>>
>>> Is that something you have in mind?
>> Yes that seems more future proof I think.
>>
>>> I assume we would have to allocate
>>> memory when crafting the new requests. This is the only reason I tend to
>>> prefer a synchronous interface for now. But if allocation is not a
>>> problem, great.
>> There are two main ways to avoid allocation:
>> 1. do not add extra data on top of each chunk passed
> If I am not wrong then this is close to what we have right now.

Yes, minus the kthread(s) and eventually with some sort of memory
allocation for the request. Once you're asynchronous via a notification
mechanisnm, there is no real need for a thread anymore, hopefully.

> One issue I see right now is that I am polling while host is freeing the
> memory.
> In the next version I could tie the logic which returns pages to the
> buddy and resets the per cpu array index value to 0 with the callback.
> (i.e.., it happens once we receive an response from the host)

The question is, what happens when freeing pages and the array is not
ready to be reused yet. In that case, you want to somehow continue
freeing pages without busy waiting or eventually not reporting pages.

The callback should put the pages back to the buddy and free the request
eventually to have a fully asynchronous mechanism.

> Other change which I am testing right now is to only capture 'MAX_ORDER

I am not sure if this is an arbitrary number we came up with here. We
should really play with different orders to find a hot spot. I wouldn't
consider this high priority, though. Getting the whole concept right to
be able to deal with any magic number we come up should be the ultimate
goal. (stuff that only works with huge pages I consider not future
proof, especially regarding fragmented guests which can happen easily)

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18 17:41         ` David Hildenbrand
@ 2019-02-18 23:47           ` Alexander Duyck
  2019-02-19  2:45             ` Michael S. Tsirkin
                               ` (2 more replies)
  0 siblings, 3 replies; 116+ messages in thread
From: Alexander Duyck @ 2019-02-18 23:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Feb 18, 2019 at 9:42 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 18.02.19 18:31, Alexander Duyck wrote:
> > On Mon, Feb 18, 2019 at 8:59 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 18.02.19 17:49, Michael S. Tsirkin wrote:
> >>> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
> >>>> It would be worth a try. My feeling is that a synchronous report after
> >>>> e.g. 512 frees should be acceptable, as it seems to be acceptable on
> >>>> s390x. (basically always enabled, nobody complains).
> >>>
> >>> What slips under the radar on an arch like s390 might
> >>> raise issues for a popular arch like x86. My fear would be
> >>> if it's only a problem e.g. for realtime. Then you get
> >>> a condition that's very hard to trigger and affects
> >>> worst case latencies.
> >>
> >> Realtime should never use free page hinting. Just like it should never
> >> use ballooning. Just like it should pin all pages in the hypervisor.
> >>
> >>>
> >>> But really what business has something that is supposedly
> >>> an optimization blocking a VCPU? We are just freeing up
> >>> lots of memory why is it a good idea to slow that
> >>> process down?
> >>
> >> I first want to know that it is a problem before we declare it a
> >> problem. I provided an example (s390x) where it does not seem to be a
> >> problem. One hypercall ~every 512 frees. As simple as it can get.
> >>
> >> No trying to deny that it could be a problem on x86, but then I assume
> >> it is only a problem in specific setups.
> >>
> >> I would much rather prefer a simple solution that can eventually be
> >> disabled in selected setup than a complicated solution that tries to fit
> >> all possible setups. Realtime is one of the examples where such stuff is
> >> to be disabled either way.
> >>
> >> Optimization of space comes with a price (here: execution time).
> >
> > One thing to keep in mind though is that if you are already having to
> > pull pages in and out of swap on the host in order be able to provide
> > enough memory for the guests the free page hinting should be a
> > significant win in terms of performance.
>
> Indeed. And also we are in a virtualized environment already, we can
> have any kind of sudden hickups. (again, realtime has special
> requirements on the setup)
>
> Side note: I like your approach because it is simple. I don't like your
> approach because it cannot deal with fragmented memory. And that can
> happen easily.
>
> The idea I described here can be similarly be an extension of your
> approach, merging in a "batched reporting" Nitesh proposed, so we can
> report on something < MAX_ORDER, similar to s390x. In the end it boils
> down to reporting via hypercall vs. reporting via virtio. The main point
> is that it is synchronous and batched. (and that we properly take care
> of the race between host freeing and guest allocation)

I'd say the discussion is even simpler then that. My concern is more
synchronous versus asynchronous. I honestly think the cost for a
synchronous call is being overblown and we are likely to see the fault
and zeroing of pages cost more than the hypercall or virtio
transaction itself.

Also one reason why I am not a fan of working with anything less than
PMD order is because there have been issues in the past with false
memory leaks being created when hints were provided on THP pages that
essentially fragmented them. I guess hugepaged went through and
started trying to reassemble the huge pages and as a result there have
been apps that ended up consuming more memory than they would have
otherwise since they were using fragments of THP pages after doing an
MADV_DONTNEED on sections of the page.

> >
> > So far with my patch set that hints at the PMD level w/ THP enabled I
> > am not really seeing that much overhead for the hypercalls. The bigger
> > piece that is eating up CPU time is all the page faults and page
> > zeroing that is going on as we are cycling the memory in and out of
> > the guest. Some of that could probably be resolved by using MADV_FREE,
> > but if we are under actual memory pressure I suspect it would behave
> > similar to MADV_DONTNEED.
> >
>
> MADV_FREE is certainly the better thing to do for hinting in my opinion.
> It should result in even less overhead. Thanks for the comment about the
> hypercall overhead.

Yeah, no problem. The only thing I don't like about MADV_FREE is that
you have to have memory pressure before the pages really start getting
scrubbed with is both a benefit and a drawback. Basically it defers
the freeing until you are under actual memory pressure so when you hit
that case things start feeling much slower, that and it limits your
allocations since the kernel doesn't recognize the pages as free until
it would have to start trying to push memory to swap.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18 21:04                       ` David Hildenbrand
@ 2019-02-19  0:01                         ` Alexander Duyck
  2019-02-19  7:54                           ` David Hildenbrand
  2019-02-19 12:47                         ` Nitesh Narayan Lal
  1 sibling, 1 reply; 116+ messages in thread
From: Alexander Duyck @ 2019-02-19  0:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nitesh Narayan Lal, Michael S. Tsirkin, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Feb 18, 2019 at 1:04 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 18.02.19 21:40, Nitesh Narayan Lal wrote:
> > On 2/18/19 3:31 PM, Michael S. Tsirkin wrote:
> >> On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
> >>>>>>>> So I'm fine with a simple implementation but the interface needs to
> >>>>>>>> allow the hypervisor to process hints in parallel while guest is
> >>>>>>>> running.  We can then fix any issues on hypervisor without breaking
> >>>>>>>> guests.
> >>>>>>> Yes, I am fine with defining an interface that theoretically let's us
> >>>>>>> change the implementation in the guest later.
> >>>>>>> I consider this even a
> >>>>>>> prerequisite. IMHO the interface shouldn't be different, it will be
> >>>>>>> exactly the same.
> >>>>>>>
> >>>>>>> It is just "who" calls the batch freeing and waits for it. And as I
> >>>>>>> outlined here, doing it without additional threads at least avoids us
> >>>>>>> for now having to think about dynamic data structures and that we can
> >>>>>>> sometimes not report "because the thread is still busy reporting or
> >>>>>>> wasn't scheduled yet".
> >>>>>> Sorry I wasn't clear. I think we need ability to change the
> >>>>>> implementation in the *host* later. IOW don't rely on
> >>>>>> host being synchronous.
> >>>>>>
> >>>>>>
> >>>>> I actually misread it :) . In any way, there has to be a mechanism to
> >>>>> synchronize.
> >>>>>
> >>>>> If we are going via a bare hypercall (like s390x, like what Alexander
> >>>>> proposes), it is going to be a synchronous interface either way. Just a
> >>>>> bare hypercall, there will not really be any blocking on the guest side.
> >>>> It bothers me that we are now tied to interface being synchronous. We
> >>>> won't be able to fix it if there's an issue as that would break guests.
> >>> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
> >>>
> >>> Yes, as I initially wrote, this mimics s390x. That might be good (we
> >>> know it has been working for years) and bad (we are inheriting the same
> >>> problem class, if it exists). And being synchronous is part of the
> >>> approach for now.
> >> BTW on s390 are these hypercalls handled by Linux?
> >>
> >>> I tend to focus on the first part (we don't know anything besides it is
> >>> working) while you focus on the second part (there could be a potential
> >>> problem). Having a real problem at hand would be great, then we would
> >>> know what exactly we actually have to fix. But read below.
> >> If we end up doing a hypercall per THP, maybe we could at least
> >> not block with interrupts disabled? Poll in guest until
> >> hypervisor reports its done?  That would already be an
> >> improvement IMHO. E.g. perf within guest will point you
> >> in the right direction and towards disabling hinting.
> >>
> >>
> >>>>> Via virtio, I guess it is waiting for a response to a requests, right?
> >>>> For the buffer to be used, yes. And it could mean putting some pages
> >>>> aside until hypervisor is done with them. Then you don't need timers or
> >>>> tricks like this, you can get an interrupt and start using the memory.
> >>> I am very open to such an approach as long as we can make it work and it
> >>> is not too complicated. (-> simple)
> >>>
> >>> This would mean for example
> >>>
> >>> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
> >>> 256/512.
> >>>
> >>> 2. Once the buffer is full, do crazy "take pages out of the balloon
> >>> action" and report them to the hypervisor via virtio. Let the VCPU
> >>> continue. This will require some memory to store the request. Small
> >>> hickup for the VCPU to kick of the reporting to the hypervisor.
> >>>
> >>> 3. On interrupt/response, go over the response and put the pages back to
> >>> the buddy.
> >>>
> >>> (assuming that reporting a bulk of frees is better than reporting every
> >>> single free obviously)
> >>>
> >>> This could allow nice things like "when OOM gets trigger, see if pages
> >>> are currently being reported and wait until they have been put back to
> >>> the buddy, return "new pages available", so in a real "low on memory"
> >>> scenario, no OOM killer would get involved. This could address the issue
> >>> Wei had with reporting when low on memory.
> >>>
> >>> Is that something you have in mind?
> >> Yes that seems more future proof I think.
> >>
> >>> I assume we would have to allocate
> >>> memory when crafting the new requests. This is the only reason I tend to
> >>> prefer a synchronous interface for now. But if allocation is not a
> >>> problem, great.
> >> There are two main ways to avoid allocation:
> >> 1. do not add extra data on top of each chunk passed
> > If I am not wrong then this is close to what we have right now.
>
> Yes, minus the kthread(s) and eventually with some sort of memory
> allocation for the request. Once you're asynchronous via a notification
> mechanisnm, there is no real need for a thread anymore, hopefully.
>
> > One issue I see right now is that I am polling while host is freeing the
> > memory.
> > In the next version I could tie the logic which returns pages to the
> > buddy and resets the per cpu array index value to 0 with the callback.
> > (i.e.., it happens once we receive an response from the host)
>
> The question is, what happens when freeing pages and the array is not
> ready to be reused yet. In that case, you want to somehow continue
> freeing pages without busy waiting or eventually not reporting pages.
>
> The callback should put the pages back to the buddy and free the request
> eventually to have a fully asynchronous mechanism.
>
> > Other change which I am testing right now is to only capture 'MAX_ORDER
>
> I am not sure if this is an arbitrary number we came up with here. We
> should really play with different orders to find a hot spot. I wouldn't
> consider this high priority, though. Getting the whole concept right to
> be able to deal with any magic number we come up should be the ultimate
> goal. (stuff that only works with huge pages I consider not future
> proof, especially regarding fragmented guests which can happen easily)

This essentially just ends up being another trade-off of CPU versus
memory though. Assuming we aren't using THP we are going to take a
penalty in terms of performance but could then free individual pages
less than HUGETLB_PAGE_ORDER, but the CPU utilization is going to be
much higher in general even without the hinting. I figure for x86 we
probably don't have too many options since if I am not mistaken
MAX_ORDER is just one or two more than HUGETLB_PAGE_ORDER.

As far as fragmentation my thought is that we may want to look into
adding support to the guest for prioritizing defragmentation on pages
lower than THP size. Then that way we could maintain the higher
overall performance with or without the hinting since shuffling lower
order pages around between guests would start to get expensive pretty
quick.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18 23:47           ` Alexander Duyck
@ 2019-02-19  2:45             ` Michael S. Tsirkin
  2019-02-19  2:46             ` Andrea Arcangeli
  2019-02-19  8:06             ` David Hildenbrand
  2 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-19  2:45 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Hildenbrand, Nitesh Narayan Lal, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Feb 18, 2019 at 03:47:22PM -0800, Alexander Duyck wrote:
> > > So far with my patch set that hints at the PMD level w/ THP enabled I
> > > am not really seeing that much overhead for the hypercalls. The bigger
> > > piece that is eating up CPU time is all the page faults and page
> > > zeroing that is going on as we are cycling the memory in and out of
> > > the guest. Some of that could probably be resolved by using MADV_FREE,
> > > but if we are under actual memory pressure I suspect it would behave
> > > similar to MADV_DONTNEED.
> > >
> >
> > MADV_FREE is certainly the better thing to do for hinting in my opinion.
> > It should result in even less overhead. Thanks for the comment about the
> > hypercall overhead.
> 
> Yeah, no problem. The only thing I don't like about MADV_FREE is that
> you have to have memory pressure before the pages really start getting
> scrubbed with is both a benefit and a drawback. Basically it defers
> the freeing until you are under actual memory pressure so when you hit
> that case things start feeling much slower, that and it limits your
> allocations since the kernel doesn't recognize the pages as free until
> it would have to start trying to push memory to swap.

For sure if someone *wants* to spend cycles freeing memory,
we could add a system call that does exactly that. There's
no reason to force that on the same CPU while VCPU
is stopped though.

-- 
MST

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18 23:47           ` Alexander Duyck
  2019-02-19  2:45             ` Michael S. Tsirkin
@ 2019-02-19  2:46             ` Andrea Arcangeli
  2019-02-19 12:52               ` Nitesh Narayan Lal
  2019-02-19 16:23               ` Alexander Duyck
  2019-02-19  8:06             ` David Hildenbrand
  2 siblings, 2 replies; 116+ messages in thread
From: Andrea Arcangeli @ 2019-02-19  2:46 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Hildenbrand, Michael S. Tsirkin, Nitesh Narayan Lal,
	kvm list, LKML, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang,
	Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk,
	dhildenb

Hello,

On Mon, Feb 18, 2019 at 03:47:22PM -0800, Alexander Duyck wrote:
> essentially fragmented them. I guess hugepaged went through and
> started trying to reassemble the huge pages and as a result there have
> been apps that ended up consuming more memory than they would have
> otherwise since they were using fragments of THP pages after doing an
> MADV_DONTNEED on sections of the page.

With relatively recent kernels MADV_DONTNEED doesn't necessarily free
anything when it's applied to a THP subpage, it only splits the
pagetables and queues the THP for deferred splitting. If there's
memory pressure a shrinker will be invoked and the queue is scanned
and the THPs are physically splitted, but to be reassembled/collapsed
after a physical split it requires at least one young pte.

If this is particularly problematic for page hinting, this behavior
where the MADV_DONTNEED can be undoed by khugepaged (if some subpage is
being frequently accessed), can be turned off by setting
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none to
0. Then the THP will only be collapsed if all 512 subpages are mapped
(i.e. they've all be re-allocated by the guest).

Regardless of the max_ptes_none default, keeping the smaller guest
buddy orders as the last target for page hinting should be good for
performance.

> Yeah, no problem. The only thing I don't like about MADV_FREE is that
> you have to have memory pressure before the pages really start getting
> scrubbed with is both a benefit and a drawback. Basically it defers
> the freeing until you are under actual memory pressure so when you hit
> that case things start feeling much slower, that and it limits your
> allocations since the kernel doesn't recognize the pages as free until
> it would have to start trying to push memory to swap.

The guest allocation behavior should not be influenced by MADV_FREE vs
MADV_DONTNEED, the guest can't see the difference anyway, so why
should it limit the allocations?

The benefit of MADV_FREE should be that when the same guest frees and
reallocates an huge amount of RAM (i.e. guest app allocating and
freeing lots of RAM in a loop, not so uncommon), there will be no KVM
page fault during guest re-allocations. So in absence of memory
pressure in the host it should be a major win. Overall it sounds like
a good tradeoff compared to MADV_DONTNEED that forcefully invokes MMU
notifiers and forces host allocations and KVM page faults in order to
reallocate the same RAM in the same guest.

When there's memory pressure it's up to the host Linux VM to notice
there's plenty of MADV_FREE material to free at zero I/O cost before
starting swapping.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-19  0:01                         ` Alexander Duyck
@ 2019-02-19  7:54                           ` David Hildenbrand
  2019-02-19 18:06                             ` Alexander Duyck
  0 siblings, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2019-02-19  7:54 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Nitesh Narayan Lal, Michael S. Tsirkin, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On 19.02.19 01:01, Alexander Duyck wrote:
> On Mon, Feb 18, 2019 at 1:04 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 18.02.19 21:40, Nitesh Narayan Lal wrote:
>>> On 2/18/19 3:31 PM, Michael S. Tsirkin wrote:
>>>> On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
>>>>>>>>>> So I'm fine with a simple implementation but the interface needs to
>>>>>>>>>> allow the hypervisor to process hints in parallel while guest is
>>>>>>>>>> running.  We can then fix any issues on hypervisor without breaking
>>>>>>>>>> guests.
>>>>>>>>> Yes, I am fine with defining an interface that theoretically let's us
>>>>>>>>> change the implementation in the guest later.
>>>>>>>>> I consider this even a
>>>>>>>>> prerequisite. IMHO the interface shouldn't be different, it will be
>>>>>>>>> exactly the same.
>>>>>>>>>
>>>>>>>>> It is just "who" calls the batch freeing and waits for it. And as I
>>>>>>>>> outlined here, doing it without additional threads at least avoids us
>>>>>>>>> for now having to think about dynamic data structures and that we can
>>>>>>>>> sometimes not report "because the thread is still busy reporting or
>>>>>>>>> wasn't scheduled yet".
>>>>>>>> Sorry I wasn't clear. I think we need ability to change the
>>>>>>>> implementation in the *host* later. IOW don't rely on
>>>>>>>> host being synchronous.
>>>>>>>>
>>>>>>>>
>>>>>>> I actually misread it :) . In any way, there has to be a mechanism to
>>>>>>> synchronize.
>>>>>>>
>>>>>>> If we are going via a bare hypercall (like s390x, like what Alexander
>>>>>>> proposes), it is going to be a synchronous interface either way. Just a
>>>>>>> bare hypercall, there will not really be any blocking on the guest side.
>>>>>> It bothers me that we are now tied to interface being synchronous. We
>>>>>> won't be able to fix it if there's an issue as that would break guests.
>>>>> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
>>>>>
>>>>> Yes, as I initially wrote, this mimics s390x. That might be good (we
>>>>> know it has been working for years) and bad (we are inheriting the same
>>>>> problem class, if it exists). And being synchronous is part of the
>>>>> approach for now.
>>>> BTW on s390 are these hypercalls handled by Linux?
>>>>
>>>>> I tend to focus on the first part (we don't know anything besides it is
>>>>> working) while you focus on the second part (there could be a potential
>>>>> problem). Having a real problem at hand would be great, then we would
>>>>> know what exactly we actually have to fix. But read below.
>>>> If we end up doing a hypercall per THP, maybe we could at least
>>>> not block with interrupts disabled? Poll in guest until
>>>> hypervisor reports its done?  That would already be an
>>>> improvement IMHO. E.g. perf within guest will point you
>>>> in the right direction and towards disabling hinting.
>>>>
>>>>
>>>>>>> Via virtio, I guess it is waiting for a response to a requests, right?
>>>>>> For the buffer to be used, yes. And it could mean putting some pages
>>>>>> aside until hypervisor is done with them. Then you don't need timers or
>>>>>> tricks like this, you can get an interrupt and start using the memory.
>>>>> I am very open to such an approach as long as we can make it work and it
>>>>> is not too complicated. (-> simple)
>>>>>
>>>>> This would mean for example
>>>>>
>>>>> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
>>>>> 256/512.
>>>>>
>>>>> 2. Once the buffer is full, do crazy "take pages out of the balloon
>>>>> action" and report them to the hypervisor via virtio. Let the VCPU
>>>>> continue. This will require some memory to store the request. Small
>>>>> hickup for the VCPU to kick of the reporting to the hypervisor.
>>>>>
>>>>> 3. On interrupt/response, go over the response and put the pages back to
>>>>> the buddy.
>>>>>
>>>>> (assuming that reporting a bulk of frees is better than reporting every
>>>>> single free obviously)
>>>>>
>>>>> This could allow nice things like "when OOM gets trigger, see if pages
>>>>> are currently being reported and wait until they have been put back to
>>>>> the buddy, return "new pages available", so in a real "low on memory"
>>>>> scenario, no OOM killer would get involved. This could address the issue
>>>>> Wei had with reporting when low on memory.
>>>>>
>>>>> Is that something you have in mind?
>>>> Yes that seems more future proof I think.
>>>>
>>>>> I assume we would have to allocate
>>>>> memory when crafting the new requests. This is the only reason I tend to
>>>>> prefer a synchronous interface for now. But if allocation is not a
>>>>> problem, great.
>>>> There are two main ways to avoid allocation:
>>>> 1. do not add extra data on top of each chunk passed
>>> If I am not wrong then this is close to what we have right now.
>>
>> Yes, minus the kthread(s) and eventually with some sort of memory
>> allocation for the request. Once you're asynchronous via a notification
>> mechanisnm, there is no real need for a thread anymore, hopefully.
>>
>>> One issue I see right now is that I am polling while host is freeing the
>>> memory.
>>> In the next version I could tie the logic which returns pages to the
>>> buddy and resets the per cpu array index value to 0 with the callback.
>>> (i.e.., it happens once we receive an response from the host)
>>
>> The question is, what happens when freeing pages and the array is not
>> ready to be reused yet. In that case, you want to somehow continue
>> freeing pages without busy waiting or eventually not reporting pages.
>>
>> The callback should put the pages back to the buddy and free the request
>> eventually to have a fully asynchronous mechanism.
>>
>>> Other change which I am testing right now is to only capture 'MAX_ORDER
>>
>> I am not sure if this is an arbitrary number we came up with here. We
>> should really play with different orders to find a hot spot. I wouldn't
>> consider this high priority, though. Getting the whole concept right to
>> be able to deal with any magic number we come up should be the ultimate
>> goal. (stuff that only works with huge pages I consider not future
>> proof, especially regarding fragmented guests which can happen easily)
> 
> This essentially just ends up being another trade-off of CPU versus
> memory though. Assuming we aren't using THP we are going to take a
> penalty in terms of performance but could then free individual pages
> less than HUGETLB_PAGE_ORDER, but the CPU utilization is going to be
> much higher in general even without the hinting. I figure for x86 we
> probably don't have too many options since if I am not mistaken
> MAX_ORDER is just one or two more than HUGETLB_PAGE_ORDER.

THP is an implementation detail in the hypervisor. Yes, it is the common
case on x86. But it is e.g. not available on s390x yet. And we also want
this mechanism to work on s390x (e.g. for nested virtualization setups
as discussed).

If we e.g. report any granularity after merging was done in the buddy,
we could end up reporting everything from page size up to MAX_SIZE - 1,
the hypervisor could ignore hints below a certain magic number, if it
makes its life easier.

> 
> As far as fragmentation my thought is that we may want to look into
> adding support to the guest for prioritizing defragmentation on pages
> lower than THP size. Then that way we could maintain the higher
> overall performance with or without the hinting since shuffling lower
> order pages around between guests would start to get expensive pretty
> quick.

My take would be, design an interface/mechanism that allows any kind of
granularity. You can than balance between cpu overead and space shifting.

I feel like repeating myself, but on s390x hinting is done on page
granularity, and I have never heard somebody say "how can I turn it off,
this is slowing down my system too much.". All we know is that one
hypercall per free is most probably not acceptable. We really have to
play with the numbers.

I tend to like an asynchronous reporting approach as discussed in this
thread, we would have to see if Nitesh could get it implemented.

Thanks Alexander!

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18 23:47           ` Alexander Duyck
  2019-02-19  2:45             ` Michael S. Tsirkin
  2019-02-19  2:46             ` Andrea Arcangeli
@ 2019-02-19  8:06             ` David Hildenbrand
  2019-02-19 14:40               ` Michael S. Tsirkin
  2 siblings, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2019-02-19  8:06 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On 19.02.19 00:47, Alexander Duyck wrote:
> On Mon, Feb 18, 2019 at 9:42 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 18.02.19 18:31, Alexander Duyck wrote:
>>> On Mon, Feb 18, 2019 at 8:59 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 18.02.19 17:49, Michael S. Tsirkin wrote:
>>>>> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
>>>>>> It would be worth a try. My feeling is that a synchronous report after
>>>>>> e.g. 512 frees should be acceptable, as it seems to be acceptable on
>>>>>> s390x. (basically always enabled, nobody complains).
>>>>>
>>>>> What slips under the radar on an arch like s390 might
>>>>> raise issues for a popular arch like x86. My fear would be
>>>>> if it's only a problem e.g. for realtime. Then you get
>>>>> a condition that's very hard to trigger and affects
>>>>> worst case latencies.
>>>>
>>>> Realtime should never use free page hinting. Just like it should never
>>>> use ballooning. Just like it should pin all pages in the hypervisor.
>>>>
>>>>>
>>>>> But really what business has something that is supposedly
>>>>> an optimization blocking a VCPU? We are just freeing up
>>>>> lots of memory why is it a good idea to slow that
>>>>> process down?
>>>>
>>>> I first want to know that it is a problem before we declare it a
>>>> problem. I provided an example (s390x) where it does not seem to be a
>>>> problem. One hypercall ~every 512 frees. As simple as it can get.
>>>>
>>>> No trying to deny that it could be a problem on x86, but then I assume
>>>> it is only a problem in specific setups.
>>>>
>>>> I would much rather prefer a simple solution that can eventually be
>>>> disabled in selected setup than a complicated solution that tries to fit
>>>> all possible setups. Realtime is one of the examples where such stuff is
>>>> to be disabled either way.
>>>>
>>>> Optimization of space comes with a price (here: execution time).
>>>
>>> One thing to keep in mind though is that if you are already having to
>>> pull pages in and out of swap on the host in order be able to provide
>>> enough memory for the guests the free page hinting should be a
>>> significant win in terms of performance.
>>
>> Indeed. And also we are in a virtualized environment already, we can
>> have any kind of sudden hickups. (again, realtime has special
>> requirements on the setup)
>>
>> Side note: I like your approach because it is simple. I don't like your
>> approach because it cannot deal with fragmented memory. And that can
>> happen easily.
>>
>> The idea I described here can be similarly be an extension of your
>> approach, merging in a "batched reporting" Nitesh proposed, so we can
>> report on something < MAX_ORDER, similar to s390x. In the end it boils
>> down to reporting via hypercall vs. reporting via virtio. The main point
>> is that it is synchronous and batched. (and that we properly take care
>> of the race between host freeing and guest allocation)
> 
> I'd say the discussion is even simpler then that. My concern is more
> synchronous versus asynchronous. I honestly think the cost for a
> synchronous call is being overblown and we are likely to see the fault
> and zeroing of pages cost more than the hypercall or virtio
> transaction itself.

The overhead of page faults and zeroing should be mitigated by
MADV_FREE, as Andrea correctly stated (thanks!). Then, the call overhead
(context switch) becomes relevant.

We have various discussions now :) And I think they are related.

synchronous versus asynchronous
batched vs. non-batched
MAX_ORDER - 1 vs. other/none magic number

1. synchronous call without batching on every kfree is bad. The
interface is fixed to big magic numbers, otherwise we end up having a
hypercall on every kfree. This is your approach.

2. asynchronous calls without batching would most probably have similar
problems with small granularities as we had when ballooning without
batching. Just overhead we can avoid.

3. synchronous and batched is what s390x does. It can deal with page
granularity. It is what I initially described in this sub-thread.

4. asynchronous and batched. This is the other approach we discussed
yesterday. If we can get it implemented, I would be interested in
performance numbers.

As far as I understood, Michael seems to favor something like 4 (and I
assume eventually 2 if it is similarly fast). I am a friend of either 3
or 4.

> 
> Also one reason why I am not a fan of working with anything less than
> PMD order is because there have been issues in the past with false
> memory leaks being created when hints were provided on THP pages that
> essentially fragmented them. I guess hugepaged went through and
> started trying to reassemble the huge pages and as a result there have
> been apps that ended up consuming more memory than they would have
> otherwise since they were using fragments of THP pages after doing an
> MADV_DONTNEED on sections of the page.

I understand your concerns, but we should not let bugs in the hypervisor
dictate the design. Bugs are there to be fixed. Interesting read,
though, thanks!

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-18 21:04                       ` David Hildenbrand
  2019-02-19  0:01                         ` Alexander Duyck
@ 2019-02-19 12:47                         ` Nitesh Narayan Lal
  2019-02-19 13:03                           ` David Hildenbrand
  1 sibling, 1 reply; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-19 12:47 UTC (permalink / raw)
  To: David Hildenbrand, Michael S. Tsirkin
  Cc: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb, aarcange,
	Alexander Duyck


[-- Attachment #1.1: Type: text/plain, Size: 6826 bytes --]


On 2/18/19 4:04 PM, David Hildenbrand wrote:
> On 18.02.19 21:40, Nitesh Narayan Lal wrote:
>> On 2/18/19 3:31 PM, Michael S. Tsirkin wrote:
>>> On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
>>>>>>>>> So I'm fine with a simple implementation but the interface needs to
>>>>>>>>> allow the hypervisor to process hints in parallel while guest is
>>>>>>>>> running.  We can then fix any issues on hypervisor without breaking
>>>>>>>>> guests.
>>>>>>>> Yes, I am fine with defining an interface that theoretically let's us
>>>>>>>> change the implementation in the guest later.
>>>>>>>> I consider this even a
>>>>>>>> prerequisite. IMHO the interface shouldn't be different, it will be
>>>>>>>> exactly the same.
>>>>>>>>
>>>>>>>> It is just "who" calls the batch freeing and waits for it. And as I
>>>>>>>> outlined here, doing it without additional threads at least avoids us
>>>>>>>> for now having to think about dynamic data structures and that we can
>>>>>>>> sometimes not report "because the thread is still busy reporting or
>>>>>>>> wasn't scheduled yet".
>>>>>>> Sorry I wasn't clear. I think we need ability to change the
>>>>>>> implementation in the *host* later. IOW don't rely on
>>>>>>> host being synchronous.
>>>>>>>
>>>>>>>
>>>>>> I actually misread it :) . In any way, there has to be a mechanism to
>>>>>> synchronize.
>>>>>>
>>>>>> If we are going via a bare hypercall (like s390x, like what Alexander
>>>>>> proposes), it is going to be a synchronous interface either way. Just a
>>>>>> bare hypercall, there will not really be any blocking on the guest side.
>>>>> It bothers me that we are now tied to interface being synchronous. We
>>>>> won't be able to fix it if there's an issue as that would break guests.
>>>> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
>>>>
>>>> Yes, as I initially wrote, this mimics s390x. That might be good (we
>>>> know it has been working for years) and bad (we are inheriting the same
>>>> problem class, if it exists). And being synchronous is part of the
>>>> approach for now.
>>> BTW on s390 are these hypercalls handled by Linux?
>>>
>>>> I tend to focus on the first part (we don't know anything besides it is
>>>> working) while you focus on the second part (there could be a potential
>>>> problem). Having a real problem at hand would be great, then we would
>>>> know what exactly we actually have to fix. But read below.
>>> If we end up doing a hypercall per THP, maybe we could at least
>>> not block with interrupts disabled? Poll in guest until
>>> hypervisor reports its done?  That would already be an
>>> improvement IMHO. E.g. perf within guest will point you
>>> in the right direction and towards disabling hinting.
>>>
>>>
>>>>>> Via virtio, I guess it is waiting for a response to a requests, right?
>>>>> For the buffer to be used, yes. And it could mean putting some pages
>>>>> aside until hypervisor is done with them. Then you don't need timers or
>>>>> tricks like this, you can get an interrupt and start using the memory.
>>>> I am very open to such an approach as long as we can make it work and it
>>>> is not too complicated. (-> simple)
>>>>
>>>> This would mean for example
>>>>
>>>> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
>>>> 256/512.
>>>>
>>>> 2. Once the buffer is full, do crazy "take pages out of the balloon
>>>> action" and report them to the hypervisor via virtio. Let the VCPU
>>>> continue. This will require some memory to store the request. Small
>>>> hickup for the VCPU to kick of the reporting to the hypervisor.
>>>>
>>>> 3. On interrupt/response, go over the response and put the pages back to
>>>> the buddy.
>>>>
>>>> (assuming that reporting a bulk of frees is better than reporting every
>>>> single free obviously)
>>>>
>>>> This could allow nice things like "when OOM gets trigger, see if pages
>>>> are currently being reported and wait until they have been put back to
>>>> the buddy, return "new pages available", so in a real "low on memory"
>>>> scenario, no OOM killer would get involved. This could address the issue
>>>> Wei had with reporting when low on memory.
>>>>
>>>> Is that something you have in mind?
>>> Yes that seems more future proof I think.
>>>
>>>> I assume we would have to allocate
>>>> memory when crafting the new requests. This is the only reason I tend to
>>>> prefer a synchronous interface for now. But if allocation is not a
>>>> problem, great.
>>> There are two main ways to avoid allocation:
>>> 1. do not add extra data on top of each chunk passed
>> If I am not wrong then this is close to what we have right now.
> Yes, minus the kthread(s) and eventually with some sort of memory
> allocation for the request. Once you're asynchronous via a notification
> mechanisnm, there is no real need for a thread anymore, hopefully.
Whether we should go with kthread or without it, I would like to do some
performance comparison before commenting on this.
>
>> One issue I see right now is that I am polling while host is freeing the
>> memory.
>> In the next version I could tie the logic which returns pages to the
>> buddy and resets the per cpu array index value to 0 with the callback.
>> (i.e.., it happens once we receive an response from the host)
> The question is, what happens when freeing pages and the array is not
> ready to be reused yet. In that case, you want to somehow continue
> freeing pages without busy waiting or eventually not reporting pages.
This is what happens right now.
Having kthread or not should not effect this behavior.
When the array is full the current approach simply skips collecting the
free pages.
>
> The callback should put the pages back to the buddy and free the request
> eventually to have a fully asynchronous mechanism.
>
>> Other change which I am testing right now is to only capture 'MAX_ORDER
> I am not sure if this is an arbitrary number we came up with here. We
> should really play with different orders to find a hot spot. I wouldn't
> consider this high priority, though. Getting the whole concept right to
> be able to deal with any magic number we come up should be the ultimate
> goal. (stuff that only works with huge pages I consider not future
> proof, especially regarding fragmented guests which can happen easily)
Its quite possible that when we are only capturing MAX_ORDER - 1 and run
a specific workload we don't get any memory back until we re-run the
program and buddy finally starts merging of pages of order MAX_ORDER -1.
This is why I think we may make this configurable from compile time and
keep capturing MAX_ORDER - 1 so that we don't end up breaking anything.
>
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-19  2:46             ` Andrea Arcangeli
@ 2019-02-19 12:52               ` Nitesh Narayan Lal
  2019-02-19 16:23               ` Alexander Duyck
  1 sibling, 0 replies; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-19 12:52 UTC (permalink / raw)
  To: Andrea Arcangeli, Alexander Duyck
  Cc: David Hildenbrand, Michael S. Tsirkin, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb


[-- Attachment #1.1: Type: text/plain, Size: 2934 bytes --]


On 2/18/19 9:46 PM, Andrea Arcangeli wrote:
> Hello,
>
> On Mon, Feb 18, 2019 at 03:47:22PM -0800, Alexander Duyck wrote:
>> essentially fragmented them. I guess hugepaged went through and
>> started trying to reassemble the huge pages and as a result there have
>> been apps that ended up consuming more memory than they would have
>> otherwise since they were using fragments of THP pages after doing an
>> MADV_DONTNEED on sections of the page.
> With relatively recent kernels MADV_DONTNEED doesn't necessarily free
> anything when it's applied to a THP subpage, it only splits the
> pagetables and queues the THP for deferred splitting. If there's
> memory pressure a shrinker will be invoked and the queue is scanned
> and the THPs are physically splitted, but to be reassembled/collapsed
> after a physical split it requires at least one young pte.
>
> If this is particularly problematic for page hinting, this behavior
> where the MADV_DONTNEED can be undoed by khugepaged (if some subpage is
> being frequently accessed), can be turned off by setting
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none to
> 0. Then the THP will only be collapsed if all 512 subpages are mapped
> (i.e. they've all be re-allocated by the guest).
>
> Regardless of the max_ptes_none default, keeping the smaller guest
> buddy orders as the last target for page hinting should be good for
> performance.
>
>> Yeah, no problem. The only thing I don't like about MADV_FREE is that
>> you have to have memory pressure before the pages really start getting
>> scrubbed with is both a benefit and a drawback. Basically it defers
>> the freeing until you are under actual memory pressure so when you hit
>> that case things start feeling much slower, that and it limits your
>> allocations since the kernel doesn't recognize the pages as free until
>> it would have to start trying to push memory to swap.
> The guest allocation behavior should not be influenced by MADV_FREE vs
> MADV_DONTNEED, the guest can't see the difference anyway, so why
> should it limit the allocations?
>
> The benefit of MADV_FREE should be that when the same guest frees and
> reallocates an huge amount of RAM (i.e. guest app allocating and
> freeing lots of RAM in a loop, not so uncommon), there will be no KVM
> page fault during guest re-allocations. So in absence of memory
> pressure in the host it should be a major win. Overall it sounds like
> a good tradeoff compared to MADV_DONTNEED that forcefully invokes MMU
> notifiers and forces host allocations and KVM page faults in order to
> reallocate the same RAM in the same guest.
This does makes sense.
Thanks for explaining this.
>
> When there's memory pressure it's up to the host Linux VM to notice
> there's plenty of MADV_FREE material to free at zero I/O cost before
> starting swapping.
>
> Thanks,
> Andrea
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-19 12:47                         ` Nitesh Narayan Lal
@ 2019-02-19 13:03                           ` David Hildenbrand
  2019-02-19 14:17                             ` Nitesh Narayan Lal
  0 siblings, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2019-02-19 13:03 UTC (permalink / raw)
  To: Nitesh Narayan Lal, Michael S. Tsirkin
  Cc: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb, aarcange,
	Alexander Duyck

>>>> There are two main ways to avoid allocation:
>>>> 1. do not add extra data on top of each chunk passed
>>> If I am not wrong then this is close to what we have right now.
>> Yes, minus the kthread(s) and eventually with some sort of memory
>> allocation for the request. Once you're asynchronous via a notification
>> mechanisnm, there is no real need for a thread anymore, hopefully.
> Whether we should go with kthread or without it, I would like to do some
> performance comparison before commenting on this.
>>
>>> One issue I see right now is that I am polling while host is freeing the
>>> memory.
>>> In the next version I could tie the logic which returns pages to the
>>> buddy and resets the per cpu array index value to 0 with the callback.
>>> (i.e.., it happens once we receive an response from the host)
>> The question is, what happens when freeing pages and the array is not
>> ready to be reused yet. In that case, you want to somehow continue
>> freeing pages without busy waiting or eventually not reporting pages.
> This is what happens right now.
> Having kthread or not should not effect this behavior.
> When the array is full the current approach simply skips collecting the
> free pages.

Well, it somehow does affect your implementation. If you have a kthread
you always have to synchronize against the VCPU: "Is the pcpu array
ready to be used again".

Once you do it asynchronously from your VCPU without another thread
being involved, such synchronization is not required. Simply prepare a
request and send it off. Reuse the pcpu array instantly. At least that's
the theory :)

If you have a guest bulk freeing a lot of memory, I guess temporarily
dropping free page hints could be counter-productive. It really depends
on how fast the thread gets scheduled and how long the hinting process
takes. Having another thread involved might add a lot to that latency to
that formula. We'll have to measure, but my gut feeling is that once we
do stuff asynchronously, there is no need for a thread anymore.

>>
>> The callback should put the pages back to the buddy and free the request
>> eventually to have a fully asynchronous mechanism.
>>
>>> Other change which I am testing right now is to only capture 'MAX_ORDER
>> I am not sure if this is an arbitrary number we came up with here. We
>> should really play with different orders to find a hot spot. I wouldn't
>> consider this high priority, though. Getting the whole concept right to
>> be able to deal with any magic number we come up should be the ultimate
>> goal. (stuff that only works with huge pages I consider not future
>> proof, especially regarding fragmented guests which can happen easily)
> Its quite possible that when we are only capturing MAX_ORDER - 1 and run
> a specific workload we don't get any memory back until we re-run the
> program and buddy finally starts merging of pages of order MAX_ORDER -1.
> This is why I think we may make this configurable from compile time and
> keep capturing MAX_ORDER - 1 so that we don't end up breaking anything.

Eventually pages will never get merged. Assume you have 1 page of a
MAX_ORDER - 1 chunk still allocated somewhere (e.g. !movable via
kmalloc). You skip reporting that chunk completely. Roughly 1mb/2mb/4mb
wasted (depending on the arch). This stuff can sum up.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-19 13:03                           ` David Hildenbrand
@ 2019-02-19 14:17                             ` Nitesh Narayan Lal
  2019-02-19 14:21                               ` David Hildenbrand
  0 siblings, 1 reply; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-19 14:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb, aarcange,
	Alexander Duyck, Michael S. Tsirkin


[-- Attachment #1.1: Type: text/plain, Size: 4334 bytes --]

On 2/19/19 8:03 AM, David Hildenbrand wrote:
>>>>> There are two main ways to avoid allocation:
>>>>> 1. do not add extra data on top of each chunk passed
>>>> If I am not wrong then this is close to what we have right now.
>>> Yes, minus the kthread(s) and eventually with some sort of memory
>>> allocation for the request. Once you're asynchronous via a notification
>>> mechanisnm, there is no real need for a thread anymore, hopefully.
>> Whether we should go with kthread or without it, I would like to do some
>> performance comparison before commenting on this.
>>>> One issue I see right now is that I am polling while host is freeing the
>>>> memory.
>>>> In the next version I could tie the logic which returns pages to the
>>>> buddy and resets the per cpu array index value to 0 with the callback.
>>>> (i.e.., it happens once we receive an response from the host)
>>> The question is, what happens when freeing pages and the array is not
>>> ready to be reused yet. In that case, you want to somehow continue
>>> freeing pages without busy waiting or eventually not reporting pages.
>> This is what happens right now.
>> Having kthread or not should not effect this behavior.
>> When the array is full the current approach simply skips collecting the
>> free pages.
> Well, it somehow does affect your implementation. If you have a kthread
> you always have to synchronize against the VCPU: "Is the pcpu array
> ready to be used again".
>
> Once you do it asynchronously from your VCPU without another thread
> being involved, such synchronization is not required. Simply prepare a
> request and send it off. Reuse the pcpu array instantly. At least that's
> the theory :)
>
> If you have a guest bulk freeing a lot of memory, I guess temporarily
> dropping free page hints could be counter-productive. It really depends
> on how fast the thread gets scheduled and how long the hinting process
> takes. Having another thread involved might add a lot to that latency to
> that formula. We'll have to measure, but my gut feeling is that once we
> do stuff asynchronously, there is no need for a thread anymore.
This is true.
>
>>> The callback should put the pages back to the buddy and free the request
>>> eventually to have a fully asynchronous mechanism.
>>>
>>>> Other change which I am testing right now is to only capture 'MAX_ORDER
>>> I am not sure if this is an arbitrary number we came up with here. We
>>> should really play with different orders to find a hot spot. I wouldn't
>>> consider this high priority, though. Getting the whole concept right to
>>> be able to deal with any magic number we come up should be the ultimate
>>> goal. (stuff that only works with huge pages I consider not future
>>> proof, especially regarding fragmented guests which can happen easily)
>> Its quite possible that when we are only capturing MAX_ORDER - 1 and run
>> a specific workload we don't get any memory back until we re-run the
>> program and buddy finally starts merging of pages of order MAX_ORDER -1.
>> This is why I think we may make this configurable from compile time and
>> keep capturing MAX_ORDER - 1 so that we don't end up breaking anything.
> Eventually pages will never get merged. Assume you have 1 page of a
> MAX_ORDER - 1 chunk still allocated somewhere (e.g. !movable via
> kmalloc). You skip reporting that chunk completely. Roughly 1mb/2mb/4mb
> wasted (depending on the arch). This stuff can sum up.

After the discussion, here are the changes on which I am planning to
work next:
1. Get rid of the kthread and dynamically allocate a per-cpu array to
hold the
isolated pages. As soon as the initial per-cpu array is completely
scanned, release it
so that we don't end up blocking anything.
2. Continue capturing MAX_ORDER - 1, for now. Reduce the initial per-cpu
array size to 256
for now. As we are doing asynchronous reporting we should be fine with a
lower size array.
3. As soon as the host responds, release the pages back to the buddy
from the callback and free the request.

Benefits wrt current implementation:
1. We will not eat up performance due to kernel thread.
2. We will still be doing reporting asynchronously=> no blocking.
3. Hopefully, we will be able to free more memory.
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-19 14:17                             ` Nitesh Narayan Lal
@ 2019-02-19 14:21                               ` David Hildenbrand
  0 siblings, 0 replies; 116+ messages in thread
From: David Hildenbrand @ 2019-02-19 14:21 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm, linux-kernel, pbonzini, lcapitulino, pagupta, wei.w.wang,
	yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb, aarcange,
	Alexander Duyck, Michael S. Tsirkin

On 19.02.19 15:17, Nitesh Narayan Lal wrote:
> On 2/19/19 8:03 AM, David Hildenbrand wrote:
>>>>>> There are two main ways to avoid allocation:
>>>>>> 1. do not add extra data on top of each chunk passed
>>>>> If I am not wrong then this is close to what we have right now.
>>>> Yes, minus the kthread(s) and eventually with some sort of memory
>>>> allocation for the request. Once you're asynchronous via a notification
>>>> mechanisnm, there is no real need for a thread anymore, hopefully.
>>> Whether we should go with kthread or without it, I would like to do some
>>> performance comparison before commenting on this.
>>>>> One issue I see right now is that I am polling while host is freeing the
>>>>> memory.
>>>>> In the next version I could tie the logic which returns pages to the
>>>>> buddy and resets the per cpu array index value to 0 with the callback.
>>>>> (i.e.., it happens once we receive an response from the host)
>>>> The question is, what happens when freeing pages and the array is not
>>>> ready to be reused yet. In that case, you want to somehow continue
>>>> freeing pages without busy waiting or eventually not reporting pages.
>>> This is what happens right now.
>>> Having kthread or not should not effect this behavior.
>>> When the array is full the current approach simply skips collecting the
>>> free pages.
>> Well, it somehow does affect your implementation. If you have a kthread
>> you always have to synchronize against the VCPU: "Is the pcpu array
>> ready to be used again".
>>
>> Once you do it asynchronously from your VCPU without another thread
>> being involved, such synchronization is not required. Simply prepare a
>> request and send it off. Reuse the pcpu array instantly. At least that's
>> the theory :)
>>
>> If you have a guest bulk freeing a lot of memory, I guess temporarily
>> dropping free page hints could be counter-productive. It really depends
>> on how fast the thread gets scheduled and how long the hinting process
>> takes. Having another thread involved might add a lot to that latency to
>> that formula. We'll have to measure, but my gut feeling is that once we
>> do stuff asynchronously, there is no need for a thread anymore.
> This is true.
>>
>>>> The callback should put the pages back to the buddy and free the request
>>>> eventually to have a fully asynchronous mechanism.
>>>>
>>>>> Other change which I am testing right now is to only capture 'MAX_ORDER
>>>> I am not sure if this is an arbitrary number we came up with here. We
>>>> should really play with different orders to find a hot spot. I wouldn't
>>>> consider this high priority, though. Getting the whole concept right to
>>>> be able to deal with any magic number we come up should be the ultimate
>>>> goal. (stuff that only works with huge pages I consider not future
>>>> proof, especially regarding fragmented guests which can happen easily)
>>> Its quite possible that when we are only capturing MAX_ORDER - 1 and run
>>> a specific workload we don't get any memory back until we re-run the
>>> program and buddy finally starts merging of pages of order MAX_ORDER -1.
>>> This is why I think we may make this configurable from compile time and
>>> keep capturing MAX_ORDER - 1 so that we don't end up breaking anything.
>> Eventually pages will never get merged. Assume you have 1 page of a
>> MAX_ORDER - 1 chunk still allocated somewhere (e.g. !movable via
>> kmalloc). You skip reporting that chunk completely. Roughly 1mb/2mb/4mb
>> wasted (depending on the arch). This stuff can sum up.
> 
> After the discussion, here are the changes on which I am planning to
> work next:
> 1. Get rid of the kthread and dynamically allocate a per-cpu array to
> hold the
> isolated pages. As soon as the initial per-cpu array is completely
> scanned, release it
> so that we don't end up blocking anything.
> 2. Continue capturing MAX_ORDER - 1, for now. Reduce the initial per-cpu
> array size to 256
> for now. As we are doing asynchronous reporting we should be fine with a
> lower size array.
> 3. As soon as the host responds, release the pages back to the buddy
> from the callback and free the request.
> 
> Benefits wrt current implementation:
> 1. We will not eat up performance due to kernel thread.
> 2. We will still be doing reporting asynchronously=> no blocking.
> 3. Hopefully, we will be able to free more memory.
> 

+1 to that approach. We can fine tune the numbers (array size, sizes to
report) easily later on. Let us know if you run into problems doing the
allocation for the request. If that is a blocker, all we left with is a
synchronous approach I guess. Let's cross fingers. :)

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-19  8:06             ` David Hildenbrand
@ 2019-02-19 14:40               ` Michael S. Tsirkin
  2019-02-19 14:44                 ` David Hildenbrand
  0 siblings, 1 reply; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-19 14:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexander Duyck, Nitesh Narayan Lal, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Tue, Feb 19, 2019 at 09:06:01AM +0100, David Hildenbrand wrote:
> On 19.02.19 00:47, Alexander Duyck wrote:
> > On Mon, Feb 18, 2019 at 9:42 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 18.02.19 18:31, Alexander Duyck wrote:
> >>> On Mon, Feb 18, 2019 at 8:59 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 18.02.19 17:49, Michael S. Tsirkin wrote:
> >>>>> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
> >>>>>> It would be worth a try. My feeling is that a synchronous report after
> >>>>>> e.g. 512 frees should be acceptable, as it seems to be acceptable on
> >>>>>> s390x. (basically always enabled, nobody complains).
> >>>>>
> >>>>> What slips under the radar on an arch like s390 might
> >>>>> raise issues for a popular arch like x86. My fear would be
> >>>>> if it's only a problem e.g. for realtime. Then you get
> >>>>> a condition that's very hard to trigger and affects
> >>>>> worst case latencies.
> >>>>
> >>>> Realtime should never use free page hinting. Just like it should never
> >>>> use ballooning. Just like it should pin all pages in the hypervisor.
> >>>>
> >>>>>
> >>>>> But really what business has something that is supposedly
> >>>>> an optimization blocking a VCPU? We are just freeing up
> >>>>> lots of memory why is it a good idea to slow that
> >>>>> process down?
> >>>>
> >>>> I first want to know that it is a problem before we declare it a
> >>>> problem. I provided an example (s390x) where it does not seem to be a
> >>>> problem. One hypercall ~every 512 frees. As simple as it can get.
> >>>>
> >>>> No trying to deny that it could be a problem on x86, but then I assume
> >>>> it is only a problem in specific setups.
> >>>>
> >>>> I would much rather prefer a simple solution that can eventually be
> >>>> disabled in selected setup than a complicated solution that tries to fit
> >>>> all possible setups. Realtime is one of the examples where such stuff is
> >>>> to be disabled either way.
> >>>>
> >>>> Optimization of space comes with a price (here: execution time).
> >>>
> >>> One thing to keep in mind though is that if you are already having to
> >>> pull pages in and out of swap on the host in order be able to provide
> >>> enough memory for the guests the free page hinting should be a
> >>> significant win in terms of performance.
> >>
> >> Indeed. And also we are in a virtualized environment already, we can
> >> have any kind of sudden hickups. (again, realtime has special
> >> requirements on the setup)
> >>
> >> Side note: I like your approach because it is simple. I don't like your
> >> approach because it cannot deal with fragmented memory. And that can
> >> happen easily.
> >>
> >> The idea I described here can be similarly be an extension of your
> >> approach, merging in a "batched reporting" Nitesh proposed, so we can
> >> report on something < MAX_ORDER, similar to s390x. In the end it boils
> >> down to reporting via hypercall vs. reporting via virtio. The main point
> >> is that it is synchronous and batched. (and that we properly take care
> >> of the race between host freeing and guest allocation)
> > 
> > I'd say the discussion is even simpler then that. My concern is more
> > synchronous versus asynchronous. I honestly think the cost for a
> > synchronous call is being overblown and we are likely to see the fault
> > and zeroing of pages cost more than the hypercall or virtio
> > transaction itself.
> 
> The overhead of page faults and zeroing should be mitigated by
> MADV_FREE, as Andrea correctly stated (thanks!). Then, the call overhead
> (context switch) becomes relevant.
> 
> We have various discussions now :) And I think they are related.
> 
> synchronous versus asynchronous
> batched vs. non-batched
> MAX_ORDER - 1 vs. other/none magic number
> 
> 1. synchronous call without batching on every kfree is bad. The
> interface is fixed to big magic numbers, otherwise we end up having a
> hypercall on every kfree. This is your approach.
> 
> 2. asynchronous calls without batching would most probably have similar
> problems with small granularities as we had when ballooning without
> batching. Just overhead we can avoid.
> 
> 3. synchronous and batched is what s390x does. It can deal with page
> granularity. It is what I initially described in this sub-thread.
> 
> 4. asynchronous and batched. This is the other approach we discussed
> yesterday. If we can get it implemented, I would be interested in
> performance numbers.
> 
> As far as I understood, Michael seems to favor something like 4 (and I
> assume eventually 2 if it is similarly fast). I am a friend of either 3
> or 4.

Well Linus said big granularity is important for linux MM
and not to bother with hinting small sizes.

Alex said cost of a hypercall is drawfed by a pagefault
after alloc. I would be curious whether async pagefault
can help things somehow though.

> > 
> > Also one reason why I am not a fan of working with anything less than
> > PMD order is because there have been issues in the past with false
> > memory leaks being created when hints were provided on THP pages that
> > essentially fragmented them. I guess hugepaged went through and
> > started trying to reassemble the huge pages and as a result there have
> > been apps that ended up consuming more memory than they would have
> > otherwise since they were using fragments of THP pages after doing an
> > MADV_DONTNEED on sections of the page.
> 
> I understand your concerns, but we should not let bugs in the hypervisor
> dictate the design. Bugs are there to be fixed. Interesting read,
> though, thanks!

Right but if we break up a huge page we are then creating
more work for hypervisor to reassemble it.

> -- 
> 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-19 14:40               ` Michael S. Tsirkin
@ 2019-02-19 14:44                 ` David Hildenbrand
  2019-02-19 14:45                   ` David Hildenbrand
  0 siblings, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2019-02-19 14:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, Nitesh Narayan Lal, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On 19.02.19 15:40, Michael S. Tsirkin wrote:
> On Tue, Feb 19, 2019 at 09:06:01AM +0100, David Hildenbrand wrote:
>> On 19.02.19 00:47, Alexander Duyck wrote:
>>> On Mon, Feb 18, 2019 at 9:42 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 18.02.19 18:31, Alexander Duyck wrote:
>>>>> On Mon, Feb 18, 2019 at 8:59 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>>
>>>>>> On 18.02.19 17:49, Michael S. Tsirkin wrote:
>>>>>>> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
>>>>>>>> It would be worth a try. My feeling is that a synchronous report after
>>>>>>>> e.g. 512 frees should be acceptable, as it seems to be acceptable on
>>>>>>>> s390x. (basically always enabled, nobody complains).
>>>>>>>
>>>>>>> What slips under the radar on an arch like s390 might
>>>>>>> raise issues for a popular arch like x86. My fear would be
>>>>>>> if it's only a problem e.g. for realtime. Then you get
>>>>>>> a condition that's very hard to trigger and affects
>>>>>>> worst case latencies.
>>>>>>
>>>>>> Realtime should never use free page hinting. Just like it should never
>>>>>> use ballooning. Just like it should pin all pages in the hypervisor.
>>>>>>
>>>>>>>
>>>>>>> But really what business has something that is supposedly
>>>>>>> an optimization blocking a VCPU? We are just freeing up
>>>>>>> lots of memory why is it a good idea to slow that
>>>>>>> process down?
>>>>>>
>>>>>> I first want to know that it is a problem before we declare it a
>>>>>> problem. I provided an example (s390x) where it does not seem to be a
>>>>>> problem. One hypercall ~every 512 frees. As simple as it can get.
>>>>>>
>>>>>> No trying to deny that it could be a problem on x86, but then I assume
>>>>>> it is only a problem in specific setups.
>>>>>>
>>>>>> I would much rather prefer a simple solution that can eventually be
>>>>>> disabled in selected setup than a complicated solution that tries to fit
>>>>>> all possible setups. Realtime is one of the examples where such stuff is
>>>>>> to be disabled either way.
>>>>>>
>>>>>> Optimization of space comes with a price (here: execution time).
>>>>>
>>>>> One thing to keep in mind though is that if you are already having to
>>>>> pull pages in and out of swap on the host in order be able to provide
>>>>> enough memory for the guests the free page hinting should be a
>>>>> significant win in terms of performance.
>>>>
>>>> Indeed. And also we are in a virtualized environment already, we can
>>>> have any kind of sudden hickups. (again, realtime has special
>>>> requirements on the setup)
>>>>
>>>> Side note: I like your approach because it is simple. I don't like your
>>>> approach because it cannot deal with fragmented memory. And that can
>>>> happen easily.
>>>>
>>>> The idea I described here can be similarly be an extension of your
>>>> approach, merging in a "batched reporting" Nitesh proposed, so we can
>>>> report on something < MAX_ORDER, similar to s390x. In the end it boils
>>>> down to reporting via hypercall vs. reporting via virtio. The main point
>>>> is that it is synchronous and batched. (and that we properly take care
>>>> of the race between host freeing and guest allocation)
>>>
>>> I'd say the discussion is even simpler then that. My concern is more
>>> synchronous versus asynchronous. I honestly think the cost for a
>>> synchronous call is being overblown and we are likely to see the fault
>>> and zeroing of pages cost more than the hypercall or virtio
>>> transaction itself.
>>
>> The overhead of page faults and zeroing should be mitigated by
>> MADV_FREE, as Andrea correctly stated (thanks!). Then, the call overhead
>> (context switch) becomes relevant.
>>
>> We have various discussions now :) And I think they are related.
>>
>> synchronous versus asynchronous
>> batched vs. non-batched
>> MAX_ORDER - 1 vs. other/none magic number
>>
>> 1. synchronous call without batching on every kfree is bad. The
>> interface is fixed to big magic numbers, otherwise we end up having a
>> hypercall on every kfree. This is your approach.
>>
>> 2. asynchronous calls without batching would most probably have similar
>> problems with small granularities as we had when ballooning without
>> batching. Just overhead we can avoid.
>>
>> 3. synchronous and batched is what s390x does. It can deal with page
>> granularity. It is what I initially described in this sub-thread.
>>
>> 4. asynchronous and batched. This is the other approach we discussed
>> yesterday. If we can get it implemented, I would be interested in
>> performance numbers.
>>
>> As far as I understood, Michael seems to favor something like 4 (and I
>> assume eventually 2 if it is similarly fast). I am a friend of either 3
>> or 4.
> 
> Well Linus said big granularity is important for linux MM
> and not to bother with hinting small sizes.
> 

For some reason I tend to challenge also the opinions of people way
smarter than me ;) Only the numbers can tell the true story later.

> Alex said cost of a hypercall is drawfed by a pagefault
> after alloc. I would be curious whether async pagefault
> can help things somehow though.

Indeed.

> 
>>>
>>> Also one reason why I am not a fan of working with anything less than
>>> PMD order is because there have been issues in the past with false
>>> memory leaks being created when hints were provided on THP pages that
>>> essentially fragmented them. I guess hugepaged went through and
>>> started trying to reassemble the huge pages and as a result there have
>>> been apps that ended up consuming more memory than they would have
>>> otherwise since they were using fragments of THP pages after doing an
>>> MADV_DONTNEED on sections of the page.
>>
>> I understand your concerns, but we should not let bugs in the hypervisor
>> dictate the design. Bugs are there to be fixed. Interesting read,
>> though, thanks!
> 
> Right but if we break up a huge page we are then creating
> more work for hypervisor to reassemble it.

Yes, but the hypervisor can decide what to do. E.g. on s390x there are
no THP, so nothing to break up.

It is all very complicated :)

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-19 14:44                 ` David Hildenbrand
@ 2019-02-19 14:45                   ` David Hildenbrand
  0 siblings, 0 replies; 116+ messages in thread
From: David Hildenbrand @ 2019-02-19 14:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, Nitesh Narayan Lal, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

>>>>
>>>> Also one reason why I am not a fan of working with anything less than
>>>> PMD order is because there have been issues in the past with false
>>>> memory leaks being created when hints were provided on THP pages that
>>>> essentially fragmented them. I guess hugepaged went through and
>>>> started trying to reassemble the huge pages and as a result there have
>>>> been apps that ended up consuming more memory than they would have
>>>> otherwise since they were using fragments of THP pages after doing an
>>>> MADV_DONTNEED on sections of the page.
>>>
>>> I understand your concerns, but we should not let bugs in the hypervisor
>>> dictate the design. Bugs are there to be fixed. Interesting read,
>>> though, thanks!
>>
>> Right but if we break up a huge page we are then creating
>> more work for hypervisor to reassemble it.
> 
> Yes, but the hypervisor can decide what to do. E.g. on s390x there are
> no THP, so nothing to break up.

To clarify as that might be confusing: No THP in a KVM guest mapping for
now.

> 
> It is all very complicated :)
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-19  2:46             ` Andrea Arcangeli
  2019-02-19 12:52               ` Nitesh Narayan Lal
@ 2019-02-19 16:23               ` Alexander Duyck
  1 sibling, 0 replies; 116+ messages in thread
From: Alexander Duyck @ 2019-02-19 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: David Hildenbrand, Michael S. Tsirkin, Nitesh Narayan Lal,
	kvm list, LKML, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang,
	Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk,
	dhildenb

On Mon, Feb 18, 2019 at 6:46 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> Hello,
>
> On Mon, Feb 18, 2019 at 03:47:22PM -0800, Alexander Duyck wrote:
> > essentially fragmented them. I guess hugepaged went through and
> > started trying to reassemble the huge pages and as a result there have
> > been apps that ended up consuming more memory than they would have
> > otherwise since they were using fragments of THP pages after doing an
> > MADV_DONTNEED on sections of the page.
>
> With relatively recent kernels MADV_DONTNEED doesn't necessarily free
> anything when it's applied to a THP subpage, it only splits the
> pagetables and queues the THP for deferred splitting. If there's
> memory pressure a shrinker will be invoked and the queue is scanned
> and the THPs are physically splitted, but to be reassembled/collapsed
> after a physical split it requires at least one young pte.
>
> If this is particularly problematic for page hinting, this behavior
> where the MADV_DONTNEED can be undoed by khugepaged (if some subpage is
> being frequently accessed), can be turned off by setting
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none to
> 0. Then the THP will only be collapsed if all 512 subpages are mapped
> (i.e. they've all be re-allocated by the guest).
>
> Regardless of the max_ptes_none default, keeping the smaller guest
> buddy orders as the last target for page hinting should be good for
> performance.

Okay, this is good to know. Thanks.

> > Yeah, no problem. The only thing I don't like about MADV_FREE is that
> > you have to have memory pressure before the pages really start getting
> > scrubbed with is both a benefit and a drawback. Basically it defers
> > the freeing until you are under actual memory pressure so when you hit
> > that case things start feeling much slower, that and it limits your
> > allocations since the kernel doesn't recognize the pages as free until
> > it would have to start trying to push memory to swap.
>
> The guest allocation behavior should not be influenced by MADV_FREE vs
> MADV_DONTNEED, the guest can't see the difference anyway, so why
> should it limit the allocations?

Actually I was talking about the host. So if  have a guest that is
using MADV_FREE what I have to do is create an allocation that would
force us to have to access swap and that in turn ends up triggering
the freeing of the pages that were moved to the "Inactive(file)" list
by the MADV_FREE call.

The only reason it came up is that one of my test systems had a small
swap so I ended up having to do multiple allocations and frees in swap
sized increments to free up memory from a large guest that wasn't in
use.

> The benefit of MADV_FREE should be that when the same guest frees and
> reallocates an huge amount of RAM (i.e. guest app allocating and
> freeing lots of RAM in a loop, not so uncommon), there will be no KVM
> page fault during guest re-allocations. So in absence of memory
> pressure in the host it should be a major win. Overall it sounds like
> a good tradeoff compared to MADV_DONTNEED that forcefully invokes MMU
> notifiers and forces host allocations and KVM page faults in order to
> reallocate the same RAM in the same guest.

Right, and I do see that behavior.

> When there's memory pressure it's up to the host Linux VM to notice
> there's plenty of MADV_FREE material to free at zero I/O cost before
> starting swapping.

Right.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-19  7:54                           ` David Hildenbrand
@ 2019-02-19 18:06                             ` Alexander Duyck
  2019-02-19 18:31                               ` David Hildenbrand
  2019-02-19 19:58                               ` Michael S. Tsirkin
  0 siblings, 2 replies; 116+ messages in thread
From: Alexander Duyck @ 2019-02-19 18:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nitesh Narayan Lal, Michael S. Tsirkin, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Feb 18, 2019 at 11:55 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 19.02.19 01:01, Alexander Duyck wrote:
> > On Mon, Feb 18, 2019 at 1:04 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 18.02.19 21:40, Nitesh Narayan Lal wrote:
> >>> On 2/18/19 3:31 PM, Michael S. Tsirkin wrote:
> >>>> On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
> >>>>>>>>>> So I'm fine with a simple implementation but the interface needs to
> >>>>>>>>>> allow the hypervisor to process hints in parallel while guest is
> >>>>>>>>>> running.  We can then fix any issues on hypervisor without breaking
> >>>>>>>>>> guests.
> >>>>>>>>> Yes, I am fine with defining an interface that theoretically let's us
> >>>>>>>>> change the implementation in the guest later.
> >>>>>>>>> I consider this even a
> >>>>>>>>> prerequisite. IMHO the interface shouldn't be different, it will be
> >>>>>>>>> exactly the same.
> >>>>>>>>>
> >>>>>>>>> It is just "who" calls the batch freeing and waits for it. And as I
> >>>>>>>>> outlined here, doing it without additional threads at least avoids us
> >>>>>>>>> for now having to think about dynamic data structures and that we can
> >>>>>>>>> sometimes not report "because the thread is still busy reporting or
> >>>>>>>>> wasn't scheduled yet".
> >>>>>>>> Sorry I wasn't clear. I think we need ability to change the
> >>>>>>>> implementation in the *host* later. IOW don't rely on
> >>>>>>>> host being synchronous.
> >>>>>>>>
> >>>>>>>>
> >>>>>>> I actually misread it :) . In any way, there has to be a mechanism to
> >>>>>>> synchronize.
> >>>>>>>
> >>>>>>> If we are going via a bare hypercall (like s390x, like what Alexander
> >>>>>>> proposes), it is going to be a synchronous interface either way. Just a
> >>>>>>> bare hypercall, there will not really be any blocking on the guest side.
> >>>>>> It bothers me that we are now tied to interface being synchronous. We
> >>>>>> won't be able to fix it if there's an issue as that would break guests.
> >>>>> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
> >>>>>
> >>>>> Yes, as I initially wrote, this mimics s390x. That might be good (we
> >>>>> know it has been working for years) and bad (we are inheriting the same
> >>>>> problem class, if it exists). And being synchronous is part of the
> >>>>> approach for now.
> >>>> BTW on s390 are these hypercalls handled by Linux?
> >>>>
> >>>>> I tend to focus on the first part (we don't know anything besides it is
> >>>>> working) while you focus on the second part (there could be a potential
> >>>>> problem). Having a real problem at hand would be great, then we would
> >>>>> know what exactly we actually have to fix. But read below.
> >>>> If we end up doing a hypercall per THP, maybe we could at least
> >>>> not block with interrupts disabled? Poll in guest until
> >>>> hypervisor reports its done?  That would already be an
> >>>> improvement IMHO. E.g. perf within guest will point you
> >>>> in the right direction and towards disabling hinting.
> >>>>
> >>>>
> >>>>>>> Via virtio, I guess it is waiting for a response to a requests, right?
> >>>>>> For the buffer to be used, yes. And it could mean putting some pages
> >>>>>> aside until hypervisor is done with them. Then you don't need timers or
> >>>>>> tricks like this, you can get an interrupt and start using the memory.
> >>>>> I am very open to such an approach as long as we can make it work and it
> >>>>> is not too complicated. (-> simple)
> >>>>>
> >>>>> This would mean for example
> >>>>>
> >>>>> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
> >>>>> 256/512.
> >>>>>
> >>>>> 2. Once the buffer is full, do crazy "take pages out of the balloon
> >>>>> action" and report them to the hypervisor via virtio. Let the VCPU
> >>>>> continue. This will require some memory to store the request. Small
> >>>>> hickup for the VCPU to kick of the reporting to the hypervisor.
> >>>>>
> >>>>> 3. On interrupt/response, go over the response and put the pages back to
> >>>>> the buddy.
> >>>>>
> >>>>> (assuming that reporting a bulk of frees is better than reporting every
> >>>>> single free obviously)
> >>>>>
> >>>>> This could allow nice things like "when OOM gets trigger, see if pages
> >>>>> are currently being reported and wait until they have been put back to
> >>>>> the buddy, return "new pages available", so in a real "low on memory"
> >>>>> scenario, no OOM killer would get involved. This could address the issue
> >>>>> Wei had with reporting when low on memory.
> >>>>>
> >>>>> Is that something you have in mind?
> >>>> Yes that seems more future proof I think.
> >>>>
> >>>>> I assume we would have to allocate
> >>>>> memory when crafting the new requests. This is the only reason I tend to
> >>>>> prefer a synchronous interface for now. But if allocation is not a
> >>>>> problem, great.
> >>>> There are two main ways to avoid allocation:
> >>>> 1. do not add extra data on top of each chunk passed
> >>> If I am not wrong then this is close to what we have right now.
> >>
> >> Yes, minus the kthread(s) and eventually with some sort of memory
> >> allocation for the request. Once you're asynchronous via a notification
> >> mechanisnm, there is no real need for a thread anymore, hopefully.
> >>
> >>> One issue I see right now is that I am polling while host is freeing the
> >>> memory.
> >>> In the next version I could tie the logic which returns pages to the
> >>> buddy and resets the per cpu array index value to 0 with the callback.
> >>> (i.e.., it happens once we receive an response from the host)
> >>
> >> The question is, what happens when freeing pages and the array is not
> >> ready to be reused yet. In that case, you want to somehow continue
> >> freeing pages without busy waiting or eventually not reporting pages.
> >>
> >> The callback should put the pages back to the buddy and free the request
> >> eventually to have a fully asynchronous mechanism.
> >>
> >>> Other change which I am testing right now is to only capture 'MAX_ORDER
> >>
> >> I am not sure if this is an arbitrary number we came up with here. We
> >> should really play with different orders to find a hot spot. I wouldn't
> >> consider this high priority, though. Getting the whole concept right to
> >> be able to deal with any magic number we come up should be the ultimate
> >> goal. (stuff that only works with huge pages I consider not future
> >> proof, especially regarding fragmented guests which can happen easily)
> >
> > This essentially just ends up being another trade-off of CPU versus
> > memory though. Assuming we aren't using THP we are going to take a
> > penalty in terms of performance but could then free individual pages
> > less than HUGETLB_PAGE_ORDER, but the CPU utilization is going to be
> > much higher in general even without the hinting. I figure for x86 we
> > probably don't have too many options since if I am not mistaken
> > MAX_ORDER is just one or two more than HUGETLB_PAGE_ORDER.
>
> THP is an implementation detail in the hypervisor. Yes, it is the common
> case on x86. But it is e.g. not available on s390x yet. And we also want
> this mechanism to work on s390x (e.g. for nested virtualization setups
> as discussed).
>
> If we e.g. report any granularity after merging was done in the buddy,
> we could end up reporting everything from page size up to MAX_SIZE - 1,
> the hypervisor could ignore hints below a certain magic number, if it
> makes its life easier.

For each architecture we can do a separate implementation of what to
hint on. We already do that for bare metal so why would we have guests
do the same type of hinting in the virtualization case when there are
fundamental differences in page size and features in each
architecture?

This is another reason why I think the hypercall approach is a better
idea since each architecture is likely going to want to handle things
differently and it would be a pain to try and sort that all out in a
virtio driver.

> >
> > As far as fragmentation my thought is that we may want to look into
> > adding support to the guest for prioritizing defragmentation on pages
> > lower than THP size. Then that way we could maintain the higher
> > overall performance with or without the hinting since shuffling lower
> > order pages around between guests would start to get expensive pretty
> > quick.
>
> My take would be, design an interface/mechanism that allows any kind of
> granularity. You can than balance between cpu overead and space shifting.

The problem with using "any kind of granularity" is that in the case
of memory we are already having problems with 4K pages being deemed
too small of a granularity to be useful for anything and making
operations too expensive.

I'm open to using other page orders for other architectures. Nothing
says we have to stick with THP sized pages for all architectures. I
have just been focused on x86 and this seems like the best fit for the
balance between CPU and freeing of memory for now on that
architecture.

> I feel like repeating myself, but on s390x hinting is done on page
> granularity, and I have never heard somebody say "how can I turn it off,
> this is slowing down my system too much.". All we know is that one
> hypercall per free is most probably not acceptable. We really have to
> play with the numbers.

My thought was we could look at doing different implementations for
other architectures such as s390 and powerPC. Odds are the
implementations would be similar but have slight differences where
appropriate such as what order we should start hinting on, or if we
bypass the hypercall/virtio-balloon for a host native approach if
available.

> I tend to like an asynchronous reporting approach as discussed in this
> thread, we would have to see if Nitesh could get it implemented.

I agree it would be great if it could work. However I have concerns
given that work on this patch set dates back to 2017, major issues
such as working around device assignment have yet to be addressed, and
it seems like most of the effort is being focused on things that in my
opinion are being over-engineered for little to no benefit.

I really think that simpler would be much better in terms of design in
this case.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-19 18:06                             ` Alexander Duyck
@ 2019-02-19 18:31                               ` David Hildenbrand
  2019-02-19 21:57                                 ` Alexander Duyck
  2019-02-19 19:58                               ` Michael S. Tsirkin
  1 sibling, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2019-02-19 18:31 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Nitesh Narayan Lal, Michael S. Tsirkin, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

>>> This essentially just ends up being another trade-off of CPU versus
>>> memory though. Assuming we aren't using THP we are going to take a
>>> penalty in terms of performance but could then free individual pages
>>> less than HUGETLB_PAGE_ORDER, but the CPU utilization is going to be
>>> much higher in general even without the hinting. I figure for x86 we
>>> probably don't have too many options since if I am not mistaken
>>> MAX_ORDER is just one or two more than HUGETLB_PAGE_ORDER.
>>
>> THP is an implementation detail in the hypervisor. Yes, it is the common
>> case on x86. But it is e.g. not available on s390x yet. And we also want
>> this mechanism to work on s390x (e.g. for nested virtualization setups
>> as discussed).
>>
>> If we e.g. report any granularity after merging was done in the buddy,
>> we could end up reporting everything from page size up to MAX_SIZE - 1,
>> the hypervisor could ignore hints below a certain magic number, if it
>> makes its life easier.
> 
> For each architecture we can do a separate implementation of what to
> hint on. We already do that for bare metal so why would we have guests
> do the same type of hinting in the virtualization case when there are
> fundamental differences in page size and features in each
> architecture?
> 
> This is another reason why I think the hypercall approach is a better
> idea since each architecture is likely going to want to handle things
> differently and it would be a pain to try and sort that all out in a
> virtio driver.

I can't follow. We are talking about something as simple as a minimum
page granularity here that can easily be configured. Nothing that
screams for different implementations. But I get your point, we could
tune for different architectures.

> 
>>>
>>> As far as fragmentation my thought is that we may want to look into
>>> adding support to the guest for prioritizing defragmentation on pages
>>> lower than THP size. Then that way we could maintain the higher
>>> overall performance with or without the hinting since shuffling lower
>>> order pages around between guests would start to get expensive pretty
>>> quick.
>>
>> My take would be, design an interface/mechanism that allows any kind of
>> granularity. You can than balance between cpu overead and space shifting.
> 
> The problem with using "any kind of granularity" is that in the case
> of memory we are already having problems with 4K pages being deemed
> too small of a granularity to be useful for anything and making
> operations too expensive.

No, sorry, s390x does it. And via batch reporting it could work. Not
saying we should do page granularity, but "to be useful for anything" is
just wrong.

> 
> I'm open to using other page orders for other architectures. Nothing
> says we have to stick with THP sized pages for all architectures. I
> have just been focused on x86 and this seems like the best fit for the
> balance between CPU and freeing of memory for now on that
> architecture.
> 
>> I feel like repeating myself, but on s390x hinting is done on page
>> granularity, and I have never heard somebody say "how can I turn it off,
>> this is slowing down my system too much.". All we know is that one
>> hypercall per free is most probably not acceptable. We really have to
>> play with the numbers.
> 
> My thought was we could look at doing different implementations for
> other architectures such as s390 and powerPC. Odds are the
> implementations would be similar but have slight differences where
> appropriate such as what order we should start hinting on, or if we
> bypass the hypercall/virtio-balloon for a host native approach if
> available.
> 
>> I tend to like an asynchronous reporting approach as discussed in this
>> thread, we would have to see if Nitesh could get it implemented.
> 
> I agree it would be great if it could work. However I have concerns
> given that work on this patch set dates back to 2017, major issues
> such as working around device assignment have yet to be addressed, and
> it seems like most of the effort is being focused on things that in my
> opinion are being over-engineered for little to no benefit.

I can understand that you are trying to push your solution. I would do
the same. Again, I don't like a pure synchronous approach that works on
one-element-at-a-time. Period. Other people might have other opinions.
This is mine - luckily I don't have anything to say here :)

MST also voted for an asynchronous solution if we can make it work.
Nitesh made significant improvements since the 2017. Complicated stuff
needs time. No need to rush. People have been talking about free page
hinting since 2006. I talked to various people that experimented with
bitmap based solutions two years ago.

So much to that, if you think your solution is the way to go, please
follow up on it. Nitesh seems to have decided to look into the
asynchronous approach you also called "great if it could work". As long
as we don't run into elementary blockers there, to me it all looks like
we are making progress, which is good. If we find out asynchronous
doesn't work, synchronous is the only alternative.

And just so you don't get me wrong: Thanks for looking and working on
this. And thanks for sharing your opinions and insights! However making
a decision about going your way at this point does not seem reasonable
to me. We have plenty of time.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-19 18:06                             ` Alexander Duyck
  2019-02-19 18:31                               ` David Hildenbrand
@ 2019-02-19 19:58                               ` Michael S. Tsirkin
  2019-02-19 20:02                                 ` David Hildenbrand
  1 sibling, 1 reply; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-19 19:58 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Hildenbrand, Nitesh Narayan Lal, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Tue, Feb 19, 2019 at 10:06:35AM -0800, Alexander Duyck wrote:
> > I tend to like an asynchronous reporting approach as discussed in this
> > thread, we would have to see if Nitesh could get it implemented.
> 
> I agree it would be great if it could work. However I have concerns
> given that work on this patch set dates back to 2017, major issues
> such as working around device assignment have yet to be addressed,

BTW for device assignment to work, your idea of sending
data directly to kvm won't work, will it?
You need to update userspace so it can update VFIO right?
Another blocker for assignment is ability to make holes
an an existing mapping - supported by hardware but
not by IOMMU drivers.

All the issues are shared with balloon btw, so that
could be another reason to use the balloon.

> and
> it seems like most of the effort is being focused on things that in my
> opinion are being over-engineered for little to no benefit.
> 
> I really think that simpler would be much better in terms of design in
> this case.
> 
> Thanks.
> 
> - Alex

I personally learned a lot from Wei's work on hinting.


All this doesn't mean you approach is not the right one.

-- 
MST

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-19 19:58                               ` Michael S. Tsirkin
@ 2019-02-19 20:02                                 ` David Hildenbrand
  2019-02-19 20:17                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2019-02-19 20:02 UTC (permalink / raw)
  To: Michael S. Tsirkin, Alexander Duyck
  Cc: Nitesh Narayan Lal, kvm list, LKML, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, Rik van Riel, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

On 19.02.19 20:58, Michael S. Tsirkin wrote:
> On Tue, Feb 19, 2019 at 10:06:35AM -0800, Alexander Duyck wrote:
>>> I tend to like an asynchronous reporting approach as discussed in this
>>> thread, we would have to see if Nitesh could get it implemented.
>>
>> I agree it would be great if it could work. However I have concerns
>> given that work on this patch set dates back to 2017, major issues
>> such as working around device assignment have yet to be addressed,
> 
> BTW for device assignment to work, your idea of sending
> data directly to kvm won't work, will it?
> You need to update userspace so it can update VFIO right?
> Another blocker for assignment is ability to make holes
> an an existing mapping - supported by hardware but
> not by IOMMU drivers.

I had the exact same thought and then realized that we decided to block
the balloon in user space until we figured out how to handle this properly.

I wonder if MADV_FREE behaves differently compared to MADV_DONTNEED when
finding pinned pages, but I doubt it. Most probably we'll have to
disable hinting for device assignments as well.

> 
> All the issues are shared with balloon btw, so that
> could be another reason to use the balloon.

Yes, smells like it.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-19 20:02                                 ` David Hildenbrand
@ 2019-02-19 20:17                                   ` Michael S. Tsirkin
  2019-02-19 20:21                                     ` David Hildenbrand
  0 siblings, 1 reply; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-19 20:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexander Duyck, Nitesh Narayan Lal, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Tue, Feb 19, 2019 at 09:02:52PM +0100, David Hildenbrand wrote:
> On 19.02.19 20:58, Michael S. Tsirkin wrote:
> > On Tue, Feb 19, 2019 at 10:06:35AM -0800, Alexander Duyck wrote:
> >>> I tend to like an asynchronous reporting approach as discussed in this
> >>> thread, we would have to see if Nitesh could get it implemented.
> >>
> >> I agree it would be great if it could work. However I have concerns
> >> given that work on this patch set dates back to 2017, major issues
> >> such as working around device assignment have yet to be addressed,
> > 
> > BTW for device assignment to work, your idea of sending
> > data directly to kvm won't work, will it?
> > You need to update userspace so it can update VFIO right?
> > Another blocker for assignment is ability to make holes
> > an an existing mapping - supported by hardware but
> > not by IOMMU drivers.
> 
> I had the exact same thought and then realized that we decided to block
> the balloon in user space until we figured out how to handle this properly.
> 
> I wonder if MADV_FREE behaves differently compared to MADV_DONTNEED when
> finding pinned pages, but I doubt it. Most probably we'll have to
> disable hinting for device assignments as well.

OK but let's recognize it as a bug not a feature.

-- 
MST

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-19 20:17                                   ` Michael S. Tsirkin
@ 2019-02-19 20:21                                     ` David Hildenbrand
  2019-02-19 20:35                                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2019-02-19 20:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, Nitesh Narayan Lal, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On 19.02.19 21:17, Michael S. Tsirkin wrote:
> On Tue, Feb 19, 2019 at 09:02:52PM +0100, David Hildenbrand wrote:
>> On 19.02.19 20:58, Michael S. Tsirkin wrote:
>>> On Tue, Feb 19, 2019 at 10:06:35AM -0800, Alexander Duyck wrote:
>>>>> I tend to like an asynchronous reporting approach as discussed in this
>>>>> thread, we would have to see if Nitesh could get it implemented.
>>>>
>>>> I agree it would be great if it could work. However I have concerns
>>>> given that work on this patch set dates back to 2017, major issues
>>>> such as working around device assignment have yet to be addressed,
>>>
>>> BTW for device assignment to work, your idea of sending
>>> data directly to kvm won't work, will it?
>>> You need to update userspace so it can update VFIO right?
>>> Another blocker for assignment is ability to make holes
>>> an an existing mapping - supported by hardware but
>>> not by IOMMU drivers.
>>
>> I had the exact same thought and then realized that we decided to block
>> the balloon in user space until we figured out how to handle this properly.
>>
>> I wonder if MADV_FREE behaves differently compared to MADV_DONTNEED when
>> finding pinned pages, but I doubt it. Most probably we'll have to
>> disable hinting for device assignments as well.
> 
> OK but let's recognize it as a bug not a feature.
> 

Yes, btw interesting read: https://lwn.net/Articles/198380/

"Pages which have been locked into memory pose an extra challenge here -
they can be part of the page cache, but they still shouldn't be taken
away by the host system. So such pages cannot be marked as "volatile."
The problem is that figuring out if a page is locked is harder than it
might seem; it can involve scanning a list of virtual memory area (VMA)
structures, which is slow. So the hinting patches add a new flag to the
address_space structure to note that somebody has locked pages from that
address space in memory."

I assume locked here actually means pinned.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-19 20:21                                     ` David Hildenbrand
@ 2019-02-19 20:35                                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-19 20:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexander Duyck, Nitesh Narayan Lal, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Tue, Feb 19, 2019 at 09:21:20PM +0100, David Hildenbrand wrote:
> On 19.02.19 21:17, Michael S. Tsirkin wrote:
> > On Tue, Feb 19, 2019 at 09:02:52PM +0100, David Hildenbrand wrote:
> >> On 19.02.19 20:58, Michael S. Tsirkin wrote:
> >>> On Tue, Feb 19, 2019 at 10:06:35AM -0800, Alexander Duyck wrote:
> >>>>> I tend to like an asynchronous reporting approach as discussed in this
> >>>>> thread, we would have to see if Nitesh could get it implemented.
> >>>>
> >>>> I agree it would be great if it could work. However I have concerns
> >>>> given that work on this patch set dates back to 2017, major issues
> >>>> such as working around device assignment have yet to be addressed,
> >>>
> >>> BTW for device assignment to work, your idea of sending
> >>> data directly to kvm won't work, will it?
> >>> You need to update userspace so it can update VFIO right?
> >>> Another blocker for assignment is ability to make holes
> >>> an an existing mapping - supported by hardware but
> >>> not by IOMMU drivers.
> >>
> >> I had the exact same thought and then realized that we decided to block
> >> the balloon in user space until we figured out how to handle this properly.
> >>
> >> I wonder if MADV_FREE behaves differently compared to MADV_DONTNEED when
> >> finding pinned pages, but I doubt it. Most probably we'll have to
> >> disable hinting for device assignments as well.
> > 
> > OK but let's recognize it as a bug not a feature.
> > 
> 
> Yes, btw interesting read: https://lwn.net/Articles/198380/

There's also slideware from Rik van Riel circa 2011. His idea
was tagging pages in guest memory, freeing a page involves
- drop the EPT PTE
- check page is still free
- unpin page

This way you can hint without exits at all if you like.

> 
> "Pages which have been locked into memory pose an extra challenge here -
> they can be part of the page cache, but they still shouldn't be taken
> away by the host system. So such pages cannot be marked as "volatile."
> The problem is that figuring out if a page is locked is harder than it
> might seem; it can involve scanning a list of virtual memory area (VMA)
> structures, which is slow. So the hinting patches add a new flag to the
> address_space structure to note that somebody has locked pages from that
> address space in memory."
> 
> I assume locked here actually means pinned.

Locked seems to mean mlock there.

This seems to also resemble Xen's tmem a bit.


> -- 
> 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-19 18:31                               ` David Hildenbrand
@ 2019-02-19 21:57                                 ` Alexander Duyck
  2019-02-19 22:17                                   ` Michael S. Tsirkin
  2019-02-19 22:36                                   ` David Hildenbrand
  0 siblings, 2 replies; 116+ messages in thread
From: Alexander Duyck @ 2019-02-19 21:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nitesh Narayan Lal, Michael S. Tsirkin, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Tue, Feb 19, 2019 at 10:32 AM David Hildenbrand <david@redhat.com> wrote:
>
> >>> This essentially just ends up being another trade-off of CPU versus
> >>> memory though. Assuming we aren't using THP we are going to take a
> >>> penalty in terms of performance but could then free individual pages
> >>> less than HUGETLB_PAGE_ORDER, but the CPU utilization is going to be
> >>> much higher in general even without the hinting. I figure for x86 we
> >>> probably don't have too many options since if I am not mistaken
> >>> MAX_ORDER is just one or two more than HUGETLB_PAGE_ORDER.
> >>
> >> THP is an implementation detail in the hypervisor. Yes, it is the common
> >> case on x86. But it is e.g. not available on s390x yet. And we also want
> >> this mechanism to work on s390x (e.g. for nested virtualization setups
> >> as discussed).
> >>
> >> If we e.g. report any granularity after merging was done in the buddy,
> >> we could end up reporting everything from page size up to MAX_SIZE - 1,
> >> the hypervisor could ignore hints below a certain magic number, if it
> >> makes its life easier.
> >
> > For each architecture we can do a separate implementation of what to
> > hint on. We already do that for bare metal so why would we have guests
> > do the same type of hinting in the virtualization case when there are
> > fundamental differences in page size and features in each
> > architecture?
> >
> > This is another reason why I think the hypercall approach is a better
> > idea since each architecture is likely going to want to handle things
> > differently and it would be a pain to try and sort that all out in a
> > virtio driver.
>
> I can't follow. We are talking about something as simple as a minimum
> page granularity here that can easily be configured. Nothing that
> screams for different implementations. But I get your point, we could
> tune for different architectures.

I was thinking about the guest side of things. Basically if we need to
define different page orders for different architectures then we start
needing to do architecture specific includes. Then if we throw in
stuff like the fact that the first level of KVM can make use of the
host style hints then that is another thing that will be a difference
int he different architectures. I'm just worried this stuff is going
to start adding up to a bunch of "#ifdef" cruft if we are trying to do
this as a virtio driver.

> >
> >>>
> >>> As far as fragmentation my thought is that we may want to look into
> >>> adding support to the guest for prioritizing defragmentation on pages
> >>> lower than THP size. Then that way we could maintain the higher
> >>> overall performance with or without the hinting since shuffling lower
> >>> order pages around between guests would start to get expensive pretty
> >>> quick.
> >>
> >> My take would be, design an interface/mechanism that allows any kind of
> >> granularity. You can than balance between cpu overead and space shifting.
> >
> > The problem with using "any kind of granularity" is that in the case
> > of memory we are already having problems with 4K pages being deemed
> > too small of a granularity to be useful for anything and making
> > operations too expensive.
>
> No, sorry, s390x does it. And via batch reporting it could work. Not
> saying we should do page granularity, but "to be useful for anything" is
> just wrong.

Yeah, I was engaging in a bit of hyperbole. I have had a headache this
morning so I am a bit cranky.

So I am assuming the batching is the reason why you also have a
arch_alloc_page then for the s390 so that you can abort the hint if a
page is reallocated before the hint is processed then? I just want to
confirm so that my understanding of this is correct.

If that is the case I would be much happier with an asynchronous page
hint setup as this doesn't deprive the guest of memory while waiting
on the hint. The current logic in the patches from Nitesh has the
pages unavailable to the guest while waiting on the hint and that has
me somewhat concerned as it is going to hurt cache locality as it will
guarantee that we cannot reuse the same page if we are doing a cycle
of alloc and free for the same page size.

> >
> > I'm open to using other page orders for other architectures. Nothing
> > says we have to stick with THP sized pages for all architectures. I
> > have just been focused on x86 and this seems like the best fit for the
> > balance between CPU and freeing of memory for now on that
> > architecture.
> >
> >> I feel like repeating myself, but on s390x hinting is done on page
> >> granularity, and I have never heard somebody say "how can I turn it off,
> >> this is slowing down my system too much.". All we know is that one
> >> hypercall per free is most probably not acceptable. We really have to
> >> play with the numbers.
> >
> > My thought was we could look at doing different implementations for
> > other architectures such as s390 and powerPC. Odds are the
> > implementations would be similar but have slight differences where
> > appropriate such as what order we should start hinting on, or if we
> > bypass the hypercall/virtio-balloon for a host native approach if
> > available.
> >
> >> I tend to like an asynchronous reporting approach as discussed in this
> >> thread, we would have to see if Nitesh could get it implemented.
> >
> > I agree it would be great if it could work. However I have concerns
> > given that work on this patch set dates back to 2017, major issues
> > such as working around device assignment have yet to be addressed, and
> > it seems like most of the effort is being focused on things that in my
> > opinion are being over-engineered for little to no benefit.
>
> I can understand that you are trying to push your solution. I would do
> the same. Again, I don't like a pure synchronous approach that works on
> one-element-at-a-time. Period. Other people might have other opinions.
> This is mine - luckily I don't have anything to say here :)
>
> MST also voted for an asynchronous solution if we can make it work.
> Nitesh made significant improvements since the 2017. Complicated stuff
> needs time. No need to rush. People have been talking about free page
> hinting since 2006. I talked to various people that experimented with
> bitmap based solutions two years ago.

Now that I think I have a better understanding of how the s390x is
handling this I'm beginning to come around to the idea of an
asynchronous setup. The one thing that has been bugging me about the
asynchronous approach is the fact that the pages are not available to
the guest while waiting on the hint to be completed. If we can do
something like an arch_alloc_page and that would abort the hint and
allow us to keep the page available while waiting on the hint that
would be my preferred way of handling this.

> So much to that, if you think your solution is the way to go, please
> follow up on it. Nitesh seems to have decided to look into the
> asynchronous approach you also called "great if it could work". As long
> as we don't run into elementary blockers there, to me it all looks like
> we are making progress, which is good. If we find out asynchronous
> doesn't work, synchronous is the only alternative.

I plan to follow up in the next week or so.

> And just so you don't get me wrong: Thanks for looking and working on
> this. And thanks for sharing your opinions and insights! However making
> a decision about going your way at this point does not seem reasonable
> to me. We have plenty of time.

I appreciate the feedback. Sorry if I seemed a bit short. As I
mentioned I've had a headache most of the morning which hasn't really
helped my mood.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-19 21:57                                 ` Alexander Duyck
@ 2019-02-19 22:17                                   ` Michael S. Tsirkin
  2019-02-19 22:36                                   ` David Hildenbrand
  1 sibling, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2019-02-19 22:17 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Hildenbrand, Nitesh Narayan Lal, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Tue, Feb 19, 2019 at 01:57:14PM -0800, Alexander Duyck wrote:
> On Tue, Feb 19, 2019 at 10:32 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > >>> This essentially just ends up being another trade-off of CPU versus
> > >>> memory though. Assuming we aren't using THP we are going to take a
> > >>> penalty in terms of performance but could then free individual pages
> > >>> less than HUGETLB_PAGE_ORDER, but the CPU utilization is going to be
> > >>> much higher in general even without the hinting. I figure for x86 we
> > >>> probably don't have too many options since if I am not mistaken
> > >>> MAX_ORDER is just one or two more than HUGETLB_PAGE_ORDER.
> > >>
> > >> THP is an implementation detail in the hypervisor. Yes, it is the common
> > >> case on x86. But it is e.g. not available on s390x yet. And we also want
> > >> this mechanism to work on s390x (e.g. for nested virtualization setups
> > >> as discussed).
> > >>
> > >> If we e.g. report any granularity after merging was done in the buddy,
> > >> we could end up reporting everything from page size up to MAX_SIZE - 1,
> > >> the hypervisor could ignore hints below a certain magic number, if it
> > >> makes its life easier.
> > >
> > > For each architecture we can do a separate implementation of what to
> > > hint on. We already do that for bare metal so why would we have guests
> > > do the same type of hinting in the virtualization case when there are
> > > fundamental differences in page size and features in each
> > > architecture?
> > >
> > > This is another reason why I think the hypercall approach is a better
> > > idea since each architecture is likely going to want to handle things
> > > differently and it would be a pain to try and sort that all out in a
> > > virtio driver.
> >
> > I can't follow. We are talking about something as simple as a minimum
> > page granularity here that can easily be configured. Nothing that
> > screams for different implementations. But I get your point, we could
> > tune for different architectures.
> 
> I was thinking about the guest side of things. Basically if we need to
> define different page orders for different architectures then we start
> needing to do architecture specific includes. Then if we throw in
> stuff like the fact that the first level of KVM can make use of the
> host style hints then that is another thing that will be a difference
> int he different architectures.

Sorry didn't catch this one. What are host style hints?

> I'm just worried this stuff is going
> to start adding up to a bunch of "#ifdef" cruft if we are trying to do
> this as a virtio driver.

I agree we want to avoid that.

And by comparison, if it's up to host or if it's tied to logic within
guest (such as MAX_PAGE_ORDER as suggested by Linus) as opposed to CPU
architecture, then virtio is easier as you can re-use config space and
feature bits to negotiate host/guest capabilities. Doing hypercalls for
that would add lots of hypercalls.

I CC'd Wei Wang who implemented host-driven hints in the balloon right
now. Wei I wonder - could you try changing from MAX_PAGE_ORDER to
HUGETLB_PAGE_ORDER? Does this affect performance for you at all? Thanks!


> > >
> > >>>
> > >>> As far as fragmentation my thought is that we may want to look into
> > >>> adding support to the guest for prioritizing defragmentation on pages
> > >>> lower than THP size. Then that way we could maintain the higher
> > >>> overall performance with or without the hinting since shuffling lower
> > >>> order pages around between guests would start to get expensive pretty
> > >>> quick.
> > >>
> > >> My take would be, design an interface/mechanism that allows any kind of
> > >> granularity. You can than balance between cpu overead and space shifting.
> > >
> > > The problem with using "any kind of granularity" is that in the case
> > > of memory we are already having problems with 4K pages being deemed
> > > too small of a granularity to be useful for anything and making
> > > operations too expensive.
> >
> > No, sorry, s390x does it. And via batch reporting it could work. Not
> > saying we should do page granularity, but "to be useful for anything" is
> > just wrong.
> 
> Yeah, I was engaging in a bit of hyperbole. I have had a headache this
> morning so I am a bit cranky.
> 
> So I am assuming the batching is the reason why you also have a
> arch_alloc_page then for the s390 so that you can abort the hint if a
> page is reallocated before the hint is processed then? I just want to
> confirm so that my understanding of this is correct.
> 
> If that is the case I would be much happier with an asynchronous page
> hint setup as this doesn't deprive the guest of memory while waiting
> on the hint. The current logic in the patches from Nitesh has the
> pages unavailable to the guest while waiting on the hint and that has
> me somewhat concerned as it is going to hurt cache locality as it will
> guarantee that we cannot reuse the same page if we are doing a cycle
> of alloc and free for the same page size.
> > >
> > > I'm open to using other page orders for other architectures. Nothing
> > > says we have to stick with THP sized pages for all architectures. I
> > > have just been focused on x86 and this seems like the best fit for the
> > > balance between CPU and freeing of memory for now on that
> > > architecture.
> > >
> > >> I feel like repeating myself, but on s390x hinting is done on page
> > >> granularity, and I have never heard somebody say "how can I turn it off,
> > >> this is slowing down my system too much.". All we know is that one
> > >> hypercall per free is most probably not acceptable. We really have to
> > >> play with the numbers.
> > >
> > > My thought was we could look at doing different implementations for
> > > other architectures such as s390 and powerPC. Odds are the
> > > implementations would be similar but have slight differences where
> > > appropriate such as what order we should start hinting on, or if we
> > > bypass the hypercall/virtio-balloon for a host native approach if
> > > available.
> > >
> > >> I tend to like an asynchronous reporting approach as discussed in this
> > >> thread, we would have to see if Nitesh could get it implemented.
> > >
> > > I agree it would be great if it could work. However I have concerns
> > > given that work on this patch set dates back to 2017, major issues
> > > such as working around device assignment have yet to be addressed, and
> > > it seems like most of the effort is being focused on things that in my
> > > opinion are being over-engineered for little to no benefit.
> >
> > I can understand that you are trying to push your solution. I would do
> > the same. Again, I don't like a pure synchronous approach that works on
> > one-element-at-a-time. Period. Other people might have other opinions.
> > This is mine - luckily I don't have anything to say here :)
> >
> > MST also voted for an asynchronous solution if we can make it work.
> > Nitesh made significant improvements since the 2017. Complicated stuff
> > needs time. No need to rush. People have been talking about free page
> > hinting since 2006. I talked to various people that experimented with
> > bitmap based solutions two years ago.
> 
> Now that I think I have a better understanding of how the s390x is
> handling this I'm beginning to come around to the idea of an
> asynchronous setup. The one thing that has been bugging me about the
> asynchronous approach is the fact that the pages are not available to
> the guest while waiting on the hint to be completed. If we can do
> something like an arch_alloc_page and that would abort the hint and
> allow us to keep the page available while waiting on the hint that
> would be my preferred way of handling this.
> 
> > So much to that, if you think your solution is the way to go, please
> > follow up on it. Nitesh seems to have decided to look into the
> > asynchronous approach you also called "great if it could work". As long
> > as we don't run into elementary blockers there, to me it all looks like
> > we are making progress, which is good. If we find out asynchronous
> > doesn't work, synchronous is the only alternative.
> 
> I plan to follow up in the next week or so.
> 
> > And just so you don't get me wrong: Thanks for looking and working on
> > this. And thanks for sharing your opinions and insights! However making
> > a decision about going your way at this point does not seem reasonable
> > to me. We have plenty of time.
> 
> I appreciate the feedback. Sorry if I seemed a bit short. As I
> mentioned I've had a headache most of the morning which hasn't really
> helped my mood.
> 
> Thanks.
> 
> - Alex


-- 
MST

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-19 21:57                                 ` Alexander Duyck
  2019-02-19 22:17                                   ` Michael S. Tsirkin
@ 2019-02-19 22:36                                   ` David Hildenbrand
  1 sibling, 0 replies; 116+ messages in thread
From: David Hildenbrand @ 2019-02-19 22:36 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Nitesh Narayan Lal, Michael S. Tsirkin, kvm list, LKML,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli


>> I can't follow. We are talking about something as simple as a minimum
>> page granularity here that can easily be configured. Nothing that
>> screams for different implementations. But I get your point, we could
>> tune for different architectures.
> 
> I was thinking about the guest side of things. Basically if we need to
> define different page orders for different architectures then we start
> needing to do architecture specific includes. Then if we throw in
> stuff like the fact that the first level of KVM can make use of the
> host style hints then that is another thing that will be a difference
> int he different architectures. I'm just worried this stuff is going
> to start adding up to a bunch of "#ifdef" cruft if we are trying to do
> this as a virtio driver.

I agree that something like that is to be avoided. As MST pointed out,
feature bits and config space are a nice way to solve that at least on
the virtio side.

> 
>>>
>>>>>
>>>>> As far as fragmentation my thought is that we may want to look into
>>>>> adding support to the guest for prioritizing defragmentation on pages
>>>>> lower than THP size. Then that way we could maintain the higher
>>>>> overall performance with or without the hinting since shuffling lower
>>>>> order pages around between guests would start to get expensive pretty
>>>>> quick.
>>>>
>>>> My take would be, design an interface/mechanism that allows any kind of
>>>> granularity. You can than balance between cpu overead and space shifting.
>>>
>>> The problem with using "any kind of granularity" is that in the case
>>> of memory we are already having problems with 4K pages being deemed
>>> too small of a granularity to be useful for anything and making
>>> operations too expensive.
>>
>> No, sorry, s390x does it. And via batch reporting it could work. Not
>> saying we should do page granularity, but "to be useful for anything" is
>> just wrong.
> 
> Yeah, I was engaging in a bit of hyperbole. I have had a headache this
> morning so I am a bit cranky.

No worries, I am very happy about this discussion. :)

> 
> So I am assuming the batching is the reason why you also have a
> arch_alloc_page then for the s390 so that you can abort the hint if a
> page is reallocated before the hint is processed then? I just want to
> confirm so that my understanding of this is correct.

s390x is very special as it actually communicates the page state to the
hypervisor via page table bits in the guest->host mapping. The reporting
is then done batched (and synchronous) via a list of PFNs. But via the
special page table bits (along with the ESSA instruction), requests for
single pages can be canceled any time. So allocation paths are *only*
blocked by a special page lock also part of the page table bits in the
guest->host mapping.

I guess now you are completely confused. The main point is: actual
reporting to _perform_ the free is synchronous + batched. Canceling any
time possible via special synchronization mechanism per page. (that's
why pages are not taken out of the buddy, because canceling is easy)

> 
> If that is the case I would be much happier with an asynchronous page
> hint setup as this doesn't deprive the guest of memory while waiting
> on the hint. The current logic in the patches from Nitesh has the
> pages unavailable to the guest while waiting on the hint and that has
> me somewhat concerned as it is going to hurt cache locality as it will
> guarantee that we cannot reuse the same page if we are doing a cycle
> of alloc and free for the same page size.

My view on things on the current plan and your question:

1. We queue up *potential* hints on kfree per VCPU. We might only queue
after merging in the buddy and if we exceed a certain page order.

2. When that list is full, we *try to* take these pages out of the
buddy. We then trigger asynchronous reporting for the ones where we
succeeded.

3. Once reporting returns, we put back the pages to the buddy.

Between 1 and 2, any page can be reallocated. 2. will simply ignore the
page if it is no longer in the buddy. Pages are removed from the buddy
only between 2 and 3. So hopefully a very short time.

Especially, the queuing might mitigate cache locality issues. And there
are the pcpu lists that - as far as I remember - won't be touched as
these pages are not "official buddy pages" (yet). But this is an
interesting point to keep in mind.

> 
>>>
>>> I'm open to using other page orders for other architectures. Nothing
>>> says we have to stick with THP sized pages for all architectures. I
>>> have just been focused on x86 and this seems like the best fit for the
>>> balance between CPU and freeing of memory for now on that
>>> architecture.
>>>
>>>> I feel like repeating myself, but on s390x hinting is done on page
>>>> granularity, and I have never heard somebody say "how can I turn it off,
>>>> this is slowing down my system too much.". All we know is that one
>>>> hypercall per free is most probably not acceptable. We really have to
>>>> play with the numbers.
>>>
>>> My thought was we could look at doing different implementations for
>>> other architectures such as s390 and powerPC. Odds are the
>>> implementations would be similar but have slight differences where
>>> appropriate such as what order we should start hinting on, or if we
>>> bypass the hypercall/virtio-balloon for a host native approach if
>>> available.
>>>
>>>> I tend to like an asynchronous reporting approach as discussed in this
>>>> thread, we would have to see if Nitesh could get it implemented.
>>>
>>> I agree it would be great if it could work. However I have concerns
>>> given that work on this patch set dates back to 2017, major issues
>>> such as working around device assignment have yet to be addressed, and
>>> it seems like most of the effort is being focused on things that in my
>>> opinion are being over-engineered for little to no benefit.
>>
>> I can understand that you are trying to push your solution. I would do
>> the same. Again, I don't like a pure synchronous approach that works on
>> one-element-at-a-time. Period. Other people might have other opinions.
>> This is mine - luckily I don't have anything to say here :)
>>
>> MST also voted for an asynchronous solution if we can make it work.
>> Nitesh made significant improvements since the 2017. Complicated stuff
>> needs time. No need to rush. People have been talking about free page
>> hinting since 2006. I talked to various people that experimented with
>> bitmap based solutions two years ago.
> 
> Now that I think I have a better understanding of how the s390x is
> handling this I'm beginning to come around to the idea of an
> asynchronous setup. The one thing that has been bugging me about the
> asynchronous approach is the fact that the pages are not available to
> the guest while waiting on the hint to be completed. If we can do
> something like an arch_alloc_page and that would abort the hint and
> allow us to keep the page available while waiting on the hint that
> would be my preferred way of handling this.

We'll have to see if any kind of abortion is easily possible and even
necessary. s390x has the advantage that this synchronization (e.g. for
abortion of a free) is built into the architecture (guest->host page
tables). I guess we cannot tell before Nitesh has some kind of prototype
how much of an issue this actually is.

> 
>> So much to that, if you think your solution is the way to go, please
>> follow up on it. Nitesh seems to have decided to look into the
>> asynchronous approach you also called "great if it could work". As long
>> as we don't run into elementary blockers there, to me it all looks like
>> we are making progress, which is good. If we find out asynchronous
>> doesn't work, synchronous is the only alternative.
> 
> I plan to follow up in the next week or so.
> 
>> And just so you don't get me wrong: Thanks for looking and working on
>> this. And thanks for sharing your opinions and insights! However making
>> a decision about going your way at this point does not seem reasonable
>> to me. We have plenty of time.
> 
> I appreciate the feedback. Sorry if I seemed a bit short. As I
> mentioned I've had a headache most of the morning which hasn't really
> helped my mood.

Absolutely no problem Alex, thanks!


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-04 20:18 [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Nitesh Narayan Lal
                   ` (10 preceding siblings ...)
  2019-02-16  9:40 ` David Hildenbrand
@ 2019-02-23  0:02 ` Alexander Duyck
  2019-02-25 13:01   ` Nitesh Narayan Lal
  11 siblings, 1 reply; 116+ messages in thread
From: Alexander Duyck @ 2019-02-23  0:02 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang,
	Yang Zhang, Rik van Riel, David Hildenbrand, Michael S. Tsirkin,
	dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

On Mon, Feb 4, 2019 at 1:47 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> The following patch-set proposes an efficient mechanism for handing freed memory between the guest and the host. It enables the guests with no page cache to rapidly free and reclaims memory to and from the host respectively.
>
> Benefit:
> With this patch-series, in our test-case, executed on a single system and single NUMA node with 15GB memory, we were able to successfully launch atleast 5 guests
> when page hinting was enabled and 3 without it. (Detailed explanation of the test procedure is provided at the bottom).
>
> Changelog in V8:
> In this patch-series, the earlier approach [1] which was used to capture and scan the pages freed by the guest has been changed. The new approach is briefly described below:
>
> The patch-set still leverages the existing arch_free_page() to add this functionality. It maintains a per CPU array which is used to store the pages freed by the guest. The maximum number of entries which it can hold is defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it is scanned and only the pages which are available in the buddy are stored. This process continues until the array is filled with pages which are part of the buddy free list. After which it wakes up a kernel per-cpu-thread.
> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation and if the page is not reallocated and present in the buddy, the kernel thread attempts to isolate it from the buddy. If it is successfully isolated, the page is added to another per-cpu array. Once the entire scanning process is complete, all the isolated pages are reported to the host through an existing virtio-balloon driver.
>
> Known Issues:
>         * Fixed array size: The problem with having a fixed/hardcoded array size arises when the size of the guest varies. For example when the guest size increases and it starts making large allocations fixed size limits this solution's ability to capture all the freed pages. This will result in less guest free memory getting reported to the host.
>
> Known code re-work:
>         * Plan to re-use Wei's work, which communicates the poison value to the host.
>         * The nomenclatures used in virtio-balloon needs to be changed so that the code can easily be distinguished from Wei's Free Page Hint code.
>         * Sorting based on zonenum, to avoid repetitive zone locks for the same zone.
>
> Other required work:
>         * Run other benchmarks to evaluate the performance/impact of this approach.
>
> Test case:
> Setup:
> Memory-15837 MB
> Guest Memory Size-5 GB
> Swap-Disabled
> Test Program-Simple program which allocates 4GB memory via malloc, touches it via memset and exits.
> Use case-Number of guests that can be launched completely including the successful execution of the test program.
> Procedure:
> The first guest is launched and once its console is up, the test allocation program is executed with 4 GB memory request (Due to this the guest occupies almost 4-5 GB of memory in the host in a system without page hinting). Once this program exits at that time another guest is launched in the host and the same process is followed. We continue launching the guests until a guest gets killed due to low memory condition in the host.
>
> Result:
> Without Hinting-3 Guests
> With Hinting-5 to 7 Guests(Based on the amount of memory freed/captured).
>
> [1] https://www.spinics.net/lists/kvm/msg170113.html

So I tried reproducing your test and I am not having much luck.
According to the sysctl in the guest  I am seeing
"vm.guest-page-hinting = 1" which is supposed to indicate that the
hinting is enabled in both QEMU and the guest right? I'm just wanting
to verify that this is the case before I start doing any debugging.

I'm assuming you never really ran any multi-threaded tests on a
multi-CPU guest did you? With the patches applied I am seeing
stability issues. If I enable a VM with multiple CPUs and run
something like the page_fault1 test from the will-it-scale suite I am
seeing multiple traces being generated by the guest kernel and it
ultimately just hangs.

I have included the traces below. There end up being 3 specific
issues, a double free that is detected, the RCU stall, and then starts
complaining about a soft lockup.

Thanks.

- Alex

-- This looks like a page complaining about a double add when added to
the LRU --
[   50.479635] list_add double add: new=fffff64480000008,
prev=ffffa000fffd50c0, next=fffff64480000008.
[   50.481066] ------------[ cut here ]------------
[   50.481753] kernel BUG at lib/list_debug.c:31!
[   50.482448] invalid opcode: 0000 [#1] SMP PTI
[   50.483108] CPU: 1 PID: 852 Comm: hinting/1 Not tainted
5.0.0-rc7-next-20190219-baseline+ #50
[   50.486362] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[   50.487881] RIP: 0010:__list_add_valid+0x4b/0x70
[   50.488623] Code: 00 00 c3 48 89 c1 48 c7 c7 d8 70 10 9e 31 c0 e8
4f db c8 ff 0f 0b 48 89 c1 48 89 fe 31 c0 48 c7 c7 88 71 10 9e e8 39
db c8 ff <0f> 0b 48 89 d1 48 c7 c7 30 71 10 9e 48 89 f2 48 89 c6 31 c0
e8 20
[   50.492626] RSP: 0018:ffffb9a8c3b4bdf0 EFLAGS: 00010246
[   50.494189] RAX: 0000000000000058 RBX: ffffa000fffd50c0 RCX: 0000000000000000
[   50.496308] RDX: 0000000000000000 RSI: ffffa000df85e6c8 RDI: ffffa000df85e6c8
[   50.497876] RBP: ffffa000fffd50c0 R08: 0000000000000273 R09: 0000000000000005
[   50.498981] R10: 0000000000000000 R11: ffffb9a8c3b4bb70 R12: fffff64480000008
[   50.500077] R13: fffff64480000008 R14: fffff64480000000 R15: ffffa000fffd5000
[   50.501184] FS:  0000000000000000(0000) GS:ffffa000df840000(0000)
knlGS:0000000000000000
[   50.502432] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   50.503325] CR2: 00007ffff6e47000 CR3: 000000080f76c002 CR4: 0000000000160ee0
[   50.504431] Call Trace:
[   50.505464]  free_one_page+0x2b5/0x470
[   50.506070]  hyperlist_ready+0xa9/0xc0
[   50.506662]  hinting_fn+0x1db/0x3c0
[   50.507220]  smpboot_thread_fn+0x10e/0x160
[   50.507868]  kthread+0xf8/0x130
[   50.508371]  ? sort_range+0x20/0x20
[   50.508934]  ? kthread_bind+0x10/0x10
[   50.509520]  ret_from_fork+0x35/0x40
[   50.510098] Modules linked in: ip6t_rpfilter ip6t_REJECT
nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat
ebtable_broute bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle
ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw
iptable_security ebtable_filter ebtables ip6table_filter ip6_tables
sunrpc sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
kvm_intel kvm ppdev irqbypass parport_pc joydev virtio_balloon
pcc_cpufreq i2c_piix4 pcspkr parport xfs libcrc32c cirrus
drm_kms_helper ttm drm e1000 crc32c_intel virtio_blk ata_generic
floppy serio_raw pata_acpi qemu_fw_cfg
[   50.519202] ---[ end trace 141fe2acdf2e3818 ]---
[   50.519935] RIP: 0010:__list_add_valid+0x4b/0x70
[   50.520675] Code: 00 00 c3 48 89 c1 48 c7 c7 d8 70 10 9e 31 c0 e8
4f db c8 ff 0f 0b 48 89 c1 48 89 fe 31 c0 48 c7 c7 88 71 10 9e e8 39
db c8 ff <0f> 0b 48 89 d1 48 c7 c7 30 71 10 9e 48 89 f2 48 89 c6 31 c0
e8 20
[   50.523570] RSP: 0018:ffffb9a8c3b4bdf0 EFLAGS: 00010246
[   50.524399] RAX: 0000000000000058 RBX: ffffa000fffd50c0 RCX: 0000000000000000
[   50.525516] RDX: 0000000000000000 RSI: ffffa000df85e6c8 RDI: ffffa000df85e6c8
[   50.526634] RBP: ffffa000fffd50c0 R08: 0000000000000273 R09: 0000000000000005
[   50.527754] R10: 0000000000000000 R11: ffffb9a8c3b4bb70 R12: fffff64480000008
[   50.528872] R13: fffff64480000008 R14: fffff64480000000 R15: ffffa000fffd5000
[   50.530004] FS:  0000000000000000(0000) GS:ffffa000df840000(0000)
knlGS:0000000000000000
[   50.531276] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   50.532189] CR2: 00007ffff6e47000 CR3: 000000080f76c002 CR4: 0000000000160ee0

-- This appears to be a deadlock on the zone lock --
[  156.436784] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  156.439195] rcu: 0-...0: (0 ticks this GP)
idle=6ca/1/0x4000000000000000 softirq=10718/10718 fqs=2546
[  156.440810] rcu: 1-...0: (1 GPs behind)
idle=8f2/1/0x4000000000000000 softirq=8233/8235 fqs=2547
[  156.442320] rcu: 2-...0: (0 ticks this GP)
idle=ae2/1/0x4000000000000002 softirq=6779/6779 fqs=2547
[  156.443910] rcu: 3-...0: (0 ticks this GP)
idle=456/1/0x4000000000000000 softirq=1616/1616 fqs=2547
[  156.445454] rcu: (detected by 14, t=60109 jiffies, g=17493, q=31)
[  156.446545] Sending NMI from CPU 14 to CPUs 0:
[  156.448330] NMI backtrace for cpu 0
[  156.448331] CPU: 0 PID: 1308 Comm: page_fault1_pro Tainted: G
D           5.0.0-rc7-next-20190219-baseline+ #50
[  156.448331] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[  156.448332] RIP: 0010:queued_spin_lock_slowpath+0x21/0x1f0
[  156.448332] Code: c0 75 ec c3 90 90 90 90 90 0f 1f 44 00 00 0f 1f
44 00 00 ba 01 00 00 00 8b 07 85 c0 75 0a f0 0f b1 17 85 c0 75 f2 f3
c3 f3 90 <eb> ec 81 fe 00 01 00 00 0f 84 44 01 00 00 81 e6 00 ff ff ff
75 3e
[  156.448333] RSP: 0000:ffffb9a8c3e83c10 EFLAGS: 00000002
[  156.448339] RAX: 0000000000000001 RBX: 0000000000000007 RCX: 0000000000000001
[  156.448340] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffa000fffd6240
[  156.448340] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000006f36aa
[  156.448341] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000081
[  156.448341] R13: 0000000000100dca R14: 0000000000000000 R15: ffffa000fffd5d00
[  156.448342] FS:  00007ffff7fec440(0000) GS:ffffa000df800000(0000)
knlGS:0000000000000000
[  156.448342] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  156.448342] CR2: 00007fffefe2d000 CR3: 0000000695904004 CR4: 0000000000160ef0
[  156.448343] Call Trace:
[  156.448343]  get_page_from_freelist+0x50f/0x1280
[  156.448343]  ? get_page_from_freelist+0xa44/0x1280
[  156.448344]  __alloc_pages_nodemask+0x141/0x2e0
[  156.448344]  alloc_pages_vma+0x73/0x180
[  156.448344]  __handle_mm_fault+0xd59/0x14e0
[  156.448345]  handle_mm_fault+0xfa/0x210
[  156.448345]  __do_page_fault+0x207/0x4c0
[  156.448345]  do_page_fault+0x32/0x140
[  156.448346]  ? async_page_fault+0x8/0x30
[  156.448346]  async_page_fault+0x1e/0x30
[  156.448346] RIP: 0033:0x401840
[  156.448347] Code: 00 00 45 31 c9 31 ff 41 b8 ff ff ff ff b9 22 00
00 00 ba 03 00 00 00 be 00 00 00 08 e8 d9 f5 ff ff 48 83 f8 ff 74 2b
48 89 c2 <c6> 02 00 48 01 ea 48 83 03 01 48 89 d1 48 29 c1 48 81 f9 ff
ff ff
[  156.448347] RSP: 002b:00007fffffffc0a0 EFLAGS: 00010293
[  156.448348] RAX: 00007fffeee48000 RBX: 00007ffff7ff7000 RCX: 0000000000fe5000
[  156.448348] RDX: 00007fffefe2d000 RSI: 0000000008000000 RDI: 0000000000000000
[  156.448349] RBP: 0000000000001000 R08: ffffffffffffffff R09: 0000000000000000
[  156.448349] R10: 0000000000000022 R11: 0000000000000246 R12: 00007fffffffc240
[  156.448349] R13: 0000000000000000 R14: 0000000000610710 R15: 0000000000000005
[  156.448355] Sending NMI from CPU 14 to CPUs 1:
[  156.489676] NMI backtrace for cpu 1
[  156.489677] CPU: 1 PID: 1309 Comm: page_fault1_pro Tainted: G
D           5.0.0-rc7-next-20190219-baseline+ #50
[  156.489677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[  156.489678] RIP: 0010:queued_spin_lock_slowpath+0x21/0x1f0
[  156.489678] Code: c0 75 ec c3 90 90 90 90 90 0f 1f 44 00 00 0f 1f
44 00 00 ba 01 00 00 00 8b 07 85 c0 75 0a f0 0f b1 17 85 c0 75 f2 f3
c3 f3 90 <eb> ec 81 fe 00 01 00 00 0f 84 44 01 00 00 81 e6 00 ff ff ff
75 3e
[  156.489679] RSP: 0000:ffffb9a8c3b4bc10 EFLAGS: 00000002
[  156.489679] RAX: 0000000000000001 RBX: 0000000000000007 RCX: 0000000000000001
[  156.489680] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffa000fffd6240
[  156.489680] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000006f36aa
[  156.489680] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000081
[  156.489681] R13: 0000000000100dca R14: 0000000000000000 R15: ffffa000fffd5d00
[  156.489681] FS:  00007ffff7fec440(0000) GS:ffffa000df840000(0000)
knlGS:0000000000000000
[  156.489682] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  156.489682] CR2: 00007ffff4608000 CR3: 000000081ddf6003 CR4: 0000000000160ee0
[  156.489682] Call Trace:
[  156.489683]  get_page_from_freelist+0x50f/0x1280
[  156.489683]  ? get_page_from_freelist+0xa44/0x1280
[  156.489683]  __alloc_pages_nodemask+0x141/0x2e0
[  156.489683]  alloc_pages_vma+0x73/0x180
[  156.489684]  __handle_mm_fault+0xd59/0x14e0
[  156.489684]  handle_mm_fault+0xfa/0x210
[  156.489684]  __do_page_fault+0x207/0x4c0
[  156.489685]  do_page_fault+0x32/0x140
[  156.489685]  ? async_page_fault+0x8/0x30
[  156.489685]  async_page_fault+0x1e/0x30
[  156.489686] RIP: 0033:0x401840
[  156.489686] Code: 00 00 45 31 c9 31 ff 41 b8 ff ff ff ff b9 22 00
00 00 ba 03 00 00 00 be 00 00 00 08 e8 d9 f5 ff ff 48 83 f8 ff 74 2b
48 89 c2 <c6> 02 00 48 01 ea 48 83 03 01 48 89 d1 48 29 c1 48 81 f9 ff
ff ff
[  156.489687] RSP: 002b:00007fffffffc0a0 EFLAGS: 00010293
[  156.489687] RAX: 00007fffeee48000 RBX: 00007ffff7ff7080 RCX: 00000000057c0000
[  156.489692] RDX: 00007ffff4608000 RSI: 0000000008000000 RDI: 0000000000000000
[  156.489693] RBP: 0000000000001000 R08: ffffffffffffffff R09: 0000000000000000
[  156.489693] R10: 0000000000000022 R11: 0000000000000246 R12: 00007fffffffc240
[  156.489694] R13: 0000000000000000 R14: 000000000060f870 R15: 0000000000000005
[  156.489696] Sending NMI from CPU 14 to CPUs 2:
[  156.530601] NMI backtrace for cpu 2
[  156.530602] CPU: 2 PID: 858 Comm: hinting/2 Tainted: G      D
    5.0.0-rc7-next-20190219-baseline+ #50
[  156.530602] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[  156.530603] RIP: 0010:queued_spin_lock_slowpath+0x21/0x1f0
[  156.530603] Code: c0 75 ec c3 90 90 90 90 90 0f 1f 44 00 00 0f 1f
44 00 00 ba 01 00 00 00 8b 07 85 c0 75 0a f0 0f b1 17 85 c0 75 f2 f3
c3 f3 90 <eb> ec 81 fe 00 01 00 00 0f 84 44 01 00 00 81 e6 00 ff ff ff
75 3e
[  156.530604] RSP: 0018:ffffa000df883e38 EFLAGS: 00000002
[  156.530604] RAX: 0000000000000001 RBX: fffff644a05a0ec8 RCX: dead000000000200
[  156.530605] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffa000fffd6240
[  156.530605] RBP: ffffa000df8af340 R08: ffffa000da2b2000 R09: 0000000000000100
[  156.530606] R10: 0000000000000004 R11: 0000000000000005 R12: fffff6449fb5fb08
[  156.530606] R13: ffffa000fffd5d00 R14: 0000000000000001 R15: 0000000000000001
[  156.530606] FS:  0000000000000000(0000) GS:ffffa000df880000(0000)
knlGS:0000000000000000
[  156.530607] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  156.530607] CR2: 00007ffff6e47000 CR3: 0000000813b34003 CR4: 0000000000160ee0
[  156.530607] Call Trace:
[  156.530608]  <IRQ>
[  156.530608]  free_pcppages_bulk+0x1af/0x6d0
[  156.530608]  free_unref_page+0x54/0x70
[  156.530608]  tlb_remove_table_rcu+0x23/0x40
[  156.530609]  rcu_core+0x2b0/0x470
[  156.530609]  __do_softirq+0xde/0x2bf
[  156.530609]  irq_exit+0xd5/0xe0
[  156.530610]  smp_apic_timer_interrupt+0x74/0x140
[  156.530610]  apic_timer_interrupt+0xf/0x20
[  156.530610]  </IRQ>
[  156.530611] RIP: 0010:_raw_spin_lock+0x10/0x20
[  156.530611] Code: b8 01 00 00 00 c3 48 8b 3c 24 be 00 02 00 00 e8
f6 cf 77 ff 31 c0 c3 0f 1f 00 0f 1f 44 00 00 31 c0 ba 01 00 00 00 f0
0f b1 17 <0f> 94 c2 84 d2 74 02 f3 c3 89 c6 e9 d0 e8 7c ff 0f 1f 44 00
00 65
[  156.530612] RSP: 0018:ffffb9a8c3bf3df0 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff13
[  156.530612] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[  156.530613] RDX: 0000000000000001 RSI: fffff6449fd4aec0 RDI: ffffa000fffd6240
[  156.530613] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000002
[  156.530613] R10: 0000000000000000 R11: 0000000000003bf3 R12: 00000000007f52bb
[  156.530614] R13: 00000000007ecca4 R14: fffff6449fd4aec0 R15: ffffa000fffd5d00
[  156.530614]  free_one_page+0x32/0x470
[  156.530614]  ? __switch_to_asm+0x40/0x70
[  156.530615]  hyperlist_ready+0xa9/0xc0
[  156.530615]  hinting_fn+0x1db/0x3c0
[  156.530615]  smpboot_thread_fn+0x10e/0x160
[  156.530616]  kthread+0xf8/0x130
[  156.530616]  ? sort_range+0x20/0x20
[  156.530616]  ? kthread_bind+0x10/0x10
[  156.530616]  ret_from_fork+0x35/0x40
[  156.530619] Sending NMI from CPU 14 to CPUs 3:
[  156.577112] NMI backtrace for cpu 3
[  156.577113] CPU: 3 PID: 1311 Comm: page_fault1_pro Tainted: G
D           5.0.0-rc7-next-20190219-baseline+ #50
[  156.577113] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[  156.577114] RIP: 0010:queued_spin_lock_slowpath+0x21/0x1f0
[  156.577114] Code: c0 75 ec c3 90 90 90 90 90 0f 1f 44 00 00 0f 1f
44 00 00 ba 01 00 00 00 8b 07 85 c0 75 0a f0 0f b1 17 85 c0 75 f2 f3
c3 f3 90 <eb> ec 81 fe 00 01 00 00 0f 84 44 01 00 00 81 e6 00 ff ff ff
75 3e
[  156.577115] RSP: 0000:ffffb9a8c407fc10 EFLAGS: 00000002
[  156.577115] RAX: 0000000000000001 RBX: 0000000000000007 RCX: 0000000000000001
[  156.577116] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffa000fffd6240
[  156.577116] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000006f36aa
[  156.577121] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000081
[  156.577122] R13: 0000000000100dca R14: 0000000000000000 R15: ffffa000fffd5d00
[  156.577122] FS:  00007ffff7fec440(0000) GS:ffffa000df8c0000(0000)
knlGS:0000000000000000
[  156.577122] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  156.577123] CR2: 00007ffff398a000 CR3: 000000081aa00003 CR4: 0000000000160ee0
[  156.577123] Call Trace:
[  156.577123]  get_page_from_freelist+0x50f/0x1280
[  156.577124]  ? get_page_from_freelist+0xa44/0x1280
[  156.577124]  ? try_charge+0x637/0x860
[  156.577124]  __alloc_pages_nodemask+0x141/0x2e0
[  156.577125]  alloc_pages_vma+0x73/0x180
[  156.577125]  __handle_mm_fault+0xd59/0x14e0
[  156.577125]  handle_mm_fault+0xfa/0x210
[  156.577126]  __do_page_fault+0x207/0x4c0
[  156.577126]  do_page_fault+0x32/0x140
[  156.577126]  ? async_page_fault+0x8/0x30
[  156.577127]  async_page_fault+0x1e/0x30
[  156.577127] RIP: 0033:0x401840
[  156.577128] Code: 00 00 45 31 c9 31 ff 41 b8 ff ff ff ff b9 22 00
00 00 ba 03 00 00 00 be 00 00 00 08 e8 d9 f5 ff ff 48 83 f8 ff 74 2b
48 89 c2 <c6> 02 00 48 01 ea 48 83 03 01 48 89 d1 48 29 c1 48 81 f9 ff
ff ff
[  156.577128] RSP: 002b:00007fffffffc0a0 EFLAGS: 00010293
[  156.577129] RAX: 00007fffeee48000 RBX: 00007ffff7ff7180 RCX: 0000000004b42000
[  156.577129] RDX: 00007ffff398a000 RSI: 0000000008000000 RDI: 0000000000000000
[  156.577130] RBP: 0000000000001000 R08: ffffffffffffffff R09: 0000000000000000
[  156.577130] R10: 0000000000000022 R11: 0000000000000246 R12: 00007fffffffc240
[  156.577130] R13: 0000000000000000 R14: 000000000060db00 R15: 0000000000000005

-- After the above two it starts spitting this one out every 10 - 30
seconds or so --
[  183.788386] watchdog: BUG: soft lockup - CPU#14 stuck for 23s!
[kworker/14:1:121]
[  183.790003] Modules linked in: ip6t_rpfilter ip6t_REJECT
nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat
ebtable_broute bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle
ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw
iptable_security ebtable_filter ebtables ip6table_filter ip6_tables
sunrpc sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
kvm_intel kvm ppdev irqbypass parport_pc joydev virtio_balloon
pcc_cpufreq i2c_piix4 pcspkr parport xfs libcrc32c cirrus
drm_kms_helper ttm drm e1000 crc32c_intel virtio_blk ata_generic
floppy serio_raw pata_acpi qemu_fw_cfg
[  183.799984] CPU: 14 PID: 121 Comm: kworker/14:1 Tainted: G      D
        5.0.0-rc7-next-20190219-baseline+ #50
[  183.801674] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[  183.803078] Workqueue: events netstamp_clear
[  183.803873] RIP: 0010:smp_call_function_many+0x206/0x260
[  183.804847] Code: e8 0f 97 7c 00 3b 05 bd d1 1e 01 0f 83 7c fe ff
ff 48 63 d0 48 8b 4d 00 48 03 0c d5 80 28 18 9e 8b 51 18 83 e2 01 74
0a f3 90 <8b> 51 18 83 e2 01 75 f6 eb c7 0f b6 4c 24 0c 48 83 c4 10 89
ef 5b
[  183.808273] RSP: 0018:ffffb9a8c35a3d38 EFLAGS: 00000202 ORIG_RAX:
ffffffffffffff13
[  183.809662] RAX: 0000000000000000 RBX: ffffa000dfba9d88 RCX: ffffa000df8301c0
[  183.810971] RDX: 0000000000000001 RSI: 0000000000000100 RDI: ffffa000dfba9d88
[  183.812268] RBP: ffffa000dfba9d80 R08: 0000000000000000 R09: 0000000000003fff
[  183.813582] R10: 0000000000000000 R11: 000000000000000f R12: ffffffff9d02f690
[  183.814884] R13: 0000000000000000 R14: ffffa000dfba9da8 R15: 0000000000000100
[  183.816195] FS:  0000000000000000(0000) GS:ffffa000dfb80000(0000)
knlGS:0000000000000000
[  183.817673] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  183.818729] CR2: 00007ffff704b080 CR3: 0000000814c48001 CR4: 0000000000160ee0
[  183.820038] Call Trace:
[  183.820510]  ? netif_receive_skb_list+0x68/0x4a0
[  183.821367]  ? poke_int3_handler+0x40/0x40
[  183.822126]  ? netif_receive_skb_list+0x69/0x4a0
[  183.822975]  on_each_cpu+0x28/0x60
[  183.823611]  ? netif_receive_skb_list+0x68/0x4a0
[  183.824467]  text_poke_bp+0x68/0xe0
[  183.825126]  ? netif_receive_skb_list+0x68/0x4a0
[  183.825983]  __jump_label_transform+0x101/0x140
[  183.826829]  arch_jump_label_transform+0x26/0x40
[  183.827687]  __jump_label_update+0x56/0xc0
[  183.828456]  static_key_enable_cpuslocked+0x57/0x80
[  183.829358]  static_key_enable+0x16/0x20
[  183.830085]  process_one_work+0x16c/0x380
[  183.830831]  worker_thread+0x49/0x3e0
[  183.831516]  kthread+0xf8/0x130
[  183.832106]  ? rescuer_thread+0x340/0x340
[  183.832848]  ? kthread_bind+0x10/0x10
[  183.833532]  ret_from_fork+0x35/0x40

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
  2019-02-23  0:02 ` Alexander Duyck
@ 2019-02-25 13:01   ` Nitesh Narayan Lal
  0 siblings, 0 replies; 116+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-25 13:01 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang,
	Yang Zhang, Rik van Riel, David Hildenbrand, Michael S. Tsirkin,
	dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli


[-- Attachment #1.1: Type: text/plain, Size: 23284 bytes --]

On 2/22/19 7:02 PM, Alexander Duyck wrote:
> On Mon, Feb 4, 2019 at 1:47 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> The following patch-set proposes an efficient mechanism for handing freed memory between the guest and the host. It enables the guests with no page cache to rapidly free and reclaims memory to and from the host respectively.
>>
>> Benefit:
>> With this patch-series, in our test-case, executed on a single system and single NUMA node with 15GB memory, we were able to successfully launch atleast 5 guests
>> when page hinting was enabled and 3 without it. (Detailed explanation of the test procedure is provided at the bottom).
>>
>> Changelog in V8:
>> In this patch-series, the earlier approach [1] which was used to capture and scan the pages freed by the guest has been changed. The new approach is briefly described below:
>>
>> The patch-set still leverages the existing arch_free_page() to add this functionality. It maintains a per CPU array which is used to store the pages freed by the guest. The maximum number of entries which it can hold is defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it is scanned and only the pages which are available in the buddy are stored. This process continues until the array is filled with pages which are part of the buddy free list. After which it wakes up a kernel per-cpu-thread.
>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation and if the page is not reallocated and present in the buddy, the kernel thread attempts to isolate it from the buddy. If it is successfully isolated, the page is added to another per-cpu array. Once the entire scanning process is complete, all the isolated pages are reported to the host through an existing virtio-balloon driver.
>>
>> Known Issues:
>>         * Fixed array size: The problem with having a fixed/hardcoded array size arises when the size of the guest varies. For example when the guest size increases and it starts making large allocations fixed size limits this solution's ability to capture all the freed pages. This will result in less guest free memory getting reported to the host.
>>
>> Known code re-work:
>>         * Plan to re-use Wei's work, which communicates the poison value to the host.
>>         * The nomenclatures used in virtio-balloon needs to be changed so that the code can easily be distinguished from Wei's Free Page Hint code.
>>         * Sorting based on zonenum, to avoid repetitive zone locks for the same zone.
>>
>> Other required work:
>>         * Run other benchmarks to evaluate the performance/impact of this approach.
>>
>> Test case:
>> Setup:
>> Memory-15837 MB
>> Guest Memory Size-5 GB
>> Swap-Disabled
>> Test Program-Simple program which allocates 4GB memory via malloc, touches it via memset and exits.
>> Use case-Number of guests that can be launched completely including the successful execution of the test program.
>> Procedure:
>> The first guest is launched and once its console is up, the test allocation program is executed with 4 GB memory request (Due to this the guest occupies almost 4-5 GB of memory in the host in a system without page hinting). Once this program exits at that time another guest is launched in the host and the same process is followed. We continue launching the guests until a guest gets killed due to low memory condition in the host.
>>
>> Result:
>> Without Hinting-3 Guests
>> With Hinting-5 to 7 Guests(Based on the amount of memory freed/captured).
>>
>> [1] https://www.spinics.net/lists/kvm/msg170113.html
> So I tried reproducing your test and I am not having much luck.
> According to the sysctl in the guest  I am seeing
> "vm.guest-page-hinting = 1" which is supposed to indicate that the
> hinting is enabled in both QEMU and the guest right? 
That is correct. If your guest has the balloon driver enabled it will
also enable the hinting.
> I'm just wanting
> to verify that this is the case before I start doing any debugging.
>
> I'm assuming you never really ran any multi-threaded tests on a
> multi-CPU guest did you?
This is correct. I forgot to mention this as another todo item for me in
the cover email.
I will test multiple vcpus, once I finalize the design changes which I
am doing right now.
Thanks for pointing this out.
>  With the patches applied I am seeing
> stability issues. If I enable a VM with multiple CPUs and run
> something like the page_fault1 test from the will-it-scale suite I am
> seeing multiple traces being generated by the guest kernel and it
> ultimately just hangs.
As I am done with the changes on which I am currently working. I will
look into this as well.
>
> I have included the traces below. There end up being 3 specific
> issues, a double free that is detected, the RCU stall, and then starts
> complaining about a soft lockup.
>
> Thanks.
>
> - Alex
>
> -- This looks like a page complaining about a double add when added to
> the LRU --
> [   50.479635] list_add double add: new=fffff64480000008,
> prev=ffffa000fffd50c0, next=fffff64480000008.
> [   50.481066] ------------[ cut here ]------------
> [   50.481753] kernel BUG at lib/list_debug.c:31!
> [   50.482448] invalid opcode: 0000 [#1] SMP PTI
> [   50.483108] CPU: 1 PID: 852 Comm: hinting/1 Not tainted
> 5.0.0-rc7-next-20190219-baseline+ #50
> [   50.486362] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Bochs 01/01/2011
> [   50.487881] RIP: 0010:__list_add_valid+0x4b/0x70
> [   50.488623] Code: 00 00 c3 48 89 c1 48 c7 c7 d8 70 10 9e 31 c0 e8
> 4f db c8 ff 0f 0b 48 89 c1 48 89 fe 31 c0 48 c7 c7 88 71 10 9e e8 39
> db c8 ff <0f> 0b 48 89 d1 48 c7 c7 30 71 10 9e 48 89 f2 48 89 c6 31 c0
> e8 20
> [   50.492626] RSP: 0018:ffffb9a8c3b4bdf0 EFLAGS: 00010246
> [   50.494189] RAX: 0000000000000058 RBX: ffffa000fffd50c0 RCX: 0000000000000000
> [   50.496308] RDX: 0000000000000000 RSI: ffffa000df85e6c8 RDI: ffffa000df85e6c8
> [   50.497876] RBP: ffffa000fffd50c0 R08: 0000000000000273 R09: 0000000000000005
> [   50.498981] R10: 0000000000000000 R11: ffffb9a8c3b4bb70 R12: fffff64480000008
> [   50.500077] R13: fffff64480000008 R14: fffff64480000000 R15: ffffa000fffd5000
> [   50.501184] FS:  0000000000000000(0000) GS:ffffa000df840000(0000)
> knlGS:0000000000000000
> [   50.502432] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   50.503325] CR2: 00007ffff6e47000 CR3: 000000080f76c002 CR4: 0000000000160ee0
> [   50.504431] Call Trace:
> [   50.505464]  free_one_page+0x2b5/0x470
> [   50.506070]  hyperlist_ready+0xa9/0xc0
> [   50.506662]  hinting_fn+0x1db/0x3c0
> [   50.507220]  smpboot_thread_fn+0x10e/0x160
> [   50.507868]  kthread+0xf8/0x130
> [   50.508371]  ? sort_range+0x20/0x20
> [   50.508934]  ? kthread_bind+0x10/0x10
> [   50.509520]  ret_from_fork+0x35/0x40
> [   50.510098] Modules linked in: ip6t_rpfilter ip6t_REJECT
> nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat
> ebtable_broute bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle
> ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw
> iptable_security ebtable_filter ebtables ip6table_filter ip6_tables
> sunrpc sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
> kvm_intel kvm ppdev irqbypass parport_pc joydev virtio_balloon
> pcc_cpufreq i2c_piix4 pcspkr parport xfs libcrc32c cirrus
> drm_kms_helper ttm drm e1000 crc32c_intel virtio_blk ata_generic
> floppy serio_raw pata_acpi qemu_fw_cfg
> [   50.519202] ---[ end trace 141fe2acdf2e3818 ]---
> [   50.519935] RIP: 0010:__list_add_valid+0x4b/0x70
> [   50.520675] Code: 00 00 c3 48 89 c1 48 c7 c7 d8 70 10 9e 31 c0 e8
> 4f db c8 ff 0f 0b 48 89 c1 48 89 fe 31 c0 48 c7 c7 88 71 10 9e e8 39
> db c8 ff <0f> 0b 48 89 d1 48 c7 c7 30 71 10 9e 48 89 f2 48 89 c6 31 c0
> e8 20
> [   50.523570] RSP: 0018:ffffb9a8c3b4bdf0 EFLAGS: 00010246
> [   50.524399] RAX: 0000000000000058 RBX: ffffa000fffd50c0 RCX: 0000000000000000
> [   50.525516] RDX: 0000000000000000 RSI: ffffa000df85e6c8 RDI: ffffa000df85e6c8
> [   50.526634] RBP: ffffa000fffd50c0 R08: 0000000000000273 R09: 0000000000000005
> [   50.527754] R10: 0000000000000000 R11: ffffb9a8c3b4bb70 R12: fffff64480000008
> [   50.528872] R13: fffff64480000008 R14: fffff64480000000 R15: ffffa000fffd5000
> [   50.530004] FS:  0000000000000000(0000) GS:ffffa000df840000(0000)
> knlGS:0000000000000000
> [   50.531276] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   50.532189] CR2: 00007ffff6e47000 CR3: 000000080f76c002 CR4: 0000000000160ee0
>
> -- This appears to be a deadlock on the zone lock --
> [  156.436784] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [  156.439195] rcu: 0-...0: (0 ticks this GP)
> idle=6ca/1/0x4000000000000000 softirq=10718/10718 fqs=2546
> [  156.440810] rcu: 1-...0: (1 GPs behind)
> idle=8f2/1/0x4000000000000000 softirq=8233/8235 fqs=2547
> [  156.442320] rcu: 2-...0: (0 ticks this GP)
> idle=ae2/1/0x4000000000000002 softirq=6779/6779 fqs=2547
> [  156.443910] rcu: 3-...0: (0 ticks this GP)
> idle=456/1/0x4000000000000000 softirq=1616/1616 fqs=2547
> [  156.445454] rcu: (detected by 14, t=60109 jiffies, g=17493, q=31)
> [  156.446545] Sending NMI from CPU 14 to CPUs 0:
> [  156.448330] NMI backtrace for cpu 0
> [  156.448331] CPU: 0 PID: 1308 Comm: page_fault1_pro Tainted: G
> D           5.0.0-rc7-next-20190219-baseline+ #50
> [  156.448331] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Bochs 01/01/2011
> [  156.448332] RIP: 0010:queued_spin_lock_slowpath+0x21/0x1f0
> [  156.448332] Code: c0 75 ec c3 90 90 90 90 90 0f 1f 44 00 00 0f 1f
> 44 00 00 ba 01 00 00 00 8b 07 85 c0 75 0a f0 0f b1 17 85 c0 75 f2 f3
> c3 f3 90 <eb> ec 81 fe 00 01 00 00 0f 84 44 01 00 00 81 e6 00 ff ff ff
> 75 3e
> [  156.448333] RSP: 0000:ffffb9a8c3e83c10 EFLAGS: 00000002
> [  156.448339] RAX: 0000000000000001 RBX: 0000000000000007 RCX: 0000000000000001
> [  156.448340] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffa000fffd6240
> [  156.448340] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000006f36aa
> [  156.448341] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000081
> [  156.448341] R13: 0000000000100dca R14: 0000000000000000 R15: ffffa000fffd5d00
> [  156.448342] FS:  00007ffff7fec440(0000) GS:ffffa000df800000(0000)
> knlGS:0000000000000000
> [  156.448342] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  156.448342] CR2: 00007fffefe2d000 CR3: 0000000695904004 CR4: 0000000000160ef0
> [  156.448343] Call Trace:
> [  156.448343]  get_page_from_freelist+0x50f/0x1280
> [  156.448343]  ? get_page_from_freelist+0xa44/0x1280
> [  156.448344]  __alloc_pages_nodemask+0x141/0x2e0
> [  156.448344]  alloc_pages_vma+0x73/0x180
> [  156.448344]  __handle_mm_fault+0xd59/0x14e0
> [  156.448345]  handle_mm_fault+0xfa/0x210
> [  156.448345]  __do_page_fault+0x207/0x4c0
> [  156.448345]  do_page_fault+0x32/0x140
> [  156.448346]  ? async_page_fault+0x8/0x30
> [  156.448346]  async_page_fault+0x1e/0x30
> [  156.448346] RIP: 0033:0x401840
> [  156.448347] Code: 00 00 45 31 c9 31 ff 41 b8 ff ff ff ff b9 22 00
> 00 00 ba 03 00 00 00 be 00 00 00 08 e8 d9 f5 ff ff 48 83 f8 ff 74 2b
> 48 89 c2 <c6> 02 00 48 01 ea 48 83 03 01 48 89 d1 48 29 c1 48 81 f9 ff
> ff ff
> [  156.448347] RSP: 002b:00007fffffffc0a0 EFLAGS: 00010293
> [  156.448348] RAX: 00007fffeee48000 RBX: 00007ffff7ff7000 RCX: 0000000000fe5000
> [  156.448348] RDX: 00007fffefe2d000 RSI: 0000000008000000 RDI: 0000000000000000
> [  156.448349] RBP: 0000000000001000 R08: ffffffffffffffff R09: 0000000000000000
> [  156.448349] R10: 0000000000000022 R11: 0000000000000246 R12: 00007fffffffc240
> [  156.448349] R13: 0000000000000000 R14: 0000000000610710 R15: 0000000000000005
> [  156.448355] Sending NMI from CPU 14 to CPUs 1:
> [  156.489676] NMI backtrace for cpu 1
> [  156.489677] CPU: 1 PID: 1309 Comm: page_fault1_pro Tainted: G
> D           5.0.0-rc7-next-20190219-baseline+ #50
> [  156.489677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Bochs 01/01/2011
> [  156.489678] RIP: 0010:queued_spin_lock_slowpath+0x21/0x1f0
> [  156.489678] Code: c0 75 ec c3 90 90 90 90 90 0f 1f 44 00 00 0f 1f
> 44 00 00 ba 01 00 00 00 8b 07 85 c0 75 0a f0 0f b1 17 85 c0 75 f2 f3
> c3 f3 90 <eb> ec 81 fe 00 01 00 00 0f 84 44 01 00 00 81 e6 00 ff ff ff
> 75 3e
> [  156.489679] RSP: 0000:ffffb9a8c3b4bc10 EFLAGS: 00000002
> [  156.489679] RAX: 0000000000000001 RBX: 0000000000000007 RCX: 0000000000000001
> [  156.489680] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffa000fffd6240
> [  156.489680] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000006f36aa
> [  156.489680] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000081
> [  156.489681] R13: 0000000000100dca R14: 0000000000000000 R15: ffffa000fffd5d00
> [  156.489681] FS:  00007ffff7fec440(0000) GS:ffffa000df840000(0000)
> knlGS:0000000000000000
> [  156.489682] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  156.489682] CR2: 00007ffff4608000 CR3: 000000081ddf6003 CR4: 0000000000160ee0
> [  156.489682] Call Trace:
> [  156.489683]  get_page_from_freelist+0x50f/0x1280
> [  156.489683]  ? get_page_from_freelist+0xa44/0x1280
> [  156.489683]  __alloc_pages_nodemask+0x141/0x2e0
> [  156.489683]  alloc_pages_vma+0x73/0x180
> [  156.489684]  __handle_mm_fault+0xd59/0x14e0
> [  156.489684]  handle_mm_fault+0xfa/0x210
> [  156.489684]  __do_page_fault+0x207/0x4c0
> [  156.489685]  do_page_fault+0x32/0x140
> [  156.489685]  ? async_page_fault+0x8/0x30
> [  156.489685]  async_page_fault+0x1e/0x30
> [  156.489686] RIP: 0033:0x401840
> [  156.489686] Code: 00 00 45 31 c9 31 ff 41 b8 ff ff ff ff b9 22 00
> 00 00 ba 03 00 00 00 be 00 00 00 08 e8 d9 f5 ff ff 48 83 f8 ff 74 2b
> 48 89 c2 <c6> 02 00 48 01 ea 48 83 03 01 48 89 d1 48 29 c1 48 81 f9 ff
> ff ff
> [  156.489687] RSP: 002b:00007fffffffc0a0 EFLAGS: 00010293
> [  156.489687] RAX: 00007fffeee48000 RBX: 00007ffff7ff7080 RCX: 00000000057c0000
> [  156.489692] RDX: 00007ffff4608000 RSI: 0000000008000000 RDI: 0000000000000000
> [  156.489693] RBP: 0000000000001000 R08: ffffffffffffffff R09: 0000000000000000
> [  156.489693] R10: 0000000000000022 R11: 0000000000000246 R12: 00007fffffffc240
> [  156.489694] R13: 0000000000000000 R14: 000000000060f870 R15: 0000000000000005
> [  156.489696] Sending NMI from CPU 14 to CPUs 2:
> [  156.530601] NMI backtrace for cpu 2
> [  156.530602] CPU: 2 PID: 858 Comm: hinting/2 Tainted: G      D
>     5.0.0-rc7-next-20190219-baseline+ #50
> [  156.530602] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Bochs 01/01/2011
> [  156.530603] RIP: 0010:queued_spin_lock_slowpath+0x21/0x1f0
> [  156.530603] Code: c0 75 ec c3 90 90 90 90 90 0f 1f 44 00 00 0f 1f
> 44 00 00 ba 01 00 00 00 8b 07 85 c0 75 0a f0 0f b1 17 85 c0 75 f2 f3
> c3 f3 90 <eb> ec 81 fe 00 01 00 00 0f 84 44 01 00 00 81 e6 00 ff ff ff
> 75 3e
> [  156.530604] RSP: 0018:ffffa000df883e38 EFLAGS: 00000002
> [  156.530604] RAX: 0000000000000001 RBX: fffff644a05a0ec8 RCX: dead000000000200
> [  156.530605] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffa000fffd6240
> [  156.530605] RBP: ffffa000df8af340 R08: ffffa000da2b2000 R09: 0000000000000100
> [  156.530606] R10: 0000000000000004 R11: 0000000000000005 R12: fffff6449fb5fb08
> [  156.530606] R13: ffffa000fffd5d00 R14: 0000000000000001 R15: 0000000000000001
> [  156.530606] FS:  0000000000000000(0000) GS:ffffa000df880000(0000)
> knlGS:0000000000000000
> [  156.530607] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  156.530607] CR2: 00007ffff6e47000 CR3: 0000000813b34003 CR4: 0000000000160ee0
> [  156.530607] Call Trace:
> [  156.530608]  <IRQ>
> [  156.530608]  free_pcppages_bulk+0x1af/0x6d0
> [  156.530608]  free_unref_page+0x54/0x70
> [  156.530608]  tlb_remove_table_rcu+0x23/0x40
> [  156.530609]  rcu_core+0x2b0/0x470
> [  156.530609]  __do_softirq+0xde/0x2bf
> [  156.530609]  irq_exit+0xd5/0xe0
> [  156.530610]  smp_apic_timer_interrupt+0x74/0x140
> [  156.530610]  apic_timer_interrupt+0xf/0x20
> [  156.530610]  </IRQ>
> [  156.530611] RIP: 0010:_raw_spin_lock+0x10/0x20
> [  156.530611] Code: b8 01 00 00 00 c3 48 8b 3c 24 be 00 02 00 00 e8
> f6 cf 77 ff 31 c0 c3 0f 1f 00 0f 1f 44 00 00 31 c0 ba 01 00 00 00 f0
> 0f b1 17 <0f> 94 c2 84 d2 74 02 f3 c3 89 c6 e9 d0 e8 7c ff 0f 1f 44 00
> 00 65
> [  156.530612] RSP: 0018:ffffb9a8c3bf3df0 EFLAGS: 00000246 ORIG_RAX:
> ffffffffffffff13
> [  156.530612] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
> [  156.530613] RDX: 0000000000000001 RSI: fffff6449fd4aec0 RDI: ffffa000fffd6240
> [  156.530613] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000002
> [  156.530613] R10: 0000000000000000 R11: 0000000000003bf3 R12: 00000000007f52bb
> [  156.530614] R13: 00000000007ecca4 R14: fffff6449fd4aec0 R15: ffffa000fffd5d00
> [  156.530614]  free_one_page+0x32/0x470
> [  156.530614]  ? __switch_to_asm+0x40/0x70
> [  156.530615]  hyperlist_ready+0xa9/0xc0
> [  156.530615]  hinting_fn+0x1db/0x3c0
> [  156.530615]  smpboot_thread_fn+0x10e/0x160
> [  156.530616]  kthread+0xf8/0x130
> [  156.530616]  ? sort_range+0x20/0x20
> [  156.530616]  ? kthread_bind+0x10/0x10
> [  156.530616]  ret_from_fork+0x35/0x40
> [  156.530619] Sending NMI from CPU 14 to CPUs 3:
> [  156.577112] NMI backtrace for cpu 3
> [  156.577113] CPU: 3 PID: 1311 Comm: page_fault1_pro Tainted: G
> D           5.0.0-rc7-next-20190219-baseline+ #50
> [  156.577113] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Bochs 01/01/2011
> [  156.577114] RIP: 0010:queued_spin_lock_slowpath+0x21/0x1f0
> [  156.577114] Code: c0 75 ec c3 90 90 90 90 90 0f 1f 44 00 00 0f 1f
> 44 00 00 ba 01 00 00 00 8b 07 85 c0 75 0a f0 0f b1 17 85 c0 75 f2 f3
> c3 f3 90 <eb> ec 81 fe 00 01 00 00 0f 84 44 01 00 00 81 e6 00 ff ff ff
> 75 3e
> [  156.577115] RSP: 0000:ffffb9a8c407fc10 EFLAGS: 00000002
> [  156.577115] RAX: 0000000000000001 RBX: 0000000000000007 RCX: 0000000000000001
> [  156.577116] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffa000fffd6240
> [  156.577116] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000006f36aa
> [  156.577121] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000081
> [  156.577122] R13: 0000000000100dca R14: 0000000000000000 R15: ffffa000fffd5d00
> [  156.577122] FS:  00007ffff7fec440(0000) GS:ffffa000df8c0000(0000)
> knlGS:0000000000000000
> [  156.577122] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  156.577123] CR2: 00007ffff398a000 CR3: 000000081aa00003 CR4: 0000000000160ee0
> [  156.577123] Call Trace:
> [  156.577123]  get_page_from_freelist+0x50f/0x1280
> [  156.577124]  ? get_page_from_freelist+0xa44/0x1280
> [  156.577124]  ? try_charge+0x637/0x860
> [  156.577124]  __alloc_pages_nodemask+0x141/0x2e0
> [  156.577125]  alloc_pages_vma+0x73/0x180
> [  156.577125]  __handle_mm_fault+0xd59/0x14e0
> [  156.577125]  handle_mm_fault+0xfa/0x210
> [  156.577126]  __do_page_fault+0x207/0x4c0
> [  156.577126]  do_page_fault+0x32/0x140
> [  156.577126]  ? async_page_fault+0x8/0x30
> [  156.577127]  async_page_fault+0x1e/0x30
> [  156.577127] RIP: 0033:0x401840
> [  156.577128] Code: 00 00 45 31 c9 31 ff 41 b8 ff ff ff ff b9 22 00
> 00 00 ba 03 00 00 00 be 00 00 00 08 e8 d9 f5 ff ff 48 83 f8 ff 74 2b
> 48 89 c2 <c6> 02 00 48 01 ea 48 83 03 01 48 89 d1 48 29 c1 48 81 f9 ff
> ff ff
> [  156.577128] RSP: 002b:00007fffffffc0a0 EFLAGS: 00010293
> [  156.577129] RAX: 00007fffeee48000 RBX: 00007ffff7ff7180 RCX: 0000000004b42000
> [  156.577129] RDX: 00007ffff398a000 RSI: 0000000008000000 RDI: 0000000000000000
> [  156.577130] RBP: 0000000000001000 R08: ffffffffffffffff R09: 0000000000000000
> [  156.577130] R10: 0000000000000022 R11: 0000000000000246 R12: 00007fffffffc240
> [  156.577130] R13: 0000000000000000 R14: 000000000060db00 R15: 0000000000000005
>
> -- After the above two it starts spitting this one out every 10 - 30
> seconds or so --
> [  183.788386] watchdog: BUG: soft lockup - CPU#14 stuck for 23s!
> [kworker/14:1:121]
> [  183.790003] Modules linked in: ip6t_rpfilter ip6t_REJECT
> nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat
> ebtable_broute bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle
> ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw
> iptable_security ebtable_filter ebtables ip6table_filter ip6_tables
> sunrpc sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
> kvm_intel kvm ppdev irqbypass parport_pc joydev virtio_balloon
> pcc_cpufreq i2c_piix4 pcspkr parport xfs libcrc32c cirrus
> drm_kms_helper ttm drm e1000 crc32c_intel virtio_blk ata_generic
> floppy serio_raw pata_acpi qemu_fw_cfg
> [  183.799984] CPU: 14 PID: 121 Comm: kworker/14:1 Tainted: G      D
>         5.0.0-rc7-next-20190219-baseline+ #50
> [  183.801674] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Bochs 01/01/2011
> [  183.803078] Workqueue: events netstamp_clear
> [  183.803873] RIP: 0010:smp_call_function_many+0x206/0x260
> [  183.804847] Code: e8 0f 97 7c 00 3b 05 bd d1 1e 01 0f 83 7c fe ff
> ff 48 63 d0 48 8b 4d 00 48 03 0c d5 80 28 18 9e 8b 51 18 83 e2 01 74
> 0a f3 90 <8b> 51 18 83 e2 01 75 f6 eb c7 0f b6 4c 24 0c 48 83 c4 10 89
> ef 5b
> [  183.808273] RSP: 0018:ffffb9a8c35a3d38 EFLAGS: 00000202 ORIG_RAX:
> ffffffffffffff13
> [  183.809662] RAX: 0000000000000000 RBX: ffffa000dfba9d88 RCX: ffffa000df8301c0
> [  183.810971] RDX: 0000000000000001 RSI: 0000000000000100 RDI: ffffa000dfba9d88
> [  183.812268] RBP: ffffa000dfba9d80 R08: 0000000000000000 R09: 0000000000003fff
> [  183.813582] R10: 0000000000000000 R11: 000000000000000f R12: ffffffff9d02f690
> [  183.814884] R13: 0000000000000000 R14: ffffa000dfba9da8 R15: 0000000000000100
> [  183.816195] FS:  0000000000000000(0000) GS:ffffa000dfb80000(0000)
> knlGS:0000000000000000
> [  183.817673] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  183.818729] CR2: 00007ffff704b080 CR3: 0000000814c48001 CR4: 0000000000160ee0
> [  183.820038] Call Trace:
> [  183.820510]  ? netif_receive_skb_list+0x68/0x4a0
> [  183.821367]  ? poke_int3_handler+0x40/0x40
> [  183.822126]  ? netif_receive_skb_list+0x69/0x4a0
> [  183.822975]  on_each_cpu+0x28/0x60
> [  183.823611]  ? netif_receive_skb_list+0x68/0x4a0
> [  183.824467]  text_poke_bp+0x68/0xe0
> [  183.825126]  ? netif_receive_skb_list+0x68/0x4a0
> [  183.825983]  __jump_label_transform+0x101/0x140
> [  183.826829]  arch_jump_label_transform+0x26/0x40
> [  183.827687]  __jump_label_update+0x56/0xc0
> [  183.828456]  static_key_enable_cpuslocked+0x57/0x80
> [  183.829358]  static_key_enable+0x16/0x20
> [  183.830085]  process_one_work+0x16c/0x380
> [  183.830831]  worker_thread+0x49/0x3e0
> [  183.831516]  kthread+0xf8/0x130
> [  183.832106]  ? rescuer_thread+0x340/0x340
> [  183.832848]  ? kthread_bind+0x10/0x10
> [  183.833532]  ret_from_fork+0x35/0x40
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

end of thread, other threads:[~2019-02-25 13:01 UTC | newest]

Thread overview: 116+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-04 20:18 [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Nitesh Narayan Lal
2019-02-04 20:18 ` [RFC][Patch v8 1/7] KVM: Support for guest free page hinting Nitesh Narayan Lal
2019-02-05  4:14   ` Michael S. Tsirkin
2019-02-05 13:06     ` Nitesh Narayan Lal
2019-02-05 16:27       ` Michael S. Tsirkin
2019-02-05 16:34         ` Nitesh Narayan Lal
2019-02-04 20:18 ` [RFC][Patch v8 2/7] KVM: Enabling guest free page hinting via static key Nitesh Narayan Lal
2019-02-08 18:07   ` Alexander Duyck
2019-02-08 18:22     ` Nitesh Narayan Lal
2019-02-04 20:18 ` [RFC][Patch v8 3/7] KVM: Guest free page hinting functional skeleton Nitesh Narayan Lal
2019-02-04 20:18 ` [RFC][Patch v8 4/7] KVM: Disabling page poisoning to prevent corruption Nitesh Narayan Lal
2019-02-07 17:23   ` Alexander Duyck
2019-02-07 17:56     ` Nitesh Narayan Lal
2019-02-07 18:24       ` Alexander Duyck
2019-02-07 19:14         ` Michael S. Tsirkin
2019-02-07 21:08   ` Michael S. Tsirkin
2019-02-04 20:18 ` [RFC][Patch v8 5/7] virtio: Enables to add a single descriptor to the host Nitesh Narayan Lal
2019-02-05 20:49   ` Michael S. Tsirkin
2019-02-06 12:56     ` Nitesh Narayan Lal
2019-02-06 13:15       ` Luiz Capitulino
2019-02-06 13:24         ` Nitesh Narayan Lal
2019-02-06 13:29           ` Luiz Capitulino
2019-02-06 14:05             ` Nitesh Narayan Lal
2019-02-06 18:03       ` Michael S. Tsirkin
2019-02-06 18:19         ` Nitesh Narayan Lal
2019-02-04 20:18 ` [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages Nitesh Narayan Lal
2019-02-05 20:45   ` Michael S. Tsirkin
2019-02-05 21:54     ` Nitesh Narayan Lal
2019-02-05 21:55       ` Michael S. Tsirkin
2019-02-07 17:43         ` Alexander Duyck
2019-02-07 19:01           ` Michael S. Tsirkin
2019-02-07 20:50           ` Nitesh Narayan Lal
2019-02-08 17:58             ` Alexander Duyck
2019-02-08 20:41               ` Nitesh Narayan Lal
2019-02-08 21:38                 ` Michael S. Tsirkin
2019-02-08 22:05                   ` Alexander Duyck
2019-02-10  0:38                     ` Michael S. Tsirkin
2019-02-11  9:28                       ` David Hildenbrand
2019-02-12  5:16                         ` Michael S. Tsirkin
2019-02-12 17:10                       ` Nitesh Narayan Lal
2019-02-08 21:35               ` Michael S. Tsirkin
2019-02-04 20:18 ` [RFC][Patch v8 7/7] KVM: Adding tracepoints for guest page hinting Nitesh Narayan Lal
2019-02-04 20:20 ` [RFC][QEMU PATCH] KVM: Support for guest free " Nitesh Narayan Lal
2019-02-12  9:03 ` [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Wang, Wei W
2019-02-12  9:24   ` David Hildenbrand
2019-02-12 17:24     ` Nitesh Narayan Lal
2019-02-12 19:34       ` David Hildenbrand
2019-02-13  8:55     ` Wang, Wei W
2019-02-13  9:19       ` David Hildenbrand
2019-02-13 12:17         ` Nitesh Narayan Lal
2019-02-13 17:09           ` Michael S. Tsirkin
2019-02-13 17:22             ` Nitesh Narayan Lal
     [not found]               ` <286AC319A985734F985F78AFA26841F73DF6F1C3@shsmsx102.ccr.corp.intel.com>
2019-02-14  9:34                 ` David Hildenbrand
2019-02-13 17:16         ` Michael S. Tsirkin
2019-02-13 17:59           ` David Hildenbrand
2019-02-13 19:08             ` Michael S. Tsirkin
2019-02-14  9:08         ` Wang, Wei W
2019-02-14 10:00           ` David Hildenbrand
2019-02-14 10:44             ` David Hildenbrand
2019-02-15  9:15             ` Wang, Wei W
2019-02-15  9:33               ` David Hildenbrand
2019-02-13  9:00 ` Wang, Wei W
2019-02-13 12:06   ` Nitesh Narayan Lal
2019-02-14  8:48     ` Wang, Wei W
2019-02-14  9:42       ` David Hildenbrand
2019-02-15  9:05         ` Wang, Wei W
2019-02-15  9:41           ` David Hildenbrand
2019-02-18  2:36             ` Wei Wang
2019-02-18  2:39               ` Wei Wang
2019-02-15 12:40           ` Nitesh Narayan Lal
2019-02-14 13:00       ` Nitesh Narayan Lal
2019-02-16  9:40 ` David Hildenbrand
2019-02-18 15:50   ` Nitesh Narayan Lal
2019-02-18 16:02     ` David Hildenbrand
2019-02-18 16:49   ` Michael S. Tsirkin
2019-02-18 16:59     ` David Hildenbrand
2019-02-18 17:31       ` Alexander Duyck
2019-02-18 17:41         ` David Hildenbrand
2019-02-18 23:47           ` Alexander Duyck
2019-02-19  2:45             ` Michael S. Tsirkin
2019-02-19  2:46             ` Andrea Arcangeli
2019-02-19 12:52               ` Nitesh Narayan Lal
2019-02-19 16:23               ` Alexander Duyck
2019-02-19  8:06             ` David Hildenbrand
2019-02-19 14:40               ` Michael S. Tsirkin
2019-02-19 14:44                 ` David Hildenbrand
2019-02-19 14:45                   ` David Hildenbrand
2019-02-18 18:01         ` Michael S. Tsirkin
2019-02-18 17:54       ` Michael S. Tsirkin
2019-02-18 18:29         ` David Hildenbrand
2019-02-18 19:16           ` Michael S. Tsirkin
2019-02-18 19:35             ` David Hildenbrand
2019-02-18 19:47               ` Michael S. Tsirkin
2019-02-18 20:04                 ` David Hildenbrand
2019-02-18 20:31                   ` Michael S. Tsirkin
2019-02-18 20:40                     ` Nitesh Narayan Lal
2019-02-18 21:04                       ` David Hildenbrand
2019-02-19  0:01                         ` Alexander Duyck
2019-02-19  7:54                           ` David Hildenbrand
2019-02-19 18:06                             ` Alexander Duyck
2019-02-19 18:31                               ` David Hildenbrand
2019-02-19 21:57                                 ` Alexander Duyck
2019-02-19 22:17                                   ` Michael S. Tsirkin
2019-02-19 22:36                                   ` David Hildenbrand
2019-02-19 19:58                               ` Michael S. Tsirkin
2019-02-19 20:02                                 ` David Hildenbrand
2019-02-19 20:17                                   ` Michael S. Tsirkin
2019-02-19 20:21                                     ` David Hildenbrand
2019-02-19 20:35                                       ` Michael S. Tsirkin
2019-02-19 12:47                         ` Nitesh Narayan Lal
2019-02-19 13:03                           ` David Hildenbrand
2019-02-19 14:17                             ` Nitesh Narayan Lal
2019-02-19 14:21                               ` David Hildenbrand
2019-02-18 20:53                     ` David Hildenbrand
2019-02-23  0:02 ` Alexander Duyck
2019-02-25 13:01   ` Nitesh Narayan Lal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).