All of lore.kernel.org
 help / color / mirror / Atom feed
* [Patch v4 0/6] KVM: Guest page hinting
@ 2017-11-03 20:30 nilal
  2017-11-03 20:30 ` [Patch v4 1/6] KVM: Support for guest " nilal
                   ` (7 more replies)
  0 siblings, 8 replies; 17+ messages in thread
From: nilal @ 2017-11-03 20:30 UTC (permalink / raw)
  To: kvm, pbonzini, pagupta, wei.w.wang, yang.zhang.wz, riel, david,
	mst, dodgen, konrad.wilk

The following patch-set proposes an efficient mechanism for handing freed memory between the guest and the host. It enables the guests with DAX (no page cache) to rapidly free and reclaim memory to and from the host respectively. 

Changelog in V4:
	-Undefined reference for arch_free_page and arch_alloc_page when KVM is build as a module is fixed
	-Batched guest free pages could be transmitted to the host/QEMU with one single kick
	-Enabling or disabling guest page hinting support could be done through sysctl variable on the run-time
	-Basic functional testing for this support is done

Virtio interface changes are picked up from Wei's patch-set for Virtio-balloon enhancement[3]. "Wei, How would you like me to credit you in the final patch?")

Performance:
	Test criteria: Kernel Build
	Command: make clean;make defconfig;time make
	With Hinting:
		real: 21m24.680s
		user: 16m3.362s
		sys : 2m19.027s
	Without Hinting:
		real: 21m18.062s
		user: 16m13.969s
		sys : 1m17.884s

	Test criteria: Stress Test
	Command: time stress --io 2 --cpu 2 --vm 2 --vm-bytes 1024M --timeout 100s -v
	With Hinting:
		real: 1m40.726s
		user: 1m23.449s
		sys : 0m5.576s
	Without Hinting:
		real: 1m40.378s
		user: 1m21.292s
		sys : 0m4.972s

[1] http://www.spinics.net/lists/kvm/msg153666.html
[2] https://www.spinics.net/lists/kvm/msg155471.html
[3] http://www.spinics.net/lists/kvm/msg152734.html
[4] https://www.spinics.net/lists/kvm/msg156658.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Patch v4 1/6] KVM: Support for guest page hinting
  2017-11-03 20:30 [Patch v4 0/6] KVM: Guest page hinting nilal
@ 2017-11-03 20:30 ` nilal
  2017-11-13 17:59   ` Michael S. Tsirkin
  2017-11-03 20:30 ` [Patch v4 2/6] KVM: Guest page hinting functionality nilal
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: nilal @ 2017-11-03 20:30 UTC (permalink / raw)
  To: kvm, pbonzini, pagupta, wei.w.wang, yang.zhang.wz, riel, david,
	mst, dodgen, konrad.wilk

From: Nitesh Narayan Lal <nilal@redhat.com>

This patch includes the following:
1. Basic skeleton for the support
2. Enablement of x86 platform to use the same

Signed-off-by: Nitesh Narayan Lal <nilal@redhat.com>
---
 arch/x86/Kbuild         |  2 +-
 arch/x86/kvm/Makefile   |  2 ++
 include/linux/gfp.h     |  7 +++++++
 virt/kvm/Kconfig        |  4 ++++
 virt/kvm/page_hinting.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 60 insertions(+), 1 deletion(-)
 create mode 100644 virt/kvm/page_hinting.c

diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
index 0038a2d..7d39d7d 100644
--- a/arch/x86/Kbuild
+++ b/arch/x86/Kbuild
@@ -2,7 +2,7 @@ obj-y += entry/
 
 obj-$(CONFIG_PERF_EVENTS) += events/
 
-obj-$(CONFIG_KVM) += kvm/
+obj-$(subst m,y,$(CONFIG_KVM)) += kvm/
 
 # Xen paravirtualization support
 obj-$(CONFIG_XEN) += xen/
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 09d4b17..d8a3800 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -15,6 +15,8 @@ kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
 			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
 			   hyperv.o page_track.o debugfs.o
 
+obj-$(CONFIG_KVM_FREE_PAGE_HINTING)	+= $(KVM)/page_hinting.o
+
 kvm-intel-y		+= vmx.o pmu_intel.o
 kvm-amd-y		+= svm.o pmu_amd.o
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index f780718..a74371f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -452,6 +452,13 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
 	return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
 }
 
+#ifdef	CONFIG_KVM_FREE_PAGE_HINTING
+#define HAVE_ARCH_ALLOC_PAGE
+#define HAVE_ARCH_FREE_PAGE
+void arch_free_page(struct page *page, int order);
+void arch_alloc_page(struct page *page, int order);
+#endif
+
 #ifndef HAVE_ARCH_FREE_PAGE
 static inline void arch_free_page(struct page *page, int order) { }
 #endif
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index b0cc1a3..936c71d 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -50,3 +50,7 @@ config KVM_COMPAT
 
 config HAVE_KVM_IRQ_BYPASS
        bool
+
+config KVM_FREE_PAGE_HINTING
+       def_bool y
+       depends on KVM
diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
new file mode 100644
index 0000000..39d2b1d
--- /dev/null
+++ b/virt/kvm/page_hinting.c
@@ -0,0 +1,46 @@
+#include <linux/gfp.h>
+#include <linux/mm.h>
+#include <linux/page_ref.h>
+#include <linux/kvm_host.h>
+#include <linux/sort.h>
+
+#include <trace/events/kvm.h>
+
+#define MAX_FGPT_ENTRIES	1000
+#define HYPERLIST_THRESHOLD	500
+/*
+ * struct kvm_free_pages - Tracks the pages which are freed by the guest.
+ * @pfn	- page frame number for the page which is to be freed
+ * @pages - number of pages which are supposed to be freed.
+ * A global array object is used to hold the list of pfn and number of pages
+ * which are freed by the guest. This list may also have fragmentated pages so
+ * defragmentation is a must prior to the hypercall.
+ */
+struct kvm_free_pages {
+	unsigned long pfn;
+	unsigned int pages;
+};
+
+/*
+ * hypervisor_pages - It is a dummy structure passed with the hypercall.
+ * @pfn - page frame number for the page which is to be freed.
+ * @pages - number of pages which are supposed to be freed.
+ * A global array object is used to to hold the list of pfn and pages and is
+ * passed as part of the hypercall.
+ */
+struct hypervisor_pages {
+	unsigned long pfn;
+	unsigned int pages;
+};
+
+DEFINE_PER_CPU(struct kvm_free_pages [MAX_FGPT_ENTRIES], kvm_pt);
+DEFINE_PER_CPU(int, kvm_pt_idx);
+struct hypervisor_pages hypervisor_pagelist[MAX_FGPT_ENTRIES];
+
+void arch_alloc_page(struct page *page, int order)
+{
+}
+
+void arch_free_page(struct page *page, int order)
+{
+}
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Patch v4 2/6] KVM: Guest page hinting functionality
  2017-11-03 20:30 [Patch v4 0/6] KVM: Guest page hinting nilal
  2017-11-03 20:30 ` [Patch v4 1/6] KVM: Support for guest " nilal
@ 2017-11-03 20:30 ` nilal
  2017-11-13 18:03   ` Michael S. Tsirkin
  2017-11-03 20:30 ` [Patch v4 3/6] KVM: Adding tracepoints for guest page hinting nilal
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: nilal @ 2017-11-03 20:30 UTC (permalink / raw)
  To: kvm, pbonzini, pagupta, wei.w.wang, yang.zhang.wz, riel, david,
	mst, dodgen, konrad.wilk

From: Nitesh Narayan Lal <niteshnarayanlalleo@gmail.com>

This patch adds the guest implementation in order to maintain the list of
pages which are freed by the guest and are not reused. To avoid any
reallocation it includes seqlock once the list is completely filled.
Though it doesn't carries the hypercall related changes.

Signed-off-by: Nitesh Narayan Lal <nilal@redhat.com>
---
 virt/kvm/page_hinting.c | 245 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 243 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
index 39d2b1d..658856d 100644
--- a/virt/kvm/page_hinting.c
+++ b/virt/kvm/page_hinting.c
@@ -3,8 +3,7 @@
 #include <linux/page_ref.h>
 #include <linux/kvm_host.h>
 #include <linux/sort.h>
-
-#include <trace/events/kvm.h>
+#include <linux/kernel.h>
 
 #define MAX_FGPT_ENTRIES	1000
 #define HYPERLIST_THRESHOLD	500
@@ -33,14 +32,256 @@ struct hypervisor_pages {
 	unsigned int pages;
 };
 
+static __cacheline_aligned_in_smp DEFINE_SEQLOCK(guest_page_lock);
 DEFINE_PER_CPU(struct kvm_free_pages [MAX_FGPT_ENTRIES], kvm_pt);
 DEFINE_PER_CPU(int, kvm_pt_idx);
 struct hypervisor_pages hypervisor_pagelist[MAX_FGPT_ENTRIES];
 
+static void empty_hyperlist(void)
+{
+	int i = 0;
+
+	while (i < MAX_FGPT_ENTRIES) {
+		hypervisor_pagelist[i].pfn = 0;
+		hypervisor_pagelist[i].pages = 0;
+		i++;
+	}
+}
+
+static void make_hypercall(void)
+{
+	/*
+	 * Dummy function: Tobe filled later.
+	 */
+	empty_hyperlist();
+}
+
+static int sort_pfn(const void *a1, const void *b1)
+{
+	const struct hypervisor_pages *a = a1;
+	const struct hypervisor_pages *b = b1;
+
+	if (a->pfn > b->pfn)
+		return 1;
+
+	if (a->pfn < b->pfn)
+		return -1;
+
+	return 0;
+}
+
+static int pack_hyperlist(void)
+{
+	int i = 0, j = 0;
+
+	while (i < MAX_FGPT_ENTRIES) {
+		if (hypervisor_pagelist[i].pfn != 0) {
+			if (i != j) {
+				hypervisor_pagelist[j].pfn =
+						hypervisor_pagelist[i].pfn;
+				hypervisor_pagelist[j].pages =
+						hypervisor_pagelist[i].pages;
+			}
+			j++;
+		}
+		i++;
+	}
+	i = j;
+	while (j < MAX_FGPT_ENTRIES) {
+		hypervisor_pagelist[j].pfn = 0;
+		hypervisor_pagelist[j].pages = 0;
+		j++;
+	}
+	return i;
+}
+
+int compress_hyperlist(void)
+{
+	int i = 0, j = 1, merge_counter = 0, ret = 0;
+
+	sort(hypervisor_pagelist, MAX_FGPT_ENTRIES,
+	     sizeof(struct hypervisor_pages), sort_pfn, NULL);
+	while (i < MAX_FGPT_ENTRIES && j < MAX_FGPT_ENTRIES) {
+		unsigned long pfni = hypervisor_pagelist[i].pfn;
+		unsigned int pagesi = hypervisor_pagelist[i].pages;
+		unsigned long pfnj = hypervisor_pagelist[j].pfn;
+		unsigned int pagesj = hypervisor_pagelist[j].pages;
+
+		if (pfnj <= pfni) {
+			if (((pfnj + pagesj - 1) <= (pfni + pagesi - 1)) &&
+			    ((pfnj + pagesj - 1) >= (pfni - 1))) {
+				hypervisor_pagelist[i].pfn = pfnj;
+				hypervisor_pagelist[i].pages += pfni - pfnj;
+				hypervisor_pagelist[j].pfn = 0;
+				hypervisor_pagelist[j].pages = 0;
+				j++;
+				merge_counter++;
+				continue;
+			} else if ((pfnj + pagesj - 1) > (pfni + pagesi - 1)) {
+				hypervisor_pagelist[i].pfn = pfnj;
+				hypervisor_pagelist[i].pages = pagesj;
+				hypervisor_pagelist[j].pfn = 0;
+				hypervisor_pagelist[j].pages = 0;
+				j++;
+				merge_counter++;
+				continue;
+			}
+		} else if (pfnj > pfni) {
+			if ((pfnj + pagesj - 1) > (pfni + pagesi - 1) &&
+			    (pfnj <= pfni + pagesi)) {
+				hypervisor_pagelist[i].pages +=
+						(pfnj + pagesj - 1) -
+						(pfni + pagesi - 1);
+				hypervisor_pagelist[j].pfn = 0;
+				hypervisor_pagelist[j].pages = 0;
+				j++;
+				merge_counter++;
+				continue;
+			} else if ((pfnj + pagesj - 1) <= (pfni + pagesi - 1)) {
+				hypervisor_pagelist[j].pfn = 0;
+				hypervisor_pagelist[j].pages = 0;
+				j++;
+				merge_counter++;
+				continue;
+			}
+		}
+		i = j;
+		j++;
+	}
+	if (merge_counter != 0)
+		ret = pack_hyperlist() - 1;
+	else
+		ret = MAX_FGPT_ENTRIES - 1;
+	return ret;
+}
+
+void copy_hyperlist(int hyper_idx)
+{
+	int *idx = &get_cpu_var(kvm_pt_idx);
+	struct kvm_free_pages *free_page_obj;
+	int i = 0;
+
+	free_page_obj = &get_cpu_var(kvm_pt)[0];
+	while (i < hyper_idx) {
+		free_page_obj[*idx].pfn = hypervisor_pagelist[i].pfn;
+		free_page_obj[*idx].pages = hypervisor_pagelist[i].pages;
+		*idx += 1;
+		i++;
+	}
+	empty_hyperlist();
+	put_cpu_var(kvm_pt);
+	put_cpu_var(kvm_pt_idx);
+}
+
+/*
+ * arch_free_page_slowpath() - This function adds the guest free page entries
+ * to hypervisor_pages list and also ensures defragmentation prior to addition
+ * if it is present with any entry of the kvm_free_pages list.
+ */
+void arch_free_page_slowpath(void)
+{
+	int idx = 0;
+	int hyper_idx = -1;
+	int *kvm_idx = &get_cpu_var(kvm_pt_idx);
+	struct kvm_free_pages *free_page_obj = &get_cpu_var(kvm_pt)[0];
+
+	write_seqlock(&guest_page_lock);
+	while (idx < MAX_FGPT_ENTRIES) {
+		unsigned long pfn = free_page_obj[idx].pfn;
+		unsigned long pfn_end = free_page_obj[idx].pfn +
+					free_page_obj[idx].pages - 1;
+		bool prev_free = false;
+
+		while (pfn <= pfn_end) {
+			struct page *p = pfn_to_page(pfn);
+
+			if (PageCompound(p)) {
+				struct page *head_page = compound_head(p);
+				unsigned long head_pfn = page_to_pfn(head_page);
+				unsigned int alloc_pages =
+					1 << compound_order(head_page);
+
+				pfn = head_pfn + alloc_pages;
+				prev_free = false;
+				continue;
+			}
+			if (page_ref_count(p)) {
+				pfn++;
+				prev_free = false;
+				continue;
+			}
+			/*
+			 * The page is free so add it to the list and free the
+			 * hypervisor_pagelist if required.
+			 */
+			if (!prev_free) {
+				hyper_idx++;
+				hypervisor_pagelist[hyper_idx].pfn = pfn;
+				hypervisor_pagelist[hyper_idx].pages = 1;
+				if (hyper_idx == MAX_FGPT_ENTRIES - 1) {
+					hyper_idx =  compress_hyperlist();
+					if (hyper_idx >=
+					    HYPERLIST_THRESHOLD) {
+						make_hypercall();
+						hyper_idx = 0;
+					}
+				}
+				/*
+				 * If the next contiguous page is free, it can
+				 * be added to this same entry.
+				 */
+				prev_free = true;
+			} else {
+				/*
+				 * Multiple adjacent free pages
+				 */
+				hypervisor_pagelist[hyper_idx].pages++;
+			}
+			pfn++;
+		}
+		free_page_obj[idx].pfn = 0;
+		free_page_obj[idx].pages = 0;
+		idx++;
+	}
+	*kvm_idx = 0;
+	put_cpu_var(kvm_pt);
+	put_cpu_var(kvm_pt_idx);
+	write_sequnlock(&guest_page_lock);
+}
+
 void arch_alloc_page(struct page *page, int order)
 {
+	unsigned int seq;
+
+	/*
+	 * arch_free_page will acquire the lock once the list carrying guest
+	 * free pages is full and a hypercall will be made. Until complete free
+	 * page list is traversed no further allocaiton will be allowed.
+	 */
+	do {
+		seq = read_seqbegin(&guest_page_lock);
+	} while (read_seqretry(&guest_page_lock, seq));
 }
 
 void arch_free_page(struct page *page, int order)
 {
+	int *free_page_idx = &get_cpu_var(kvm_pt_idx);
+	struct kvm_free_pages *free_page_obj;
+	unsigned long flags;
+
+	/*
+	 * use of global variables may trigger a race condition between irq and
+	 * process context causing unwanted overwrites. This will be replaced
+	 * with a better solution to prevent such race conditions.
+	 */
+	local_irq_save(flags);
+	free_page_obj = &get_cpu_var(kvm_pt)[0];
+	free_page_obj[*free_page_idx].pfn = page_to_pfn(page);
+	free_page_obj[*free_page_idx].pages = 1 << order;
+	*free_page_idx += 1;
+	if (*free_page_idx == MAX_FGPT_ENTRIES)
+		arch_free_page_slowpath();
+	put_cpu_var(kvm_pt);
+	put_cpu_var(kvm_pt_idx);
+	local_irq_restore(flags);
 }
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Patch v4 3/6] KVM: Adding tracepoints for guest page hinting
  2017-11-03 20:30 [Patch v4 0/6] KVM: Guest page hinting nilal
  2017-11-03 20:30 ` [Patch v4 1/6] KVM: Support for guest " nilal
  2017-11-03 20:30 ` [Patch v4 2/6] KVM: Guest page hinting functionality nilal
@ 2017-11-03 20:30 ` nilal
  2017-11-03 20:30 ` [Patch v4 4/6] virtio: Exposes added descriptor to the other side synchronously nilal
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: nilal @ 2017-11-03 20:30 UTC (permalink / raw)
  To: kvm, pbonzini, pagupta, wei.w.wang, yang.zhang.wz, riel, david,
	mst, dodgen, konrad.wilk

From: Nitesh Narayan Lal <nilal@redhat.com>

Signed-off-by: Nitesh Narayan Lal <nilal@redhat.com>
---
 include/trace/events/kmem.h | 101 ++++++++++++++++++++++++++++++++++++++++++++
 virt/kvm/page_hinting.c     |  20 ++++++++-
 2 files changed, 119 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 6b2e154..ec97791 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -317,6 +317,107 @@ TRACE_EVENT(mm_page_alloc_extfrag,
 		__entry->change_ownership)
 );
 
+TRACE_EVENT(guest_free_page,
+	    TP_PROTO(struct page *page, unsigned int order),
+
+	TP_ARGS(page, order),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, pfn)
+		__field(unsigned int, order)
+	),
+
+	TP_fast_assign(
+		__entry->pfn            = page_to_pfn(page);
+		__entry->order          = order;
+	),
+
+	TP_printk("page=%p pfn=%lu number of pages=%d",
+		  pfn_to_page(__entry->pfn),
+		  __entry->pfn,
+		  (1 << __entry->order))
+);
+
+TRACE_EVENT(guest_alloc_page,
+	    TP_PROTO(struct page *page, unsigned int order),
+
+	TP_ARGS(page, order),
+
+	TP_STRUCT__entry(
+		__field(unsigned long,  pfn)
+		__field(unsigned int,   order)
+	),
+
+	TP_fast_assign(
+		__entry->pfn            = page_to_pfn(page);
+		__entry->order          = order;
+		),
+
+	TP_printk("page=%p pfn=%lu number of pages=%d",
+		  pfn_to_page(__entry->pfn),
+		  __entry->pfn,
+		  (1 << __entry->order))
+);
+
+TRACE_EVENT(guest_free_page_slowpath,
+	    TP_PROTO(unsigned long pfn, unsigned int pages),
+
+	TP_ARGS(pfn, pages),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, pfn)
+		__field(unsigned int, pages)
+	),
+
+	TP_fast_assign(
+		__entry->pfn            = pfn;
+		__entry->pages          = pages;
+	),
+
+	TP_printk("pfn=%lu number of pages=%u",
+		  __entry->pfn,
+		  __entry->pages)
+);
+
+TRACE_EVENT(guest_pfn_dump,
+	    TP_PROTO(char *type, unsigned long pfn, unsigned int pages),
+
+	TP_ARGS(type, pfn, pages),
+
+	TP_STRUCT__entry(
+		__field(char *, type)
+		__field(unsigned long, pfn)
+		__field(unsigned int, pages)
+	),
+
+	TP_fast_assign(
+		__entry->type		= type;
+		__entry->pfn            = pfn;
+		__entry->pages          = pages;
+		),
+
+	TP_printk("Type=%s pfn=%lu number of pages=%d",
+		  __entry->type,
+		  __entry->pfn,
+		  __entry->pages)
+);
+
+TRACE_EVENT(guest_str_dump,
+	    TP_PROTO(char *str),
+
+	TP_ARGS(str),
+
+	TP_STRUCT__entry(
+		__field(char *, str)
+	),
+
+	TP_fast_assign(
+		__entry->str		= str;
+		),
+
+	TP_printk("Debug=%s",
+		  __entry->str)
+);
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
index 658856d..54fe6bc 100644
--- a/virt/kvm/page_hinting.c
+++ b/virt/kvm/page_hinting.c
@@ -4,6 +4,7 @@
 #include <linux/kvm_host.h>
 #include <linux/sort.h>
 #include <linux/kernel.h>
+#include <trace/events/kmem.h>
 
 #define MAX_FGPT_ENTRIES	1000
 #define HYPERLIST_THRESHOLD	500
@@ -48,12 +49,13 @@ static void empty_hyperlist(void)
 	}
 }
 
-static void make_hypercall(void)
+void make_hypercall(void)
 {
 	/*
 	 * Dummy function: Tobe filled later.
 	 */
 	empty_hyperlist();
+	trace_guest_str_dump("Hypercall to host...:");
 }
 
 static int sort_pfn(const void *a1, const void *b1)
@@ -70,13 +72,16 @@ static int sort_pfn(const void *a1, const void *b1)
 	return 0;
 }
 
-static int pack_hyperlist(void)
+int pack_hyperlist(void)
 {
 	int i = 0, j = 0;
 
 	while (i < MAX_FGPT_ENTRIES) {
 		if (hypervisor_pagelist[i].pfn != 0) {
 			if (i != j) {
+				trace_guest_pfn_dump("Packing Hyperlist",
+						     hypervisor_pagelist[i].pfn,
+						hypervisor_pagelist[i].pages);
 				hypervisor_pagelist[j].pfn =
 						hypervisor_pagelist[i].pfn;
 				hypervisor_pagelist[j].pages =
@@ -163,6 +168,9 @@ void copy_hyperlist(int hyper_idx)
 
 	free_page_obj = &get_cpu_var(kvm_pt)[0];
 	while (i < hyper_idx) {
+		trace_guest_pfn_dump("HyperList entry copied",
+				     hypervisor_pagelist[i].pfn,
+				     hypervisor_pagelist[i].pages);
 		free_page_obj[*idx].pfn = hypervisor_pagelist[i].pfn;
 		free_page_obj[*idx].pages = hypervisor_pagelist[i].pages;
 		*idx += 1;
@@ -203,11 +211,14 @@ void arch_free_page_slowpath(void)
 
 				pfn = head_pfn + alloc_pages;
 				prev_free = false;
+				trace_guest_pfn_dump("Compound",
+						     head_pfn, alloc_pages);
 				continue;
 			}
 			if (page_ref_count(p)) {
 				pfn++;
 				prev_free = false;
+				trace_guest_pfn_dump("Single", pfn, 1);
 				continue;
 			}
 			/*
@@ -218,6 +229,9 @@ void arch_free_page_slowpath(void)
 				hyper_idx++;
 				hypervisor_pagelist[hyper_idx].pfn = pfn;
 				hypervisor_pagelist[hyper_idx].pages = 1;
+				trace_guest_free_page_slowpath(
+				hypervisor_pagelist[hyper_idx].pfn,
+				hypervisor_pagelist[hyper_idx].pages);
 				if (hyper_idx == MAX_FGPT_ENTRIES - 1) {
 					hyper_idx =  compress_hyperlist();
 					if (hyper_idx >=
@@ -261,6 +275,7 @@ void arch_alloc_page(struct page *page, int order)
 	do {
 		seq = read_seqbegin(&guest_page_lock);
 	} while (read_seqretry(&guest_page_lock, seq));
+	trace_guest_alloc_page(page, order);
 }
 
 void arch_free_page(struct page *page, int order)
@@ -276,6 +291,7 @@ void arch_free_page(struct page *page, int order)
 	 */
 	local_irq_save(flags);
 	free_page_obj = &get_cpu_var(kvm_pt)[0];
+	trace_guest_free_page(page, order);
 	free_page_obj[*free_page_idx].pfn = page_to_pfn(page);
 	free_page_obj[*free_page_idx].pages = 1 << order;
 	*free_page_idx += 1;
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Patch v4 4/6] virtio: Exposes added descriptor to the other side synchronously
  2017-11-03 20:30 [Patch v4 0/6] KVM: Guest page hinting nilal
                   ` (2 preceding siblings ...)
  2017-11-03 20:30 ` [Patch v4 3/6] KVM: Adding tracepoints for guest page hinting nilal
@ 2017-11-03 20:30 ` nilal
  2017-11-03 20:30 ` [Patch v4 5/6] KVM: Sending hyperlist to the host via hinting_vq nilal
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: nilal @ 2017-11-03 20:30 UTC (permalink / raw)
  To: kvm, pbonzini, pagupta, wei.w.wang, yang.zhang.wz, riel, david,
	mst, dodgen, konrad.wilk

From: Nitesh Narayan Lal <nilal@redhat.com>

This patch enables the driver to expose a chain of buffers to the other end
using vring descriptor followed by a kick. After which it busy waits till the
update is done.

Signed-off-by: Nitesh Narayan Lal <nilal@redhat.com>
---
 drivers/virtio/virtio_ring.c | 157 ++++++++++++++++++++++++++++++++++++++++++-
 include/linux/virtio.h       |  19 ++++++
 2 files changed, 175 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index eb30f3e..651ce8f 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -438,6 +438,136 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 }
 
 /**
+ * virtqueue_add_chain - expose a chain of buffers to the other end
+ * @_vq: the struct virtqueue we're talking about.
+ * @head: desc id of the chain head.
+ * @indirect: set if the chain of descs are indrect descs.
+ * @indir_desc: the first indirect desc.
+ * @data: the token identifying the chain.
+ * @ctx: extra context for the token.
+ *
+ * Caller must ensure we don't call this with other virtqueue operations
+ * at the same time (except where noted).
+ *
+ * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
+ */
+int virtqueue_add_chain(struct virtqueue *_vq,
+			unsigned int head,
+			bool indirect,
+			struct vring_desc *indir_desc,
+			void *data,
+			void *ctx)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	/* The desc chain is empty. */
+	if (head == VIRTQUEUE_DESC_ID_INIT)
+		return 0;
+
+	START_USE(vq);
+
+	if (unlikely(vq->broken)) {
+		END_USE(vq);
+		return -EIO;
+	}
+
+	/* This is the data for callback, in our case may not be required. */
+	vq->desc_state[head].data = data;
+	if (indirect)
+		vq->desc_state[head].indir_desc = indir_desc;
+	if (ctx)
+		vq->desc_state[head].indir_desc = ctx;
+
+	vq->avail_idx_shadow = 1;
+	vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
+	vq->num_added = 1;
+	END_USE(vq);
+	virtqueue_kick_sync(_vq);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtqueue_add_chain);
+
+/**
+ * virtqueue_add_chain_desc - add a buffer to a chain using a vring desc
+ * @vq: the struct virtqueue we're talking about.
+ * @addr: address of the buffer to add.
+ * @len: length of the buffer.
+ * @head_id: desc id of the chain head.
+ * @prev_id: desc id of the previous buffer.
+ * @in: set if the buffer is for the device to write.
+ *
+ * Caller must ensure we don't call this with other virtqueue operations
+ * at the same time (except where noted).
+ *
+ * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
+ */
+int virtqueue_add_chain_desc(struct virtqueue *_vq,
+			     u64 addr,
+			     u32 len,
+			     unsigned int *head_id,
+			     unsigned int *prev_id,
+			     bool in)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	struct vring_desc *desc = vq->vring.desc;
+	u16 flags = in ? VRING_DESC_F_WRITE : 0;
+	unsigned int i;
+
+	/* Sanity check */
+	if (!_vq || !head_id || !prev_id)
+		return -EINVAL;
+retry:
+	START_USE(vq);
+	if (unlikely(vq->broken)) {
+		END_USE(vq);
+		return -EIO;
+	}
+
+	if (vq->vq.num_free < 1) {
+		/*
+		 * If there is no desc avail in the vq, so kick what is
+		 * already added, and re-start to build a new chain for
+		 * the passed sg.
+		 */
+		if (likely(*head_id != VIRTQUEUE_DESC_ID_INIT)) {
+			END_USE(vq);
+			virtqueue_add_chain(_vq, *head_id, 0, NULL, vq, NULL);
+			virtqueue_kick_sync(_vq);
+			*head_id = VIRTQUEUE_DESC_ID_INIT;
+			*prev_id = VIRTQUEUE_DESC_ID_INIT;
+			goto retry;
+		} else {
+			END_USE(vq);
+			return -ENOSPC;
+		}
+	}
+
+	i = vq->free_head;
+	flags &= ~VRING_DESC_F_NEXT;
+	desc[i].flags = cpu_to_virtio16(_vq->vdev, flags);
+	desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
+	desc[i].len = cpu_to_virtio32(_vq->vdev, len);
+
+	/* Add the desc to the end of the chain */
+	if (*prev_id != VIRTQUEUE_DESC_ID_INIT) {
+		desc[*prev_id].next = cpu_to_virtio16(_vq->vdev, i);
+		desc[*prev_id].flags |= cpu_to_virtio16(_vq->vdev,
+							VRING_DESC_F_NEXT);
+	}
+	*prev_id = i;
+	if (*head_id == VIRTQUEUE_DESC_ID_INIT)
+		*head_id = *prev_id;
+
+	vq->vq.num_free--;
+	vq->free_head = virtio16_to_cpu(_vq->vdev, desc[i].next);
+	END_USE(vq);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtqueue_add_chain_desc);
+
+/**
  * virtqueue_add_sgs - expose buffers to other end
  * @vq: the struct virtqueue we're talking about.
  * @sgs: array of terminated scatterlists.
@@ -559,7 +689,6 @@ bool virtqueue_kick_prepare(struct virtqueue *_vq)
 	START_USE(vq);
 	/* We need to expose available array entries before checking avail
 	 * event. */
-	virtio_mb(vq->weak_barriers);
 
 	old = vq->avail_idx_shadow - vq->num_added;
 	new = vq->avail_idx_shadow;
@@ -609,6 +738,32 @@ bool virtqueue_notify(struct virtqueue *_vq)
 EXPORT_SYMBOL_GPL(virtqueue_notify);
 
 /**
+ * virtqueue_kick_sync - update after add_buf and busy wait till update is done
+ * @vq: the struct virtqueue
+ *
+ * After one or more virtqueue_add_* calls, invoke this to kick
+ * the other side. Busy wait till the other side is done with the update.
+ *
+ * Caller must ensure we don't call this with other virtqueue
+ * operations at the same time (except where noted).
+ *
+ * Returns false if kick failed, otherwise true.
+ */
+bool virtqueue_kick_sync(struct virtqueue *vq)
+{
+	u32 len;
+
+	if (likely(virtqueue_kick(vq))) {
+		while (!virtqueue_get_buf(vq, &len) &&
+		       !virtqueue_is_broken(vq))
+			cpu_relax();
+		return true;
+	}
+	return false;
+}
+EXPORT_SYMBOL_GPL(virtqueue_kick_sync);
+
+/**
  * virtqueue_kick - update after add_buf
  * @vq: the struct virtqueue
  *
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 28b0e96..bc95e84 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -56,6 +56,25 @@ int virtqueue_add_sgs(struct virtqueue *vq,
 		      unsigned int in_sgs,
 		      void *data,
 		      gfp_t gfp);
+/* A desc with this init id is treated as an invalid desc */
+#define VIRTQUEUE_DESC_ID_INIT UINT_MAX
+int virtqueue_add_chain_desc(struct virtqueue *_vq,
+			     u64 addr,
+			     u32 len,
+			     unsigned int *head_id,
+			     unsigned int *prev_id,
+			     bool in);
+
+int virtqueue_add_chain(struct virtqueue *_vq,
+			unsigned int head,
+			bool indirect,
+			struct vring_desc *indirect_desc,
+			void *data,
+			void *ctx);
+
+bool virtqueue_kick_sync(struct virtqueue *vq);
+
+bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq);
 
 bool virtqueue_kick(struct virtqueue *vq);
 
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Patch v4 5/6] KVM: Sending hyperlist to the host via hinting_vq
  2017-11-03 20:30 [Patch v4 0/6] KVM: Guest page hinting nilal
                   ` (3 preceding siblings ...)
  2017-11-03 20:30 ` [Patch v4 4/6] virtio: Exposes added descriptor to the other side synchronously nilal
@ 2017-11-03 20:30 ` nilal
  2017-11-03 20:30 ` [Patch v4 6/6] KVM: Enabling guest page hinting via static key nilal
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: nilal @ 2017-11-03 20:30 UTC (permalink / raw)
  To: kvm, pbonzini, pagupta, wei.w.wang, yang.zhang.wz, riel, david,
	mst, dodgen, konrad.wilk

From: Nitesh Narayan Lal <nilal@redhat.com>

This patch creates a new vq (hinting_vq) to be used for page hinting
and adds support in the existing virtio balloon infrastructure so
that the hyper list carrying pages which are supposed to be freed
could be sent to the host (QEMU) for processing by using hinting_vq.

Signed-off-by: Nitesh Narayan Lal <nilal@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 45 ++++++++++++++++++++++++++++++++++++-----
 include/linux/page_hinting.h    | 16 +++++++++++++++
 virt/kvm/page_hinting.c         | 36 +++++++++++++--------------------
 3 files changed, 70 insertions(+), 27 deletions(-)
 create mode 100644 include/linux/page_hinting.h

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f0b3a0b..e678196 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,7 @@
 #include <linux/mm.h>
 #include <linux/mount.h>
 #include <linux/magic.h>
+#include <linux/page_hinting.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -53,7 +54,7 @@ static struct vfsmount *balloon_mnt;
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *hinting_vq;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
@@ -95,6 +96,26 @@ static struct virtio_device_id id_table[] = {
 	{ 0 },
 };
 
+#ifdef CONFIG_KVM_FREE_PAGE_HINTING
+static void tell_host_one_page(struct virtio_balloon *vb, struct virtqueue *vq,
+			       u64 gvaddr, int len)
+{
+	unsigned int id = VIRTQUEUE_DESC_ID_INIT;
+	u64 gpaddr = virt_to_phys((void *)gvaddr);
+
+	virtqueue_add_chain_desc(vq, gpaddr, len, &id, &id, 0);
+	virtqueue_add_chain(vq, id, 0, NULL, (void *)gpaddr, NULL);
+}
+
+void virtballoon_page_hinting(struct virtio_balloon *vb, int hyper_entries)
+{
+	u64 gvaddr = (u64)hypervisor_pagelist;
+
+	vb->num_pfns = hyper_entries;
+	tell_host_one_page(vb, vb->hinting_vq, gvaddr, hyper_entries);
+}
+#endif
+
 static u32 page_to_balloon_pfn(struct page *page)
 {
 	unsigned long pfn = page_to_pfn(page);
@@ -111,6 +132,12 @@ static void balloon_ack(struct virtqueue *vq)
 	wake_up(&vb->acked);
 }
 
+static void hinting_ack(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+
+	wake_up(&vb->acked);
+}
 static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 {
 	struct scatterlist sg;
@@ -404,22 +431,25 @@ static void update_balloon_size_func(struct work_struct *work)
 
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
+	struct virtqueue *vqs[4];
+	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, hinting_ack,
+				       stats_request };
+	static const char * const names[] = { "inflate", "deflate", "hinting",
+					      "stats" };
 	int err, nvqs;
 
 	/*
 	 * We expect two virtqueues: inflate and deflate, and
 	 * optionally stat.
 	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
+	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 4 : 3;
 	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
 	if (err)
 		return err;
 
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
+	vb->hinting_vq = vqs[3];
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		struct scatterlist sg;
 		unsigned int num_stats;
@@ -581,6 +611,11 @@ static int virtballoon_probe(struct virtio_device *vdev)
 
 	virtio_device_ready(vdev);
 
+#ifdef CONFIG_KVM_FREE_PAGE_HINTING
+	request_hypercall = (void *)&virtballoon_page_hinting;
+	balloon_ptr = vb;
+#endif
+
 	if (towards_target(vb))
 		virtballoon_changed(vdev);
 	return 0;
diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
new file mode 100644
index 0000000..0bfb646
--- /dev/null
+++ b/include/linux/page_hinting.h
@@ -0,0 +1,16 @@
+#define MAX_FGPT_ENTRIES	1000
+/*
+ * hypervisor_pages - It is a dummy structure passed with the hypercall.
+ * @pfn - page frame number for the page which is to be freed.
+ * @pages - number of pages which are supposed to be freed.
+ * A global array object is used to to hold the list of pfn and pages and is
+ * passed as part of the hypercall.
+ */
+struct hypervisor_pages {
+	unsigned long pfn;
+	unsigned int pages;
+};
+
+extern struct hypervisor_pages hypervisor_pagelist[MAX_FGPT_ENTRIES];
+extern void (*request_hypercall)(void *, int);
+extern void *balloon_ptr;
diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
index 54fe6bc..22c892b 100644
--- a/virt/kvm/page_hinting.c
+++ b/virt/kvm/page_hinting.c
@@ -5,8 +5,8 @@
 #include <linux/sort.h>
 #include <linux/kernel.h>
 #include <trace/events/kmem.h>
+#include <linux/page_hinting.h>
 
-#define MAX_FGPT_ENTRIES	1000
 #define HYPERLIST_THRESHOLD	500
 /*
  * struct kvm_free_pages - Tracks the pages which are freed by the guest.
@@ -21,22 +21,15 @@ struct kvm_free_pages {
 	unsigned int pages;
 };
 
-/*
- * hypervisor_pages - It is a dummy structure passed with the hypercall.
- * @pfn - page frame number for the page which is to be freed.
- * @pages - number of pages which are supposed to be freed.
- * A global array object is used to to hold the list of pfn and pages and is
- * passed as part of the hypercall.
- */
-struct hypervisor_pages {
-	unsigned long pfn;
-	unsigned int pages;
-};
-
 static __cacheline_aligned_in_smp DEFINE_SEQLOCK(guest_page_lock);
 DEFINE_PER_CPU(struct kvm_free_pages [MAX_FGPT_ENTRIES], kvm_pt);
 DEFINE_PER_CPU(int, kvm_pt_idx);
 struct hypervisor_pages hypervisor_pagelist[MAX_FGPT_ENTRIES];
+EXPORT_SYMBOL(hypervisor_pagelist);
+void (*request_hypercall)(void *, int);
+EXPORT_SYMBOL(request_hypercall);
+void *balloon_ptr;
+EXPORT_SYMBOL(balloon_ptr);
 
 static void empty_hyperlist(void)
 {
@@ -49,13 +42,11 @@ static void empty_hyperlist(void)
 	}
 }
 
-void make_hypercall(void)
+void hyperlist_ready(int entries)
 {
-	/*
-	 * Dummy function: Tobe filled later.
-	 */
-	empty_hyperlist();
 	trace_guest_str_dump("Hypercall to host...:");
+	request_hypercall(balloon_ptr, entries);
+	empty_hyperlist();
 }
 
 static int sort_pfn(const void *a1, const void *b1)
@@ -156,7 +147,7 @@ int compress_hyperlist(void)
 	if (merge_counter != 0)
 		ret = pack_hyperlist() - 1;
 	else
-		ret = MAX_FGPT_ENTRIES - 1;
+		ret = MAX_FGPT_ENTRIES;
 	return ret;
 }
 
@@ -227,16 +218,16 @@ void arch_free_page_slowpath(void)
 			 */
 			if (!prev_free) {
 				hyper_idx++;
-				hypervisor_pagelist[hyper_idx].pfn = pfn;
-				hypervisor_pagelist[hyper_idx].pages = 1;
 				trace_guest_free_page_slowpath(
 				hypervisor_pagelist[hyper_idx].pfn,
 				hypervisor_pagelist[hyper_idx].pages);
+				hypervisor_pagelist[hyper_idx].pfn = pfn;
+				hypervisor_pagelist[hyper_idx].pages = 1;
 				if (hyper_idx == MAX_FGPT_ENTRIES - 1) {
 					hyper_idx =  compress_hyperlist();
 					if (hyper_idx >=
 					    HYPERLIST_THRESHOLD) {
-						make_hypercall();
+						hyperlist_ready(hyper_idx);
 						hyper_idx = 0;
 					}
 				}
@@ -272,6 +263,7 @@ void arch_alloc_page(struct page *page, int order)
 	 * free pages is full and a hypercall will be made. Until complete free
 	 * page list is traversed no further allocaiton will be allowed.
 	 */
+
 	do {
 		seq = read_seqbegin(&guest_page_lock);
 	} while (read_seqretry(&guest_page_lock, seq));
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Patch v4 6/6] KVM: Enabling guest page hinting via static key
  2017-11-03 20:30 [Patch v4 0/6] KVM: Guest page hinting nilal
                   ` (4 preceding siblings ...)
  2017-11-03 20:30 ` [Patch v4 5/6] KVM: Sending hyperlist to the host via hinting_vq nilal
@ 2017-11-03 20:30 ` nilal
  2017-11-03 20:37 ` [QEMU PATCH] kvm: Support for guest page hinting nilal
  2017-11-12 21:23 ` [Patch v4 0/6] KVM: Guest " Rik van Riel
  7 siblings, 0 replies; 17+ messages in thread
From: nilal @ 2017-11-03 20:30 UTC (permalink / raw)
  To: kvm, pbonzini, pagupta, wei.w.wang, yang.zhang.wz, riel, david,
	mst, dodgen, konrad.wilk

From: Nitesh Narayan Lal <nilal@redhat.com>

This patch adds the support for enabling or disabling
the guest page hinting based on the STATIC key which
could be set via sysctl.

Signed-off-by: Nitesh Narayan Lal <nilal@redhat.com>
---
 include/linux/page_hinting.h |  3 +++
 kernel/sysctl.c              | 10 ++++++++++
 virt/kvm/page_hinting.c      | 26 ++++++++++++++++++++++++++
 3 files changed, 39 insertions(+)

diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
index 0bfb646..6f6068a 100644
--- a/include/linux/page_hinting.h
+++ b/include/linux/page_hinting.h
@@ -14,3 +14,6 @@ struct hypervisor_pages {
 extern struct hypervisor_pages hypervisor_pagelist[MAX_FGPT_ENTRIES];
 extern void (*request_hypercall)(void *, int);
 extern void *balloon_ptr;
+int guest_page_hinting_sysctl(struct ctl_table *table, int write,
+			      void __user *buffer, size_t *lenp, loff_t *ppos);
+extern int guest_page_hinting_flag;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d9c31bc..f4c4ddc 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -67,6 +67,7 @@
 #include <linux/kexec.h>
 #include <linux/bpf.h>
 #include <linux/mount.h>
+#include <linux/page_hinting.h>
 
 #include <linux/uaccess.h>
 #include <asm/processor.h>
@@ -1849,6 +1850,15 @@ static struct ctl_table fs_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &one,
 	},
+#ifdef CONFIG_KVM_FREE_PAGE_HINTING
+	{
+		.procname	= "guest-page-hinting",
+		.data		= &guest_page_hinting_flag,
+		.maxlen		= sizeof(guest_page_hinting_flag),
+		.mode		= 0644,
+		.proc_handler   = guest_page_hinting_sysctl,
+	},
+#endif
 	{ }
 };
 
diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
index 22c892b..d4bfbce 100644
--- a/virt/kvm/page_hinting.c
+++ b/virt/kvm/page_hinting.c
@@ -30,6 +30,28 @@ void (*request_hypercall)(void *, int);
 EXPORT_SYMBOL(request_hypercall);
 void *balloon_ptr;
 EXPORT_SYMBOL(balloon_ptr);
+DEFINE_STATIC_KEY_FALSE(guest_page_hinting_key);
+EXPORT_SYMBOL(guest_page_hinting_key);
+static DEFINE_MUTEX(hinting_mutex);
+int guest_page_hinting_flag;
+
+int guest_page_hinting_sysctl(struct ctl_table *table, int write,
+			      void __user *buffer, size_t *lenp,
+			      loff_t *ppos)
+{
+	int ret;
+
+	mutex_lock(&hinting_mutex);
+
+	ret = proc_dointvec(table, write, buffer, lenp, ppos);
+
+	if (guest_page_hinting_flag)
+		static_key_enable(&guest_page_hinting_key.key);
+	else
+		static_key_disable(&guest_page_hinting_key.key);
+	mutex_unlock(&hinting_mutex);
+	return ret;
+}
 
 static void empty_hyperlist(void)
 {
@@ -258,6 +280,8 @@ void arch_alloc_page(struct page *page, int order)
 {
 	unsigned int seq;
 
+	if (!static_branch_unlikely(&guest_page_hinting_key))
+		return;
 	/*
 	 * arch_free_page will acquire the lock once the list carrying guest
 	 * free pages is full and a hypercall will be made. Until complete free
@@ -276,6 +300,8 @@ void arch_free_page(struct page *page, int order)
 	struct kvm_free_pages *free_page_obj;
 	unsigned long flags;
 
+	if (!static_branch_unlikely(&guest_page_hinting_key))
+		return;
 	/*
 	 * use of global variables may trigger a race condition between irq and
 	 * process context causing unwanted overwrites. This will be replaced
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [QEMU PATCH] kvm: Support for guest page hinting
  2017-11-03 20:30 [Patch v4 0/6] KVM: Guest page hinting nilal
                   ` (5 preceding siblings ...)
  2017-11-03 20:30 ` [Patch v4 6/6] KVM: Enabling guest page hinting via static key nilal
@ 2017-11-03 20:37 ` nilal
  2017-11-03 20:37   ` nilal
  2017-11-12 21:23 ` [Patch v4 0/6] KVM: Guest " Rik van Riel
  7 siblings, 1 reply; 17+ messages in thread
From: nilal @ 2017-11-03 20:37 UTC (permalink / raw)
  To: kvm, pbonzini, pagupta, wei.w.wang, yang.zhang.wz, riel, david,
	mst, dodgen, konrad.wilk

This patch adds a new vq (hvq) to existing virtio-balloon code in QEMU.
This hvq is used to recieve the address where the list of guest pfn
which are supposed to be freed are stored.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [QEMU PATCH] kvm: Support for guest page hinting
  2017-11-03 20:37 ` [QEMU PATCH] kvm: Support for guest page hinting nilal
@ 2017-11-03 20:37   ` nilal
  2017-11-06 11:21     ` Pankaj Gupta
  0 siblings, 1 reply; 17+ messages in thread
From: nilal @ 2017-11-03 20:37 UTC (permalink / raw)
  To: kvm, pbonzini, pagupta, wei.w.wang, yang.zhang.wz, riel, david,
	mst, dodgen, konrad.wilk

From: Nitesh Narayan Lal <nilal@redhat.com>

This patch enables QEMU to handle page hinting requests
from the guest. Once the guest kicks QEMU to free a page,
QEMU retrives the guest physical address and converts it to
host virtual address and then MADVISE that memory.

Signed-off-by: Nitesh Narayan Lal <nilal@redhat.com>
---
 hw/virtio/virtio-balloon.c         | 81 ++++++++++++++++++++++++++++++++++++++
 hw/virtio/virtio.c                 | 24 +++++++++++
 include/hw/virtio/virtio-access.h  |  1 +
 include/hw/virtio/virtio-balloon.h |  2 +-
 include/qemu/osdep.h               |  7 ++++
 5 files changed, 114 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 37cde38..de10350 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -33,6 +33,8 @@
 
 #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
 
+void page_hinting_request(uint64_t addr, uint32_t len);
+
 static void balloon_page(void *addr, int deflate)
 {
     if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
@@ -205,6 +207,84 @@ static void balloon_stats_set_poll_interval(Object *obj, Visitor *v,
     balloon_stats_change_timer(s, 0);
 }
 
+static void *gpa2hva(MemoryRegion **p_mr, hwaddr addr, Error **errp)
+{
+    MemoryRegionSection mrs = memory_region_find(get_system_memory(),
+                                                 addr, 1);
+
+    if (!mrs.mr) {
+        error_setg(errp, "No memory is mapped at address 0x%" HWADDR_PRIx, addr);
+        return NULL;
+    }
+
+    if (!memory_region_is_ram(mrs.mr) && !memory_region_is_romd(mrs.mr)) {
+        error_setg(errp, "Memory at address 0x%" HWADDR_PRIx "is not RAM", addr);
+        memory_region_unref(mrs.mr);
+        return NULL;
+    }
+
+    *p_mr = mrs.mr;
+    return qemu_map_ram_ptr(mrs.mr->ram_block, mrs.offset_within_region);
+}
+
+struct guest_pages {
+	unsigned long pfn;
+	unsigned int pages;
+};
+
+
+void page_hinting_request(uint64_t addr, uint32_t len)
+{
+    Error *local_err = NULL;
+    MemoryRegion *mr = NULL;
+    void *hvaddr;
+    int ret = 0;
+    struct guest_pages *guest_obj;
+    int i = 0;
+    void *hvaddr_to_free;
+    unsigned long pfn, pfn_end;
+    uint64_t gpaddr_to_free;
+
+    /*ptr is the host physical address*/
+    hvaddr = gpa2hva(&mr, addr, &local_err);
+    if (local_err) {
+        error_report_err(local_err);
+        return;
+    }
+    guest_obj = hvaddr;
+
+    while (i < len) {
+        pfn = guest_obj[i].pfn;
+	pfn_end = guest_obj[i].pfn + guest_obj[i].pages - 1;
+	while (pfn <= pfn_end) {
+	        gpaddr_to_free = pfn << VIRTIO_BALLOON_PFN_SHIFT;
+	        hvaddr_to_free = gpa2hva(&mr, gpaddr_to_free, &local_err);
+	        if (local_err) {
+			error_report_err(local_err);
+		        return;
+		}
+		ret = qemu_madvise((void *)hvaddr_to_free, 4096, QEMU_MADV_FREE);
+		if (ret == -1)
+		    printf("\n%d:%s Error: Madvise failed with error:%d\n", __LINE__, __func__, ret);
+		pfn++;
+	}
+	i++;
+    }
+}
+
+
+static void virtio_balloon_page_hinting(VirtIODevice *vdev, VirtQueue *vq)
+{
+    uint64_t addr;
+    uint32_t len;
+    VirtQueueElement elem = {};
+
+    pop_hinting_addr(vq, &addr, &len);
+    page_hinting_request(addr, len);
+    virtqueue_push(vq, &elem, 0);
+    virtio_notify(vdev, vq);
+}
+
 static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
@@ -443,6 +523,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
     s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
     s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
     s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
+    s->hvq = virtio_add_queue(vdev, 128, virtio_balloon_page_hinting);
 
     reset_stats(s);
 }
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 311929e..24e26ec 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -825,6 +825,30 @@ static void *virtqueue_alloc_element(size_t sz, unsigned out_num, unsigned in_nu
     return elem;
 }
 
+void pop_hinting_addr(VirtQueue *vq, uint64_t *addr, uint32_t *len)
+{
+   VRingMemoryRegionCaches *caches;
+   VRingDesc desc;
+   MemoryRegionCache *desc_cache;
+   VirtIODevice *vdev = vq->vdev;
+   unsigned int head, i, max;
+
+   max = vq->vring.num;
+   if (!virtqueue_get_head(vq, vq->last_avail_idx++, &head)) {
+	printf("\n%d:%sError: Unable to read head\n", __LINE__, __func__);
+   }
+   i = head;
+
+   caches = vring_get_region_caches(vq);
+   if (caches->desc.len < max * sizeof(VRingDesc)) {
+       virtio_error(vdev, "Cannot map descriptor ring");
+   }
+   desc_cache = &caches->desc;
+   vring_desc_read(vdev, &desc, desc_cache, i);
+   *addr = desc.addr;
+   *len = desc.len;
+}
+
 void *virtqueue_pop(VirtQueue *vq, size_t sz)
 {
     unsigned int i, head, max;
diff --git a/include/hw/virtio/virtio-access.h b/include/hw/virtio/virtio-access.h
index 2e92074..568d71f 100644
--- a/include/hw/virtio/virtio-access.h
+++ b/include/hw/virtio/virtio-access.h
@@ -24,6 +24,7 @@
 #define LEGACY_VIRTIO_IS_BIENDIAN 1
 #endif
 
+void pop_hinting_addr(VirtQueue *vq, uint64_t *addr, uint32_t *len);
 static inline bool virtio_access_is_big_endian(VirtIODevice *vdev)
 {
 #if defined(LEGACY_VIRTIO_IS_BIENDIAN)
diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
index 1ea13bd..dfb5782 100644
--- a/include/hw/virtio/virtio-balloon.h
+++ b/include/hw/virtio/virtio-balloon.h
@@ -33,7 +33,7 @@ typedef struct virtio_balloon_stat_modern {
 
 typedef struct VirtIOBalloon {
     VirtIODevice parent_obj;
-    VirtQueue *ivq, *dvq, *svq;
+    VirtQueue *ivq, *dvq, *svq, *hvq;
     uint32_t num_pages;
     uint32_t actual;
     uint64_t stats[VIRTIO_BALLOON_S_NR];
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 9dd318a..033d64c 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -278,6 +278,11 @@ void qemu_anon_ram_free(void *ptr, size_t size);
 #else
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
 #endif
+#ifdef MADV_FREE
+#define QEMU_MADV_FREE MADV_FREE
+#else
+#define QEMU_MADV_FREE QEMU_MAD_INVALID
+#endif
 
 #elif defined(CONFIG_POSIX_MADVISE)
 
@@ -291,6 +296,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
 #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
+#define QEMU_MADV_FREE QEMU_MAD_INVALID
 
 #else /* no-op */
 
@@ -304,6 +310,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
 #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
+#define QEMU_MADV_FREE QEMU_MAD_INVALID
 
 #endif
 
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [QEMU PATCH] kvm: Support for guest page hinting
  2017-11-03 20:37   ` nilal
@ 2017-11-06 11:21     ` Pankaj Gupta
  2017-11-06 14:21       ` Nitesh Narayan Lal
  0 siblings, 1 reply; 17+ messages in thread
From: Pankaj Gupta @ 2017-11-06 11:21 UTC (permalink / raw)
  To: nilal
  Cc: kvm, pbonzini, wei w wang, yang zhang wz, riel, david, mst,
	konrad wilk, dodgen


Hi Nitesh,

> 
> This patch enables QEMU to handle page hinting requests
> from the guest. Once the guest kicks QEMU to free a page,
> QEMU retrives the guest physical address and converts it to
> host virtual address and then MADVISE that memory.
> 
> Signed-off-by: Nitesh Narayan Lal <nilal@redhat.com>
> ---
>  hw/virtio/virtio-balloon.c         | 81
>  ++++++++++++++++++++++++++++++++++++++
>  hw/virtio/virtio.c                 | 24 +++++++++++
>  include/hw/virtio/virtio-access.h  |  1 +
>  include/hw/virtio/virtio-balloon.h |  2 +-
>  include/qemu/osdep.h               |  7 ++++
>  5 files changed, 114 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 37cde38..de10350 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -33,6 +33,8 @@
>  
>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
>  
> +void page_hinting_request(uint64_t addr, uint32_t len);
> +
>  static void balloon_page(void *addr, int deflate)
>  {
>      if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
> @@ -205,6 +207,84 @@ static void balloon_stats_set_poll_interval(Object *obj,
> Visitor *v,
>      balloon_stats_change_timer(s, 0);
>  }
>  
> +static void *gpa2hva(MemoryRegion **p_mr, hwaddr addr, Error **errp)
> +{
> +    MemoryRegionSection mrs = memory_region_find(get_system_memory(),
> +                                                 addr, 1);
> +
> +    if (!mrs.mr) {
> +        error_setg(errp, "No memory is mapped at address 0x%" HWADDR_PRIx,
> addr);
> +        return NULL;
> +    }
> +
> +    if (!memory_region_is_ram(mrs.mr) && !memory_region_is_romd(mrs.mr)) {
> +        error_setg(errp, "Memory at address 0x%" HWADDR_PRIx "is not RAM",
> addr);
> +        memory_region_unref(mrs.mr);
> +        return NULL;
> +    }
> +
> +    *p_mr = mrs.mr;
> +    return qemu_map_ram_ptr(mrs.mr->ram_block, mrs.offset_within_region);
> +}
> +
> +struct guest_pages {
> +	unsigned long pfn;
> +	unsigned int pages;
> +};
> +
> +
> +void page_hinting_request(uint64_t addr, uint32_t len)
> +{
> +    Error *local_err = NULL;
> +    MemoryRegion *mr = NULL;
> +    void *hvaddr;
> +    int ret = 0;
> +    struct guest_pages *guest_obj;
> +    int i = 0;
> +    void *hvaddr_to_free;
> +    unsigned long pfn, pfn_end;
> +    uint64_t gpaddr_to_free;
> +
> +    /*ptr is the host physical address*/

Could not understand the comment?

> +    hvaddr = gpa2hva(&mr, addr, &local_err);
> +    if (local_err) {
> +        error_report_err(local_err);
> +        return;
> +    }
> +    guest_obj = hvaddr;
> +
> +    while (i < len) {
> +        pfn = guest_obj[i].pfn;
> +	pfn_end = guest_obj[i].pfn + guest_obj[i].pages - 1;
> +	while (pfn <= pfn_end) {
> +	        gpaddr_to_free = pfn << VIRTIO_BALLOON_PFN_SHIFT;
> +	        hvaddr_to_free = gpa2hva(&mr, gpaddr_to_free, &local_err);
> +	        if (local_err) {
> +			error_report_err(local_err);
> +		        return;
> +		}
> +		ret = qemu_madvise((void *)hvaddr_to_free, 4096, QEMU_MADV_FREE);
> +		if (ret == -1)
> +		    printf("\n%d:%s Error: Madvise failed with error:%d\n", __LINE__,
> __func__, ret);
> +		pfn++;
> +	}
> +	i++;
> +    }
> +}
> +
> +
> +static void virtio_balloon_page_hinting(VirtIODevice *vdev, VirtQueue *vq)
> +{
> +    uint64_t addr;
> +    uint32_t len;
> +    VirtQueueElement elem = {};
> +
> +    pop_hinting_addr(vq, &addr, &len);
> +    page_hinting_request(addr, len);
> +    virtqueue_push(vq, &elem, 0);
> +    virtio_notify(vdev, vq);
> +}
> +
>  static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>  {
>      VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
> @@ -443,6 +523,7 @@ static void virtio_balloon_device_realize(DeviceState
> *dev, Error **errp)
>      s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>      s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>      s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
> +    s->hvq = virtio_add_queue(vdev, 128, virtio_balloon_page_hinting);
>  
>      reset_stats(s);
>  }
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 311929e..24e26ec 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -825,6 +825,30 @@ static void *virtqueue_alloc_element(size_t sz, unsigned
> out_num, unsigned in_nu
>      return elem;
>  }
>  
> +void pop_hinting_addr(VirtQueue *vq, uint64_t *addr, uint32_t *len)
> +{
> +   VRingMemoryRegionCaches *caches;
> +   VRingDesc desc;
> +   MemoryRegionCache *desc_cache;
> +   VirtIODevice *vdev = vq->vdev;
> +   unsigned int head, i, max;
> +
> +   max = vq->vring.num;
> +   if (!virtqueue_get_head(vq, vq->last_avail_idx++, &head)) {
> +	printf("\n%d:%sError: Unable to read head\n", __LINE__, __func__);
> +   }
> +   i = head;

we can avoid 'i' here and directly pass head
'to vring_desc_read' function?

> +
> +   caches = vring_get_region_caches(vq);
> +   if (caches->desc.len < max * sizeof(VRingDesc)) {
> +       virtio_error(vdev, "Cannot map descriptor ring");
> +   }
> +   desc_cache = &caches->desc;
> +   vring_desc_read(vdev, &desc, desc_cache, i);
> +   *addr = desc.addr;
> +   *len = desc.len;
> +}
> +
>  void *virtqueue_pop(VirtQueue *vq, size_t sz)
>  {
>      unsigned int i, head, max;
> diff --git a/include/hw/virtio/virtio-access.h
> b/include/hw/virtio/virtio-access.h
> index 2e92074..568d71f 100644
> --- a/include/hw/virtio/virtio-access.h
> +++ b/include/hw/virtio/virtio-access.h
> @@ -24,6 +24,7 @@
>  #define LEGACY_VIRTIO_IS_BIENDIAN 1
>  #endif
>  
> +void pop_hinting_addr(VirtQueue *vq, uint64_t *addr, uint32_t *len);
>  static inline bool virtio_access_is_big_endian(VirtIODevice *vdev)
>  {
>  #if defined(LEGACY_VIRTIO_IS_BIENDIAN)
> diff --git a/include/hw/virtio/virtio-balloon.h
> b/include/hw/virtio/virtio-balloon.h
> index 1ea13bd..dfb5782 100644
> --- a/include/hw/virtio/virtio-balloon.h
> +++ b/include/hw/virtio/virtio-balloon.h
> @@ -33,7 +33,7 @@ typedef struct virtio_balloon_stat_modern {
>  
>  typedef struct VirtIOBalloon {
>      VirtIODevice parent_obj;
> -    VirtQueue *ivq, *dvq, *svq;
> +    VirtQueue *ivq, *dvq, *svq, *hvq;
>      uint32_t num_pages;
>      uint32_t actual;
>      uint64_t stats[VIRTIO_BALLOON_S_NR];
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index 9dd318a..033d64c 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -278,6 +278,11 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>  #else
>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>  #endif
> +#ifdef MADV_FREE
> +#define QEMU_MADV_FREE MADV_FREE
> +#else
> +#define QEMU_MADV_FREE QEMU_MAD_INVALID
> +#endif
>  
>  #elif defined(CONFIG_POSIX_MADVISE)
>  
> @@ -291,6 +296,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
> +#define QEMU_MADV_FREE QEMU_MAD_INVALID
>  
>  #else /* no-op */
>  
> @@ -304,6 +310,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
> +#define QEMU_MADV_FREE QEMU_MAD_INVALID
>  
>  #endif
>  
> --
> 2.9.4
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [QEMU PATCH] kvm: Support for guest page hinting
  2017-11-06 11:21     ` Pankaj Gupta
@ 2017-11-06 14:21       ` Nitesh Narayan Lal
  0 siblings, 0 replies; 17+ messages in thread
From: Nitesh Narayan Lal @ 2017-11-06 14:21 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: kvm, pbonzini, wei w wang, yang zhang wz, riel, david, mst,
	konrad wilk, dodgen


[-- Attachment #1.1: Type: text/plain, Size: 8124 bytes --]

Hi Pankaj,


On 11/06/2017 06:21 AM, Pankaj Gupta wrote:
> Hi Nitesh,
>
>> This patch enables QEMU to handle page hinting requests
>> from the guest. Once the guest kicks QEMU to free a page,
>> QEMU retrives the guest physical address and converts it to
>> host virtual address and then MADVISE that memory.
>>
>> Signed-off-by: Nitesh Narayan Lal <nilal@redhat.com>
>> ---
>>  hw/virtio/virtio-balloon.c         | 81
>>  ++++++++++++++++++++++++++++++++++++++
>>  hw/virtio/virtio.c                 | 24 +++++++++++
>>  include/hw/virtio/virtio-access.h  |  1 +
>>  include/hw/virtio/virtio-balloon.h |  2 +-
>>  include/qemu/osdep.h               |  7 ++++
>>  5 files changed, 114 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
>> index 37cde38..de10350 100644
>> --- a/hw/virtio/virtio-balloon.c
>> +++ b/hw/virtio/virtio-balloon.c
>> @@ -33,6 +33,8 @@
>>  
>>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
>>  
>> +void page_hinting_request(uint64_t addr, uint32_t len);
>> +
>>  static void balloon_page(void *addr, int deflate)
>>  {
>>      if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
>> @@ -205,6 +207,84 @@ static void balloon_stats_set_poll_interval(Object *obj,
>> Visitor *v,
>>      balloon_stats_change_timer(s, 0);
>>  }
>>  
>> +static void *gpa2hva(MemoryRegion **p_mr, hwaddr addr, Error **errp)
>> +{
>> +    MemoryRegionSection mrs = memory_region_find(get_system_memory(),
>> +                                                 addr, 1);
>> +
>> +    if (!mrs.mr) {
>> +        error_setg(errp, "No memory is mapped at address 0x%" HWADDR_PRIx,
>> addr);
>> +        return NULL;
>> +    }
>> +
>> +    if (!memory_region_is_ram(mrs.mr) && !memory_region_is_romd(mrs.mr)) {
>> +        error_setg(errp, "Memory at address 0x%" HWADDR_PRIx "is not RAM",
>> addr);
>> +        memory_region_unref(mrs.mr);
>> +        return NULL;
>> +    }
>> +
>> +    *p_mr = mrs.mr;
>> +    return qemu_map_ram_ptr(mrs.mr->ram_block, mrs.offset_within_region);
>> +}
>> +
>> +struct guest_pages {
>> +	unsigned long pfn;
>> +	unsigned int pages;
>> +};
>> +
>> +
>> +void page_hinting_request(uint64_t addr, uint32_t len)
>> +{
>> +    Error *local_err = NULL;
>> +    MemoryRegion *mr = NULL;
>> +    void *hvaddr;
>> +    int ret = 0;
>> +    struct guest_pages *guest_obj;
>> +    int i = 0;
>> +    void *hvaddr_to_free;
>> +    unsigned long pfn, pfn_end;
>> +    uint64_t gpaddr_to_free;
>> +
>> +    /*ptr is the host physical address*/
> Could not understand the comment?
It's a typo, It should have been something like "hvaddr is the the host
virtual address obtained by converting guest physical address".
I will correct this in the next patch-set.
>
>> +    hvaddr = gpa2hva(&mr, addr, &local_err);
>> +    if (local_err) {
>> +        error_report_err(local_err);
>> +        return;
>> +    }
>> +    guest_obj = hvaddr;
>> +
>> +    while (i < len) {
>> +        pfn = guest_obj[i].pfn;
>> +	pfn_end = guest_obj[i].pfn + guest_obj[i].pages - 1;
>> +	while (pfn <= pfn_end) {
>> +	        gpaddr_to_free = pfn << VIRTIO_BALLOON_PFN_SHIFT;
>> +	        hvaddr_to_free = gpa2hva(&mr, gpaddr_to_free, &local_err);
>> +	        if (local_err) {
>> +			error_report_err(local_err);
>> +		        return;
>> +		}
>> +		ret = qemu_madvise((void *)hvaddr_to_free, 4096, QEMU_MADV_FREE);
>> +		if (ret == -1)
>> +		    printf("\n%d:%s Error: Madvise failed with error:%d\n", __LINE__,
>> __func__, ret);
>> +		pfn++;
>> +	}
>> +	i++;
>> +    }
>> +}
>> +
>> +
>> +static void virtio_balloon_page_hinting(VirtIODevice *vdev, VirtQueue *vq)
>> +{
>> +    uint64_t addr;
>> +    uint32_t len;
>> +    VirtQueueElement elem = {};
>> +
>> +    pop_hinting_addr(vq, &addr, &len);
>> +    page_hinting_request(addr, len);
>> +    virtqueue_push(vq, &elem, 0);
>> +    virtio_notify(vdev, vq);
>> +}
>> +
>>  static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>>  {
>>      VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
>> @@ -443,6 +523,7 @@ static void virtio_balloon_device_realize(DeviceState
>> *dev, Error **errp)
>>      s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>>      s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>>      s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
>> +    s->hvq = virtio_add_queue(vdev, 128, virtio_balloon_page_hinting);
>>  
>>      reset_stats(s);
>>  }
>> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
>> index 311929e..24e26ec 100644
>> --- a/hw/virtio/virtio.c
>> +++ b/hw/virtio/virtio.c
>> @@ -825,6 +825,30 @@ static void *virtqueue_alloc_element(size_t sz, unsigned
>> out_num, unsigned in_nu
>>      return elem;
>>  }
>>  
>> +void pop_hinting_addr(VirtQueue *vq, uint64_t *addr, uint32_t *len)
>> +{
>> +   VRingMemoryRegionCaches *caches;
>> +   VRingDesc desc;
>> +   MemoryRegionCache *desc_cache;
>> +   VirtIODevice *vdev = vq->vdev;
>> +   unsigned int head, i, max;
>> +
>> +   max = vq->vring.num;
>> +   if (!virtqueue_get_head(vq, vq->last_avail_idx++, &head)) {
>> +	printf("\n%d:%sError: Unable to read head\n", __LINE__, __func__);
>> +   }
>> +   i = head;
> we can avoid 'i' here and directly pass head
> 'to vring_desc_read' function?
Yes, that's right. Thank you for pointing out.
I will correct this in the next version.
>
>> +
>> +   caches = vring_get_region_caches(vq);
>> +   if (caches->desc.len < max * sizeof(VRingDesc)) {
>> +       virtio_error(vdev, "Cannot map descriptor ring");
>> +   }
>> +   desc_cache = &caches->desc;
>> +   vring_desc_read(vdev, &desc, desc_cache, i);
>> +   *addr = desc.addr;
>> +   *len = desc.len;
>> +}
>> +
>>  void *virtqueue_pop(VirtQueue *vq, size_t sz)
>>  {
>>      unsigned int i, head, max;
>> diff --git a/include/hw/virtio/virtio-access.h
>> b/include/hw/virtio/virtio-access.h
>> index 2e92074..568d71f 100644
>> --- a/include/hw/virtio/virtio-access.h
>> +++ b/include/hw/virtio/virtio-access.h
>> @@ -24,6 +24,7 @@
>>  #define LEGACY_VIRTIO_IS_BIENDIAN 1
>>  #endif
>>  
>> +void pop_hinting_addr(VirtQueue *vq, uint64_t *addr, uint32_t *len);
>>  static inline bool virtio_access_is_big_endian(VirtIODevice *vdev)
>>  {
>>  #if defined(LEGACY_VIRTIO_IS_BIENDIAN)
>> diff --git a/include/hw/virtio/virtio-balloon.h
>> b/include/hw/virtio/virtio-balloon.h
>> index 1ea13bd..dfb5782 100644
>> --- a/include/hw/virtio/virtio-balloon.h
>> +++ b/include/hw/virtio/virtio-balloon.h
>> @@ -33,7 +33,7 @@ typedef struct virtio_balloon_stat_modern {
>>  
>>  typedef struct VirtIOBalloon {
>>      VirtIODevice parent_obj;
>> -    VirtQueue *ivq, *dvq, *svq;
>> +    VirtQueue *ivq, *dvq, *svq, *hvq;
>>      uint32_t num_pages;
>>      uint32_t actual;
>>      uint64_t stats[VIRTIO_BALLOON_S_NR];
>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>> index 9dd318a..033d64c 100644
>> --- a/include/qemu/osdep.h
>> +++ b/include/qemu/osdep.h
>> @@ -278,6 +278,11 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>>  #else
>>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>>  #endif
>> +#ifdef MADV_FREE
>> +#define QEMU_MADV_FREE MADV_FREE
>> +#else
>> +#define QEMU_MADV_FREE QEMU_MAD_INVALID
>> +#endif
>>  
>>  #elif defined(CONFIG_POSIX_MADVISE)
>>  
>> @@ -291,6 +296,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>> +#define QEMU_MADV_FREE QEMU_MAD_INVALID
>>  
>>  #else /* no-op */
>>  
>> @@ -304,6 +310,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>> +#define QEMU_MADV_FREE QEMU_MAD_INVALID
>>  
>>  #endif
>>  
>> --
>> 2.9.4
>>
>>

-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v4 0/6] KVM: Guest page hinting
  2017-11-03 20:30 [Patch v4 0/6] KVM: Guest page hinting nilal
                   ` (6 preceding siblings ...)
  2017-11-03 20:37 ` [QEMU PATCH] kvm: Support for guest page hinting nilal
@ 2017-11-12 21:23 ` Rik van Riel
  2017-11-13 15:14   ` Nitesh Narayan Lal
  7 siblings, 1 reply; 17+ messages in thread
From: Rik van Riel @ 2017-11-12 21:23 UTC (permalink / raw)
  To: nilal, kvm, pbonzini, pagupta, wei.w.wang, yang.zhang.wz, david,
	mst, dodgen, konrad.wilk

On Fri, 2017-11-03 at 16:30 -0400, nilal@redhat.com wrote:
> The following patch-set proposes an efficient mechanism for handing
> freed memory between the guest and the host. It enables the guests
> with DAX (no page cache) to rapidly free and reclaim memory to and
> from the host respectively. 


> Performance:
> 	Test criteria: Kernel Build
> 	Command: make clean;make defconfig;time make
> 	With Hinting:
> 		real: 21m24.680s
> 		user: 16m3.362s
> 		sys : 2m19.027s
> 	Without Hinting:
> 		real: 21m18.062s
> 		user: 16m13.969s
> 		sys : 1m17.884s
> 
> 	Test criteria: Stress Test
> 	Command: time stress --io 2 --cpu 2 --vm 2 --vm-bytes 1024M --
> timeout 100s -v
> 	With Hinting:
> 		real: 1m40.726s
> 		user: 1m23.449s
> 		sys : 0m5.576s
> 	Without Hinting:
> 		real: 1m40.378s
> 		user: 1m21.292s
> 		sys : 0m4.972s

These numbers look really good, but these workloads are mostly in
user space.

Could you also try with more kernel heavy workloads, like netperf
(sender and receiver on the same CPU, vs sender and receiver on
different CPUs) and hackbench?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v4 0/6] KVM: Guest page hinting
  2017-11-12 21:23 ` [Patch v4 0/6] KVM: Guest " Rik van Riel
@ 2017-11-13 15:14   ` Nitesh Narayan Lal
  0 siblings, 0 replies; 17+ messages in thread
From: Nitesh Narayan Lal @ 2017-11-13 15:14 UTC (permalink / raw)
  To: Rik van Riel, kvm, pbonzini, pagupta, wei.w.wang, yang.zhang.wz,
	david, mst, dodgen, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 1276 bytes --]

Hi Rik,


On 11/12/2017 04:23 PM, Rik van Riel wrote:
> On Fri, 2017-11-03 at 16:30 -0400, nilal@redhat.com wrote:
>> The following patch-set proposes an efficient mechanism for handing
>> freed memory between the guest and the host. It enables the guests
>> with DAX (no page cache) to rapidly free and reclaim memory to and
>> from the host respectively. 
>
>> Performance:
>> 	Test criteria: Kernel Build
>> 	Command: make clean;make defconfig;time make
>> 	With Hinting:
>> 		real: 21m24.680s
>> 		user: 16m3.362s
>> 		sys : 2m19.027s
>> 	Without Hinting:
>> 		real: 21m18.062s
>> 		user: 16m13.969s
>> 		sys : 1m17.884s
>>
>> 	Test criteria: Stress Test
>> 	Command: time stress --io 2 --cpu 2 --vm 2 --vm-bytes 1024M --
>> timeout 100s -v
>> 	With Hinting:
>> 		real: 1m40.726s
>> 		user: 1m23.449s
>> 		sys : 0m5.576s
>> 	Without Hinting:
>> 		real: 1m40.378s
>> 		user: 1m21.292s
>> 		sys : 0m4.972s
> These numbers look really good, but these workloads are mostly in
> user space.
>
> Could you also try with more kernel heavy workloads, like netperf
> (sender and receiver on the same CPU, vs sender and receiver on
> different CPUs) and hackbench?
>
Yeap, I can do that. Thanks for the suggestion.

-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v4 1/6] KVM: Support for guest page hinting
  2017-11-03 20:30 ` [Patch v4 1/6] KVM: Support for guest " nilal
@ 2017-11-13 17:59   ` Michael S. Tsirkin
  2017-11-15 19:38     ` Nitesh Narayan Lal
  0 siblings, 1 reply; 17+ messages in thread
From: Michael S. Tsirkin @ 2017-11-13 17:59 UTC (permalink / raw)
  To: nilal
  Cc: kvm, pbonzini, pagupta, wei.w.wang, yang.zhang.wz, riel, david,
	dodgen, konrad.wilk

On Fri, Nov 03, 2017 at 04:30:08PM -0400, nilal@redhat.com wrote:
> From: Nitesh Narayan Lal <nilal@redhat.com>
> 
> This patch includes the following:
> 1. Basic skeleton for the support
> 2. Enablement of x86 platform to use the same
> 
> Signed-off-by: Nitesh Narayan Lal <nilal@redhat.com>
> ---
>  arch/x86/Kbuild         |  2 +-
>  arch/x86/kvm/Makefile   |  2 ++
>  include/linux/gfp.h     |  7 +++++++
>  virt/kvm/Kconfig        |  4 ++++
>  virt/kvm/page_hinting.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 60 insertions(+), 1 deletion(-)
>  create mode 100644 virt/kvm/page_hinting.c
> 
> diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
> index 0038a2d..7d39d7d 100644
> --- a/arch/x86/Kbuild
> +++ b/arch/x86/Kbuild
> @@ -2,7 +2,7 @@ obj-y += entry/
>  
>  obj-$(CONFIG_PERF_EVENTS) += events/
>  
> -obj-$(CONFIG_KVM) += kvm/
> +obj-$(subst m,y,$(CONFIG_KVM)) += kvm/
>  
>  # Xen paravirtualization support
>  obj-$(CONFIG_XEN) += xen/
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index 09d4b17..d8a3800 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -15,6 +15,8 @@ kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
>  			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
>  			   hyperv.o page_track.o debugfs.o
>  
> +obj-$(CONFIG_KVM_FREE_PAGE_HINTING)	+= $(KVM)/page_hinting.o
> +
>  kvm-intel-y		+= vmx.o pmu_intel.o
>  kvm-amd-y		+= svm.o pmu_amd.o
>  
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index f780718..a74371f 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -452,6 +452,13 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>  	return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
>  }
>  
> +#ifdef	CONFIG_KVM_FREE_PAGE_HINTING
> +#define HAVE_ARCH_ALLOC_PAGE
> +#define HAVE_ARCH_FREE_PAGE
> +void arch_free_page(struct page *page, int order);
> +void arch_alloc_page(struct page *page, int order);
> +#endif
> +
>  #ifndef HAVE_ARCH_FREE_PAGE
>  static inline void arch_free_page(struct page *page, int order) { }
>  #endif


So the idea is to send pages to host in arch_free_page?

However, I notice:

        arch_free_page(page, order);
        kernel_poison_pages(page, 1 << order, 0);

Thus pages are immediately written to after they are freed.





> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index b0cc1a3..936c71d 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -50,3 +50,7 @@ config KVM_COMPAT
>  
>  config HAVE_KVM_IRQ_BYPASS
>         bool
> +
> +config KVM_FREE_PAGE_HINTING
> +       def_bool y
> +       depends on KVM
> diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
> new file mode 100644
> index 0000000..39d2b1d
> --- /dev/null
> +++ b/virt/kvm/page_hinting.c
> @@ -0,0 +1,46 @@
> +#include <linux/gfp.h>
> +#include <linux/mm.h>
> +#include <linux/page_ref.h>
> +#include <linux/kvm_host.h>
> +#include <linux/sort.h>
> +
> +#include <trace/events/kvm.h>
> +
> +#define MAX_FGPT_ENTRIES	1000
> +#define HYPERLIST_THRESHOLD	500
> +/*
> + * struct kvm_free_pages - Tracks the pages which are freed by the guest.
> + * @pfn	- page frame number for the page which is to be freed
> + * @pages - number of pages which are supposed to be freed.
> + * A global array object is used to hold the list of pfn and number of pages
> + * which are freed by the guest. This list may also have fragmentated pages so
> + * defragmentation is a must prior to the hypercall.
> + */
> +struct kvm_free_pages {
> +	unsigned long pfn;
> +	unsigned int pages;
> +};
> +
> +/*
> + * hypervisor_pages - It is a dummy structure passed with the hypercall.
> + * @pfn - page frame number for the page which is to be freed.
> + * @pages - number of pages which are supposed to be freed.
> + * A global array object is used to to hold the list of pfn and pages and is
> + * passed as part of the hypercall.
> + */
> +struct hypervisor_pages {
> +	unsigned long pfn;
> +	unsigned int pages;
> +};
> +
> +DEFINE_PER_CPU(struct kvm_free_pages [MAX_FGPT_ENTRIES], kvm_pt);
> +DEFINE_PER_CPU(int, kvm_pt_idx);
> +struct hypervisor_pages hypervisor_pagelist[MAX_FGPT_ENTRIES];
> +
> +void arch_alloc_page(struct page *page, int order)
> +{
> +}
> +
> +void arch_free_page(struct page *page, int order)
> +{
> +}
> -- 
> 2.9.4

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v4 2/6] KVM: Guest page hinting functionality
  2017-11-03 20:30 ` [Patch v4 2/6] KVM: Guest page hinting functionality nilal
@ 2017-11-13 18:03   ` Michael S. Tsirkin
  0 siblings, 0 replies; 17+ messages in thread
From: Michael S. Tsirkin @ 2017-11-13 18:03 UTC (permalink / raw)
  To: nilal
  Cc: kvm, pbonzini, pagupta, wei.w.wang, yang.zhang.wz, riel, david,
	dodgen, konrad.wilk

On Fri, Nov 03, 2017 at 04:30:09PM -0400, nilal@redhat.com wrote:
> From: Nitesh Narayan Lal <niteshnarayanlalleo@gmail.com>
> 
> This patch adds the guest implementation in order to maintain the list of
> pages which are freed by the guest and are not reused. To avoid any
> reallocation it includes seqlock once the list is completely filled.
> Though it doesn't carries the hypercall related changes.
> 
> Signed-off-by: Nitesh Narayan Lal <nilal@redhat.com>
> ---
>  virt/kvm/page_hinting.c | 245 +++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 243 insertions(+), 2 deletions(-)
> 
> diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
> index 39d2b1d..658856d 100644
> --- a/virt/kvm/page_hinting.c
> +++ b/virt/kvm/page_hinting.c
> @@ -3,8 +3,7 @@
>  #include <linux/page_ref.h>
>  #include <linux/kvm_host.h>
>  #include <linux/sort.h>
> -
> -#include <trace/events/kvm.h>
> +#include <linux/kernel.h>
>  
>  #define MAX_FGPT_ENTRIES	1000
>  #define HYPERLIST_THRESHOLD	500
> @@ -33,14 +32,256 @@ struct hypervisor_pages {
>  	unsigned int pages;
>  };
>  
> +static __cacheline_aligned_in_smp DEFINE_SEQLOCK(guest_page_lock);
>  DEFINE_PER_CPU(struct kvm_free_pages [MAX_FGPT_ENTRIES], kvm_pt);
>  DEFINE_PER_CPU(int, kvm_pt_idx);
>  struct hypervisor_pages hypervisor_pagelist[MAX_FGPT_ENTRIES];
>  
> +static void empty_hyperlist(void)
> +{
> +	int i = 0;
> +
> +	while (i < MAX_FGPT_ENTRIES) {
> +		hypervisor_pagelist[i].pfn = 0;
> +		hypervisor_pagelist[i].pages = 0;
> +		i++;
> +	}
> +}
> +
> +static void make_hypercall(void)
> +{
> +	/*
> +	 * Dummy function: Tobe filled later.
> +	 */
> +	empty_hyperlist();
> +}
> +
> +static int sort_pfn(const void *a1, const void *b1)
> +{
> +	const struct hypervisor_pages *a = a1;
> +	const struct hypervisor_pages *b = b1;
> +
> +	if (a->pfn > b->pfn)
> +		return 1;
> +
> +	if (a->pfn < b->pfn)
> +		return -1;
> +
> +	return 0;
> +}
> +
> +static int pack_hyperlist(void)
> +{
> +	int i = 0, j = 0;
> +
> +	while (i < MAX_FGPT_ENTRIES) {
> +		if (hypervisor_pagelist[i].pfn != 0) {
> +			if (i != j) {
> +				hypervisor_pagelist[j].pfn =
> +						hypervisor_pagelist[i].pfn;
> +				hypervisor_pagelist[j].pages =
> +						hypervisor_pagelist[i].pages;
> +			}
> +			j++;
> +		}
> +		i++;
> +	}
> +	i = j;
> +	while (j < MAX_FGPT_ENTRIES) {
> +		hypervisor_pagelist[j].pfn = 0;
> +		hypervisor_pagelist[j].pages = 0;
> +		j++;
> +	}
> +	return i;
> +}
> +
> +int compress_hyperlist(void)
> +{
> +	int i = 0, j = 1, merge_counter = 0, ret = 0;
> +
> +	sort(hypervisor_pagelist, MAX_FGPT_ENTRIES,
> +	     sizeof(struct hypervisor_pages), sort_pfn, NULL);
> +	while (i < MAX_FGPT_ENTRIES && j < MAX_FGPT_ENTRIES) {
> +		unsigned long pfni = hypervisor_pagelist[i].pfn;
> +		unsigned int pagesi = hypervisor_pagelist[i].pages;
> +		unsigned long pfnj = hypervisor_pagelist[j].pfn;
> +		unsigned int pagesj = hypervisor_pagelist[j].pages;
> +
> +		if (pfnj <= pfni) {
> +			if (((pfnj + pagesj - 1) <= (pfni + pagesi - 1)) &&
> +			    ((pfnj + pagesj - 1) >= (pfni - 1))) {
> +				hypervisor_pagelist[i].pfn = pfnj;
> +				hypervisor_pagelist[i].pages += pfni - pfnj;
> +				hypervisor_pagelist[j].pfn = 0;
> +				hypervisor_pagelist[j].pages = 0;
> +				j++;
> +				merge_counter++;
> +				continue;
> +			} else if ((pfnj + pagesj - 1) > (pfni + pagesi - 1)) {
> +				hypervisor_pagelist[i].pfn = pfnj;
> +				hypervisor_pagelist[i].pages = pagesj;
> +				hypervisor_pagelist[j].pfn = 0;
> +				hypervisor_pagelist[j].pages = 0;
> +				j++;
> +				merge_counter++;
> +				continue;
> +			}
> +		} else if (pfnj > pfni) {
> +			if ((pfnj + pagesj - 1) > (pfni + pagesi - 1) &&
> +			    (pfnj <= pfni + pagesi)) {
> +				hypervisor_pagelist[i].pages +=
> +						(pfnj + pagesj - 1) -
> +						(pfni + pagesi - 1);
> +				hypervisor_pagelist[j].pfn = 0;
> +				hypervisor_pagelist[j].pages = 0;
> +				j++;
> +				merge_counter++;
> +				continue;
> +			} else if ((pfnj + pagesj - 1) <= (pfni + pagesi - 1)) {
> +				hypervisor_pagelist[j].pfn = 0;
> +				hypervisor_pagelist[j].pages = 0;
> +				j++;
> +				merge_counter++;
> +				continue;
> +			}
> +		}
> +		i = j;
> +		j++;
> +	}
> +	if (merge_counter != 0)
> +		ret = pack_hyperlist() - 1;
> +	else
> +		ret = MAX_FGPT_ENTRIES - 1;
> +	return ret;
> +}
> +
> +void copy_hyperlist(int hyper_idx)
> +{
> +	int *idx = &get_cpu_var(kvm_pt_idx);
> +	struct kvm_free_pages *free_page_obj;
> +	int i = 0;
> +
> +	free_page_obj = &get_cpu_var(kvm_pt)[0];
> +	while (i < hyper_idx) {
> +		free_page_obj[*idx].pfn = hypervisor_pagelist[i].pfn;
> +		free_page_obj[*idx].pages = hypervisor_pagelist[i].pages;
> +		*idx += 1;
> +		i++;
> +	}
> +	empty_hyperlist();
> +	put_cpu_var(kvm_pt);
> +	put_cpu_var(kvm_pt_idx);
> +}
> +
> +/*
> + * arch_free_page_slowpath() - This function adds the guest free page entries
> + * to hypervisor_pages list and also ensures defragmentation prior to addition
> + * if it is present with any entry of the kvm_free_pages list.
> + */
> +void arch_free_page_slowpath(void)
> +{
> +	int idx = 0;
> +	int hyper_idx = -1;
> +	int *kvm_idx = &get_cpu_var(kvm_pt_idx);
> +	struct kvm_free_pages *free_page_obj = &get_cpu_var(kvm_pt)[0];
> +
> +	write_seqlock(&guest_page_lock);
> +	while (idx < MAX_FGPT_ENTRIES) {
> +		unsigned long pfn = free_page_obj[idx].pfn;
> +		unsigned long pfn_end = free_page_obj[idx].pfn +
> +					free_page_obj[idx].pages - 1;
> +		bool prev_free = false;
> +
> +		while (pfn <= pfn_end) {
> +			struct page *p = pfn_to_page(pfn);
> +
> +			if (PageCompound(p)) {
> +				struct page *head_page = compound_head(p);
> +				unsigned long head_pfn = page_to_pfn(head_page);
> +				unsigned int alloc_pages =
> +					1 << compound_order(head_page);
> +
> +				pfn = head_pfn + alloc_pages;
> +				prev_free = false;
> +				continue;
> +			}
> +			if (page_ref_count(p)) {
> +				pfn++;
> +				prev_free = false;
> +				continue;
> +			}
> +			/*
> +			 * The page is free so add it to the list and free the
> +			 * hypervisor_pagelist if required.
> +			 */
> +			if (!prev_free) {
> +				hyper_idx++;
> +				hypervisor_pagelist[hyper_idx].pfn = pfn;
> +				hypervisor_pagelist[hyper_idx].pages = 1;
> +				if (hyper_idx == MAX_FGPT_ENTRIES - 1) {
> +					hyper_idx =  compress_hyperlist();
> +					if (hyper_idx >=
> +					    HYPERLIST_THRESHOLD) {
> +						make_hypercall();
> +						hyper_idx = 0;
> +					}
> +				}
> +				/*
> +				 * If the next contiguous page is free, it can
> +				 * be added to this same entry.
> +				 */
> +				prev_free = true;
> +			} else {
> +				/*
> +				 * Multiple adjacent free pages
> +				 */
> +				hypervisor_pagelist[hyper_idx].pages++;
> +			}
> +			pfn++;
> +		}
> +		free_page_obj[idx].pfn = 0;
> +		free_page_obj[idx].pages = 0;
> +		idx++;
> +	}
> +	*kvm_idx = 0;
> +	put_cpu_var(kvm_pt);
> +	put_cpu_var(kvm_pt_idx);
> +	write_sequnlock(&guest_page_lock);
> +}
> +
>  void arch_alloc_page(struct page *page, int order)
>  {
> +	unsigned int seq;
> +
> +	/*
> +	 * arch_free_page will acquire the lock once the list carrying guest
> +	 * free pages is full and a hypercall will be made. Until complete free
> +	 * page list is traversed no further allocaiton will be allowed.
> +	 */
> +	do {
> +		seq = read_seqbegin(&guest_page_lock);
> +	} while (read_seqretry(&guest_page_lock, seq));
>  }
>  
>  void arch_free_page(struct page *page, int order)
>  {
> +	int *free_page_idx = &get_cpu_var(kvm_pt_idx);
> +	struct kvm_free_pages *free_page_obj;
> +	unsigned long flags;
> +
> +	/*
> +	 * use of global variables may trigger a race condition between irq and
> +	 * process context causing unwanted overwrites. This will be replaced
> +	 * with a better solution to prevent such race conditions.

When do you plan to replace this? If you don't want this to
be merged yet pls include RFC in subject.

> +	 */
> +	local_irq_save(flags);
> +	free_page_obj = &get_cpu_var(kvm_pt)[0];
> +	free_page_obj[*free_page_idx].pfn = page_to_pfn(page);
> +	free_page_obj[*free_page_idx].pages = 1 << order;
> +	*free_page_idx += 1;
> +	if (*free_page_idx == MAX_FGPT_ENTRIES)
> +		arch_free_page_slowpath();
> +	put_cpu_var(kvm_pt);
> +	put_cpu_var(kvm_pt_idx);
> +	local_irq_restore(flags);
>  }
> -- 
> 2.9.4

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v4 1/6] KVM: Support for guest page hinting
  2017-11-13 17:59   ` Michael S. Tsirkin
@ 2017-11-15 19:38     ` Nitesh Narayan Lal
  2017-11-15 20:06       ` Michael S. Tsirkin
  0 siblings, 1 reply; 17+ messages in thread
From: Nitesh Narayan Lal @ 2017-11-15 19:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, pbonzini, pagupta, wei.w.wang, yang.zhang.wz, riel, david,
	dodgen, konrad.wilk


[-- Attachment #1.1: Type: text/plain, Size: 7710 bytes --]

Hi Michael,


On 11/13/2017 12:59 PM, Michael S. Tsirkin wrote:
> On Fri, Nov 03, 2017 at 04:30:08PM -0400, nilal@redhat.com wrote:
>> From: Nitesh Narayan Lal <nilal@redhat.com>
>>
>> This patch includes the following:
>> 1. Basic skeleton for the support
>> 2. Enablement of x86 platform to use the same
>>
>> Signed-off-by: Nitesh Narayan Lal <nilal@redhat.com>
>> ---
>>  arch/x86/Kbuild         |  2 +-
>>  arch/x86/kvm/Makefile   |  2 ++
>>  include/linux/gfp.h     |  7 +++++++
>>  virt/kvm/Kconfig        |  4 ++++
>>  virt/kvm/page_hinting.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
>>  5 files changed, 60 insertions(+), 1 deletion(-)
>>  create mode 100644 virt/kvm/page_hinting.c
>>
>> diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
>> index 0038a2d..7d39d7d 100644
>> --- a/arch/x86/Kbuild
>> +++ b/arch/x86/Kbuild
>> @@ -2,7 +2,7 @@ obj-y += entry/
>>  
>>  obj-$(CONFIG_PERF_EVENTS) += events/
>>  
>> -obj-$(CONFIG_KVM) += kvm/
>> +obj-$(subst m,y,$(CONFIG_KVM)) += kvm/
>>  
>>  # Xen paravirtualization support
>>  obj-$(CONFIG_XEN) += xen/
>> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
>> index 09d4b17..d8a3800 100644
>> --- a/arch/x86/kvm/Makefile
>> +++ b/arch/x86/kvm/Makefile
>> @@ -15,6 +15,8 @@ kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
>>  			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
>>  			   hyperv.o page_track.o debugfs.o
>>  
>> +obj-$(CONFIG_KVM_FREE_PAGE_HINTING)	+= $(KVM)/page_hinting.o
>> +
>>  kvm-intel-y		+= vmx.o pmu_intel.o
>>  kvm-amd-y		+= svm.o pmu_amd.o
>>  
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index f780718..a74371f 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -452,6 +452,13 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>>  	return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
>>  }
>>  
>> +#ifdef	CONFIG_KVM_FREE_PAGE_HINTING
>> +#define HAVE_ARCH_ALLOC_PAGE
>> +#define HAVE_ARCH_FREE_PAGE
>> +void arch_free_page(struct page *page, int order);
>> +void arch_alloc_page(struct page *page, int order);
>> +#endif
>> +
>>  #ifndef HAVE_ARCH_FREE_PAGE
>>  static inline void arch_free_page(struct page *page, int order) { }
>>  #endif
>
> So the idea is to send pages to host in arch_free_page?
>
> However, I notice:
>
>         arch_free_page(page, order);
>         kernel_poison_pages(page, 1 << order, 0);
>
> Thus pages are immediately written to after they are freed.
>
>
>
>
>
>> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
>> index b0cc1a3..936c71d 100644
>> --- a/virt/kvm/Kconfig
>> +++ b/virt/kvm/Kconfig
>> @@ -50,3 +50,7 @@ config KVM_COMPAT
>>  
>>  config HAVE_KVM_IRQ_BYPASS
>>         bool
>> +
>> +config KVM_FREE_PAGE_HINTING
>> +       def_bool y
>> +       depends on KVM
>> diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
>> new file mode 100644
>> index 0000000..39d2b1d
>> --- /dev/null
>> +++ b/virt/kvm/page_hinting.c
>> @@ -0,0 +1,46 @@
>> +#include <linux/gfp.h>
>> +#include <linux/mm.h>
>> +#include <linux/page_ref.h>
>> +#include <linux/kvm_host.h>
>> +#include <linux/sort.h>
>> +
>> +#include <trace/events/kvm.h>
>> +
>> +#define MAX_FGPT_ENTRIES	1000
>> +#define HYPERLIST_THRESHOLD	500
>> +/*
>> + * struct kvm_free_pages - Tracks the pages which are freed by the guest.
>> + * @pfn	- page frame number for the page which is to be freed
>> + * @pages - number of pages which are supposed to be freed.
>> + * A global array object is used to hold the list of pfn and number of pages
>> + * which are freed by the guest. This list may also have fragmentated pages so
>> + * defragmentation is a must prior to the hypercall.
>> + */
>> +struct kvm_free_pages {
>> +	unsigned long pfn;
>> +	unsigned int pages;
>> +};
>> +
>> +/*
>> + * hypervisor_pages - It is a dummy structure passed with the hypercall.
>> + * @pfn - page frame number for the page which is to be freed.
>> + * @pages - number of pages which are supposed to be freed.
>> + * A global array object is used to to hold the list of pfn and pages and is
>> + * passed as part of the hypercall.
>> + */
>> +struct hypervisor_pages {
>> +	unsigned long pfn;
>> +	unsigned int pages;
>> +};
>> +
>> +DEFINE_PER_CPU(struct kvm_free_pages [MAX_FGPT_ENTRIES], kvm_pt);
>> +DEFINE_PER_CPU(int, kvm_pt_idx);
>> +struct hypervisor_pages hypervisor_pagelist[MAX_FGPT_ENTRIES];
>> +
>> +void arch_alloc_page(struct page *page, int order)
>> +{
>> +}
>> +
>> +void arch_free_page(struct page *page, int order)
>> +{
>> +}
>> -- 
>> 2.9.4
Thanks Michael for pointing out this bug.
The issue here is when both 'Guest Page Hinting' and 'Page Poisoning'
are enabled as it may result in guest generating memory corruption
errors when the hypervisor maps a different physical memory to the guest
after freeing the earlier mapped memory.

Page Poisoning is a feature in which the page is filled with a specific
pattern of (0x00 or 0xaa) after arch_free_page and the same is verified
before arch_alloc_page to prevent following issues:

    *information leak from the freed data
    *use after free bugs
    *memory corruption

-Selection of the pattern depends on the CONFIG_PAGE_POISONING_ZERO

Guest page hinting is a support for handing freed memory between the
guest and the host. It enables the guests with DAX (no page cache) to
rapidly free and reclaims memory to and from the host respectively. To
achieve this task it uses two lists one CPU-local and other CPU-global. 
Whenever a page is freed it is added to the respective CPU-local list
through arch_free_page() until it is full. Once the list is full a
seqlock is taken to prevent any further page allocations and the per
CPU-local list is traversed and only those pages which are not
reallocated by the guest are added to the CPU-global list. This global
list is then sent to the hypervisor/host.

After freeing the pages in the global list following things may happen:

    *Hypervisor reallocates the freed memory back to the guest if required
    *Hypervisor frees the memory and maps a different physical memory at
its place

In order to prevent any information leak hypervisor before allocating
memory to the guest fills it with zeroes.
The issue arises when the pattern used for Page Poisoning is 0xaa while
the newly allocated page received from the hypervisor by the guest is
filled with the pattern 0x00. This will result in memory corruption errors.
As per the current implementation in order to enable page poisoning a
boot time parameter (page_poison) is set. Hence  there could be two
following ways by which we could resolve the above-stated issue:

1. We can disable page poisoning if guest page hinting is enabled. The
only code change required in this case will be to check if
CONFIG_PAGE_POISONING is enabled, if so disable it so that poisoning is
not checked once we receive newly allocated memory from the hypervisor.

2. We can have a flag in the page structure using which we can control
poisoning on a per page basis. Using this we can disable the poisoning
for only those pages which are sent to the hypervisor so that posioning
is not checked on the newly allocated memory which is received by the
guest form the hypervisor. Though in order to incorporate this I may
have to make changes in page_poison.c, mm_types.h and other related files.

I am looking forward to the comments and suggestions.

[1] http://www.spinics.net/lists/kvm/msg153666.html
[2] https://www.spinics.net/lists/kvm/msg158054.html

-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v4 1/6] KVM: Support for guest page hinting
  2017-11-15 19:38     ` Nitesh Narayan Lal
@ 2017-11-15 20:06       ` Michael S. Tsirkin
  0 siblings, 0 replies; 17+ messages in thread
From: Michael S. Tsirkin @ 2017-11-15 20:06 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm, pbonzini, pagupta, wei.w.wang, yang.zhang.wz, riel, david,
	dodgen, konrad.wilk

On Wed, Nov 15, 2017 at 02:38:20PM -0500, Nitesh Narayan Lal wrote:
> Hi Michael,
> 
> 
> On 11/13/2017 12:59 PM, Michael S. Tsirkin wrote:
> > On Fri, Nov 03, 2017 at 04:30:08PM -0400, nilal@redhat.com wrote:
> >> From: Nitesh Narayan Lal <nilal@redhat.com>
> >>
> >> This patch includes the following:
> >> 1. Basic skeleton for the support
> >> 2. Enablement of x86 platform to use the same
> >>
> >> Signed-off-by: Nitesh Narayan Lal <nilal@redhat.com>
> >> ---
> >>  arch/x86/Kbuild         |  2 +-
> >>  arch/x86/kvm/Makefile   |  2 ++
> >>  include/linux/gfp.h     |  7 +++++++
> >>  virt/kvm/Kconfig        |  4 ++++
> >>  virt/kvm/page_hinting.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> >>  5 files changed, 60 insertions(+), 1 deletion(-)
> >>  create mode 100644 virt/kvm/page_hinting.c
> >>
> >> diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
> >> index 0038a2d..7d39d7d 100644
> >> --- a/arch/x86/Kbuild
> >> +++ b/arch/x86/Kbuild
> >> @@ -2,7 +2,7 @@ obj-y += entry/
> >>  
> >>  obj-$(CONFIG_PERF_EVENTS) += events/
> >>  
> >> -obj-$(CONFIG_KVM) += kvm/
> >> +obj-$(subst m,y,$(CONFIG_KVM)) += kvm/
> >>  
> >>  # Xen paravirtualization support
> >>  obj-$(CONFIG_XEN) += xen/
> >> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> >> index 09d4b17..d8a3800 100644
> >> --- a/arch/x86/kvm/Makefile
> >> +++ b/arch/x86/kvm/Makefile
> >> @@ -15,6 +15,8 @@ kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
> >>  			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
> >>  			   hyperv.o page_track.o debugfs.o
> >>  
> >> +obj-$(CONFIG_KVM_FREE_PAGE_HINTING)	+= $(KVM)/page_hinting.o
> >> +
> >>  kvm-intel-y		+= vmx.o pmu_intel.o
> >>  kvm-amd-y		+= svm.o pmu_amd.o
> >>  
> >> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> >> index f780718..a74371f 100644
> >> --- a/include/linux/gfp.h
> >> +++ b/include/linux/gfp.h
> >> @@ -452,6 +452,13 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> >>  	return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> >>  }
> >>  
> >> +#ifdef	CONFIG_KVM_FREE_PAGE_HINTING
> >> +#define HAVE_ARCH_ALLOC_PAGE
> >> +#define HAVE_ARCH_FREE_PAGE
> >> +void arch_free_page(struct page *page, int order);
> >> +void arch_alloc_page(struct page *page, int order);
> >> +#endif
> >> +
> >>  #ifndef HAVE_ARCH_FREE_PAGE
> >>  static inline void arch_free_page(struct page *page, int order) { }
> >>  #endif
> >
> > So the idea is to send pages to host in arch_free_page?
> >
> > However, I notice:
> >
> >         arch_free_page(page, order);
> >         kernel_poison_pages(page, 1 << order, 0);
> >
> > Thus pages are immediately written to after they are freed.
> >
> >
> >
> >
> >
> >> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> >> index b0cc1a3..936c71d 100644
> >> --- a/virt/kvm/Kconfig
> >> +++ b/virt/kvm/Kconfig
> >> @@ -50,3 +50,7 @@ config KVM_COMPAT
> >>  
> >>  config HAVE_KVM_IRQ_BYPASS
> >>         bool
> >> +
> >> +config KVM_FREE_PAGE_HINTING
> >> +       def_bool y
> >> +       depends on KVM
> >> diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
> >> new file mode 100644
> >> index 0000000..39d2b1d
> >> --- /dev/null
> >> +++ b/virt/kvm/page_hinting.c
> >> @@ -0,0 +1,46 @@
> >> +#include <linux/gfp.h>
> >> +#include <linux/mm.h>
> >> +#include <linux/page_ref.h>
> >> +#include <linux/kvm_host.h>
> >> +#include <linux/sort.h>
> >> +
> >> +#include <trace/events/kvm.h>
> >> +
> >> +#define MAX_FGPT_ENTRIES	1000
> >> +#define HYPERLIST_THRESHOLD	500
> >> +/*
> >> + * struct kvm_free_pages - Tracks the pages which are freed by the guest.
> >> + * @pfn	- page frame number for the page which is to be freed
> >> + * @pages - number of pages which are supposed to be freed.
> >> + * A global array object is used to hold the list of pfn and number of pages
> >> + * which are freed by the guest. This list may also have fragmentated pages so
> >> + * defragmentation is a must prior to the hypercall.
> >> + */
> >> +struct kvm_free_pages {
> >> +	unsigned long pfn;
> >> +	unsigned int pages;
> >> +};
> >> +
> >> +/*
> >> + * hypervisor_pages - It is a dummy structure passed with the hypercall.
> >> + * @pfn - page frame number for the page which is to be freed.
> >> + * @pages - number of pages which are supposed to be freed.
> >> + * A global array object is used to to hold the list of pfn and pages and is
> >> + * passed as part of the hypercall.
> >> + */
> >> +struct hypervisor_pages {
> >> +	unsigned long pfn;
> >> +	unsigned int pages;
> >> +};
> >> +
> >> +DEFINE_PER_CPU(struct kvm_free_pages [MAX_FGPT_ENTRIES], kvm_pt);
> >> +DEFINE_PER_CPU(int, kvm_pt_idx);
> >> +struct hypervisor_pages hypervisor_pagelist[MAX_FGPT_ENTRIES];
> >> +
> >> +void arch_alloc_page(struct page *page, int order)
> >> +{
> >> +}
> >> +
> >> +void arch_free_page(struct page *page, int order)
> >> +{
> >> +}
> >> -- 
> >> 2.9.4
> Thanks Michael for pointing out this bug.
> The issue here is when both 'Guest Page Hinting' and 'Page Poisoning'
> are enabled as it may result in guest generating memory corruption
> errors when the hypervisor maps a different physical memory to the guest
> after freeing the earlier mapped memory.
> 
> Page Poisoning is a feature in which the page is filled with a specific
> pattern of (0x00 or 0xaa) after arch_free_page and the same is verified
> before arch_alloc_page to prevent following issues:
> 
>     *information leak from the freed data
>     *use after free bugs
>     *memory corruption
> 
> -Selection of the pattern depends on the CONFIG_PAGE_POISONING_ZERO
> 
> Guest page hinting is a support for handing freed memory between the
> guest and the host. It enables the guests with DAX (no page cache) to
> rapidly free and reclaims memory to and from the host respectively. To
> achieve this task it uses two lists one CPU-local and other CPU-global. 
> Whenever a page is freed it is added to the respective CPU-local list
> through arch_free_page() until it is full. Once the list is full a
> seqlock is taken to prevent any further page allocations and the per
> CPU-local list is traversed and only those pages which are not
> reallocated by the guest are added to the CPU-global list. This global
> list is then sent to the hypervisor/host.
> 
> After freeing the pages in the global list following things may happen:
> 
>     *Hypervisor reallocates the freed memory back to the guest if required
>     *Hypervisor frees the memory and maps a different physical memory at
> its place
> 
> In order to prevent any information leak hypervisor before allocating
> memory to the guest fills it with zeroes.
> The issue arises when the pattern used for Page Poisoning is 0xaa while
> the newly allocated page received from the hypervisor by the guest is
> filled with the pattern 0x00. This will result in memory corruption errors.
> As per the current implementation in order to enable page poisoning a
> boot time parameter (page_poison) is set. Hence  there could be two
> following ways by which we could resolve the above-stated issue:
> 
> 1. We can disable page poisoning if guest page hinting is enabled. The
> only code change required in this case will be to check if
> CONFIG_PAGE_POISONING is enabled, if so disable it so that poisoning is
> not checked once we receive newly allocated memory from the hypervisor.
> 
> 2. We can have a flag in the page structure using which we can control
> poisoning on a per page basis. Using this we can disable the poisoning
> for only those pages which are sent to the hypervisor so that posioning
> is not checked on the newly allocated memory which is received by the
> guest form the hypervisor. Though in order to incorporate this I may
> have to make changes in page_poison.c, mm_types.h and other related files.
> 
> I am looking forward to the comments and suggestions.
> 
> [1] http://www.spinics.net/lists/kvm/msg153666.html
> [2] https://www.spinics.net/lists/kvm/msg158054.html

Let's not discuss this off-list. Please repost with Cc a
mailing list so the discussion is archived.

Thanks!

> -- 
> Regards
> Nitesh
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-11-15 20:06 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-03 20:30 [Patch v4 0/6] KVM: Guest page hinting nilal
2017-11-03 20:30 ` [Patch v4 1/6] KVM: Support for guest " nilal
2017-11-13 17:59   ` Michael S. Tsirkin
2017-11-15 19:38     ` Nitesh Narayan Lal
2017-11-15 20:06       ` Michael S. Tsirkin
2017-11-03 20:30 ` [Patch v4 2/6] KVM: Guest page hinting functionality nilal
2017-11-13 18:03   ` Michael S. Tsirkin
2017-11-03 20:30 ` [Patch v4 3/6] KVM: Adding tracepoints for guest page hinting nilal
2017-11-03 20:30 ` [Patch v4 4/6] virtio: Exposes added descriptor to the other side synchronously nilal
2017-11-03 20:30 ` [Patch v4 5/6] KVM: Sending hyperlist to the host via hinting_vq nilal
2017-11-03 20:30 ` [Patch v4 6/6] KVM: Enabling guest page hinting via static key nilal
2017-11-03 20:37 ` [QEMU PATCH] kvm: Support for guest page hinting nilal
2017-11-03 20:37   ` nilal
2017-11-06 11:21     ` Pankaj Gupta
2017-11-06 14:21       ` Nitesh Narayan Lal
2017-11-12 21:23 ` [Patch v4 0/6] KVM: Guest " Rik van Riel
2017-11-13 15:14   ` Nitesh Narayan Lal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.