linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH v12 0/2] mm: Support for page reporting
@ 2019-08-12 13:12 Nitesh Narayan Lal
  2019-08-12 13:12 ` [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure Nitesh Narayan Lal
                   ` (3 more replies)
  0 siblings, 4 replies; 35+ messages in thread
From: Nitesh Narayan Lal @ 2019-08-12 13:12 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, virtio-dev, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck, john.starks,
	dave.hansen, mhocko, cohuck

This patch series proposes an efficient mechanism for reporting free memory
from a guest to its hypervisor. It especially enables guests with no page cache
(e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
rapidly hand back free memory to the hypervisor.
This approach has a minimal impact on the existing core-mm infrastructure.

This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
A new hook after buddy merging is used to set the bits in the bitmap for a freed 
page. Each set bit is cleared after they are processed/checked for
re-allocation.
Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
threshold is met, trying to isolate and report pages that are still free.
The isolated pages are stored in a scatterlist and are reported via
virtio-balloon, which is responsible for sending batched pages to the
hypervisor. Once the hypervisor processed the reporting request, the isolated
pages are returned back to the buddy.
The thershold which defines the number of pages which will be isolated and
reported to the hypervisor at a time is currently hardcoded to 16 in the guest.

Benefit analysis:
Number of 5 GB guests (each touching 4 to 5 GB memory) that can be launched on a
15 GB single NUMA system without using swap space in the host.

	    Guest kernel-->	Unmodified		with v12 page reporting
	Number of guests-->	    2				7

Conclusion: In a page-reporting enabled kernel, the guest is able to report
most of its unused memory back to the host. Due to this on the same host, I was
able to launch 7 guests without touching any swap compared to 2 which were
launched with an unmodified kernel.

Performance Analysis:
In order to measure the performance impact of this patch-series over an
unmodified kernel, I am using will-it-scale/page_fault1 on a 30 GB, 24 vcpus
single NUMA guest which is affined to a single node in the host. Over several
runs, I observed that with this patch-series there is a degradation of around
1-3% for certain cases. This degradation could be a result of page-zeroing
overhead which comes with every page-fault in the guest.
I also tried this test on a 2 NUMA node host running page reporting
enabled 60GB guest also having 2 NUMA nodes and 24 vcpus. I observed a similar
degradation of around 1-3% in most of the cases.
For certain cases, the variability even with an unmodified kernel was around
4-6% with every fresh boot. I will continue to investigate this further to find
the reason behind it.

Ongoing work-items:
* I have a working prototype for supporting memory hotplug/hotremove with page
  reporting. However, it still requires more testing and fixes specifically on
  the hotremove side.
  Right now, for any memory hotplug or hotremove request bitmap or its
  respective fields are not changed. Hence, memory added via hotplug is not
  tracked in the bitmap. Similarly, removed memory is not reported to the
  hypervisor by using an online memory check. 
* I will also have to look into the details about how to handle page poisoning
  scenarios and test with directly assigned devices.


Changes from v11:
https://lkml.org/lkml/2019/7/10/742
* Moved the fields required to manage bitmap of free pages to 'struct zone'.
* Replaced the list which was used to hold and report the free pages with
  scatterlist.
* Tried to fix the anti-kernel patterns and improve overall code quality.
* Fixed a few bugs in the code which were reported in the last posting.
* Moved to use MADV_DONTNEED from MADV_FREE.
* Replaced page hinting in favor of page reporting.
* Addressed other comments which I received in the last posting.	


Changes from v10:
https://lkml.org/lkml/2019/6/3/943
* Added logic to take care of multiple NUMA nodes scenarios.
* Simplified the logic for reporting isolated pages to the host. (Eg. replaced
  dynamically allocated arrays with static ones, introduced wait event instead
  of the loop in order to wait for a response from the host)
* Added a mutex to prevent race condition when page reporting is enabled by
  multiple drivers.
* Simplified the logic responsible for decrementing free page counter for each
  zone.
* Simplified code structuring/naming.
 
--

Nitesh Narayan Lal (2):
  mm: page_reporting: core infrastructure
  virtio-balloon: interface to support free page reporting

 drivers/virtio/Kconfig              |   1 +
 drivers/virtio/virtio_balloon.c     |  64 +++++-
 include/linux/mmzone.h              |  11 +
 include/linux/page_reporting.h      |  63 ++++++
 include/uapi/linux/virtio_balloon.h |   1 +
 mm/Kconfig                          |   6 +
 mm/Makefile                         |   1 +
 mm/page_alloc.c                     |  42 +++-
 mm/page_reporting.c                 | 332 ++++++++++++++++++++++++++++
 9 files changed, 513 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/page_reporting.h
 create mode 100644 mm/page_reporting.c

-- 




^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-12 13:12 [RFC][PATCH v12 0/2] mm: Support for page reporting Nitesh Narayan Lal
@ 2019-08-12 13:12 ` Nitesh Narayan Lal
  2019-08-12 18:47   ` Alexander Duyck
  2019-10-10 20:36   ` Alexander Duyck
  2019-08-12 13:12 ` [RFC][Patch v12 2/2] virtio-balloon: interface to support free page reporting Nitesh Narayan Lal
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 35+ messages in thread
From: Nitesh Narayan Lal @ 2019-08-12 13:12 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, virtio-dev, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck, john.starks,
	dave.hansen, mhocko, cohuck

This patch introduces the core infrastructure for free page reporting in
virtual environments. It enables the kernel to track the free pages which
can be reported to its hypervisor so that the hypervisor could
free and reuse that memory as per its requirement.

While the pages are getting processed in the hypervisor (e.g.,
via MADV_DONTNEED), the guest must not use them, otherwise, data loss
would be possible. To avoid such a situation, these pages are
temporarily removed from the buddy. The amount of pages removed
temporarily from the buddy is governed by the backend(virtio-balloon
in our case).

To efficiently identify free pages that can to be reported to the
hypervisor, bitmaps in a coarse granularity are used. Only fairly big
chunks are reported to the hypervisor - especially, to not break up THP
in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
in the bitmap are an indication whether a page *might* be free, not a
guarantee. A new hook after buddy merging sets the bits.

Bitmaps are stored per zone, protected by the zone lock. A workqueue
asynchronously processes the bitmaps, trying to isolate and report pages
that are still free. The backend (virtio-balloon) is responsible for
reporting these batched pages to the host synchronously. Once reporting/
freeing is complete, isolated pages are returned back to the buddy.

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 include/linux/mmzone.h         |  11 ++
 include/linux/page_reporting.h |  63 +++++++
 mm/Kconfig                     |   6 +
 mm/Makefile                    |   1 +
 mm/page_alloc.c                |  42 ++++-
 mm/page_reporting.c            | 332 +++++++++++++++++++++++++++++++++
 6 files changed, 448 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/page_reporting.h
 create mode 100644 mm/page_reporting.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d77d717c620c..ba5f5b508f25 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -559,6 +559,17 @@ struct zone {
 	/* Zone statistics */
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 	atomic_long_t		vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
+#ifdef CONFIG_PAGE_REPORTING
+	/* Pointer to the bitmap in PAGE_REPORTING_MIN_ORDER granularity */
+	unsigned long *bitmap;
+	/* Preserve start and end PFN in case they change due to hotplug */
+	unsigned long base_pfn;
+	unsigned long end_pfn;
+	/* Free pages of granularity PAGE_REPORTING_MIN_ORDER */
+	atomic_t free_pages;
+	/* Number of bits required in the bitmap */
+	unsigned long nbits;
+#endif
 } ____cacheline_internodealigned_in_smp;
 
 enum pgdat_flags {
diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
new file mode 100644
index 000000000000..37a39589939d
--- /dev/null
+++ b/include/linux/page_reporting.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PAGE_REPORTING_H
+#define _LINUX_PAGE_REPORTING_H
+
+#define PAGE_REPORTING_MIN_ORDER		(MAX_ORDER - 2)
+#define PAGE_REPORTING_MAX_PAGES		16
+
+#ifdef CONFIG_PAGE_REPORTING
+struct page_reporting_config {
+	/* function to hint batch of isolated pages */
+	void (*report)(struct page_reporting_config *phconf,
+		       unsigned int num_pages);
+
+	/* scatterlist to hold the isolated pages to be hinted */
+	struct scatterlist *sg;
+
+	/*
+	 * Maxmimum pages that are going to be hinted to the hypervisor at a
+	 * time of granularity >= PAGE_REPORTING_MIN_ORDER.
+	 */
+	int max_pages;
+
+	/* work object to process page reporting rqeuests */
+	struct work_struct reporting_work;
+
+	/* tracks the number of reporting request processed at a time */
+	atomic_t refcnt;
+};
+
+void __page_reporting_enqueue(struct page *page);
+void __return_isolated_page(struct zone *zone, struct page *page);
+void set_pageblock_migratetype(struct page *page, int migratetype);
+
+/**
+ * page_reporting_enqueue - checks the eligibility of the freed page based on
+ * its order for further page reporting processing.
+ * @page: page which has been freed.
+ * @order: order for the the free page.
+ */
+static inline void page_reporting_enqueue(struct page *page, int order)
+{
+	if (order < PAGE_REPORTING_MIN_ORDER)
+		return;
+	__page_reporting_enqueue(page);
+}
+
+int page_reporting_enable(struct page_reporting_config *phconf);
+void page_reporting_disable(struct page_reporting_config *phconf);
+#else
+static inline void page_reporting_enqueue(struct page *page, int order)
+{
+}
+
+static inline int page_reporting_enable(struct page_reporting_config *phconf)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void page_reporting_disable(struct page_reporting_config *phconf)
+{
+}
+#endif /* CONFIG_PAGE_REPORTING */
+#endif /* _LINUX_PAGE_REPORTING_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 56cec636a1fc..6a35a9dad350 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -736,4 +736,10 @@ config ARCH_HAS_PTE_SPECIAL
 config ARCH_HAS_HUGEPD
 	bool
 
+# PAGE_REPORTING will allow the guest to report the free pages to the
+# host in fixed chunks as soon as a fixed threshold is reached.
+config PAGE_REPORTING
+       bool
+       def_bool n
+       depends on X86_64
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 338e528ad436..6a32cdfa61c2 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -104,3 +104,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_HMM_MIRROR) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 272c6de1bf4e..aa7c89d50c85 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -68,6 +68,7 @@
 #include <linux/lockdep.h>
 #include <linux/nmi.h>
 #include <linux/psi.h>
+#include <linux/page_reporting.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -903,7 +904,7 @@ compaction_capture(struct capture_control *capc, struct page *page,
 static inline void __free_one_page(struct page *page,
 		unsigned long pfn,
 		struct zone *zone, unsigned int order,
-		int migratetype)
+		int migratetype, bool needs_reporting)
 {
 	unsigned long combined_pfn;
 	unsigned long uninitialized_var(buddy_pfn);
@@ -1006,7 +1007,8 @@ static inline void __free_one_page(struct page *page,
 				migratetype);
 	else
 		add_to_free_area(page, &zone->free_area[order], migratetype);
-
+	if (needs_reporting)
+		page_reporting_enqueue(page, order);
 }
 
 /*
@@ -1317,7 +1319,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		if (unlikely(isolated_pageblocks))
 			mt = get_pageblock_migratetype(page);
 
-		__free_one_page(page, page_to_pfn(page), zone, 0, mt);
+		__free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
 		trace_mm_page_pcpu_drain(page, 0, mt);
 	}
 	spin_unlock(&zone->lock);
@@ -1326,14 +1328,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 static void free_one_page(struct zone *zone,
 				struct page *page, unsigned long pfn,
 				unsigned int order,
-				int migratetype)
+				int migratetype, bool needs_reporting)
 {
 	spin_lock(&zone->lock);
 	if (unlikely(has_isolate_pageblock(zone) ||
 		is_migrate_isolate(migratetype))) {
 		migratetype = get_pfnblock_migratetype(page, pfn);
 	}
-	__free_one_page(page, pfn, zone, order, migratetype);
+	__free_one_page(page, pfn, zone, order, migratetype, needs_reporting);
 	spin_unlock(&zone->lock);
 }
 
@@ -1423,7 +1425,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	free_one_page(page_zone(page), page, pfn, order, migratetype);
+	free_one_page(page_zone(page), page, pfn, order, migratetype, true);
 	local_irq_restore(flags);
 }
 
@@ -2197,6 +2199,32 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	return NULL;
 }
 
+#ifdef CONFIG_PAGE_REPORTING
+/**
+ * return_isolated_page - returns a reported page back to the buddy.
+ * @zone: zone from where the page was isolated.
+ * @page: page which will be returned.
+ */
+void __return_isolated_page(struct zone *zone, struct page *page)
+{
+	unsigned int order, mt;
+	unsigned long pfn;
+
+	/* zone lock should be held when this function is called */
+	lockdep_assert_held(&zone->lock);
+
+	mt = get_pageblock_migratetype(page);
+	pfn = page_to_pfn(page);
+
+	if (unlikely(has_isolate_pageblock(zone) || is_migrate_isolate(mt)))
+		mt = get_pfnblock_migratetype(page, pfn);
+
+	order = page_private(page);
+	set_page_private(page, 0);
+
+	__free_one_page(page, pfn, zone, order, mt, false);
+}
+#endif /* CONFIG_PAGE_REPORTING */
 
 /*
  * This array describes the order lists are fallen back to when
@@ -3041,7 +3069,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
 	 */
 	if (migratetype >= MIGRATE_PCPTYPES) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(zone, page, pfn, 0, migratetype);
+			free_one_page(zone, page, pfn, 0, migratetype, true);
 			return;
 		}
 		migratetype = MIGRATE_MOVABLE;
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
new file mode 100644
index 000000000000..4ee2c172cd9a
--- /dev/null
+++ b/mm/page_reporting.c
@@ -0,0 +1,332 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Page reporting core infrastructure to enable a VM to report free pages to its
+ * hypervisor.
+ *
+ * Copyright Red Hat, Inc. 2019
+ *
+ * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/page_reporting.h>
+#include <linux/scatterlist.h>
+#include "internal.h"
+
+static struct page_reporting_config __rcu *page_reporting_conf __read_mostly;
+static DEFINE_MUTEX(page_reporting_mutex);
+
+static inline unsigned long pfn_to_bit(struct page *page, struct zone *zone)
+{
+	unsigned long bitnr;
+
+	bitnr = (page_to_pfn(page) - zone->base_pfn) >>
+		PAGE_REPORTING_MIN_ORDER;
+
+	return bitnr;
+}
+
+static void return_isolated_page(struct zone *zone,
+				 struct page_reporting_config *phconf)
+{
+	struct scatterlist *sg = phconf->sg;
+
+	spin_lock(&zone->lock);
+	do {
+		__return_isolated_page(zone, sg_page(sg));
+	} while (!sg_is_last(sg++));
+	spin_unlock(&zone->lock);
+}
+
+static void bitmap_set_bit(struct page *page, struct zone *zone)
+{
+	unsigned long bitnr = 0;
+
+	/* zone lock should be held when this function is called */
+	lockdep_assert_held(&zone->lock);
+
+	bitnr = pfn_to_bit(page, zone);
+	/* set bit if it is not already set and is a valid bit */
+	if (zone->bitmap && bitnr < zone->nbits &&
+	    !test_and_set_bit(bitnr, zone->bitmap))
+		atomic_inc(&zone->free_pages);
+}
+
+static int process_free_page(struct page *page,
+			     struct page_reporting_config *phconf, int count)
+{
+	int mt, order, ret = 0;
+
+	mt = get_pageblock_migratetype(page);
+	order = page_private(page);
+	ret = __isolate_free_page(page, order);
+
+	if (ret) {
+		/*
+		 * Preserving order and migratetype for reuse while
+		 * releasing the pages back to the buddy.
+		 */
+		set_pageblock_migratetype(page, mt);
+		set_page_private(page, order);
+
+		sg_set_page(&phconf->sg[count++], page,
+			    PAGE_SIZE << order, 0);
+	}
+
+	return count;
+}
+
+/**
+ * scan_zone_bitmap - scans the bitmap for the requested zone.
+ * @phconf: page reporting configuration object initialized by the backend.
+ * @zone: zone for which page reporting is requested.
+ *
+ * For every page marked in the bitmap it checks if it is still free if so it
+ * isolates and adds them to a scatterlist. As soon as the number of isolated
+ * pages reach the threshold set by the backend, they are reported to the
+ * hypervisor by the backend. Once the hypervisor responds after processing
+ * they are returned back to the buddy for reuse.
+ */
+static void scan_zone_bitmap(struct page_reporting_config *phconf,
+			     struct zone *zone)
+{
+	unsigned long setbit;
+	struct page *page;
+	int count = 0;
+
+	sg_init_table(phconf->sg, phconf->max_pages);
+
+	for_each_set_bit(setbit, zone->bitmap, zone->nbits) {
+		/* Process only if the page is still online */
+		page = pfn_to_online_page((setbit << PAGE_REPORTING_MIN_ORDER) +
+					  zone->base_pfn);
+		if (!page)
+			continue;
+
+		spin_lock(&zone->lock);
+
+		/* Ensure page is still free and can be processed */
+		if (PageBuddy(page) && page_private(page) >=
+		    PAGE_REPORTING_MIN_ORDER)
+			count = process_free_page(page, phconf, count);
+
+		spin_unlock(&zone->lock);
+		/* Page has been processed, adjust its bit and zone counter */
+		clear_bit(setbit, zone->bitmap);
+		atomic_dec(&zone->free_pages);
+
+		if (count == phconf->max_pages) {
+			/* Report isolated pages to the hypervisor */
+			phconf->report(phconf, count);
+
+			/* Return processed pages back to the buddy */
+			return_isolated_page(zone, phconf);
+
+			/* Reset for next reporting */
+			sg_init_table(phconf->sg, phconf->max_pages);
+			count = 0;
+		}
+	}
+	/*
+	 * If the number of isolated pages does not meet the max_pages
+	 * threshold, we would still prefer to report them as we have already
+	 * isolated them.
+	 */
+	if (count) {
+		sg_mark_end(&phconf->sg[count - 1]);
+		phconf->report(phconf, count);
+
+		return_isolated_page(zone, phconf);
+	}
+}
+
+/**
+ * page_reporting_wq - checks the number of free_pages in all the zones and
+ * invokes a request to scan the respective bitmap if free_pages reaches or
+ * exceeds the threshold specified by the backend.
+ */
+static void page_reporting_wq(struct work_struct *work)
+{
+	struct page_reporting_config *phconf =
+		container_of(work, struct page_reporting_config,
+			     reporting_work);
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		if (atomic_read(&zone->free_pages) >= phconf->max_pages)
+			scan_zone_bitmap(phconf, zone);
+	}
+	/*
+	 * We have processed all the zones, we can process new page reporting
+	 * request now.
+	 */
+	atomic_set(&phconf->refcnt, 0);
+}
+
+/**
+ * __page_reporting_enqueue - tracks the freed page in the respective zone's
+ * bitmap and enqueues a new page reporting job to the workqueue if possible.
+ */
+void __page_reporting_enqueue(struct page *page)
+{
+	struct page_reporting_config *phconf;
+	struct zone *zone;
+
+	rcu_read_lock();
+	/*
+	 * We should not process this page if either page reporting is not
+	 * yet completely enabled or it has been disabled by the backend.
+	 */
+	phconf = rcu_dereference(page_reporting_conf);
+	if (!phconf)
+		return;
+
+	zone = page_zone(page);
+	bitmap_set_bit(page, zone);
+
+	/*
+	 * We should not enqueue a job if a previously enqueued reporting work
+	 * is in progress or we don't have enough free pages in the zone.
+	 */
+	if (atomic_read(&zone->free_pages) >= phconf->max_pages &&
+	    !atomic_cmpxchg(&phconf->refcnt, 0, 1))
+		schedule_work(&phconf->reporting_work);
+
+	rcu_read_unlock();
+}
+
+/**
+ * zone_reporting_cleanup - resets the page reporting fields and free the
+ * bitmap for all the initialized zones.
+ */
+static void zone_reporting_cleanup(void)
+{
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		/*
+		 * Bitmap may not be allocated for all the zones if the
+		 * initialization fails before reaching to the last one.
+		 */
+		if (!zone->bitmap)
+			continue;
+		bitmap_free(zone->bitmap);
+		zone->bitmap = NULL;
+		atomic_set(&zone->free_pages, 0);
+	}
+}
+
+static int zone_bitmap_alloc(struct zone *zone)
+{
+	unsigned long bitmap_size, pages;
+
+	pages = zone->end_pfn - zone->base_pfn;
+	bitmap_size = (pages >> PAGE_REPORTING_MIN_ORDER) + 1;
+
+	if (!bitmap_size)
+		return 0;
+
+	zone->bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
+	if (!zone->bitmap)
+		return -ENOMEM;
+
+	zone->nbits = bitmap_size;
+
+	return 0;
+}
+
+/**
+ * zone_reporting_init - For each zone initializes the page reporting fields
+ * and allocates the respective bitmap.
+ *
+ * This function returns 0 on successful initialization, -ENOMEM otherwise.
+ */
+static int zone_reporting_init(void)
+{
+	struct zone *zone;
+	int ret;
+
+	for_each_populated_zone(zone) {
+#ifdef CONFIG_ZONE_DEVICE
+		/* we can not report pages which are not in the buddy */
+		if (zone_idx(zone) == ZONE_DEVICE)
+			continue;
+#endif
+		spin_lock(&zone->lock);
+		zone->base_pfn = zone->zone_start_pfn;
+		zone->end_pfn = zone_end_pfn(zone);
+		spin_unlock(&zone->lock);
+
+		ret = zone_bitmap_alloc(zone);
+		if (ret < 0) {
+			zone_reporting_cleanup();
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+void page_reporting_disable(struct page_reporting_config *phconf)
+{
+	mutex_lock(&page_reporting_mutex);
+
+	if (rcu_access_pointer(page_reporting_conf) != phconf)
+		return;
+
+	RCU_INIT_POINTER(page_reporting_conf, NULL);
+	synchronize_rcu();
+
+	/* Cancel any pending reporting request */
+	cancel_work_sync(&phconf->reporting_work);
+
+	/* Free the scatterlist used for isolated pages */
+	kfree(phconf->sg);
+	phconf->sg = NULL;
+
+	/* Cleanup the bitmaps and old tracking data */
+	zone_reporting_cleanup();
+
+	mutex_unlock(&page_reporting_mutex);
+}
+EXPORT_SYMBOL_GPL(page_reporting_disable);
+
+int page_reporting_enable(struct page_reporting_config *phconf)
+{
+	int ret = 0;
+
+	mutex_lock(&page_reporting_mutex);
+
+	/* check if someone is already using page reporting*/
+	if (rcu_access_pointer(page_reporting_conf)) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	/* allocate scatterlist to hold isolated pages */
+	phconf->sg = kcalloc(phconf->max_pages, sizeof(*phconf->sg),
+			     GFP_KERNEL);
+	if (!phconf->sg) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/* initialize each zone's fields required for page reporting */
+	ret = zone_reporting_init();
+	if (ret < 0) {
+		kfree(phconf->sg);
+		goto out;
+	}
+
+	atomic_set(&phconf->refcnt, 0);
+	INIT_WORK(&phconf->reporting_work, page_reporting_wq);
+
+	/* assign the configuration object provided by the backend */
+	rcu_assign_pointer(page_reporting_conf, phconf);
+
+out:
+	mutex_unlock(&page_reporting_mutex);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(page_reporting_enable);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC][Patch v12 2/2] virtio-balloon: interface to support free page reporting
  2019-08-12 13:12 [RFC][PATCH v12 0/2] mm: Support for page reporting Nitesh Narayan Lal
  2019-08-12 13:12 ` [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure Nitesh Narayan Lal
@ 2019-08-12 13:12 ` Nitesh Narayan Lal
  2019-08-14 10:29   ` Cornelia Huck
  2019-08-12 13:13 ` [QEMU Patch 1/2] virtio-balloon: adding bit for page reporting support Nitesh Narayan Lal
  2019-09-11 12:30 ` [RFC][PATCH v12 0/2] mm: Support for " David Hildenbrand
  3 siblings, 1 reply; 35+ messages in thread
From: Nitesh Narayan Lal @ 2019-08-12 13:12 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, virtio-dev, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck, john.starks,
	dave.hansen, mhocko, cohuck

Enables the kernel to negotiate VIRTIO_BALLOON_F_REPORTING feature with
the host. If it is available and page_reporting_flag is set to true,
page_reporting is enabled and its callback is configured along with
the max_pages count which indicates the maximum number of pages that
can be isolated and reported at a time. Currently, only free pages of
order >= (MAX_ORDER - 2) are reported. To prevent any false OOM
max_pages count is set to 16.

By default page_reporting feature is enabled and gets loaded as soon
as the virtio-balloon driver is loaded. However, it could be disabled
by writing the page_reporting_flag which is a virtio-balloon parameter.

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 drivers/virtio/Kconfig              |  1 +
 drivers/virtio/virtio_balloon.c     | 64 ++++++++++++++++++++++++++++-
 include/uapi/linux/virtio_balloon.h |  1 +
 3 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 078615cf2afc..4b2dd8259ff5 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -58,6 +58,7 @@ config VIRTIO_BALLOON
 	tristate "Virtio balloon driver"
 	depends on VIRTIO
 	select MEMORY_BALLOON
+	select PAGE_REPORTING
 	---help---
 	 This driver supports increasing and decreasing the amount
 	 of memory within a KVM guest.
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 226fbb995fb0..defec00d4ee2 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -19,6 +19,7 @@
 #include <linux/mount.h>
 #include <linux/magic.h>
 #include <linux/pseudo_fs.h>
+#include <linux/page_reporting.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -46,6 +47,7 @@ enum virtio_balloon_vq {
 	VIRTIO_BALLOON_VQ_DEFLATE,
 	VIRTIO_BALLOON_VQ_STATS,
 	VIRTIO_BALLOON_VQ_FREE_PAGE,
+	VIRTIO_BALLOON_VQ_REPORTING,
 	VIRTIO_BALLOON_VQ_MAX
 };
 
@@ -55,7 +57,8 @@ enum virtio_balloon_config_read {
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
+			 *reporting_vq;
 
 	/* Balloon's own wq for cpu-intensive work items */
 	struct workqueue_struct *balloon_wq;
@@ -113,6 +116,9 @@ struct virtio_balloon {
 
 	/* To register a shrinker to shrink memory upon memory pressure */
 	struct shrinker shrinker;
+
+	/* To configure page reporting to report isolated pages */
+	struct page_reporting_config page_reporting_conf;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -120,6 +126,10 @@ static struct virtio_device_id id_table[] = {
 	{ 0 },
 };
 
+bool page_reporting_flag = true;
+module_param(page_reporting_flag, bool, 0644);
+MODULE_PARM_DESC(page_reporting_flag, "Enable page reporting");
+
 static u32 page_to_balloon_pfn(struct page *page)
 {
 	unsigned long pfn = page_to_pfn(page);
@@ -152,6 +162,44 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 
 }
 
+void virtballoon_report_pages(struct page_reporting_config *page_reporting_conf,
+			      unsigned int num_pages)
+{
+	struct virtio_balloon *vb = container_of(page_reporting_conf,
+						 struct virtio_balloon,
+						 page_reporting_conf);
+	struct virtqueue *vq = vb->reporting_vq;
+	int err, unused;
+
+	/* We should always be able to add these buffers to an empty queue. */
+	err = virtqueue_add_inbuf(vq, page_reporting_conf->sg, num_pages, vb,
+				  GFP_NOWAIT);
+	/* We should not report if the guest is low on memory */
+	if (unlikely(err))
+		return;
+	virtqueue_kick(vq);
+
+	/* When host has read buffer, this completes via balloon_ack */
+	wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
+}
+
+static void virtballoon_page_reporting_setup(struct virtio_balloon *vb)
+{
+	struct device *dev = &vb->vdev->dev;
+	int err;
+
+	vb->page_reporting_conf.report = virtballoon_report_pages;
+	vb->page_reporting_conf.max_pages = PAGE_REPORTING_MAX_PAGES;
+	err = page_reporting_enable(&vb->page_reporting_conf);
+	if (err < 0) {
+		dev_err(dev, "Failed to enable reporting, err = %d\n", err);
+		page_reporting_flag = false;
+		vb->page_reporting_conf.report = NULL;
+		vb->page_reporting_conf.max_pages = 0;
+		return;
+	}
+}
+
 static void set_page_pfns(struct virtio_balloon *vb,
 			  __virtio32 pfns[], struct page *page)
 {
@@ -476,6 +524,7 @@ static int init_vqs(struct virtio_balloon *vb)
 	names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
 	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
 	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
+	names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;
 
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
@@ -487,11 +536,18 @@ static int init_vqs(struct virtio_balloon *vb)
 		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
 	}
 
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
+		names[VIRTIO_BALLOON_VQ_REPORTING] = "reporting_vq";
+		callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;
+	}
 	err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
 					 vqs, callbacks, names, NULL, NULL);
 	if (err)
 		return err;
 
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
+		vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING];
+
 	vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
 	vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
@@ -924,6 +980,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		if (err)
 			goto out_del_balloon_wq;
 	}
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING) &&
+	    page_reporting_flag)
+		virtballoon_page_reporting_setup(vb);
 	virtio_device_ready(vdev);
 
 	if (towards_target(vb))
@@ -971,6 +1030,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
 		destroy_workqueue(vb->balloon_wq);
 	}
 
+	if (page_reporting_flag)
+		page_reporting_disable(&vb->page_reporting_conf);
 	remove_common(vb);
 #ifdef CONFIG_BALLOON_COMPACTION
 	if (vb->vb_dev_info.inode)
@@ -1027,6 +1088,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
 	VIRTIO_BALLOON_F_PAGE_POISON,
+	VIRTIO_BALLOON_F_REPORTING,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index a1966cd7b677..19974392d324 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -36,6 +36,7 @@
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
 #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
+#define VIRTIO_BALLOON_F_REPORTING	5 /* Page reporting virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [QEMU Patch 1/2] virtio-balloon: adding bit for page reporting support
  2019-08-12 13:12 [RFC][PATCH v12 0/2] mm: Support for page reporting Nitesh Narayan Lal
  2019-08-12 13:12 ` [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure Nitesh Narayan Lal
  2019-08-12 13:12 ` [RFC][Patch v12 2/2] virtio-balloon: interface to support free page reporting Nitesh Narayan Lal
@ 2019-08-12 13:13 ` Nitesh Narayan Lal
  2019-08-12 13:13   ` [QEMU Patch 2/2] virtio-balloon: support for handling page reporting Nitesh Narayan Lal
  2019-09-11 12:30 ` [RFC][PATCH v12 0/2] mm: Support for " David Hildenbrand
  3 siblings, 1 reply; 35+ messages in thread
From: Nitesh Narayan Lal @ 2019-08-12 13:13 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, virtio-dev, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck, john.starks,
	dave.hansen, mhocko, cohuck

This patch will be replaced once the feature is merged into the
Linux kernel.

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 include/standard-headers/linux/virtio_balloon.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
index 9375ca2a70..1c5f6d6f2d 100644
--- a/include/standard-headers/linux/virtio_balloon.h
+++ b/include/standard-headers/linux/virtio_balloon.h
@@ -36,6 +36,7 @@
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
 #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
+#define VIRTIO_BALLOON_F_REPORTING	5 /* Page reporting virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [QEMU Patch 2/2] virtio-balloon: support for handling page reporting
  2019-08-12 13:13 ` [QEMU Patch 1/2] virtio-balloon: adding bit for page reporting support Nitesh Narayan Lal
@ 2019-08-12 13:13   ` Nitesh Narayan Lal
  2019-08-12 15:18     ` Alexander Duyck
  0 siblings, 1 reply; 35+ messages in thread
From: Nitesh Narayan Lal @ 2019-08-12 13:13 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, virtio-dev, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck, john.starks,
	dave.hansen, mhocko, cohuck

Page reporting is a feature which enables the virtual machine to report
chunk of free pages to the hypervisor.
This patch enables QEMU to process these reports from the VM and discard the
unused memory range.

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 hw/virtio/virtio-balloon.c         | 41 ++++++++++++++++++++++++++++++
 include/hw/virtio/virtio-balloon.h |  2 +-
 2 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 25de154307..1132e47ee0 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -320,6 +320,39 @@ static void balloon_stats_set_poll_interval(Object *obj, Visitor *v,
     balloon_stats_change_timer(s, 0);
 }
 
+static void virtio_balloon_handle_reporting(VirtIODevice *vdev, VirtQueue *vq)
+{
+    VirtQueueElement *elem;
+
+    while ((elem = virtqueue_pop(vq, sizeof(VirtQueueElement)))) {
+        unsigned int i;
+
+        for (i = 0; i < elem->in_num; i++) {
+            void *gaddr = elem->in_sg[i].iov_base;
+            size_t size = elem->in_sg[i].iov_len;
+            ram_addr_t ram_offset;
+            size_t rb_page_size;
+	    RAMBlock *rb;
+
+            if (qemu_balloon_is_inhibited())
+                continue;
+
+            rb = qemu_ram_block_from_host(gaddr, false, &ram_offset);
+            rb_page_size = qemu_ram_pagesize(rb);
+
+            /* For now we will simply ignore unaligned memory regions */
+            if ((ram_offset | size) & (rb_page_size - 1))
+                continue;
+
+            ram_block_discard_range(rb, ram_offset, size);
+        }
+
+        virtqueue_push(vq, elem, 0);
+        virtio_notify(vdev, vq);
+        g_free(elem);
+    }
+}
+
 static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
@@ -792,6 +825,12 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
     s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
     s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
 
+    if (virtio_has_feature(s->host_features,
+                           VIRTIO_BALLOON_F_REPORTING)) {
+        s->reporting_vq = virtio_add_queue(vdev, 16,
+					   virtio_balloon_handle_reporting);
+    }
+
     if (virtio_has_feature(s->host_features,
                            VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
         s->free_page_vq = virtio_add_queue(vdev, VIRTQUEUE_MAX_SIZE,
@@ -912,6 +951,8 @@ static Property virtio_balloon_properties[] = {
      * is disabled, resulting in QEMU 3.1 migration incompatibility.  This
      * property retains this quirk for QEMU 4.1 machine types.
      */
+    DEFINE_PROP_BIT("free-page-reporting", VirtIOBalloon, host_features,
+                    VIRTIO_BALLOON_F_REPORTING, true),
     DEFINE_PROP_BOOL("qemu-4-0-config-size", VirtIOBalloon,
                      qemu_4_0_config_size, false),
     DEFINE_PROP_LINK("iothread", VirtIOBalloon, iothread, TYPE_IOTHREAD,
diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
index d1c968d237..15a05e6435 100644
--- a/include/hw/virtio/virtio-balloon.h
+++ b/include/hw/virtio/virtio-balloon.h
@@ -42,7 +42,7 @@ enum virtio_balloon_free_page_report_status {
 
 typedef struct VirtIOBalloon {
     VirtIODevice parent_obj;
-    VirtQueue *ivq, *dvq, *svq, *free_page_vq;
+    VirtQueue *ivq, *dvq, *svq, *free_page_vq, *reporting_vq;
     uint32_t free_page_report_status;
     uint32_t num_pages;
     uint32_t actual;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [QEMU Patch 2/2] virtio-balloon: support for handling page reporting
  2019-08-12 13:13   ` [QEMU Patch 2/2] virtio-balloon: support for handling page reporting Nitesh Narayan Lal
@ 2019-08-12 15:18     ` Alexander Duyck
  2019-08-12 15:26       ` Nitesh Narayan Lal
  0 siblings, 1 reply; 35+ messages in thread
From: Alexander Duyck @ 2019-08-12 15:18 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, virtio-dev, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko, cohuck

On Mon, Aug 12, 2019 at 6:14 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> Page reporting is a feature which enables the virtual machine to report
> chunk of free pages to the hypervisor.
> This patch enables QEMU to process these reports from the VM and discard the
> unused memory range.
>
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
>  hw/virtio/virtio-balloon.c         | 41 ++++++++++++++++++++++++++++++
>  include/hw/virtio/virtio-balloon.h |  2 +-
>  2 files changed, 42 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 25de154307..1132e47ee0 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -320,6 +320,39 @@ static void balloon_stats_set_poll_interval(Object *obj, Visitor *v,
>      balloon_stats_change_timer(s, 0);
>  }
>
> +static void virtio_balloon_handle_reporting(VirtIODevice *vdev, VirtQueue *vq)
> +{
> +    VirtQueueElement *elem;
> +
> +    while ((elem = virtqueue_pop(vq, sizeof(VirtQueueElement)))) {
> +        unsigned int i;
> +
> +        for (i = 0; i < elem->in_num; i++) {
> +            void *gaddr = elem->in_sg[i].iov_base;
> +            size_t size = elem->in_sg[i].iov_len;
> +            ram_addr_t ram_offset;
> +            size_t rb_page_size;
> +           RAMBlock *rb;
> +
> +            if (qemu_balloon_is_inhibited())
> +                continue;
> +
> +            rb = qemu_ram_block_from_host(gaddr, false, &ram_offset);
> +            rb_page_size = qemu_ram_pagesize(rb);
> +
> +            /* For now we will simply ignore unaligned memory regions */
> +            if ((ram_offset | size) & (rb_page_size - 1))
> +                continue;
> +
> +            ram_block_discard_range(rb, ram_offset, size);
> +        }
> +
> +        virtqueue_push(vq, elem, 0);
> +        virtio_notify(vdev, vq);
> +        g_free(elem);
> +    }
> +}
> +

No offense, but I am a bit annoyed. If you are going to copy my code
you should at least keep up with the fixes. You are missing all of the
stuff to handle the poison value. If you are going to just duplicate
my setup you might as well have just pulled the QEMU patches from the
last submission I did. Then this would have at least has the fix for
the page poisoning. Also it wouldn't hurt to mention that you are
basing it off of the patch set I submitted since it hasn't been
accepted yet.

>  static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>  {
>      VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
> @@ -792,6 +825,12 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>      s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>      s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
>
> +    if (virtio_has_feature(s->host_features,
> +                           VIRTIO_BALLOON_F_REPORTING)) {
> +        s->reporting_vq = virtio_add_queue(vdev, 16,
> +                                          virtio_balloon_handle_reporting);
> +    }
> +
>      if (virtio_has_feature(s->host_features,
>                             VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
>          s->free_page_vq = virtio_add_queue(vdev, VIRTQUEUE_MAX_SIZE,
> @@ -912,6 +951,8 @@ static Property virtio_balloon_properties[] = {
>       * is disabled, resulting in QEMU 3.1 migration incompatibility.  This
>       * property retains this quirk for QEMU 4.1 machine types.
>       */
> +    DEFINE_PROP_BIT("free-page-reporting", VirtIOBalloon, host_features,
> +                    VIRTIO_BALLOON_F_REPORTING, true),
>      DEFINE_PROP_BOOL("qemu-4-0-config-size", VirtIOBalloon,
>                       qemu_4_0_config_size, false),
>      DEFINE_PROP_LINK("iothread", VirtIOBalloon, iothread, TYPE_IOTHREAD,
> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> index d1c968d237..15a05e6435 100644
> --- a/include/hw/virtio/virtio-balloon.h
> +++ b/include/hw/virtio/virtio-balloon.h
> @@ -42,7 +42,7 @@ enum virtio_balloon_free_page_report_status {
>
>  typedef struct VirtIOBalloon {
>      VirtIODevice parent_obj;
> -    VirtQueue *ivq, *dvq, *svq, *free_page_vq;
> +    VirtQueue *ivq, *dvq, *svq, *free_page_vq, *reporting_vq;
>      uint32_t free_page_report_status;
>      uint32_t num_pages;
>      uint32_t actual;
> --
> 2.21.0
>q


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QEMU Patch 2/2] virtio-balloon: support for handling page reporting
  2019-08-12 15:18     ` Alexander Duyck
@ 2019-08-12 15:26       ` Nitesh Narayan Lal
  0 siblings, 0 replies; 35+ messages in thread
From: Nitesh Narayan Lal @ 2019-08-12 15:26 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, virtio-dev, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko, cohuck


On 8/12/19 11:18 AM, Alexander Duyck wrote:
> On Mon, Aug 12, 2019 at 6:14 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> Page reporting is a feature which enables the virtual machine to report
>> chunk of free pages to the hypervisor.
>> This patch enables QEMU to process these reports from the VM and discard the
>> unused memory range.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> ---
>>  hw/virtio/virtio-balloon.c         | 41 ++++++++++++++++++++++++++++++
>>  include/hw/virtio/virtio-balloon.h |  2 +-
>>  2 files changed, 42 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
>> index 25de154307..1132e47ee0 100644
>> --- a/hw/virtio/virtio-balloon.c
>> +++ b/hw/virtio/virtio-balloon.c
>> @@ -320,6 +320,39 @@ static void balloon_stats_set_poll_interval(Object *obj, Visitor *v,
>>      balloon_stats_change_timer(s, 0);
>>  }
>>
>> +static void virtio_balloon_handle_reporting(VirtIODevice *vdev, VirtQueue *vq)
>> +{
>> +    VirtQueueElement *elem;
>> +
>> +    while ((elem = virtqueue_pop(vq, sizeof(VirtQueueElement)))) {
>> +        unsigned int i;
>> +
>> +        for (i = 0; i < elem->in_num; i++) {
>> +            void *gaddr = elem->in_sg[i].iov_base;
>> +            size_t size = elem->in_sg[i].iov_len;
>> +            ram_addr_t ram_offset;
>> +            size_t rb_page_size;
>> +           RAMBlock *rb;
>> +
>> +            if (qemu_balloon_is_inhibited())
>> +                continue;
>> +
>> +            rb = qemu_ram_block_from_host(gaddr, false, &ram_offset);
>> +            rb_page_size = qemu_ram_pagesize(rb);
>> +
>> +            /* For now we will simply ignore unaligned memory regions */
>> +            if ((ram_offset | size) & (rb_page_size - 1))
>> +                continue;
>> +
>> +            ram_block_discard_range(rb, ram_offset, size);
>> +        }
>> +
>> +        virtqueue_push(vq, elem, 0);
>> +        virtio_notify(vdev, vq);
>> +        g_free(elem);
>> +    }
>> +}
>> +
> No offense, but I am a bit annoyed.

None taken at all.

>  If you are going to copy my code
> you should at least keep up with the fixes.


Yeah I did refer to your code and just because the quality of your code is
better than what I posted earlier and there is quite a lot for me to learn from it.


> stuff to handle the poison value. If you are going to just duplicate
> my setup you might as well have just pulled the QEMU patches from the
> last submission I did. Then this would have at least has the fix for
> the page poisoning.
>

The only reason I didn't include the poison change as I still need to understand
them.
I have this mentioned in my cover-email.


>  Also it wouldn't hurt to mention that you are
> basing it off of the patch set I submitted since it hasn't been
> accepted yet.


My bad!! This I will surely do from next time.

>
>>  static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>>  {
>>      VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
>> @@ -792,6 +825,12 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>>      s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>>      s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
>>
>> +    if (virtio_has_feature(s->host_features,
>> +                           VIRTIO_BALLOON_F_REPORTING)) {
>> +        s->reporting_vq = virtio_add_queue(vdev, 16,
>> +                                          virtio_balloon_handle_reporting);
>> +    }
>> +
>>      if (virtio_has_feature(s->host_features,
>>                             VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
>>          s->free_page_vq = virtio_add_queue(vdev, VIRTQUEUE_MAX_SIZE,
>> @@ -912,6 +951,8 @@ static Property virtio_balloon_properties[] = {
>>       * is disabled, resulting in QEMU 3.1 migration incompatibility.  This
>>       * property retains this quirk for QEMU 4.1 machine types.
>>       */
>> +    DEFINE_PROP_BIT("free-page-reporting", VirtIOBalloon, host_features,
>> +                    VIRTIO_BALLOON_F_REPORTING, true),
>>      DEFINE_PROP_BOOL("qemu-4-0-config-size", VirtIOBalloon,
>>                       qemu_4_0_config_size, false),
>>      DEFINE_PROP_LINK("iothread", VirtIOBalloon, iothread, TYPE_IOTHREAD,
>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
>> index d1c968d237..15a05e6435 100644
>> --- a/include/hw/virtio/virtio-balloon.h
>> +++ b/include/hw/virtio/virtio-balloon.h
>> @@ -42,7 +42,7 @@ enum virtio_balloon_free_page_report_status {
>>
>>  typedef struct VirtIOBalloon {
>>      VirtIODevice parent_obj;
>> -    VirtQueue *ivq, *dvq, *svq, *free_page_vq;
>> +    VirtQueue *ivq, *dvq, *svq, *free_page_vq, *reporting_vq;
>>      uint32_t free_page_report_status;
>>      uint32_t num_pages;
>>      uint32_t actual;
>> --
>> 2.21.0
>> q
-- 
Thanks
Nitesh



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-12 13:12 ` [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure Nitesh Narayan Lal
@ 2019-08-12 18:47   ` Alexander Duyck
  2019-08-12 20:04     ` Nitesh Narayan Lal
                       ` (3 more replies)
  2019-10-10 20:36   ` Alexander Duyck
  1 sibling, 4 replies; 35+ messages in thread
From: Alexander Duyck @ 2019-08-12 18:47 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, virtio-dev, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko, cohuck

On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> This patch introduces the core infrastructure for free page reporting in
> virtual environments. It enables the kernel to track the free pages which
> can be reported to its hypervisor so that the hypervisor could
> free and reuse that memory as per its requirement.
>
> While the pages are getting processed in the hypervisor (e.g.,
> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
> would be possible. To avoid such a situation, these pages are
> temporarily removed from the buddy. The amount of pages removed
> temporarily from the buddy is governed by the backend(virtio-balloon
> in our case).
>
> To efficiently identify free pages that can to be reported to the
> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> chunks are reported to the hypervisor - especially, to not break up THP
> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> in the bitmap are an indication whether a page *might* be free, not a
> guarantee. A new hook after buddy merging sets the bits.
>
> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> asynchronously processes the bitmaps, trying to isolate and report pages
> that are still free. The backend (virtio-balloon) is responsible for
> reporting these batched pages to the host synchronously. Once reporting/
> freeing is complete, isolated pages are returned back to the buddy.
>
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>

So if I understand correctly the hotplug support for this is still not
included correct? I assume that is the case since I don't see any
logic for zone resizing.

Also I don't see how this dealt with the sparse issue that was pointed
out earlier. Specifically how would you deal with a zone that has a
wide range between the base and the end and a huge gap somewhere
in-between?

> ---
>  include/linux/mmzone.h         |  11 ++
>  include/linux/page_reporting.h |  63 +++++++
>  mm/Kconfig                     |   6 +
>  mm/Makefile                    |   1 +
>  mm/page_alloc.c                |  42 ++++-
>  mm/page_reporting.c            | 332 +++++++++++++++++++++++++++++++++
>  6 files changed, 448 insertions(+), 7 deletions(-)
>  create mode 100644 include/linux/page_reporting.h
>  create mode 100644 mm/page_reporting.c
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index d77d717c620c..ba5f5b508f25 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -559,6 +559,17 @@ struct zone {
>         /* Zone statistics */
>         atomic_long_t           vm_stat[NR_VM_ZONE_STAT_ITEMS];
>         atomic_long_t           vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
> +#ifdef CONFIG_PAGE_REPORTING
> +       /* Pointer to the bitmap in PAGE_REPORTING_MIN_ORDER granularity */
> +       unsigned long *bitmap;
> +       /* Preserve start and end PFN in case they change due to hotplug */
> +       unsigned long base_pfn;
> +       unsigned long end_pfn;
> +       /* Free pages of granularity PAGE_REPORTING_MIN_ORDER */
> +       atomic_t free_pages;
> +       /* Number of bits required in the bitmap */
> +       unsigned long nbits;
> +#endif
>  } ____cacheline_internodealigned_in_smp;

Okay, so the original thing this patch set had going for it was that
it was non-invasive. However, now you are adding a bunch of stuff to
the zone. That kind of loses the non-invasive argument for this patch
set compared to mine.

If we are going to continue further with this patch set then it might
be worth looking into dynamically allocating the space you need for
this block. At a minimum you could probably look at making the bitmap
an RCU based setup so you could define the base and end along with the
bitmap. It would probably help to resolve the hotplug issues you still
need to address.

>  enum pgdat_flags {
> diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
> new file mode 100644
> index 000000000000..37a39589939d
> --- /dev/null
> +++ b/include/linux/page_reporting.h
> @@ -0,0 +1,63 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_PAGE_REPORTING_H
> +#define _LINUX_PAGE_REPORTING_H
> +
> +#define PAGE_REPORTING_MIN_ORDER               (MAX_ORDER - 2)
> +#define PAGE_REPORTING_MAX_PAGES               16
> +
> +#ifdef CONFIG_PAGE_REPORTING
> +struct page_reporting_config {
> +       /* function to hint batch of isolated pages */
> +       void (*report)(struct page_reporting_config *phconf,
> +                      unsigned int num_pages);
> +
> +       /* scatterlist to hold the isolated pages to be hinted */
> +       struct scatterlist *sg;
> +
> +       /*
> +        * Maxmimum pages that are going to be hinted to the hypervisor at a
> +        * time of granularity >= PAGE_REPORTING_MIN_ORDER.
> +        */
> +       int max_pages;
> +
> +       /* work object to process page reporting rqeuests */
> +       struct work_struct reporting_work;
> +
> +       /* tracks the number of reporting request processed at a time */
> +       atomic_t refcnt;
> +};
> +
> +void __page_reporting_enqueue(struct page *page);
> +void __return_isolated_page(struct zone *zone, struct page *page);
> +void set_pageblock_migratetype(struct page *page, int migratetype);
> +
> +/**
> + * page_reporting_enqueue - checks the eligibility of the freed page based on
> + * its order for further page reporting processing.
> + * @page: page which has been freed.
> + * @order: order for the the free page.
> + */
> +static inline void page_reporting_enqueue(struct page *page, int order)
> +{
> +       if (order < PAGE_REPORTING_MIN_ORDER)
> +               return;
> +       __page_reporting_enqueue(page);
> +}
> +
> +int page_reporting_enable(struct page_reporting_config *phconf);
> +void page_reporting_disable(struct page_reporting_config *phconf);
> +#else
> +static inline void page_reporting_enqueue(struct page *page, int order)
> +{
> +}
> +
> +static inline int page_reporting_enable(struct page_reporting_config *phconf)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +static inline void page_reporting_disable(struct page_reporting_config *phconf)
> +{
> +}
> +#endif /* CONFIG_PAGE_REPORTING */
> +#endif /* _LINUX_PAGE_REPORTING_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 56cec636a1fc..6a35a9dad350 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -736,4 +736,10 @@ config ARCH_HAS_PTE_SPECIAL
>  config ARCH_HAS_HUGEPD
>         bool
>
> +# PAGE_REPORTING will allow the guest to report the free pages to the
> +# host in fixed chunks as soon as a fixed threshold is reached.
> +config PAGE_REPORTING
> +       bool
> +       def_bool n
> +       depends on X86_64
>  endmenu
> diff --git a/mm/Makefile b/mm/Makefile
> index 338e528ad436..6a32cdfa61c2 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -104,3 +104,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
>  obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
>  obj-$(CONFIG_HMM_MIRROR) += hmm.o
>  obj-$(CONFIG_MEMFD_CREATE) += memfd.o
> +obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 272c6de1bf4e..aa7c89d50c85 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -68,6 +68,7 @@
>  #include <linux/lockdep.h>
>  #include <linux/nmi.h>
>  #include <linux/psi.h>
> +#include <linux/page_reporting.h>
>
>  #include <asm/sections.h>
>  #include <asm/tlbflush.h>
> @@ -903,7 +904,7 @@ compaction_capture(struct capture_control *capc, struct page *page,
>  static inline void __free_one_page(struct page *page,
>                 unsigned long pfn,
>                 struct zone *zone, unsigned int order,
> -               int migratetype)
> +               int migratetype, bool needs_reporting)
>  {
>         unsigned long combined_pfn;
>         unsigned long uninitialized_var(buddy_pfn);
> @@ -1006,7 +1007,8 @@ static inline void __free_one_page(struct page *page,
>                                 migratetype);
>         else
>                 add_to_free_area(page, &zone->free_area[order], migratetype);
> -
> +       if (needs_reporting)
> +               page_reporting_enqueue(page, order);
>  }
>
>  /*
> @@ -1317,7 +1319,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>                 if (unlikely(isolated_pageblocks))
>                         mt = get_pageblock_migratetype(page);
>
> -               __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> +               __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
>                 trace_mm_page_pcpu_drain(page, 0, mt);
>         }
>         spin_unlock(&zone->lock);
> @@ -1326,14 +1328,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  static void free_one_page(struct zone *zone,
>                                 struct page *page, unsigned long pfn,
>                                 unsigned int order,
> -                               int migratetype)
> +                               int migratetype, bool needs_reporting)
>  {
>         spin_lock(&zone->lock);
>         if (unlikely(has_isolate_pageblock(zone) ||
>                 is_migrate_isolate(migratetype))) {
>                 migratetype = get_pfnblock_migratetype(page, pfn);
>         }
> -       __free_one_page(page, pfn, zone, order, migratetype);
> +       __free_one_page(page, pfn, zone, order, migratetype, needs_reporting);
>         spin_unlock(&zone->lock);
>  }
>
> @@ -1423,7 +1425,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>         migratetype = get_pfnblock_migratetype(page, pfn);
>         local_irq_save(flags);
>         __count_vm_events(PGFREE, 1 << order);
> -       free_one_page(page_zone(page), page, pfn, order, migratetype);
> +       free_one_page(page_zone(page), page, pfn, order, migratetype, true);
>         local_irq_restore(flags);
>  }
>
> @@ -2197,6 +2199,32 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>         return NULL;
>  }
>
> +#ifdef CONFIG_PAGE_REPORTING
> +/**
> + * return_isolated_page - returns a reported page back to the buddy.
> + * @zone: zone from where the page was isolated.
> + * @page: page which will be returned.
> + */
> +void __return_isolated_page(struct zone *zone, struct page *page)
> +{
> +       unsigned int order, mt;
> +       unsigned long pfn;
> +
> +       /* zone lock should be held when this function is called */
> +       lockdep_assert_held(&zone->lock);
> +
> +       mt = get_pageblock_migratetype(page);
> +       pfn = page_to_pfn(page);
> +
> +       if (unlikely(has_isolate_pageblock(zone) || is_migrate_isolate(mt)))
> +               mt = get_pfnblock_migratetype(page, pfn);
> +
> +       order = page_private(page);
> +       set_page_private(page, 0);
> +
> +       __free_one_page(page, pfn, zone, order, mt, false);
> +}
> +#endif /* CONFIG_PAGE_REPORTING */
>
>  /*
>   * This array describes the order lists are fallen back to when
> @@ -3041,7 +3069,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>          */
>         if (migratetype >= MIGRATE_PCPTYPES) {
>                 if (unlikely(is_migrate_isolate(migratetype))) {
> -                       free_one_page(zone, page, pfn, 0, migratetype);
> +                       free_one_page(zone, page, pfn, 0, migratetype, true);
>                         return;
>                 }
>                 migratetype = MIGRATE_MOVABLE;
> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
> new file mode 100644
> index 000000000000..4ee2c172cd9a
> --- /dev/null
> +++ b/mm/page_reporting.c
> @@ -0,0 +1,332 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Page reporting core infrastructure to enable a VM to report free pages to its
> + * hypervisor.
> + *
> + * Copyright Red Hat, Inc. 2019
> + *
> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/page_reporting.h>
> +#include <linux/scatterlist.h>
> +#include "internal.h"
> +
> +static struct page_reporting_config __rcu *page_reporting_conf __read_mostly;
> +static DEFINE_MUTEX(page_reporting_mutex);
> +
> +static inline unsigned long pfn_to_bit(struct page *page, struct zone *zone)
> +{
> +       unsigned long bitnr;
> +
> +       bitnr = (page_to_pfn(page) - zone->base_pfn) >>
> +               PAGE_REPORTING_MIN_ORDER;
> +
> +       return bitnr;
> +}
> +
> +static void return_isolated_page(struct zone *zone,
> +                                struct page_reporting_config *phconf)
> +{
> +       struct scatterlist *sg = phconf->sg;
> +
> +       spin_lock(&zone->lock);
> +       do {
> +               __return_isolated_page(zone, sg_page(sg));
> +       } while (!sg_is_last(sg++));
> +       spin_unlock(&zone->lock);
> +}
> +
> +static void bitmap_set_bit(struct page *page, struct zone *zone)
> +{
> +       unsigned long bitnr = 0;
> +
> +       /* zone lock should be held when this function is called */
> +       lockdep_assert_held(&zone->lock);
> +

So why does the zone lock need to be held? What could you possibly
race against? If nothing else it might make more sense to look at
moving the bitmap, base, end, and length values into one RCU
allocation structure so that you could do without the requirement of
the zone lock to manipulate the bitmap.

> +       bitnr = pfn_to_bit(page, zone);
> +       /* set bit if it is not already set and is a valid bit */
> +       if (zone->bitmap && bitnr < zone->nbits &&
> +           !test_and_set_bit(bitnr, zone->bitmap))
> +               atomic_inc(&zone->free_pages);
> +}
> +
> +static int process_free_page(struct page *page,
> +                            struct page_reporting_config *phconf, int count)
> +{
> +       int mt, order, ret = 0;
> +
> +       mt = get_pageblock_migratetype(page);
> +       order = page_private(page);
> +       ret = __isolate_free_page(page, order);
> +
> +       if (ret) {
> +               /*
> +                * Preserving order and migratetype for reuse while
> +                * releasing the pages back to the buddy.
> +                */
> +               set_pageblock_migratetype(page, mt);
> +               set_page_private(page, order);
> +
> +               sg_set_page(&phconf->sg[count++], page,
> +                           PAGE_SIZE << order, 0);
> +       }
> +
> +       return count;
> +}
> +
> +/**
> + * scan_zone_bitmap - scans the bitmap for the requested zone.
> + * @phconf: page reporting configuration object initialized by the backend.
> + * @zone: zone for which page reporting is requested.
> + *
> + * For every page marked in the bitmap it checks if it is still free if so it
> + * isolates and adds them to a scatterlist. As soon as the number of isolated
> + * pages reach the threshold set by the backend, they are reported to the
> + * hypervisor by the backend. Once the hypervisor responds after processing
> + * they are returned back to the buddy for reuse.
> + */
> +static void scan_zone_bitmap(struct page_reporting_config *phconf,
> +                            struct zone *zone)
> +{
> +       unsigned long setbit;
> +       struct page *page;
> +       int count = 0;
> +
> +       sg_init_table(phconf->sg, phconf->max_pages);
> +
> +       for_each_set_bit(setbit, zone->bitmap, zone->nbits) {
> +               /* Process only if the page is still online */
> +               page = pfn_to_online_page((setbit << PAGE_REPORTING_MIN_ORDER) +
> +                                         zone->base_pfn);
> +               if (!page)
> +                       continue;
> +

Shouldn't you be clearing the bit and dropping the reference to
free_pages before you move on to the next bit? Otherwise you are going
to be stuck with those aren't you?

> +               spin_lock(&zone->lock);
> +
> +               /* Ensure page is still free and can be processed */
> +               if (PageBuddy(page) && page_private(page) >=
> +                   PAGE_REPORTING_MIN_ORDER)
> +                       count = process_free_page(page, phconf, count);
> +
> +               spin_unlock(&zone->lock);

So I kind of wonder just how much overhead you are taking for bouncing
the zone lock once per page here. Especially since it can result in
you not actually making any progress since the page may have already
been reallocated.

> +               /* Page has been processed, adjust its bit and zone counter */
> +               clear_bit(setbit, zone->bitmap);
> +               atomic_dec(&zone->free_pages);

So earlier you were setting this bit and required that the zone->lock
be held while doing so. Here you are clearing it without any
zone->lock being held.

> +               if (count == phconf->max_pages) {
> +                       /* Report isolated pages to the hypervisor */
> +                       phconf->report(phconf, count);
> +
> +                       /* Return processed pages back to the buddy */
> +                       return_isolated_page(zone, phconf);
> +
> +                       /* Reset for next reporting */
> +                       sg_init_table(phconf->sg, phconf->max_pages);
> +                       count = 0;
> +               }
> +       }
> +       /*
> +        * If the number of isolated pages does not meet the max_pages
> +        * threshold, we would still prefer to report them as we have already
> +        * isolated them.
> +        */
> +       if (count) {
> +               sg_mark_end(&phconf->sg[count - 1]);
> +               phconf->report(phconf, count);
> +
> +               return_isolated_page(zone, phconf);
> +       }
> +}
> +
> +/**
> + * page_reporting_wq - checks the number of free_pages in all the zones and
> + * invokes a request to scan the respective bitmap if free_pages reaches or
> + * exceeds the threshold specified by the backend.
> + */
> +static void page_reporting_wq(struct work_struct *work)
> +{
> +       struct page_reporting_config *phconf =
> +               container_of(work, struct page_reporting_config,
> +                            reporting_work);
> +       struct zone *zone;
> +
> +       for_each_populated_zone(zone) {
> +               if (atomic_read(&zone->free_pages) >= phconf->max_pages)
> +                       scan_zone_bitmap(phconf, zone);
> +       }
> +       /*
> +        * We have processed all the zones, we can process new page reporting
> +        * request now.
> +        */
> +       atomic_set(&phconf->refcnt, 0);

It doesn't make sense to reset the refcnt once you have made a single pass.

> +}
> +
> +/**
> + * __page_reporting_enqueue - tracks the freed page in the respective zone's
> + * bitmap and enqueues a new page reporting job to the workqueue if possible.
> + */
> +void __page_reporting_enqueue(struct page *page)
> +{
> +       struct page_reporting_config *phconf;
> +       struct zone *zone;
> +
> +       rcu_read_lock();
> +       /*
> +        * We should not process this page if either page reporting is not
> +        * yet completely enabled or it has been disabled by the backend.
> +        */
> +       phconf = rcu_dereference(page_reporting_conf);
> +       if (!phconf)
> +               return;
> +
> +       zone = page_zone(page);
> +       bitmap_set_bit(page, zone);
> +
> +       /*
> +        * We should not enqueue a job if a previously enqueued reporting work
> +        * is in progress or we don't have enough free pages in the zone.
> +        */
> +       if (atomic_read(&zone->free_pages) >= phconf->max_pages &&
> +           !atomic_cmpxchg(&phconf->refcnt, 0, 1))

This doesn't make any sense to me. Why are you only incrementing the
refcount if it is zero? Combining this with the assignment above, this
isn't really a refcnt. It is just an oversized bitflag.

Also I am pretty sure this results in the opportunity to miss pages
because there is nothing to prevent you from possibly missing a ton of
pages you could hint on if a large number of pages are pushed out all
at once and then the system goes idle in terms of memory allocation
and freeing.

> +               schedule_work(&phconf->reporting_work);
> +
> +       rcu_read_unlock();
> +}
> +
> +/**
> + * zone_reporting_cleanup - resets the page reporting fields and free the
> + * bitmap for all the initialized zones.
> + */
> +static void zone_reporting_cleanup(void)
> +{
> +       struct zone *zone;
> +
> +       for_each_populated_zone(zone) {
> +               /*
> +                * Bitmap may not be allocated for all the zones if the
> +                * initialization fails before reaching to the last one.
> +                */
> +               if (!zone->bitmap)
> +                       continue;
> +               bitmap_free(zone->bitmap);
> +               zone->bitmap = NULL;
> +               atomic_set(&zone->free_pages, 0);
> +       }
> +}
> +
> +static int zone_bitmap_alloc(struct zone *zone)
> +{
> +       unsigned long bitmap_size, pages;
> +
> +       pages = zone->end_pfn - zone->base_pfn;
> +       bitmap_size = (pages >> PAGE_REPORTING_MIN_ORDER) + 1;
> +
> +       if (!bitmap_size)
> +               return 0;
> +
> +       zone->bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
> +       if (!zone->bitmap)
> +               return -ENOMEM;
> +
> +       zone->nbits = bitmap_size;
> +
> +       return 0;
> +}
> +

So as per your comments in the cover page, the two functions above
should also probably be plugged into the zone resizing logic somewhere
so if a zone is resized the bitmap is adjusted.

> +/**
> + * zone_reporting_init - For each zone initializes the page reporting fields
> + * and allocates the respective bitmap.
> + *
> + * This function returns 0 on successful initialization, -ENOMEM otherwise.
> + */
> +static int zone_reporting_init(void)
> +{
> +       struct zone *zone;
> +       int ret;
> +
> +       for_each_populated_zone(zone) {
> +#ifdef CONFIG_ZONE_DEVICE
> +               /* we can not report pages which are not in the buddy */
> +               if (zone_idx(zone) == ZONE_DEVICE)
> +                       continue;
> +#endif

I'm pretty sure this isn't needed since I don't think the ZONE_DEVICE
zone will be considered "populated".


> +               spin_lock(&zone->lock);
> +               zone->base_pfn = zone->zone_start_pfn;
> +               zone->end_pfn = zone_end_pfn(zone);
> +               spin_unlock(&zone->lock);
> +
> +               ret = zone_bitmap_alloc(zone);
> +               if (ret < 0) {
> +                       zone_reporting_cleanup();
> +                       return ret;
> +               }
> +       }
> +
> +       return 0;
> +}
> +
> +void page_reporting_disable(struct page_reporting_config *phconf)
> +{
> +       mutex_lock(&page_reporting_mutex);
> +
> +       if (rcu_access_pointer(page_reporting_conf) != phconf)
> +               return;
> +
> +       RCU_INIT_POINTER(page_reporting_conf, NULL);
> +       synchronize_rcu();
> +
> +       /* Cancel any pending reporting request */
> +       cancel_work_sync(&phconf->reporting_work);
> +
> +       /* Free the scatterlist used for isolated pages */
> +       kfree(phconf->sg);
> +       phconf->sg = NULL;
> +
> +       /* Cleanup the bitmaps and old tracking data */
> +       zone_reporting_cleanup();
> +
> +       mutex_unlock(&page_reporting_mutex);
> +}
> +EXPORT_SYMBOL_GPL(page_reporting_disable);
> +
> +int page_reporting_enable(struct page_reporting_config *phconf)
> +{
> +       int ret = 0;
> +
> +       mutex_lock(&page_reporting_mutex);
> +
> +       /* check if someone is already using page reporting*/
> +       if (rcu_access_pointer(page_reporting_conf)) {
> +               ret = -EBUSY;
> +               goto out;
> +       }
> +
> +       /* allocate scatterlist to hold isolated pages */
> +       phconf->sg = kcalloc(phconf->max_pages, sizeof(*phconf->sg),
> +                            GFP_KERNEL);
> +       if (!phconf->sg) {
> +               ret = -ENOMEM;
> +               goto out;
> +       }
> +
> +       /* initialize each zone's fields required for page reporting */
> +       ret = zone_reporting_init();
> +       if (ret < 0) {
> +               kfree(phconf->sg);
> +               goto out;
> +       }
> +
> +       atomic_set(&phconf->refcnt, 0);
> +       INIT_WORK(&phconf->reporting_work, page_reporting_wq);
> +
> +       /* assign the configuration object provided by the backend */
> +       rcu_assign_pointer(page_reporting_conf, phconf);
> +
> +out:
> +       mutex_unlock(&page_reporting_mutex);
> +       return ret;
> +}
> +EXPORT_SYMBOL_GPL(page_reporting_enable);
> --
> 2.21.0
>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-12 18:47   ` Alexander Duyck
@ 2019-08-12 20:04     ` Nitesh Narayan Lal
  2019-08-20 14:11       ` Nitesh Narayan Lal
  2019-08-12 20:05     ` David Hildenbrand
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 35+ messages in thread
From: Nitesh Narayan Lal @ 2019-08-12 20:04 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, virtio-dev, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko, cohuck


On 8/12/19 2:47 PM, Alexander Duyck wrote:
> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> This patch introduces the core infrastructure for free page reporting in
>> virtual environments. It enables the kernel to track the free pages which
>> can be reported to its hypervisor so that the hypervisor could
>> free and reuse that memory as per its requirement.
>>
>> While the pages are getting processed in the hypervisor (e.g.,
>> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
>> would be possible. To avoid such a situation, these pages are
>> temporarily removed from the buddy. The amount of pages removed
>> temporarily from the buddy is governed by the backend(virtio-balloon
>> in our case).
>>
>> To efficiently identify free pages that can to be reported to the
>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>> chunks are reported to the hypervisor - especially, to not break up THP
>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>> in the bitmap are an indication whether a page *might* be free, not a
>> guarantee. A new hook after buddy merging sets the bits.
>>
>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>> asynchronously processes the bitmaps, trying to isolate and report pages
>> that are still free. The backend (virtio-balloon) is responsible for
>> reporting these batched pages to the host synchronously. Once reporting/
>> freeing is complete, isolated pages are returned back to the buddy.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> So if I understand correctly the hotplug support for this is still not
> included correct? 


That is correct, I have it as an ongoing-item in my cover-email.
In case, we decide to go with this approach do you think it is a blocker?


> I assume that is the case since I don't see any
> logic for zone resizing.
>
> Also I don't see how this dealt with the sparse issue that was pointed
> out earlier. Specifically how would you deal with a zone that has a
> wide range between the base and the end and a huge gap somewhere
> in-between?

It doesn't. However, considering we are tracking page on order of (MAX_ORDER -
2) I don't think the loss will be significant.
In any case, this is certainly a drawback of this approach and I should add this
in my cover.
The one thing which I did change in this version is that now I started
maintaining bitmap for each zone.

>
>> ---
>>  include/linux/mmzone.h         |  11 ++
>>  include/linux/page_reporting.h |  63 +++++++
>>  mm/Kconfig                     |   6 +
>>  mm/Makefile                    |   1 +
>>  mm/page_alloc.c                |  42 ++++-
>>  mm/page_reporting.c            | 332 +++++++++++++++++++++++++++++++++
>>  6 files changed, 448 insertions(+), 7 deletions(-)
>>  create mode 100644 include/linux/page_reporting.h
>>  create mode 100644 mm/page_reporting.c
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index d77d717c620c..ba5f5b508f25 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -559,6 +559,17 @@ struct zone {
>>         /* Zone statistics */
>>         atomic_long_t           vm_stat[NR_VM_ZONE_STAT_ITEMS];
>>         atomic_long_t           vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
>> +#ifdef CONFIG_PAGE_REPORTING
>> +       /* Pointer to the bitmap in PAGE_REPORTING_MIN_ORDER granularity */
>> +       unsigned long *bitmap;
>> +       /* Preserve start and end PFN in case they change due to hotplug */
>> +       unsigned long base_pfn;
>> +       unsigned long end_pfn;
>> +       /* Free pages of granularity PAGE_REPORTING_MIN_ORDER */
>> +       atomic_t free_pages;
>> +       /* Number of bits required in the bitmap */
>> +       unsigned long nbits;
>> +#endif
>>  } ____cacheline_internodealigned_in_smp;
> Okay, so the original thing this patch set had going for it was that
> it was non-invasive. However, now you are adding a bunch of stuff to
> the zone. That kind of loses the non-invasive argument for this patch
> set compared to mine.

I think it is fair to add that it not as invasive as yours. :) (But that has its
own pros and cons)
In any case, I do understand your point.

>
>
> If we are going to continue further with this patch set then it might
> be worth looking into dynamically allocating the space you need for
> this block.

Not sure if I understood this part. Dynamic allocation for the structure which
you are suggesting below?


>  At a minimum you could probably look at making the bitmap
> an RCU based setup so you could define the base and end along with the
> bitmap. It would probably help to resolve the hotplug issues you still
> need to address.


Thanks for the suggestion. I will look into it.


>
>>  enum pgdat_flags {
>> diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
>> new file mode 100644
>> index 000000000000..37a39589939d
>> --- /dev/null
>> +++ b/include/linux/page_reporting.h
>> @@ -0,0 +1,63 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_PAGE_REPORTING_H
>> +#define _LINUX_PAGE_REPORTING_H
>> +
>> +#define PAGE_REPORTING_MIN_ORDER               (MAX_ORDER - 2)
>> +#define PAGE_REPORTING_MAX_PAGES               16
>> +
>> +#ifdef CONFIG_PAGE_REPORTING
>> +struct page_reporting_config {
>> +       /* function to hint batch of isolated pages */
>> +       void (*report)(struct page_reporting_config *phconf,
>> +                      unsigned int num_pages);
>> +
>> +       /* scatterlist to hold the isolated pages to be hinted */
>> +       struct scatterlist *sg;
>> +
>> +       /*
>> +        * Maxmimum pages that are going to be hinted to the hypervisor at a
>> +        * time of granularity >= PAGE_REPORTING_MIN_ORDER.
>> +        */
>> +       int max_pages;
>> +
>> +       /* work object to process page reporting rqeuests */
>> +       struct work_struct reporting_work;
>> +
>> +       /* tracks the number of reporting request processed at a time */
>> +       atomic_t refcnt;
>> +};
>> +
>> +void __page_reporting_enqueue(struct page *page);
>> +void __return_isolated_page(struct zone *zone, struct page *page);
>> +void set_pageblock_migratetype(struct page *page, int migratetype);
>> +
>> +/**
>> + * page_reporting_enqueue - checks the eligibility of the freed page based on
>> + * its order for further page reporting processing.
>> + * @page: page which has been freed.
>> + * @order: order for the the free page.
>> + */
>> +static inline void page_reporting_enqueue(struct page *page, int order)
>> +{
>> +       if (order < PAGE_REPORTING_MIN_ORDER)
>> +               return;
>> +       __page_reporting_enqueue(page);
>> +}
>> +
>> +int page_reporting_enable(struct page_reporting_config *phconf);
>> +void page_reporting_disable(struct page_reporting_config *phconf);
>> +#else
>> +static inline void page_reporting_enqueue(struct page *page, int order)
>> +{
>> +}
>> +
>> +static inline int page_reporting_enable(struct page_reporting_config *phconf)
>> +{
>> +       return -EOPNOTSUPP;
>> +}
>> +
>> +static inline void page_reporting_disable(struct page_reporting_config *phconf)
>> +{
>> +}
>> +#endif /* CONFIG_PAGE_REPORTING */
>> +#endif /* _LINUX_PAGE_REPORTING_H */
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 56cec636a1fc..6a35a9dad350 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -736,4 +736,10 @@ config ARCH_HAS_PTE_SPECIAL
>>  config ARCH_HAS_HUGEPD
>>         bool
>>
>> +# PAGE_REPORTING will allow the guest to report the free pages to the
>> +# host in fixed chunks as soon as a fixed threshold is reached.
>> +config PAGE_REPORTING
>> +       bool
>> +       def_bool n
>> +       depends on X86_64
>>  endmenu
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 338e528ad436..6a32cdfa61c2 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -104,3 +104,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
>>  obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
>>  obj-$(CONFIG_HMM_MIRROR) += hmm.o
>>  obj-$(CONFIG_MEMFD_CREATE) += memfd.o
>> +obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 272c6de1bf4e..aa7c89d50c85 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -68,6 +68,7 @@
>>  #include <linux/lockdep.h>
>>  #include <linux/nmi.h>
>>  #include <linux/psi.h>
>> +#include <linux/page_reporting.h>
>>
>>  #include <asm/sections.h>
>>  #include <asm/tlbflush.h>
>> @@ -903,7 +904,7 @@ compaction_capture(struct capture_control *capc, struct page *page,
>>  static inline void __free_one_page(struct page *page,
>>                 unsigned long pfn,
>>                 struct zone *zone, unsigned int order,
>> -               int migratetype)
>> +               int migratetype, bool needs_reporting)
>>  {
>>         unsigned long combined_pfn;
>>         unsigned long uninitialized_var(buddy_pfn);
>> @@ -1006,7 +1007,8 @@ static inline void __free_one_page(struct page *page,
>>                                 migratetype);
>>         else
>>                 add_to_free_area(page, &zone->free_area[order], migratetype);
>> -
>> +       if (needs_reporting)
>> +               page_reporting_enqueue(page, order);
>>  }
>>
>>  /*
>> @@ -1317,7 +1319,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>                 if (unlikely(isolated_pageblocks))
>>                         mt = get_pageblock_migratetype(page);
>>
>> -               __free_one_page(page, page_to_pfn(page), zone, 0, mt);
>> +               __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
>>                 trace_mm_page_pcpu_drain(page, 0, mt);
>>         }
>>         spin_unlock(&zone->lock);
>> @@ -1326,14 +1328,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>  static void free_one_page(struct zone *zone,
>>                                 struct page *page, unsigned long pfn,
>>                                 unsigned int order,
>> -                               int migratetype)
>> +                               int migratetype, bool needs_reporting)
>>  {
>>         spin_lock(&zone->lock);
>>         if (unlikely(has_isolate_pageblock(zone) ||
>>                 is_migrate_isolate(migratetype))) {
>>                 migratetype = get_pfnblock_migratetype(page, pfn);
>>         }
>> -       __free_one_page(page, pfn, zone, order, migratetype);
>> +       __free_one_page(page, pfn, zone, order, migratetype, needs_reporting);
>>         spin_unlock(&zone->lock);
>>  }
>>
>> @@ -1423,7 +1425,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>>         migratetype = get_pfnblock_migratetype(page, pfn);
>>         local_irq_save(flags);
>>         __count_vm_events(PGFREE, 1 << order);
>> -       free_one_page(page_zone(page), page, pfn, order, migratetype);
>> +       free_one_page(page_zone(page), page, pfn, order, migratetype, true);
>>         local_irq_restore(flags);
>>  }
>>
>> @@ -2197,6 +2199,32 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>>         return NULL;
>>  }
>>
>> +#ifdef CONFIG_PAGE_REPORTING
>> +/**
>> + * return_isolated_page - returns a reported page back to the buddy.
>> + * @zone: zone from where the page was isolated.
>> + * @page: page which will be returned.
>> + */
>> +void __return_isolated_page(struct zone *zone, struct page *page)
>> +{
>> +       unsigned int order, mt;
>> +       unsigned long pfn;
>> +
>> +       /* zone lock should be held when this function is called */
>> +       lockdep_assert_held(&zone->lock);
>> +
>> +       mt = get_pageblock_migratetype(page);
>> +       pfn = page_to_pfn(page);
>> +
>> +       if (unlikely(has_isolate_pageblock(zone) || is_migrate_isolate(mt)))
>> +               mt = get_pfnblock_migratetype(page, pfn);
>> +
>> +       order = page_private(page);
>> +       set_page_private(page, 0);
>> +
>> +       __free_one_page(page, pfn, zone, order, mt, false);
>> +}
>> +#endif /* CONFIG_PAGE_REPORTING */
>>
>>  /*
>>   * This array describes the order lists are fallen back to when
>> @@ -3041,7 +3069,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>>          */
>>         if (migratetype >= MIGRATE_PCPTYPES) {
>>                 if (unlikely(is_migrate_isolate(migratetype))) {
>> -                       free_one_page(zone, page, pfn, 0, migratetype);
>> +                       free_one_page(zone, page, pfn, 0, migratetype, true);
>>                         return;
>>                 }
>>                 migratetype = MIGRATE_MOVABLE;
>> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
>> new file mode 100644
>> index 000000000000..4ee2c172cd9a
>> --- /dev/null
>> +++ b/mm/page_reporting.c
>> @@ -0,0 +1,332 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Page reporting core infrastructure to enable a VM to report free pages to its
>> + * hypervisor.
>> + *
>> + * Copyright Red Hat, Inc. 2019
>> + *
>> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/slab.h>
>> +#include <linux/page_reporting.h>
>> +#include <linux/scatterlist.h>
>> +#include "internal.h"
>> +
>> +static struct page_reporting_config __rcu *page_reporting_conf __read_mostly;
>> +static DEFINE_MUTEX(page_reporting_mutex);
>> +
>> +static inline unsigned long pfn_to_bit(struct page *page, struct zone *zone)
>> +{
>> +       unsigned long bitnr;
>> +
>> +       bitnr = (page_to_pfn(page) - zone->base_pfn) >>
>> +               PAGE_REPORTING_MIN_ORDER;
>> +
>> +       return bitnr;
>> +}
>> +
>> +static void return_isolated_page(struct zone *zone,
>> +                                struct page_reporting_config *phconf)
>> +{
>> +       struct scatterlist *sg = phconf->sg;
>> +
>> +       spin_lock(&zone->lock);
>> +       do {
>> +               __return_isolated_page(zone, sg_page(sg));
>> +       } while (!sg_is_last(sg++));
>> +       spin_unlock(&zone->lock);
>> +}
>> +
>> +static void bitmap_set_bit(struct page *page, struct zone *zone)
>> +{
>> +       unsigned long bitnr = 0;
>> +
>> +       /* zone lock should be held when this function is called */
>> +       lockdep_assert_held(&zone->lock);
>> +
> So why does the zone lock need to be held? What could you possibly
> race against? If nothing else it might make more sense to look at
> moving the bitmap, base, end, and length values into one RCU
> allocation structure so that you could do without the requirement of
> the zone lock to manipulate the bitmap.


Makes sense to me.


>> +       bitnr = pfn_to_bit(page, zone);
>> +       /* set bit if it is not already set and is a valid bit */
>> +       if (zone->bitmap && bitnr < zone->nbits &&
>> +           !test_and_set_bit(bitnr, zone->bitmap))
>> +               atomic_inc(&zone->free_pages);
>> +}
>> +
>> +static int process_free_page(struct page *page,
>> +                            struct page_reporting_config *phconf, int count)
>> +{
>> +       int mt, order, ret = 0;
>> +
>> +       mt = get_pageblock_migratetype(page);
>> +       order = page_private(page);
>> +       ret = __isolate_free_page(page, order);
>> +
>> +       if (ret) {
>> +               /*
>> +                * Preserving order and migratetype for reuse while
>> +                * releasing the pages back to the buddy.
>> +                */
>> +               set_pageblock_migratetype(page, mt);
>> +               set_page_private(page, order);
>> +
>> +               sg_set_page(&phconf->sg[count++], page,
>> +                           PAGE_SIZE << order, 0);
>> +       }
>> +
>> +       return count;
>> +}
>> +
>> +/**
>> + * scan_zone_bitmap - scans the bitmap for the requested zone.
>> + * @phconf: page reporting configuration object initialized by the backend.
>> + * @zone: zone for which page reporting is requested.
>> + *
>> + * For every page marked in the bitmap it checks if it is still free if so it
>> + * isolates and adds them to a scatterlist. As soon as the number of isolated
>> + * pages reach the threshold set by the backend, they are reported to the
>> + * hypervisor by the backend. Once the hypervisor responds after processing
>> + * they are returned back to the buddy for reuse.
>> + */
>> +static void scan_zone_bitmap(struct page_reporting_config *phconf,
>> +                            struct zone *zone)
>> +{
>> +       unsigned long setbit;
>> +       struct page *page;
>> +       int count = 0;
>> +
>> +       sg_init_table(phconf->sg, phconf->max_pages);
>> +
>> +       for_each_set_bit(setbit, zone->bitmap, zone->nbits) {
>> +               /* Process only if the page is still online */
>> +               page = pfn_to_online_page((setbit << PAGE_REPORTING_MIN_ORDER) +
>> +                                         zone->base_pfn);
>> +               if (!page)
>> +                       continue;
>> +
> Shouldn't you be clearing the bit and dropping the reference to
> free_pages before you move on to the next bit? Otherwise you are going
> to be stuck with those aren't you?


+1. Thanks for pointing this out.


>> +               spin_lock(&zone->lock);
>> +
>> +               /* Ensure page is still free and can be processed */
>> +               if (PageBuddy(page) && page_private(page) >=
>> +                   PAGE_REPORTING_MIN_ORDER)
>> +                       count = process_free_page(page, phconf, count);
>> +
>> +               spin_unlock(&zone->lock);
> So I kind of wonder just how much overhead you are taking for bouncing
> the zone lock once per page here. Especially since it can result in
> you not actually making any progress since the page may have already
> been reallocated.


Any suggestion about how can I see the overhead involved here?


>> +               /* Page has been processed, adjust its bit and zone counter */
>> +               clear_bit(setbit, zone->bitmap);
>> +               atomic_dec(&zone->free_pages);
> So earlier you were setting this bit and required that the zone->lock
> be held while doing so. Here you are clearing it without any
> zone->lock being held.


I should have held the zone->lock here while clearing the bit.


>
>> +               if (count == phconf->max_pages) {
>> +                       /* Report isolated pages to the hypervisor */
>> +                       phconf->report(phconf, count);
>> +
>> +                       /* Return processed pages back to the buddy */
>> +                       return_isolated_page(zone, phconf);
>> +
>> +                       /* Reset for next reporting */
>> +                       sg_init_table(phconf->sg, phconf->max_pages);
>> +                       count = 0;
>> +               }
>> +       }
>> +       /*
>> +        * If the number of isolated pages does not meet the max_pages
>> +        * threshold, we would still prefer to report them as we have already
>> +        * isolated them.
>> +        */
>> +       if (count) {
>> +               sg_mark_end(&phconf->sg[count - 1]);
>> +               phconf->report(phconf, count);
>> +
>> +               return_isolated_page(zone, phconf);
>> +       }
>> +}
>> +
>> +/**
>> + * page_reporting_wq - checks the number of free_pages in all the zones and
>> + * invokes a request to scan the respective bitmap if free_pages reaches or
>> + * exceeds the threshold specified by the backend.
>> + */
>> +static void page_reporting_wq(struct work_struct *work)
>> +{
>> +       struct page_reporting_config *phconf =
>> +               container_of(work, struct page_reporting_config,
>> +                            reporting_work);
>> +       struct zone *zone;
>> +
>> +       for_each_populated_zone(zone) {
>> +               if (atomic_read(&zone->free_pages) >= phconf->max_pages)
>> +                       scan_zone_bitmap(phconf, zone);
>> +       }
>> +       /*
>> +        * We have processed all the zones, we can process new page reporting
>> +        * request now.
>> +        */
>> +       atomic_set(&phconf->refcnt, 0);
> It doesn't make sense to reset the refcnt once you have made a single pass.
>
>> +}
>> +
>> +/**
>> + * __page_reporting_enqueue - tracks the freed page in the respective zone's
>> + * bitmap and enqueues a new page reporting job to the workqueue if possible.
>> + */
>> +void __page_reporting_enqueue(struct page *page)
>> +{
>> +       struct page_reporting_config *phconf;
>> +       struct zone *zone;
>> +
>> +       rcu_read_lock();
>> +       /*
>> +        * We should not process this page if either page reporting is not
>> +        * yet completely enabled or it has been disabled by the backend.
>> +        */
>> +       phconf = rcu_dereference(page_reporting_conf);
>> +       if (!phconf)
>> +               return;
>> +
>> +       zone = page_zone(page);
>> +       bitmap_set_bit(page, zone);
>> +
>> +       /*
>> +        * We should not enqueue a job if a previously enqueued reporting work
>> +        * is in progress or we don't have enough free pages in the zone.
>> +        */
>> +       if (atomic_read(&zone->free_pages) >= phconf->max_pages &&
>> +           !atomic_cmpxchg(&phconf->refcnt, 0, 1))
> This doesn't make any sense to me. Why are you only incrementing the
> refcount if it is zero? Combining this with the assignment above, this
> isn't really a refcnt. It is just an oversized bitflag.
>

The reason why I have this here is that at a time I only want a single request
to be en-queued.
Now, every time free_page threshold for any zone is met I am checking each zone
for possible reporting.

I think there are two change required here:
1. rename this flag.
2. use refcnt to actually track the usage of page_hinting_config object separately.

> Also I am pretty sure this results in the opportunity to miss pages
> because there is nothing to prevent you from possibly missing a ton of
> pages you could hint on if a large number of pages are pushed out all
> at once and then the system goes idle in terms of memory allocation
> and freeing.

I have failed to reproduce this kind of situation.
I have a test app which simply allocates large memory chunk, touches it and then
exits. In this case, I get most of the memory back.


>
>> +               schedule_work(&phconf->reporting_work);
>> +
>> +       rcu_read_unlock();
>> +}
>> +
>> +/**
>> + * zone_reporting_cleanup - resets the page reporting fields and free the
>> + * bitmap for all the initialized zones.
>> + */
>> +static void zone_reporting_cleanup(void)
>> +{
>> +       struct zone *zone;
>> +
>> +       for_each_populated_zone(zone) {
>> +               /*
>> +                * Bitmap may not be allocated for all the zones if the
>> +                * initialization fails before reaching to the last one.
>> +                */
>> +               if (!zone->bitmap)
>> +                       continue;
>> +               bitmap_free(zone->bitmap);
>> +               zone->bitmap = NULL;
>> +               atomic_set(&zone->free_pages, 0);
>> +       }
>> +}
>> +
>> +static int zone_bitmap_alloc(struct zone *zone)
>> +{
>> +       unsigned long bitmap_size, pages;
>> +
>> +       pages = zone->end_pfn - zone->base_pfn;
>> +       bitmap_size = (pages >> PAGE_REPORTING_MIN_ORDER) + 1;
>> +
>> +       if (!bitmap_size)
>> +               return 0;
>> +
>> +       zone->bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
>> +       if (!zone->bitmap)
>> +               return -ENOMEM;
>> +
>> +       zone->nbits = bitmap_size;
>> +
>> +       return 0;
>> +}
>> +
> So as per your comments in the cover page, the two functions above
> should also probably be plugged into the zone resizing logic somewhere
> so if a zone is resized the bitmap is adjusted.


I think the right way will be to have a common interface which could be called
here and in memory hotplug/unplug case.
Right now, in my prototype, I have a different functions which adjusts the size
of the bitmap based on the memory notifier action.


>
>> +/**
>> + * zone_reporting_init - For each zone initializes the page reporting fields
>> + * and allocates the respective bitmap.
>> + *
>> + * This function returns 0 on successful initialization, -ENOMEM otherwise.
>> + */
>> +static int zone_reporting_init(void)
>> +{
>> +       struct zone *zone;
>> +       int ret;
>> +
>> +       for_each_populated_zone(zone) {
>> +#ifdef CONFIG_ZONE_DEVICE
>> +               /* we can not report pages which are not in the buddy */
>> +               if (zone_idx(zone) == ZONE_DEVICE)
>> +                       continue;
>> +#endif
> I'm pretty sure this isn't needed since I don't think the ZONE_DEVICE
> zone will be considered "populated".


hmm, I was not aware of it. I will dig in more about it.


>
>> +               spin_lock(&zone->lock);
>> +               zone->base_pfn = zone->zone_start_pfn;
>> +               zone->end_pfn = zone_end_pfn(zone);
>> +               spin_unlock(&zone->lock);
>> +
>> +               ret = zone_bitmap_alloc(zone);
>> +               if (ret < 0) {
>> +                       zone_reporting_cleanup();
>> +                       return ret;
>> +               }
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +void page_reporting_disable(struct page_reporting_config *phconf)
>> +{
>> +       mutex_lock(&page_reporting_mutex);
>> +
>> +       if (rcu_access_pointer(page_reporting_conf) != phconf)
>> +               return;
>> +
>> +       RCU_INIT_POINTER(page_reporting_conf, NULL);
>> +       synchronize_rcu();
>> +
>> +       /* Cancel any pending reporting request */
>> +       cancel_work_sync(&phconf->reporting_work);
>> +
>> +       /* Free the scatterlist used for isolated pages */
>> +       kfree(phconf->sg);
>> +       phconf->sg = NULL;
>> +
>> +       /* Cleanup the bitmaps and old tracking data */
>> +       zone_reporting_cleanup();
>> +
>> +       mutex_unlock(&page_reporting_mutex);
>> +}
>> +EXPORT_SYMBOL_GPL(page_reporting_disable);
>> +
>> +int page_reporting_enable(struct page_reporting_config *phconf)
>> +{
>> +       int ret = 0;
>> +
>> +       mutex_lock(&page_reporting_mutex);
>> +
>> +       /* check if someone is already using page reporting*/
>> +       if (rcu_access_pointer(page_reporting_conf)) {
>> +               ret = -EBUSY;
>> +               goto out;
>> +       }
>> +
>> +       /* allocate scatterlist to hold isolated pages */
>> +       phconf->sg = kcalloc(phconf->max_pages, sizeof(*phconf->sg),
>> +                            GFP_KERNEL);
>> +       if (!phconf->sg) {
>> +               ret = -ENOMEM;
>> +               goto out;
>> +       }
>> +
>> +       /* initialize each zone's fields required for page reporting */
>> +       ret = zone_reporting_init();
>> +       if (ret < 0) {
>> +               kfree(phconf->sg);
>> +               goto out;
>> +       }
>> +
>> +       atomic_set(&phconf->refcnt, 0);
>> +       INIT_WORK(&phconf->reporting_work, page_reporting_wq);
>> +
>> +       /* assign the configuration object provided by the backend */
>> +       rcu_assign_pointer(page_reporting_conf, phconf);
>> +
>> +out:
>> +       mutex_unlock(&page_reporting_mutex);
>> +       return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(page_reporting_enable);
>> --
>> 2.21.0
>>
-- 
Thanks
Nitesh



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-12 18:47   ` Alexander Duyck
  2019-08-12 20:04     ` Nitesh Narayan Lal
@ 2019-08-12 20:05     ` David Hildenbrand
  2019-08-13 10:30       ` Nitesh Narayan Lal
  2019-08-14 15:49     ` Nitesh Narayan Lal
  2019-08-30 15:15     ` Nitesh Narayan Lal
  3 siblings, 1 reply; 35+ messages in thread
From: David Hildenbrand @ 2019-08-12 20:05 UTC (permalink / raw)
  To: Alexander Duyck, Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, virtio-dev, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, Rik van Riel,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko, cohuck

>> ---
>>  include/linux/mmzone.h         |  11 ++
>>  include/linux/page_reporting.h |  63 +++++++
>>  mm/Kconfig                     |   6 +
>>  mm/Makefile                    |   1 +
>>  mm/page_alloc.c                |  42 ++++-
>>  mm/page_reporting.c            | 332 +++++++++++++++++++++++++++++++++
>>  6 files changed, 448 insertions(+), 7 deletions(-)
>>  create mode 100644 include/linux/page_reporting.h
>>  create mode 100644 mm/page_reporting.c
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index d77d717c620c..ba5f5b508f25 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -559,6 +559,17 @@ struct zone {
>>         /* Zone statistics */
>>         atomic_long_t           vm_stat[NR_VM_ZONE_STAT_ITEMS];
>>         atomic_long_t           vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
>> +#ifdef CONFIG_PAGE_REPORTING
>> +       /* Pointer to the bitmap in PAGE_REPORTING_MIN_ORDER granularity */
>> +       unsigned long *bitmap;
>> +       /* Preserve start and end PFN in case they change due to hotplug */
>> +       unsigned long base_pfn;
>> +       unsigned long end_pfn;
>> +       /* Free pages of granularity PAGE_REPORTING_MIN_ORDER */
>> +       atomic_t free_pages;
>> +       /* Number of bits required in the bitmap */
>> +       unsigned long nbits;
>> +#endif
>>  } ____cacheline_internodealigned_in_smp;
> 
> Okay, so the original thing this patch set had going for it was that
> it was non-invasive. However, now you are adding a bunch of stuff to
> the zone. That kind of loses the non-invasive argument for this patch
> set compared to mine.
> 

Adding something to "struct zone" is certainly less invasive than core
buddy modifications, just saying (I agree that this is suboptimal. I
would have guessed that all that's needed is a pointer to some private
structure here). However, the migratetype thingy below looks fishy to me.

> If we are going to continue further with this patch set then it might
> be worth looking into dynamically allocating the space you need for
> this block. At a minimum you could probably look at making the bitmap
> an RCU based setup so you could define the base and end along with the
> bitmap. It would probably help to resolve the hotplug issues you still
> need to address.

Yeah, I guess that makes sense.

[...]
>> +
>> +static int process_free_page(struct page *page,
>> +                            struct page_reporting_config *phconf, int count)
>> +{
>> +       int mt, order, ret = 0;
>> +
>> +       mt = get_pageblock_migratetype(page);
>> +       order = page_private(page);
>> +       ret = __isolate_free_page(page, order);
>> +

I just started looking into the wonderful world of
isolation/compaction/migration.

I don't think saving/restoring the migratetype is correct here. AFAIK,
MOVABLE/UNMOVABLE/RECLAIMABLE is just a hint, doesn't mean that e.g.,
movable pages and up in UNMOVABLE or ordinary kernel allocations on
MOVABLE. So that shouldn't be an issue - I guess.

1. You should never allocate something that is no
MOVABLE/UNMOVABLE/RECLAIMABLE. Especially not, if you have ISOLATE or
CMA here. There should at least be a !is_migrate_isolate_page() check
somewhere

2. set_migratetype_isolate() takes the zone lock, so to avoid racing
with isolation code, you have to hold the zone lock. Your code seems to
do that, so at least you cannot race against isolation.

3. You could end up temporarily allocating something in the
ZONE_MOVABLE. The pages you allocate are, however, not movable. There
would have to be a way to make alloc_contig_range()/offlining code
properly wait until the pages have been processed. Not sure about the
real implications, though - too many details in the code (I wonder if
Alex' series has a way of dealing with that)

When you restore the migratetype, you could suddenly overwrite e.g.,
ISOLATE, which feels wrong.

[...]
> So as per your comments in the cover page, the two functions above
> should also probably be plugged into the zone resizing logic somewhere
> so if a zone is resized the bitmap is adjusted.
> 
>> +/**
>> + * zone_reporting_init - For each zone initializes the page reporting fields
>> + * and allocates the respective bitmap.
>> + *
>> + * This function returns 0 on successful initialization, -ENOMEM otherwise.
>> + */
>> +static int zone_reporting_init(void)
>> +{
>> +       struct zone *zone;
>> +       int ret;
>> +
>> +       for_each_populated_zone(zone) {
>> +#ifdef CONFIG_ZONE_DEVICE
>> +               /* we can not report pages which are not in the buddy */
>> +               if (zone_idx(zone) == ZONE_DEVICE)
>> +                       continue;
>> +#endif
> 
> I'm pretty sure this isn't needed since I don't think the ZONE_DEVICE
> zone will be considered "populated".
> 
I think you are right (although it's confusing, we will have present
sections part of a zone but the zone has no present_pages - screams like
a re factoring - leftover from ZONE_DEVICE introduction).

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-12 20:05     ` David Hildenbrand
@ 2019-08-13 10:30       ` Nitesh Narayan Lal
  2019-08-13 10:34         ` David Hildenbrand
  0 siblings, 1 reply; 35+ messages in thread
From: Nitesh Narayan Lal @ 2019-08-13 10:30 UTC (permalink / raw)
  To: David Hildenbrand, Alexander Duyck
  Cc: kvm list, LKML, linux-mm, virtio-dev, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, Rik van Riel,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko, cohuck


On 8/12/19 4:05 PM, David Hildenbrand wrote:
>>> ---
>>>  include/linux/mmzone.h         |  11 ++
>>>  include/linux/page_reporting.h |  63 +++++++
>>>  mm/Kconfig                     |   6 +
>>>  mm/Makefile                    |   1 +
>>>  mm/page_alloc.c                |  42 ++++-
>>>  mm/page_reporting.c            | 332 +++++++++++++++++++++++++++++++++
>>>  6 files changed, 448 insertions(+), 7 deletions(-)
>>>  create mode 100644 include/linux/page_reporting.h
>>>  create mode 100644 mm/page_reporting.c
>>>
>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>> index d77d717c620c..ba5f5b508f25 100644
>>> --- a/include/linux/mmzone.h
>>> +++ b/include/linux/mmzone.h
>>> @@ -559,6 +559,17 @@ struct zone {
>>>         /* Zone statistics */
>>>         atomic_long_t           vm_stat[NR_VM_ZONE_STAT_ITEMS];
>>>         atomic_long_t           vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
>>> +#ifdef CONFIG_PAGE_REPORTING
>>> +       /* Pointer to the bitmap in PAGE_REPORTING_MIN_ORDER granularity */
>>> +       unsigned long *bitmap;
>>> +       /* Preserve start and end PFN in case they change due to hotplug */
>>> +       unsigned long base_pfn;
>>> +       unsigned long end_pfn;
>>> +       /* Free pages of granularity PAGE_REPORTING_MIN_ORDER */
>>> +       atomic_t free_pages;
>>> +       /* Number of bits required in the bitmap */
>>> +       unsigned long nbits;
>>> +#endif
>>>  } ____cacheline_internodealigned_in_smp;
>> Okay, so the original thing this patch set had going for it was that
>> it was non-invasive. However, now you are adding a bunch of stuff to
>> the zone. That kind of loses the non-invasive argument for this patch
>> set compared to mine.
>>
> Adding something to "struct zone" is certainly less invasive than core
> buddy modifications, just saying (I agree that this is suboptimal. I
> would have guessed that all that's needed is a pointer to some private
> structure here). 


I think having just a pointer to a private structure makes sense here.
If I am not wrong then I can probably make an allocation for it for each
populated zone at the time I enable page reporting.

> However, the migratetype thingy below looks fishy to me.
>
>> If we are going to continue further with this patch set then it might
>> be worth looking into dynamically allocating the space you need for
>> this block. At a minimum you could probably look at making the bitmap
>> an RCU based setup so you could define the base and end along with the
>> bitmap. It would probably help to resolve the hotplug issues you still
>> need to address.
> Yeah, I guess that makes sense.
>
> [...]
>>> +
>>> +static int process_free_page(struct page *page,
>>> +                            struct page_reporting_config *phconf, int count)
>>> +{
>>> +       int mt, order, ret = 0;
>>> +
>>> +       mt = get_pageblock_migratetype(page);
>>> +       order = page_private(page);
>>> +       ret = __isolate_free_page(page, order);
>>> +
> I just started looking into the wonderful world of
> isolation/compaction/migration.
>
> I don't think saving/restoring the migratetype is correct here. AFAIK,
> MOVABLE/UNMOVABLE/RECLAIMABLE is just a hint, doesn't mean that e.g.,
> movable pages and up in UNMOVABLE or ordinary kernel allocations on
> MOVABLE. So that shouldn't be an issue - I guess.
>
> 1. You should never allocate something that is no
> MOVABLE/UNMOVABLE/RECLAIMABLE. Especially not, if you have ISOLATE or
> CMA here. There should at least be a !is_migrate_isolate_page() check
> somewhere
>
> 2. set_migratetype_isolate() takes the zone lock, so to avoid racing
> with isolation code, you have to hold the zone lock. Your code seems to
> do that, so at least you cannot race against isolation.
>
> 3. You could end up temporarily allocating something in the
> ZONE_MOVABLE. The pages you allocate are, however, not movable. There
> would have to be a way to make alloc_contig_range()/offlining code
> properly wait until the pages have been processed. Not sure about the
> real implications, though - too many details in the code (I wonder if
> Alex' series has a way of dealing with that)
>
> When you restore the migratetype, you could suddenly overwrite e.g.,
> ISOLATE, which feels wrong.


I was triggering an occasional CPU stall bug earlier, with saving and restoring
the migratetype I was able to fix it.
But I will further look into this to figure out if it is really required.

> [...]
>> So as per your comments in the cover page, the two functions above
>> should also probably be plugged into the zone resizing logic somewhere
>> so if a zone is resized the bitmap is adjusted.
>>
>>> +/**
>>> + * zone_reporting_init - For each zone initializes the page reporting fields
>>> + * and allocates the respective bitmap.
>>> + *
>>> + * This function returns 0 on successful initialization, -ENOMEM otherwise.
>>> + */
>>> +static int zone_reporting_init(void)
>>> +{
>>> +       struct zone *zone;
>>> +       int ret;
>>> +
>>> +       for_each_populated_zone(zone) {
>>> +#ifdef CONFIG_ZONE_DEVICE
>>> +               /* we can not report pages which are not in the buddy */
>>> +               if (zone_idx(zone) == ZONE_DEVICE)
>>> +                       continue;
>>> +#endif
>> I'm pretty sure this isn't needed since I don't think the ZONE_DEVICE
>> zone will be considered "populated".
>>
> I think you are right (although it's confusing, we will have present
> sections part of a zone but the zone has no present_pages - screams like
> a re factoring - leftover from ZONE_DEVICE introduction).


I think in that case it is safe to have this check here.
What do you guys suggest?


>
-- 
Thanks
Nitesh



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-13 10:30       ` Nitesh Narayan Lal
@ 2019-08-13 10:34         ` David Hildenbrand
  2019-08-13 10:42           ` Nitesh Narayan Lal
  2019-08-13 23:14           ` Alexander Duyck
  0 siblings, 2 replies; 35+ messages in thread
From: David Hildenbrand @ 2019-08-13 10:34 UTC (permalink / raw)
  To: Nitesh Narayan Lal, Alexander Duyck
  Cc: kvm list, LKML, linux-mm, virtio-dev, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, Rik van Riel,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko, cohuck

>>>> +static int process_free_page(struct page *page,
>>>> +                            struct page_reporting_config *phconf, int count)
>>>> +{
>>>> +       int mt, order, ret = 0;
>>>> +
>>>> +       mt = get_pageblock_migratetype(page);
>>>> +       order = page_private(page);
>>>> +       ret = __isolate_free_page(page, order);
>>>> +
>> I just started looking into the wonderful world of
>> isolation/compaction/migration.
>>
>> I don't think saving/restoring the migratetype is correct here. AFAIK,
>> MOVABLE/UNMOVABLE/RECLAIMABLE is just a hint, doesn't mean that e.g.,
>> movable pages and up in UNMOVABLE or ordinary kernel allocations on
>> MOVABLE. So that shouldn't be an issue - I guess.
>>
>> 1. You should never allocate something that is no
>> MOVABLE/UNMOVABLE/RECLAIMABLE. Especially not, if you have ISOLATE or
>> CMA here. There should at least be a !is_migrate_isolate_page() check
>> somewhere
>>
>> 2. set_migratetype_isolate() takes the zone lock, so to avoid racing
>> with isolation code, you have to hold the zone lock. Your code seems to
>> do that, so at least you cannot race against isolation.
>>
>> 3. You could end up temporarily allocating something in the
>> ZONE_MOVABLE. The pages you allocate are, however, not movable. There
>> would have to be a way to make alloc_contig_range()/offlining code
>> properly wait until the pages have been processed. Not sure about the
>> real implications, though - too many details in the code (I wonder if
>> Alex' series has a way of dealing with that)
>>
>> When you restore the migratetype, you could suddenly overwrite e.g.,
>> ISOLATE, which feels wrong.
> 
> 
> I was triggering an occasional CPU stall bug earlier, with saving and restoring
> the migratetype I was able to fix it.
> But I will further look into this to figure out if it is really required.
> 

You should especially look into handling isolated/cma pages. Maybe that
was the original issue. Alex seems to have added that in his latest
series (skipping isolated/cma pageblocks completely) as well.

>> [...]
>>> So as per your comments in the cover page, the two functions above
>>> should also probably be plugged into the zone resizing logic somewhere
>>> so if a zone is resized the bitmap is adjusted.
>>>
>>>> +/**
>>>> + * zone_reporting_init - For each zone initializes the page reporting fields
>>>> + * and allocates the respective bitmap.
>>>> + *
>>>> + * This function returns 0 on successful initialization, -ENOMEM otherwise.
>>>> + */
>>>> +static int zone_reporting_init(void)
>>>> +{
>>>> +       struct zone *zone;
>>>> +       int ret;
>>>> +
>>>> +       for_each_populated_zone(zone) {
>>>> +#ifdef CONFIG_ZONE_DEVICE
>>>> +               /* we can not report pages which are not in the buddy */
>>>> +               if (zone_idx(zone) == ZONE_DEVICE)
>>>> +                       continue;
>>>> +#endif
>>> I'm pretty sure this isn't needed since I don't think the ZONE_DEVICE
>>> zone will be considered "populated".
>>>
>> I think you are right (although it's confusing, we will have present
>> sections part of a zone but the zone has no present_pages - screams like
>> a re factoring - leftover from ZONE_DEVICE introduction).
> 
> 
> I think in that case it is safe to have this check here.
> What do you guys suggest?

If it's not needed, I'd say drop it (eventually add a comment).


-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-13 10:34         ` David Hildenbrand
@ 2019-08-13 10:42           ` Nitesh Narayan Lal
  2019-08-13 10:44             ` David Hildenbrand
  2019-08-13 23:14           ` Alexander Duyck
  1 sibling, 1 reply; 35+ messages in thread
From: Nitesh Narayan Lal @ 2019-08-13 10:42 UTC (permalink / raw)
  To: David Hildenbrand, Alexander Duyck
  Cc: kvm list, LKML, linux-mm, virtio-dev, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, Rik van Riel,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko, cohuck


On 8/13/19 6:34 AM, David Hildenbrand wrote:
>>>>> +static int process_free_page(struct page *page,
>>>>> +                            struct page_reporting_config *phconf, int count)
>>>>> +{
>>>>> +       int mt, order, ret = 0;
[...]
>>>>> +/**
>>>>> + * zone_reporting_init - For each zone initializes the page reporting fields
>>>>> + * and allocates the respective bitmap.
>>>>> + *
>>>>> + * This function returns 0 on successful initialization, -ENOMEM otherwise.
>>>>> + */
>>>>> +static int zone_reporting_init(void)
>>>>> +{
>>>>> +       struct zone *zone;
>>>>> +       int ret;
>>>>> +
>>>>> +       for_each_populated_zone(zone) {
>>>>> +#ifdef CONFIG_ZONE_DEVICE
>>>>> +               /* we can not report pages which are not in the buddy */
>>>>> +               if (zone_idx(zone) == ZONE_DEVICE)
>>>>> +                       continue;
>>>>> +#endif
>>>> I'm pretty sure this isn't needed since I don't think the ZONE_DEVICE
>>>> zone will be considered "populated".
>>>>
>>> I think you are right (although it's confusing, we will have present
>>> sections part of a zone but the zone has no present_pages - screams like
>>> a re factoring - leftover from ZONE_DEVICE introduction).
>>
>> I think in that case it is safe to have this check here.
>> What do you guys suggest?
> If it's not needed, I'd say drop it (eventually add a comment).


Comment to mention that we do not expect a zone with non-buddy page to be
initialized here?

>
>
-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-13 10:42           ` Nitesh Narayan Lal
@ 2019-08-13 10:44             ` David Hildenbrand
  0 siblings, 0 replies; 35+ messages in thread
From: David Hildenbrand @ 2019-08-13 10:44 UTC (permalink / raw)
  To: Nitesh Narayan Lal, Alexander Duyck
  Cc: kvm list, LKML, linux-mm, virtio-dev, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, Rik van Riel,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko, cohuck

On 13.08.19 12:42, Nitesh Narayan Lal wrote:
> 
> On 8/13/19 6:34 AM, David Hildenbrand wrote:
>>>>>> +static int process_free_page(struct page *page,
>>>>>> +                            struct page_reporting_config *phconf, int count)
>>>>>> +{
>>>>>> +       int mt, order, ret = 0;
> [...]
>>>>>> +/**
>>>>>> + * zone_reporting_init - For each zone initializes the page reporting fields
>>>>>> + * and allocates the respective bitmap.
>>>>>> + *
>>>>>> + * This function returns 0 on successful initialization, -ENOMEM otherwise.
>>>>>> + */
>>>>>> +static int zone_reporting_init(void)
>>>>>> +{
>>>>>> +       struct zone *zone;
>>>>>> +       int ret;
>>>>>> +
>>>>>> +       for_each_populated_zone(zone) {
>>>>>> +#ifdef CONFIG_ZONE_DEVICE
>>>>>> +               /* we can not report pages which are not in the buddy */
>>>>>> +               if (zone_idx(zone) == ZONE_DEVICE)
>>>>>> +                       continue;
>>>>>> +#endif
>>>>> I'm pretty sure this isn't needed since I don't think the ZONE_DEVICE
>>>>> zone will be considered "populated".
>>>>>
>>>> I think you are right (although it's confusing, we will have present
>>>> sections part of a zone but the zone has no present_pages - screams like
>>>> a re factoring - leftover from ZONE_DEVICE introduction).
>>>
>>> I think in that case it is safe to have this check here.
>>> What do you guys suggest?
>> If it's not needed, I'd say drop it (eventually add a comment).
> 
> 
> Comment to mention that we do not expect a zone with non-buddy page to be
> initialized here?

Something along these lines, or something like

/* ZONE_DEVICE is never considered populated */

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-13 10:34         ` David Hildenbrand
  2019-08-13 10:42           ` Nitesh Narayan Lal
@ 2019-08-13 23:14           ` Alexander Duyck
  2019-08-14  7:07             ` David Hildenbrand
  1 sibling, 1 reply; 35+ messages in thread
From: Alexander Duyck @ 2019-08-13 23:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nitesh Narayan Lal, kvm list, LKML, linux-mm, virtio-dev,
	Paolo Bonzini, lcapitulino, Pankaj Gupta, Wang, Wei W,
	Yang Zhang, Rik van Riel, Michael S. Tsirkin, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli, john.starks,
	Dave Hansen, Michal Hocko, cohuck

On Tue, Aug 13, 2019 at 3:34 AM David Hildenbrand <david@redhat.com> wrote:
>
> >>>> +static int process_free_page(struct page *page,
> >>>> +                            struct page_reporting_config *phconf, int count)
> >>>> +{
> >>>> +       int mt, order, ret = 0;
> >>>> +
> >>>> +       mt = get_pageblock_migratetype(page);
> >>>> +       order = page_private(page);
> >>>> +       ret = __isolate_free_page(page, order);
> >>>> +
> >> I just started looking into the wonderful world of
> >> isolation/compaction/migration.
> >>
> >> I don't think saving/restoring the migratetype is correct here. AFAIK,
> >> MOVABLE/UNMOVABLE/RECLAIMABLE is just a hint, doesn't mean that e.g.,
> >> movable pages and up in UNMOVABLE or ordinary kernel allocations on
> >> MOVABLE. So that shouldn't be an issue - I guess.
> >>
> >> 1. You should never allocate something that is no
> >> MOVABLE/UNMOVABLE/RECLAIMABLE. Especially not, if you have ISOLATE or
> >> CMA here. There should at least be a !is_migrate_isolate_page() check
> >> somewhere
> >>
> >> 2. set_migratetype_isolate() takes the zone lock, so to avoid racing
> >> with isolation code, you have to hold the zone lock. Your code seems to
> >> do that, so at least you cannot race against isolation.
> >>
> >> 3. You could end up temporarily allocating something in the
> >> ZONE_MOVABLE. The pages you allocate are, however, not movable. There
> >> would have to be a way to make alloc_contig_range()/offlining code
> >> properly wait until the pages have been processed. Not sure about the
> >> real implications, though - too many details in the code (I wonder if
> >> Alex' series has a way of dealing with that)
> >>
> >> When you restore the migratetype, you could suddenly overwrite e.g.,
> >> ISOLATE, which feels wrong.
> >
> >
> > I was triggering an occasional CPU stall bug earlier, with saving and restoring
> > the migratetype I was able to fix it.
> > But I will further look into this to figure out if it is really required.
> >
>
> You should especially look into handling isolated/cma pages. Maybe that
> was the original issue. Alex seems to have added that in his latest
> series (skipping isolated/cma pageblocks completely) as well.

So as far as skipping isolated pageblocks, I get the reason for
skipping isolated, but why would we need to skip CMA? I had made the
change I did based on comments you had made earlier. But while working
on some of the changes to address isolation better and looking over
several spots in the code it seems like CMA is already being used as
an allocation fallback for MIGRATE_MOVABLE. If that is the case
wouldn't it make sense to allow pulling pages and reporting them while
they are in the free_list?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-13 23:14           ` Alexander Duyck
@ 2019-08-14  7:07             ` David Hildenbrand
  2019-08-14 12:49               ` [virtio-dev] " Nitesh Narayan Lal
  0 siblings, 1 reply; 35+ messages in thread
From: David Hildenbrand @ 2019-08-14  7:07 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Nitesh Narayan Lal, kvm list, LKML, linux-mm, virtio-dev,
	Paolo Bonzini, lcapitulino, Pankaj Gupta, Wang, Wei W,
	Yang Zhang, Rik van Riel, Michael S. Tsirkin, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli, john.starks,
	Dave Hansen, Michal Hocko, cohuck

On 14.08.19 01:14, Alexander Duyck wrote:
> On Tue, Aug 13, 2019 at 3:34 AM David Hildenbrand <david@redhat.com> wrote:
>>
>>>>>> +static int process_free_page(struct page *page,
>>>>>> +                            struct page_reporting_config *phconf, int count)
>>>>>> +{
>>>>>> +       int mt, order, ret = 0;
>>>>>> +
>>>>>> +       mt = get_pageblock_migratetype(page);
>>>>>> +       order = page_private(page);
>>>>>> +       ret = __isolate_free_page(page, order);
>>>>>> +
>>>> I just started looking into the wonderful world of
>>>> isolation/compaction/migration.
>>>>
>>>> I don't think saving/restoring the migratetype is correct here. AFAIK,
>>>> MOVABLE/UNMOVABLE/RECLAIMABLE is just a hint, doesn't mean that e.g.,
>>>> movable pages and up in UNMOVABLE or ordinary kernel allocations on
>>>> MOVABLE. So that shouldn't be an issue - I guess.
>>>>
>>>> 1. You should never allocate something that is no
>>>> MOVABLE/UNMOVABLE/RECLAIMABLE. Especially not, if you have ISOLATE or
>>>> CMA here. There should at least be a !is_migrate_isolate_page() check
>>>> somewhere
>>>>
>>>> 2. set_migratetype_isolate() takes the zone lock, so to avoid racing
>>>> with isolation code, you have to hold the zone lock. Your code seems to
>>>> do that, so at least you cannot race against isolation.
>>>>
>>>> 3. You could end up temporarily allocating something in the
>>>> ZONE_MOVABLE. The pages you allocate are, however, not movable. There
>>>> would have to be a way to make alloc_contig_range()/offlining code
>>>> properly wait until the pages have been processed. Not sure about the
>>>> real implications, though - too many details in the code (I wonder if
>>>> Alex' series has a way of dealing with that)
>>>>
>>>> When you restore the migratetype, you could suddenly overwrite e.g.,
>>>> ISOLATE, which feels wrong.
>>>
>>>
>>> I was triggering an occasional CPU stall bug earlier, with saving and restoring
>>> the migratetype I was able to fix it.
>>> But I will further look into this to figure out if it is really required.
>>>
>>
>> You should especially look into handling isolated/cma pages. Maybe that
>> was the original issue. Alex seems to have added that in his latest
>> series (skipping isolated/cma pageblocks completely) as well.
> 
> So as far as skipping isolated pageblocks, I get the reason for
> skipping isolated, but why would we need to skip CMA? I had made the
> change I did based on comments you had made earlier. But while working
> on some of the changes to address isolation better and looking over
> several spots in the code it seems like CMA is already being used as
> an allocation fallback for MIGRATE_MOVABLE. If that is the case
> wouldn't it make sense to allow pulling pages and reporting them while
> they are in the free_list?

I was assuming that CMA is also to be skipped because "static int
fallbacks[MIGRATE_TYPES][4]" in mm/page_alloc.c doesn't handle CMA at
all, meaning we should never fallback to CMA or from CMA to another type
- at least when stealing pages from another migratetype. So it smells
like MIGRATE_CMA is static -> the area is marked once and will never be
converted to something else (except MIGRATE_ISOLATE temporarily).

I assume you are talking about gfp_to_alloc_flags()/prepare_alloc_pages():

#ifdef CONFIG_CMA
	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
		alloc_flags |= ALLOC_CMA;
#endif

Yeah, this looks like MOVABLE allocations can fallback to CMA
pageblocks. And from what I read, "CMA may use its own migratetype
(MIGRATE_CMA) which behaves similarly to ZONE_MOVABLE but can be put in
arbitrary places."

So I think you are right, it could be that it is safe to temporarily
pull out CMA pages (in contrast to isolated pages) - assuming it is fine
to have temporary unmovable allocations on them (different discussion).

(I am learning about the details as we discuss :) )

The important part would then be to never allocate from the isolated
pageblocks and to never overwrite MIGRATE_ISOLATE.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 2/2] virtio-balloon: interface to support free page reporting
  2019-08-12 13:12 ` [RFC][Patch v12 2/2] virtio-balloon: interface to support free page reporting Nitesh Narayan Lal
@ 2019-08-14 10:29   ` Cornelia Huck
  2019-08-14 11:47     ` Nitesh Narayan Lal
  0 siblings, 1 reply; 35+ messages in thread
From: Cornelia Huck @ 2019-08-14 10:29 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm, linux-kernel, linux-mm, virtio-dev, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck, john.starks,
	dave.hansen, mhocko

On Mon, 12 Aug 2019 09:12:35 -0400
Nitesh Narayan Lal <nitesh@redhat.com> wrote:

> Enables the kernel to negotiate VIRTIO_BALLOON_F_REPORTING feature with
> the host. If it is available and page_reporting_flag is set to true,
> page_reporting is enabled and its callback is configured along with
> the max_pages count which indicates the maximum number of pages that
> can be isolated and reported at a time. Currently, only free pages of
> order >= (MAX_ORDER - 2) are reported. To prevent any false OOM
> max_pages count is set to 16.
> 
> By default page_reporting feature is enabled and gets loaded as soon
> as the virtio-balloon driver is loaded. However, it could be disabled
> by writing the page_reporting_flag which is a virtio-balloon parameter.
> 
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
>  drivers/virtio/Kconfig              |  1 +
>  drivers/virtio/virtio_balloon.c     | 64 ++++++++++++++++++++++++++++-
>  include/uapi/linux/virtio_balloon.h |  1 +
>  3 files changed, 65 insertions(+), 1 deletion(-)
> 

> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 226fbb995fb0..defec00d4ee2 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c

(...)

> +static void virtballoon_page_reporting_setup(struct virtio_balloon *vb)
> +{
> +	struct device *dev = &vb->vdev->dev;
> +	int err;
> +
> +	vb->page_reporting_conf.report = virtballoon_report_pages;
> +	vb->page_reporting_conf.max_pages = PAGE_REPORTING_MAX_PAGES;
> +	err = page_reporting_enable(&vb->page_reporting_conf);
> +	if (err < 0) {
> +		dev_err(dev, "Failed to enable reporting, err = %d\n", err);
> +		page_reporting_flag = false;

Should we clear the feature bit in this case as well?

> +		vb->page_reporting_conf.report = NULL;
> +		vb->page_reporting_conf.max_pages = 0;
> +		return;
> +	}
> +}
> +
>  static void set_page_pfns(struct virtio_balloon *vb,
>  			  __virtio32 pfns[], struct page *page)
>  {
> @@ -476,6 +524,7 @@ static int init_vqs(struct virtio_balloon *vb)
>  	names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
>  	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
>  	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> +	names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;
>  
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>  		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> @@ -487,11 +536,18 @@ static int init_vqs(struct virtio_balloon *vb)
>  		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>  	}
>  
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
> +		names[VIRTIO_BALLOON_VQ_REPORTING] = "reporting_vq";
> +		callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;

Do we even want to try to set up the reporting queue if reporting has
been disabled via module parameter? Might make more sense to not even
negotiate the feature bit in that case.

> +	}
>  	err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
>  					 vqs, callbacks, names, NULL, NULL);
>  	if (err)
>  		return err;
>  
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
> +		vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING];
> +
>  	vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
>  	vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> @@ -924,6 +980,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  		if (err)
>  			goto out_del_balloon_wq;
>  	}
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING) &&
> +	    page_reporting_flag)
> +		virtballoon_page_reporting_setup(vb);

In that case, you'd only need to check for the feature bit here.

>  	virtio_device_ready(vdev);
>  
>  	if (towards_target(vb))
> @@ -971,6 +1030,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  		destroy_workqueue(vb->balloon_wq);
>  	}
>  
> +	if (page_reporting_flag)
> +		page_reporting_disable(&vb->page_reporting_conf);

I think you should not disable it if the feature bit was never offered
in the first place, right?

>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
>  	if (vb->vb_dev_info.inode)

(...)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 2/2] virtio-balloon: interface to support free page reporting
  2019-08-14 10:29   ` Cornelia Huck
@ 2019-08-14 11:47     ` Nitesh Narayan Lal
  2019-08-14 13:42       ` Cornelia Huck
  0 siblings, 1 reply; 35+ messages in thread
From: Nitesh Narayan Lal @ 2019-08-14 11:47 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: kvm, linux-kernel, linux-mm, virtio-dev, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck, john.starks,
	dave.hansen, mhocko


On 8/14/19 6:29 AM, Cornelia Huck wrote:
> On Mon, 12 Aug 2019 09:12:35 -0400
> Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>> Enables the kernel to negotiate VIRTIO_BALLOON_F_REPORTING feature with
>> the host. If it is available and page_reporting_flag is set to true,
>> page_reporting is enabled and its callback is configured along with
>> the max_pages count which indicates the maximum number of pages that
>> can be isolated and reported at a time. Currently, only free pages of
>> order >= (MAX_ORDER - 2) are reported. To prevent any false OOM
>> max_pages count is set to 16.
>>
>> By default page_reporting feature is enabled and gets loaded as soon
>> as the virtio-balloon driver is loaded. However, it could be disabled
>> by writing the page_reporting_flag which is a virtio-balloon parameter.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> ---
>>  drivers/virtio/Kconfig              |  1 +
>>  drivers/virtio/virtio_balloon.c     | 64 ++++++++++++++++++++++++++++-
>>  include/uapi/linux/virtio_balloon.h |  1 +
>>  3 files changed, 65 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>> index 226fbb995fb0..defec00d4ee2 100644
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
> (...)
>
>> +static void virtballoon_page_reporting_setup(struct virtio_balloon *vb)
>> +{
>> +	struct device *dev = &vb->vdev->dev;
>> +	int err;
>> +
>> +	vb->page_reporting_conf.report = virtballoon_report_pages;
>> +	vb->page_reporting_conf.max_pages = PAGE_REPORTING_MAX_PAGES;
>> +	err = page_reporting_enable(&vb->page_reporting_conf);
>> +	if (err < 0) {
>> +		dev_err(dev, "Failed to enable reporting, err = %d\n", err);
>> +		page_reporting_flag = false;
> Should we clear the feature bit in this case as well?

I think yes.
If I am not wrong then in a case where page reporting setup fails for some
reason and at a later point the user wants to re-enable it then for that balloon
driver has to be reloaded.
Which would mean re-negotiation of the feature bit.

>
>> +		vb->page_reporting_conf.report = NULL;
>> +		vb->page_reporting_conf.max_pages = 0;
>> +		return;
>> +	}
>> +}
>> +
>>  static void set_page_pfns(struct virtio_balloon *vb,
>>  			  __virtio32 pfns[], struct page *page)
>>  {
>> @@ -476,6 +524,7 @@ static int init_vqs(struct virtio_balloon *vb)
>>  	names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
>>  	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
>>  	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>> +	names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;
>>  
>>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>>  		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
>> @@ -487,11 +536,18 @@ static int init_vqs(struct virtio_balloon *vb)
>>  		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>>  	}
>>  
>> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
>> +		names[VIRTIO_BALLOON_VQ_REPORTING] = "reporting_vq";
>> +		callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;
> Do we even want to try to set up the reporting queue if reporting has
> been disabled via module parameter? Might make more sense to not even
> negotiate the feature bit in that case.

True.
I think this should be replaced with something like (page_reporting_flag &&
virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)).

>
>> +	}
>>  	err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
>>  					 vqs, callbacks, names, NULL, NULL);
>>  	if (err)
>>  		return err;
>>  
>> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
>> +		vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING];
>> +
>>  	vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
>>  	vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
>>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>> @@ -924,6 +980,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>>  		if (err)
>>  			goto out_del_balloon_wq;
>>  	}
>> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING) &&
>> +	    page_reporting_flag)
>> +		virtballoon_page_reporting_setup(vb);
> In that case, you'd only need to check for the feature bit here.

Why is that?
I think both the checks should be present here as we need both the conditions to
be true to enable page reporting.
However, the order should be reversed because of the reason you mentioned earlier.

>
>>  	virtio_device_ready(vdev);
>>  
>>  	if (towards_target(vb))
>> @@ -971,6 +1030,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
>>  		destroy_workqueue(vb->balloon_wq);
>>  	}
>>  
>> +	if (page_reporting_flag)
>> +		page_reporting_disable(&vb->page_reporting_conf);
> I think you should not disable it if the feature bit was never offered
> in the first place, right?

Yes.
I should check both feature bit and module parameter here.

>
>>  	remove_common(vb);
>>  #ifdef CONFIG_BALLOON_COMPACTION
>>  	if (vb->vb_dev_info.inode)
> (...)
-- 
Thanks
Nitesh



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-14  7:07             ` David Hildenbrand
@ 2019-08-14 12:49               ` Nitesh Narayan Lal
  0 siblings, 0 replies; 35+ messages in thread
From: Nitesh Narayan Lal @ 2019-08-14 12:49 UTC (permalink / raw)
  To: David Hildenbrand, Alexander Duyck
  Cc: kvm list, LKML, linux-mm, virtio-dev, Paolo Bonzini, lcapitulino,
	Pankaj Gupta, Wang, Wei W, Yang Zhang, Rik van Riel,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko, cohuck


On 8/14/19 3:07 AM, David Hildenbrand wrote:
> On 14.08.19 01:14, Alexander Duyck wrote:
>> On Tue, Aug 13, 2019 at 3:34 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>>> +static int process_free_page(struct page *page,
>>>>>>> +                            struct page_reporting_config *phconf, int count)
>>>>>>> +{
>>>>>>> +       int mt, order, ret = 0;
>>>>>>> +
>>>>>>> +       mt = get_pageblock_migratetype(page);
>>>>>>> +       order = page_private(page);
>>>>>>> +       ret = __isolate_free_page(page, order);
>>>>>>> +
>>>>> I just started looking into the wonderful world of
>>>>> isolation/compaction/migration.
>>>>>
>>>>> I don't think saving/restoring the migratetype is correct here. AFAIK,
>>>>> MOVABLE/UNMOVABLE/RECLAIMABLE is just a hint, doesn't mean that e.g.,
>>>>> movable pages and up in UNMOVABLE or ordinary kernel allocations on
>>>>> MOVABLE. So that shouldn't be an issue - I guess.
>>>>>
>>>>> 1. You should never allocate something that is no
>>>>> MOVABLE/UNMOVABLE/RECLAIMABLE. Especially not, if you have ISOLATE or
>>>>> CMA here. There should at least be a !is_migrate_isolate_page() check
>>>>> somewhere
>>>>>
>>>>> 2. set_migratetype_isolate() takes the zone lock, so to avoid racing
>>>>> with isolation code, you have to hold the zone lock. Your code seems to
>>>>> do that, so at least you cannot race against isolation.
>>>>>
>>>>> 3. You could end up temporarily allocating something in the
>>>>> ZONE_MOVABLE. The pages you allocate are, however, not movable. There
>>>>> would have to be a way to make alloc_contig_range()/offlining code
>>>>> properly wait until the pages have been processed. Not sure about the
>>>>> real implications, though - too many details in the code (I wonder if
>>>>> Alex' series has a way of dealing with that)
>>>>>
>>>>> When you restore the migratetype, you could suddenly overwrite e.g.,
>>>>> ISOLATE, which feels wrong.
>>>>
>>>> I was triggering an occasional CPU stall bug earlier, with saving and restoring
>>>> the migratetype I was able to fix it.
>>>> But I will further look into this to figure out if it is really required.
>>>>
>>> You should especially look into handling isolated/cma pages. Maybe that
>>> was the original issue. Alex seems to have added that in his latest
>>> series (skipping isolated/cma pageblocks completely) as well.
>> So as far as skipping isolated pageblocks, I get the reason for
>> skipping isolated, but why would we need to skip CMA? I had made the
>> change I did based on comments you had made earlier. But while working
>> on some of the changes to address isolation better and looking over
>> several spots in the code it seems like CMA is already being used as
>> an allocation fallback for MIGRATE_MOVABLE. If that is the case
>> wouldn't it make sense to allow pulling pages and reporting them while
>> they are in the free_list?
> I was assuming that CMA is also to be skipped because "static int
> fallbacks[MIGRATE_TYPES][4]" in mm/page_alloc.c doesn't handle CMA at
> all, meaning we should never fallback to CMA or from CMA to another type
> - at least when stealing pages from another migratetype. So it smells
> like MIGRATE_CMA is static -> the area is marked once and will never be
> converted to something else (except MIGRATE_ISOLATE temporarily).
>
> I assume you are talking about gfp_to_alloc_flags()/prepare_alloc_pages():


I am also trying to look into this to get more understanding of it.
Another thing which I am looking into right now is the difference between
get/set_pcppage_migratetype() and ge/set_pageblock_migratetype().
To an extent, I do understand what is the benefit of using
get/set_pcppage_migratetype() by reading the comments. However, I am not sure
how it gets along with MIGRATE_CMA.
Hopefully, I will have an understanding of it soon.

> #ifdef CONFIG_CMA
> 	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> 		alloc_flags |= ALLOC_CMA;
> #endif
>
> Yeah, this looks like MOVABLE allocations can fallback to CMA
> pageblocks. And from what I read, "CMA may use its own migratetype
> (MIGRATE_CMA) which behaves similarly to ZONE_MOVABLE but can be put in
> arbitrary places."
>
> So I think you are right, it could be that it is safe to temporarily
> pull out CMA pages (in contrast to isolated pages) - assuming it is fine
> to have temporary unmovable allocations on them (different discussion).
>
> (I am learning about the details as we discuss :) )
>
> The important part would then be to never allocate from the isolated
> pageblocks and to never overwrite MIGRATE_ISOLATE.


Agreed. I think I should just avoid isolating pages with migratetype
MIGRATE_ISOLATE.
Adding a check with is_migrate_isolate_page() before isolating the page should
do it.


>
-- 
Thanks
Nitesh



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 2/2] virtio-balloon: interface to support free page reporting
  2019-08-14 11:47     ` Nitesh Narayan Lal
@ 2019-08-14 13:42       ` Cornelia Huck
  2019-08-14 14:01         ` Nitesh Narayan Lal
  0 siblings, 1 reply; 35+ messages in thread
From: Cornelia Huck @ 2019-08-14 13:42 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm, linux-kernel, linux-mm, virtio-dev, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck, john.starks,
	dave.hansen, mhocko

On Wed, 14 Aug 2019 07:47:40 -0400
Nitesh Narayan Lal <nitesh@redhat.com> wrote:

> On 8/14/19 6:29 AM, Cornelia Huck wrote:
> > On Mon, 12 Aug 2019 09:12:35 -0400
> > Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >  
> >> Enables the kernel to negotiate VIRTIO_BALLOON_F_REPORTING feature with
> >> the host. If it is available and page_reporting_flag is set to true,
> >> page_reporting is enabled and its callback is configured along with
> >> the max_pages count which indicates the maximum number of pages that
> >> can be isolated and reported at a time. Currently, only free pages of
> >> order >= (MAX_ORDER - 2) are reported. To prevent any false OOM
> >> max_pages count is set to 16.
> >>
> >> By default page_reporting feature is enabled and gets loaded as soon
> >> as the virtio-balloon driver is loaded. However, it could be disabled
> >> by writing the page_reporting_flag which is a virtio-balloon parameter.
> >>
> >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> >> ---
> >>  drivers/virtio/Kconfig              |  1 +
> >>  drivers/virtio/virtio_balloon.c     | 64 ++++++++++++++++++++++++++++-
> >>  include/uapi/linux/virtio_balloon.h |  1 +
> >>  3 files changed, 65 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> >> index 226fbb995fb0..defec00d4ee2 100644
> >> --- a/drivers/virtio/virtio_balloon.c
> >> +++ b/drivers/virtio/virtio_balloon.c  
> > (...)
> >  
> >> +static void virtballoon_page_reporting_setup(struct virtio_balloon *vb)
> >> +{
> >> +	struct device *dev = &vb->vdev->dev;
> >> +	int err;
> >> +
> >> +	vb->page_reporting_conf.report = virtballoon_report_pages;
> >> +	vb->page_reporting_conf.max_pages = PAGE_REPORTING_MAX_PAGES;
> >> +	err = page_reporting_enable(&vb->page_reporting_conf);
> >> +	if (err < 0) {
> >> +		dev_err(dev, "Failed to enable reporting, err = %d\n", err);
> >> +		page_reporting_flag = false;  
> > Should we clear the feature bit in this case as well?  
> 
> I think yes.

Eww, I didn't recall that we don't call the ->probe callback until
after feature negotiation has finished, so scratch that particular idea.

For what reasons may page_reporting_enable() fail? Does it make sense
to fail probing the device in that case? And does it make sense to
re-try later (i.e. leave page_reporting_flag set)?

> If I am not wrong then in a case where page reporting setup fails for some
> reason and at a later point the user wants to re-enable it then for that balloon
> driver has to be reloaded.
> Which would mean re-negotiation of the feature bit.

Re-negotiation actually already happens if a driver is unbound and
rebound.

> 
> >  
> >> +		vb->page_reporting_conf.report = NULL;
> >> +		vb->page_reporting_conf.max_pages = 0;
> >> +		return;
> >> +	}
> >> +}
> >> +
> >>  static void set_page_pfns(struct virtio_balloon *vb,
> >>  			  __virtio32 pfns[], struct page *page)
> >>  {
> >> @@ -476,6 +524,7 @@ static int init_vqs(struct virtio_balloon *vb)
> >>  	names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
> >>  	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
> >>  	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> >> +	names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;
> >>  
> >>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> >>  		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> >> @@ -487,11 +536,18 @@ static int init_vqs(struct virtio_balloon *vb)
> >>  		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> >>  	}
> >>  
> >> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
> >> +		names[VIRTIO_BALLOON_VQ_REPORTING] = "reporting_vq";
> >> +		callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;  
> > Do we even want to try to set up the reporting queue if reporting has
> > been disabled via module parameter? Might make more sense to not even
> > negotiate the feature bit in that case.  
> 
> True.
> I think this should be replaced with something like (page_reporting_flag &&
> virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)).

Yes.

Is page_reporting_flag supposed to be changeable on the fly? The only
way to really turn off the feature bit from the driver is to not pass
in the feature in the features table; we could provide two different
tables depending on the flag if it were static.

> 
> >  
> >> +	}
> >>  	err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
> >>  					 vqs, callbacks, names, NULL, NULL);
> >>  	if (err)
> >>  		return err;
> >>  
> >> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
> >> +		vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING];
> >> +
> >>  	vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
> >>  	vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
> >>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> >> @@ -924,6 +980,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
> >>  		if (err)
> >>  			goto out_del_balloon_wq;
> >>  	}
> >> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING) &&
> >> +	    page_reporting_flag)
> >> +		virtballoon_page_reporting_setup(vb);  
> > In that case, you'd only need to check for the feature bit here.  
> 
> Why is that?
> I think both the checks should be present here as we need both the conditions to
> be true to enable page reporting.

Yeah, because we can't clear the feature bit if the flag is not set.

> However, the order should be reversed because of the reason you mentioned earlier.
> 
> >  
> >>  	virtio_device_ready(vdev);
> >>  
> >>  	if (towards_target(vb))


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 2/2] virtio-balloon: interface to support free page reporting
  2019-08-14 13:42       ` Cornelia Huck
@ 2019-08-14 14:01         ` Nitesh Narayan Lal
  0 siblings, 0 replies; 35+ messages in thread
From: Nitesh Narayan Lal @ 2019-08-14 14:01 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: kvm, linux-kernel, linux-mm, virtio-dev, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck, john.starks,
	dave.hansen, mhocko


On 8/14/19 9:42 AM, Cornelia Huck wrote:
> On Wed, 14 Aug 2019 07:47:40 -0400
> Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>> On 8/14/19 6:29 AM, Cornelia Huck wrote:
>>> On Mon, 12 Aug 2019 09:12:35 -0400
>>> Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>  
>>>> Enables the kernel to negotiate VIRTIO_BALLOON_F_REPORTING feature with
>>>> the host. If it is available and page_reporting_flag is set to true,
>>>> page_reporting is enabled and its callback is configured along with
>>>> the max_pages count which indicates the maximum number of pages that
>>>> can be isolated and reported at a time. Currently, only free pages of
>>>> order >= (MAX_ORDER - 2) are reported. To prevent any false OOM
>>>> max_pages count is set to 16.
>>>>
>>>> By default page_reporting feature is enabled and gets loaded as soon
>>>> as the virtio-balloon driver is loaded. However, it could be disabled
>>>> by writing the page_reporting_flag which is a virtio-balloon parameter.
>>>>
>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>>>> ---
>>>>  drivers/virtio/Kconfig              |  1 +
>>>>  drivers/virtio/virtio_balloon.c     | 64 ++++++++++++++++++++++++++++-
>>>>  include/uapi/linux/virtio_balloon.h |  1 +
>>>>  3 files changed, 65 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>>>> index 226fbb995fb0..defec00d4ee2 100644
>>>> --- a/drivers/virtio/virtio_balloon.c
>>>> +++ b/drivers/virtio/virtio_balloon.c  
>>> (...)
>>>  
>>>> +static void virtballoon_page_reporting_setup(struct virtio_balloon *vb)
>>>> +{
>>>> +	struct device *dev = &vb->vdev->dev;
>>>> +	int err;
>>>> +
>>>> +	vb->page_reporting_conf.report = virtballoon_report_pages;
>>>> +	vb->page_reporting_conf.max_pages = PAGE_REPORTING_MAX_PAGES;
>>>> +	err = page_reporting_enable(&vb->page_reporting_conf);
>>>> +	if (err < 0) {
>>>> +		dev_err(dev, "Failed to enable reporting, err = %d\n", err);
>>>> +		page_reporting_flag = false;  
>>> Should we clear the feature bit in this case as well?  
>> I think yes.
> Eww, I didn't recall that we don't call the ->probe callback until
> after feature negotiation has finished, so scratch that particular idea.
>
> For what reasons may page_reporting_enable() fail?

If the guest is low in memory and some allocation required for page reporting
setup fails.

>  Does it make sense
> to fail probing the device in that case? And does it make sense to
> re-try later (i.e. leave page_reporting_flag set)?


Re-trying to setup page reporting will mean that virtballoon_probe has to be
called again.
For which the driver has to be re-loaded, isn't?

>
>> If I am not wrong then in a case where page reporting setup fails for some
>> reason and at a later point the user wants to re-enable it then for that balloon
>> driver has to be reloaded.
>> Which would mean re-negotiation of the feature bit.
> Re-negotiation actually already happens if a driver is unbound and
> rebound.
>
>>>  
>>>> +		vb->page_reporting_conf.report = NULL;
>>>> +		vb->page_reporting_conf.max_pages = 0;
>>>> +		return;
>>>> +	}
>>>> +}
>>>> +
>>>>  static void set_page_pfns(struct virtio_balloon *vb,
>>>>  			  __virtio32 pfns[], struct page *page)
>>>>  {
>>>> @@ -476,6 +524,7 @@ static int init_vqs(struct virtio_balloon *vb)
>>>>  	names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
>>>>  	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
>>>>  	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>>>> +	names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;
>>>>  
>>>>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>>>>  		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
>>>> @@ -487,11 +536,18 @@ static int init_vqs(struct virtio_balloon *vb)
>>>>  		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>>>>  	}
>>>>  
>>>> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
>>>> +		names[VIRTIO_BALLOON_VQ_REPORTING] = "reporting_vq";
>>>> +		callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;  
>>> Do we even want to try to set up the reporting queue if reporting has
>>> been disabled via module parameter? Might make more sense to not even
>>> negotiate the feature bit in that case.  
>> True.
>> I think this should be replaced with something like (page_reporting_flag &&
>> virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)).
> Yes.
>
> Is page_reporting_flag supposed to be changeable on the fly?


Yes.

>  The only
> way to really turn off the feature bit from the driver is to not pass
> in the feature in the features table; we could provide two different
> tables depending on the flag if it were static.

I did have a plan of moving to static keys eventually instead of using module
parameters for this purpose. :)
That way I will be able to just control the kernel side of things on the fly
without changing the balloon-page-reporting framework.
The objective is to allow the user to enable/disable page tracking on the fly.

>
>>>  
>>>> +	}
>>>>  	err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
>>>>  					 vqs, callbacks, names, NULL, NULL);
>>>>  	if (err)
>>>>  		return err;
>>>>  
>>>> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
>>>> +		vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING];
>>>> +
>>>>  	vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
>>>>  	vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
>>>>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>>>> @@ -924,6 +980,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>>>>  		if (err)
>>>>  			goto out_del_balloon_wq;
>>>>  	}
>>>> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING) &&
>>>> +	    page_reporting_flag)
>>>> +		virtballoon_page_reporting_setup(vb);  
>>> In that case, you'd only need to check for the feature bit here.  
>> Why is that?
>> I think both the checks should be present here as we need both the conditions to
>> be true to enable page reporting.
> Yeah, because we can't clear the feature bit if the flag is not set.


+1

>
>> However, the order should be reversed because of the reason you mentioned earlier.
>>
>>>  
>>>>  	virtio_device_ready(vdev);
>>>>  
>>>>  	if (towards_target(vb))
-- 
Thanks
Nitesh



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-12 18:47   ` Alexander Duyck
  2019-08-12 20:04     ` Nitesh Narayan Lal
  2019-08-12 20:05     ` David Hildenbrand
@ 2019-08-14 15:49     ` Nitesh Narayan Lal
  2019-08-14 16:11       ` Alexander Duyck
  2019-08-30 15:15     ` Nitesh Narayan Lal
  3 siblings, 1 reply; 35+ messages in thread
From: Nitesh Narayan Lal @ 2019-08-14 15:49 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, virtio-dev, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko, cohuck


On 8/12/19 2:47 PM, Alexander Duyck wrote:
> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> This patch introduces the core infrastructure for free page reporting in
>> virtual environments. It enables the kernel to track the free pages which
>> can be reported to its hypervisor so that the hypervisor could
>> free and reuse that memory as per its requirement.
>>
>> While the pages are getting processed in the hypervisor (e.g.,
>> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
>> would be possible. To avoid such a situation, these pages are
>> temporarily removed from the buddy. The amount of pages removed
>> temporarily from the buddy is governed by the backend(virtio-balloon
>> in our case).
>>
>> To efficiently identify free pages that can to be reported to the
>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>> chunks are reported to the hypervisor - especially, to not break up THP
>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>> in the bitmap are an indication whether a page *might* be free, not a
>> guarantee. A new hook after buddy merging sets the bits.
>>
>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>> asynchronously processes the bitmaps, trying to isolate and report pages
>> that are still free. The backend (virtio-balloon) is responsible for
>> reporting these batched pages to the host synchronously. Once reporting/
>> freeing is complete, isolated pages are returned back to the buddy.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>
[...]
>> +}
>> +
>> +/**
>> + * __page_reporting_enqueue - tracks the freed page in the respective zone's
>> + * bitmap and enqueues a new page reporting job to the workqueue if possible.
>> + */
>> +void __page_reporting_enqueue(struct page *page)
>> +{
>> +       struct page_reporting_config *phconf;
>> +       struct zone *zone;
>> +
>> +       rcu_read_lock();
>> +       /*
>> +        * We should not process this page if either page reporting is not
>> +        * yet completely enabled or it has been disabled by the backend.
>> +        */
>> +       phconf = rcu_dereference(page_reporting_conf);
>> +       if (!phconf)
>> +               return;
>> +
>> +       zone = page_zone(page);
>> +       bitmap_set_bit(page, zone);
>> +
>> +       /*
>> +        * We should not enqueue a job if a previously enqueued reporting work
>> +        * is in progress or we don't have enough free pages in the zone.
>> +        */
>> +       if (atomic_read(&zone->free_pages) >= phconf->max_pages &&
>> +           !atomic_cmpxchg(&phconf->refcnt, 0, 1))
> This doesn't make any sense to me. Why are you only incrementing the
> refcount if it is zero? Combining this with the assignment above, this
> isn't really a refcnt. It is just an oversized bitflag.


The intent for having an extra variable was to ensure that at a time only one
reporting job is enqueued. I do agree that for that purpose I really don't need
a reference counter and I should have used something like bool
'page_hinting_active'. But with bool, I think there could be a possible chance
of race. Maybe I should rename this variable and keep it as atomic.
Any thoughts?


>
> Also I am pretty sure this results in the opportunity to miss pages
> because there is nothing to prevent you from possibly missing a ton of
> pages you could hint on if a large number of pages are pushed out all
> at once and then the system goes idle in terms of memory allocation
> and freeing.


I was looking at how you are enqueuing/processing reporting jobs for each zone.
I am wondering if I should also consider something on similar lines as having
that I might be able to address the concern which you have raised above. But it
would also mean that I have to add an additional flag in the zone_flags. :)

>
[...]
>
>> +EXPORT_SYMBOL_GPL(page_reporting_enable);
>> --
>> 2.21.0
>>
-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-14 15:49     ` Nitesh Narayan Lal
@ 2019-08-14 16:11       ` Alexander Duyck
  2019-08-15 13:15         ` Nitesh Narayan Lal
  0 siblings, 1 reply; 35+ messages in thread
From: Alexander Duyck @ 2019-08-14 16:11 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, virtio-dev, Paolo Bonzini, lcapitulino,
	Pankaj Gupta, Wang, Wei W, Yang Zhang, Rik van Riel,
	David Hildenbrand, Michael S. Tsirkin, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli, john.starks,
	Dave Hansen, Michal Hocko, cohuck

On Wed, Aug 14, 2019 at 8:49 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 8/12/19 2:47 PM, Alexander Duyck wrote:
> > On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >> This patch introduces the core infrastructure for free page reporting in
> >> virtual environments. It enables the kernel to track the free pages which
> >> can be reported to its hypervisor so that the hypervisor could
> >> free and reuse that memory as per its requirement.
> >>
> >> While the pages are getting processed in the hypervisor (e.g.,
> >> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
> >> would be possible. To avoid such a situation, these pages are
> >> temporarily removed from the buddy. The amount of pages removed
> >> temporarily from the buddy is governed by the backend(virtio-balloon
> >> in our case).
> >>
> >> To efficiently identify free pages that can to be reported to the
> >> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> >> chunks are reported to the hypervisor - especially, to not break up THP
> >> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> >> in the bitmap are an indication whether a page *might* be free, not a
> >> guarantee. A new hook after buddy merging sets the bits.
> >>
> >> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> >> asynchronously processes the bitmaps, trying to isolate and report pages
> >> that are still free. The backend (virtio-balloon) is responsible for
> >> reporting these batched pages to the host synchronously. Once reporting/
> >> freeing is complete, isolated pages are returned back to the buddy.
> >>
> >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> >
> [...]
> >> +}
> >> +
> >> +/**
> >> + * __page_reporting_enqueue - tracks the freed page in the respective zone's
> >> + * bitmap and enqueues a new page reporting job to the workqueue if possible.
> >> + */
> >> +void __page_reporting_enqueue(struct page *page)
> >> +{
> >> +       struct page_reporting_config *phconf;
> >> +       struct zone *zone;
> >> +
> >> +       rcu_read_lock();
> >> +       /*
> >> +        * We should not process this page if either page reporting is not
> >> +        * yet completely enabled or it has been disabled by the backend.
> >> +        */
> >> +       phconf = rcu_dereference(page_reporting_conf);
> >> +       if (!phconf)
> >> +               return;
> >> +
> >> +       zone = page_zone(page);
> >> +       bitmap_set_bit(page, zone);
> >> +
> >> +       /*
> >> +        * We should not enqueue a job if a previously enqueued reporting work
> >> +        * is in progress or we don't have enough free pages in the zone.
> >> +        */
> >> +       if (atomic_read(&zone->free_pages) >= phconf->max_pages &&
> >> +           !atomic_cmpxchg(&phconf->refcnt, 0, 1))
> > This doesn't make any sense to me. Why are you only incrementing the
> > refcount if it is zero? Combining this with the assignment above, this
> > isn't really a refcnt. It is just an oversized bitflag.
>
>
> The intent for having an extra variable was to ensure that at a time only one
> reporting job is enqueued. I do agree that for that purpose I really don't need
> a reference counter and I should have used something like bool
> 'page_hinting_active'. But with bool, I think there could be a possible chance
> of race. Maybe I should rename this variable and keep it as atomic.
> Any thoughts?

You could just use a bitflag to achieve what you are doing here. That
is the primary use case for many of the test_and_set_bit type
operations. However one issue with doing it as a bitflag is that you
have no way of telling that you took care of all requesters. That is
where having an actual reference count comes in handy as you know
exactly how many zones are requesting to be reported on.

> >
> > Also I am pretty sure this results in the opportunity to miss pages
> > because there is nothing to prevent you from possibly missing a ton of
> > pages you could hint on if a large number of pages are pushed out all
> > at once and then the system goes idle in terms of memory allocation
> > and freeing.
>
>
> I was looking at how you are enqueuing/processing reporting jobs for each zone.
> I am wondering if I should also consider something on similar lines as having
> that I might be able to address the concern which you have raised above. But it
> would also mean that I have to add an additional flag in the zone_flags. :)

You could do it either in the zone or outside the zone as yet another
bitmap. I decided to put the flags inside the zone because there was a
number of free bits there and it should be faster since we were
already using the zone structure.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-14 16:11       ` Alexander Duyck
@ 2019-08-15 13:15         ` Nitesh Narayan Lal
  2019-08-15 19:22           ` Nitesh Narayan Lal
  0 siblings, 1 reply; 35+ messages in thread
From: Nitesh Narayan Lal @ 2019-08-15 13:15 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, virtio-dev, Paolo Bonzini, lcapitulino,
	Pankaj Gupta, Wang, Wei W, Yang Zhang, Rik van Riel,
	David Hildenbrand, Michael S. Tsirkin, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli, john.starks,
	Dave Hansen, Michal Hocko, cohuck


On 8/14/19 12:11 PM, Alexander Duyck wrote:
> On Wed, Aug 14, 2019 at 8:49 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>
>> On 8/12/19 2:47 PM, Alexander Duyck wrote:
>>> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>> This patch introduces the core infrastructure for free page reporting in
>>>> virtual environments. It enables the kernel to track the free pages which
>>>> can be reported to its hypervisor so that the hypervisor could
>>>> free and reuse that memory as per its requirement.
>>>>
>>>> While the pages are getting processed in the hypervisor (e.g.,
>>>> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
>>>> would be possible. To avoid such a situation, these pages are
>>>> temporarily removed from the buddy. The amount of pages removed
>>>> temporarily from the buddy is governed by the backend(virtio-balloon
>>>> in our case).
>>>>
>>>> To efficiently identify free pages that can to be reported to the
>>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>>>> chunks are reported to the hypervisor - especially, to not break up THP
>>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>>>> in the bitmap are an indication whether a page *might* be free, not a
>>>> guarantee. A new hook after buddy merging sets the bits.
>>>>
>>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>>>> asynchronously processes the bitmaps, trying to isolate and report pages
>>>> that are still free. The backend (virtio-balloon) is responsible for
>>>> reporting these batched pages to the host synchronously. Once reporting/
>>>> freeing is complete, isolated pages are returned back to the buddy.
>>>>
>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> [...]
>>>> +}
>>>> +
>>>> +/**
>>>> + * __page_reporting_enqueue - tracks the freed page in the respective zone's
>>>> + * bitmap and enqueues a new page reporting job to the workqueue if possible.
>>>> + */
>>>> +void __page_reporting_enqueue(struct page *page)
>>>> +{
>>>> +       struct page_reporting_config *phconf;
>>>> +       struct zone *zone;
>>>> +
>>>> +       rcu_read_lock();
>>>> +       /*
>>>> +        * We should not process this page if either page reporting is not
>>>> +        * yet completely enabled or it has been disabled by the backend.
>>>> +        */
>>>> +       phconf = rcu_dereference(page_reporting_conf);
>>>> +       if (!phconf)
>>>> +               return;
>>>> +
>>>> +       zone = page_zone(page);
>>>> +       bitmap_set_bit(page, zone);
>>>> +
>>>> +       /*
>>>> +        * We should not enqueue a job if a previously enqueued reporting work
>>>> +        * is in progress or we don't have enough free pages in the zone.
>>>> +        */
>>>> +       if (atomic_read(&zone->free_pages) >= phconf->max_pages &&
>>>> +           !atomic_cmpxchg(&phconf->refcnt, 0, 1))
>>> This doesn't make any sense to me. Why are you only incrementing the
>>> refcount if it is zero? Combining this with the assignment above, this
>>> isn't really a refcnt. It is just an oversized bitflag.
>>
>> The intent for having an extra variable was to ensure that at a time only one
>> reporting job is enqueued. I do agree that for that purpose I really don't need
>> a reference counter and I should have used something like bool
>> 'page_hinting_active'. But with bool, I think there could be a possible chance
>> of race. Maybe I should rename this variable and keep it as atomic.
>> Any thoughts?
> You could just use a bitflag to achieve what you are doing here. That
> is the primary use case for many of the test_and_set_bit type
> operations. However one issue with doing it as a bitflag is that you
> have no way of telling that you took care of all requesters.

I think you are right, I might end up missing on certain reporting
opportunities in some special cases. Specifically when the pages which are
part of this new reporting request belongs to a section of the bitmap which
has already been scanned. Although, I have failed to reproduce this kind of
situation in an actual setup.

>  That is
> where having an actual reference count comes in handy as you know
> exactly how many zones are requesting to be reported on.


True.

>
>>> Also I am pretty sure this results in the opportunity to miss pages
>>> because there is nothing to prevent you from possibly missing a ton of
>>> pages you could hint on if a large number of pages are pushed out all
>>> at once and then the system goes idle in terms of memory allocation
>>> and freeing.
>>
>> I was looking at how you are enqueuing/processing reporting jobs for each zone.
>> I am wondering if I should also consider something on similar lines as having
>> that I might be able to address the concern which you have raised above. But it
>> would also mean that I have to add an additional flag in the zone_flags. :)
> You could do it either in the zone or outside the zone as yet another
> bitmap. I decided to put the flags inside the zone because there was a
> number of free bits there and it should be faster since we were
> already using the zone structure.

There are two possibilities which could happen while I am reporting:
1. Another request might come in for a different zone.
2. Another request could come in for the same zone and the pages belong to a
    section of the bitmap which has already been scanned.

Having a per zone flag to indicate reporting status will solve the first
issue and to an extent the second as well. Having refcnt will possibly solve
both of them. What I am wondering about is that in my case I could easily
impact the performance negatively by performing more bitmap scanning.


-- 
Thanks
Nitesh



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-15 13:15         ` Nitesh Narayan Lal
@ 2019-08-15 19:22           ` Nitesh Narayan Lal
  2019-08-15 23:00             ` Alexander Duyck
  0 siblings, 1 reply; 35+ messages in thread
From: Nitesh Narayan Lal @ 2019-08-15 19:22 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, virtio-dev, Paolo Bonzini, lcapitulino,
	Pankaj Gupta, Wang, Wei W, Yang Zhang, Rik van Riel,
	David Hildenbrand, Michael S. Tsirkin, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli, john.starks,
	Dave Hansen, Michal Hocko, cohuck


On 8/15/19 9:15 AM, Nitesh Narayan Lal wrote:
> On 8/14/19 12:11 PM, Alexander Duyck wrote:
>> On Wed, Aug 14, 2019 at 8:49 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>> On 8/12/19 2:47 PM, Alexander Duyck wrote:
>>>> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>>> This patch introduces the core infrastructure for free page reporting in
>>>>> virtual environments. It enables the kernel to track the free pages which
>>>>> can be reported to its hypervisor so that the hypervisor could
>>>>> free and reuse that memory as per its requirement.
>>>>>
>>>>> While the pages are getting processed in the hypervisor (e.g.,
>>>>> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
>>>>> would be possible. To avoid such a situation, these pages are
>>>>> temporarily removed from the buddy. The amount of pages removed
>>>>> temporarily from the buddy is governed by the backend(virtio-balloon
>>>>> in our case).
>>>>>
>>>>> To efficiently identify free pages that can to be reported to the
>>>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>>>>> chunks are reported to the hypervisor - especially, to not break up THP
>>>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>>>>> in the bitmap are an indication whether a page *might* be free, not a
>>>>> guarantee. A new hook after buddy merging sets the bits.
>>>>>
>>>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>>>>> asynchronously processes the bitmaps, trying to isolate and report pages
>>>>> that are still free. The backend (virtio-balloon) is responsible for
>>>>> reporting these batched pages to the host synchronously. Once reporting/
>>>>> freeing is complete, isolated pages are returned back to the buddy.
>>>>>
>>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>>> [...]
>>>>> +}
>>>>> +
>>>>> +/**
>>>>> + * __page_reporting_enqueue - tracks the freed page in the respective zone's
>>>>> + * bitmap and enqueues a new page reporting job to the workqueue if possible.
>>>>> + */
>>>>> +void __page_reporting_enqueue(struct page *page)
>>>>> +{
>>>>> +       struct page_reporting_config *phconf;
>>>>> +       struct zone *zone;
>>>>> +
>>>>> +       rcu_read_lock();
>>>>> +       /*
>>>>> +        * We should not process this page if either page reporting is not
>>>>> +        * yet completely enabled or it has been disabled by the backend.
>>>>> +        */
>>>>> +       phconf = rcu_dereference(page_reporting_conf);
>>>>> +       if (!phconf)
>>>>> +               return;
>>>>> +
>>>>> +       zone = page_zone(page);
>>>>> +       bitmap_set_bit(page, zone);
>>>>> +
>>>>> +       /*
>>>>> +        * We should not enqueue a job if a previously enqueued reporting work
>>>>> +        * is in progress or we don't have enough free pages in the zone.
>>>>> +        */
>>>>> +       if (atomic_read(&zone->free_pages) >= phconf->max_pages &&
>>>>> +           !atomic_cmpxchg(&phconf->refcnt, 0, 1))
>>>> This doesn't make any sense to me. Why are you only incrementing the
>>>> refcount if it is zero? Combining this with the assignment above, this
>>>> isn't really a refcnt. It is just an oversized bitflag.
>>> The intent for having an extra variable was to ensure that at a time only one
>>> reporting job is enqueued. I do agree that for that purpose I really don't need
>>> a reference counter and I should have used something like bool
>>> 'page_hinting_active'. But with bool, I think there could be a possible chance
>>> of race. Maybe I should rename this variable and keep it as atomic.
>>> Any thoughts?
>> You could just use a bitflag to achieve what you are doing here. That
>> is the primary use case for many of the test_and_set_bit type
>> operations. However one issue with doing it as a bitflag is that you
>> have no way of telling that you took care of all requesters.
> I think you are right, I might end up missing on certain reporting
> opportunities in some special cases. Specifically when the pages which are
> part of this new reporting request belongs to a section of the bitmap which
> has already been scanned. Although, I have failed to reproduce this kind of
> situation in an actual setup.
>
>>  That is
>> where having an actual reference count comes in handy as you know
>> exactly how many zones are requesting to be reported on.
>
> True.
>
>>>> Also I am pretty sure this results in the opportunity to miss pages
>>>> because there is nothing to prevent you from possibly missing a ton of
>>>> pages you could hint on if a large number of pages are pushed out all
>>>> at once and then the system goes idle in terms of memory allocation
>>>> and freeing.
>>> I was looking at how you are enqueuing/processing reporting jobs for each zone.
>>> I am wondering if I should also consider something on similar lines as having
>>> that I might be able to address the concern which you have raised above. But it
>>> would also mean that I have to add an additional flag in the zone_flags. :)
>> You could do it either in the zone or outside the zone as yet another
>> bitmap. I decided to put the flags inside the zone because there was a
>> number of free bits there and it should be faster since we were
>> already using the zone structure.
> There are two possibilities which could happen while I am reporting:
> 1. Another request might come in for a different zone.
> 2. Another request could come in for the same zone and the pages belong to a
>     section of the bitmap which has already been scanned.
>
> Having a per zone flag to indicate reporting status will solve the first
> issue and to an extent the second as well. Having refcnt will possibly solve
> both of them. What I am wondering about is that in my case I could easily
> impact the performance negatively by performing more bitmap scanning.
>
>

I realized that it may not be possible for me to directly adopt either refcnt
or zone flags just because of the way I have page reporting setup right now.

For now, I will just replace the refcnt with a bitflag as that should work
for most of the cases.  Nevertheless, I will also keep looking for a better way.

-- 
Thanks
Nitesh



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-15 19:22           ` Nitesh Narayan Lal
@ 2019-08-15 23:00             ` Alexander Duyck
  2019-08-16 18:35               ` Nitesh Narayan Lal
  0 siblings, 1 reply; 35+ messages in thread
From: Alexander Duyck @ 2019-08-15 23:00 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, virtio-dev, Paolo Bonzini, lcapitulino,
	Pankaj Gupta, Wang, Wei W, Yang Zhang, Rik van Riel,
	David Hildenbrand, Michael S. Tsirkin, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli, john.starks,
	Dave Hansen, Michal Hocko, cohuck

On Thu, Aug 15, 2019 at 12:23 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 8/15/19 9:15 AM, Nitesh Narayan Lal wrote:
> > On 8/14/19 12:11 PM, Alexander Duyck wrote:
> >> On Wed, Aug 14, 2019 at 8:49 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>> On 8/12/19 2:47 PM, Alexander Duyck wrote:
> >>>> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>>>> This patch introduces the core infrastructure for free page reporting in
> >>>>> virtual environments. It enables the kernel to track the free pages which
> >>>>> can be reported to its hypervisor so that the hypervisor could
> >>>>> free and reuse that memory as per its requirement.
> >>>>>
> >>>>> While the pages are getting processed in the hypervisor (e.g.,
> >>>>> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
> >>>>> would be possible. To avoid such a situation, these pages are
> >>>>> temporarily removed from the buddy. The amount of pages removed
> >>>>> temporarily from the buddy is governed by the backend(virtio-balloon
> >>>>> in our case).
> >>>>>
> >>>>> To efficiently identify free pages that can to be reported to the
> >>>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> >>>>> chunks are reported to the hypervisor - especially, to not break up THP
> >>>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> >>>>> in the bitmap are an indication whether a page *might* be free, not a
> >>>>> guarantee. A new hook after buddy merging sets the bits.
> >>>>>
> >>>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> >>>>> asynchronously processes the bitmaps, trying to isolate and report pages
> >>>>> that are still free. The backend (virtio-balloon) is responsible for
> >>>>> reporting these batched pages to the host synchronously. Once reporting/
> >>>>> freeing is complete, isolated pages are returned back to the buddy.
> >>>>>
> >>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> >>> [...]
> >>>>> +}
> >>>>> +
> >>>>> +/**
> >>>>> + * __page_reporting_enqueue - tracks the freed page in the respective zone's
> >>>>> + * bitmap and enqueues a new page reporting job to the workqueue if possible.
> >>>>> + */
> >>>>> +void __page_reporting_enqueue(struct page *page)
> >>>>> +{
> >>>>> +       struct page_reporting_config *phconf;
> >>>>> +       struct zone *zone;
> >>>>> +
> >>>>> +       rcu_read_lock();
> >>>>> +       /*
> >>>>> +        * We should not process this page if either page reporting is not
> >>>>> +        * yet completely enabled or it has been disabled by the backend.
> >>>>> +        */
> >>>>> +       phconf = rcu_dereference(page_reporting_conf);
> >>>>> +       if (!phconf)
> >>>>> +               return;
> >>>>> +
> >>>>> +       zone = page_zone(page);
> >>>>> +       bitmap_set_bit(page, zone);
> >>>>> +
> >>>>> +       /*
> >>>>> +        * We should not enqueue a job if a previously enqueued reporting work
> >>>>> +        * is in progress or we don't have enough free pages in the zone.
> >>>>> +        */
> >>>>> +       if (atomic_read(&zone->free_pages) >= phconf->max_pages &&
> >>>>> +           !atomic_cmpxchg(&phconf->refcnt, 0, 1))
> >>>> This doesn't make any sense to me. Why are you only incrementing the
> >>>> refcount if it is zero? Combining this with the assignment above, this
> >>>> isn't really a refcnt. It is just an oversized bitflag.
> >>> The intent for having an extra variable was to ensure that at a time only one
> >>> reporting job is enqueued. I do agree that for that purpose I really don't need
> >>> a reference counter and I should have used something like bool
> >>> 'page_hinting_active'. But with bool, I think there could be a possible chance
> >>> of race. Maybe I should rename this variable and keep it as atomic.
> >>> Any thoughts?
> >> You could just use a bitflag to achieve what you are doing here. That
> >> is the primary use case for many of the test_and_set_bit type
> >> operations. However one issue with doing it as a bitflag is that you
> >> have no way of telling that you took care of all requesters.
> > I think you are right, I might end up missing on certain reporting
> > opportunities in some special cases. Specifically when the pages which are
> > part of this new reporting request belongs to a section of the bitmap which
> > has already been scanned. Although, I have failed to reproduce this kind of
> > situation in an actual setup.
> >
> >>  That is
> >> where having an actual reference count comes in handy as you know
> >> exactly how many zones are requesting to be reported on.
> >
> > True.
> >
> >>>> Also I am pretty sure this results in the opportunity to miss pages
> >>>> because there is nothing to prevent you from possibly missing a ton of
> >>>> pages you could hint on if a large number of pages are pushed out all
> >>>> at once and then the system goes idle in terms of memory allocation
> >>>> and freeing.
> >>> I was looking at how you are enqueuing/processing reporting jobs for each zone.
> >>> I am wondering if I should also consider something on similar lines as having
> >>> that I might be able to address the concern which you have raised above. But it
> >>> would also mean that I have to add an additional flag in the zone_flags. :)
> >> You could do it either in the zone or outside the zone as yet another
> >> bitmap. I decided to put the flags inside the zone because there was a
> >> number of free bits there and it should be faster since we were
> >> already using the zone structure.
> > There are two possibilities which could happen while I am reporting:
> > 1. Another request might come in for a different zone.
> > 2. Another request could come in for the same zone and the pages belong to a
> >     section of the bitmap which has already been scanned.
> >
> > Having a per zone flag to indicate reporting status will solve the first
> > issue and to an extent the second as well. Having refcnt will possibly solve
> > both of them. What I am wondering about is that in my case I could easily
> > impact the performance negatively by performing more bitmap scanning.
> >
> >
>
> I realized that it may not be possible for me to directly adopt either refcnt
> or zone flags just because of the way I have page reporting setup right now.
>
> For now, I will just replace the refcnt with a bitflag as that should work
> for most of the cases.  Nevertheless, I will also keep looking for a better way.

If nothing else something you could consider is a refcnt for the
number of bits you have set in your bitfield. Then all you would need
to be doing is replace the cmpxchg with just a atomic_fetch_inc and
what you would need to do is have your worker thread track how many
bits it has cleared and subtract that from the refcnt at the end.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-15 23:00             ` Alexander Duyck
@ 2019-08-16 18:35               ` Nitesh Narayan Lal
  0 siblings, 0 replies; 35+ messages in thread
From: Nitesh Narayan Lal @ 2019-08-16 18:35 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, virtio-dev, Paolo Bonzini, lcapitulino,
	Pankaj Gupta, Wang, Wei W, Yang Zhang, Rik van Riel,
	David Hildenbrand, Michael S. Tsirkin, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli, john.starks,
	Dave Hansen, Michal Hocko, cohuck


On 8/15/19 7:00 PM, Alexander Duyck wrote:
> On Thu, Aug 15, 2019 at 12:23 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
[...]
>>>>>>> +}
>>>>>>> +
>>>>>>> +/**
>>>>>>> + * __page_reporting_enqueue - tracks the freed page in the respective zone's
>>>>>>> + * bitmap and enqueues a new page reporting job to the workqueue if possible.
>>>>>>> + */
>>>>>>> +void __page_reporting_enqueue(struct page *page)
>>>>>>> +{
>>>>>>> +       struct page_reporting_config *phconf;
>>>>>>> +       struct zone *zone;
>>>>>>> +
>>>>>>> +       rcu_read_lock();
>>>>>>> +       /*
>>>>>>> +        * We should not process this page if either page reporting is not
>>>>>>> +        * yet completely enabled or it has been disabled by the backend.
>>>>>>> +        */
>>>>>>> +       phconf = rcu_dereference(page_reporting_conf);
>>>>>>> +       if (!phconf)
>>>>>>> +               return;
>>>>>>> +
>>>>>>> +       zone = page_zone(page);
>>>>>>> +       bitmap_set_bit(page, zone);
>>>>>>> +
>>>>>>> +       /*
>>>>>>> +        * We should not enqueue a job if a previously enqueued reporting work
>>>>>>> +        * is in progress or we don't have enough free pages in the zone.
>>>>>>> +        */
>>>>>>> +       if (atomic_read(&zone->free_pages) >= phconf->max_pages &&
>>>>>>> +           !atomic_cmpxchg(&phconf->refcnt, 0, 1))
>>>>>> This doesn't make any sense to me. Why are you only incrementing the
>>>>>> refcount if it is zero? Combining this with the assignment above, this
>>>>>> isn't really a refcnt. It is just an oversized bitflag.
>>>>> The intent for having an extra variable was to ensure that at a time only one
>>>>> reporting job is enqueued. I do agree that for that purpose I really don't need
>>>>> a reference counter and I should have used something like bool
>>>>> 'page_hinting_active'. But with bool, I think there could be a possible chance
>>>>> of race. Maybe I should rename this variable and keep it as atomic.
>>>>> Any thoughts?
>>>> You could just use a bitflag to achieve what you are doing here. That
>>>> is the primary use case for many of the test_and_set_bit type
>>>> operations. However one issue with doing it as a bitflag is that you
>>>> have no way of telling that you took care of all requesters.
>>> I think you are right, I might end up missing on certain reporting
>>> opportunities in some special cases. Specifically when the pages which are
>>> part of this new reporting request belongs to a section of the bitmap which
>>> has already been scanned. Although, I have failed to reproduce this kind of
>>> situation in an actual setup.
>>>
>>>>  That is
>>>> where having an actual reference count comes in handy as you know
>>>> exactly how many zones are requesting to be reported on.
>>> True.
>>>
>>>>>> Also I am pretty sure this results in the opportunity to miss pages
>>>>>> because there is nothing to prevent you from possibly missing a ton of
>>>>>> pages you could hint on if a large number of pages are pushed out all
>>>>>> at once and then the system goes idle in terms of memory allocation
>>>>>> and freeing.
>>>>> I was looking at how you are enqueuing/processing reporting jobs for each zone.
>>>>> I am wondering if I should also consider something on similar lines as having
>>>>> that I might be able to address the concern which you have raised above. But it
>>>>> would also mean that I have to add an additional flag in the zone_flags. :)
>>>> You could do it either in the zone or outside the zone as yet another
>>>> bitmap. I decided to put the flags inside the zone because there was a
>>>> number of free bits there and it should be faster since we were
>>>> already using the zone structure.
>>> There are two possibilities which could happen while I am reporting:
>>> 1. Another request might come in for a different zone.
>>> 2. Another request could come in for the same zone and the pages belong to a
>>>     section of the bitmap which has already been scanned.
>>>
>>> Having a per zone flag to indicate reporting status will solve the first
>>> issue and to an extent the second as well. Having refcnt will possibly solve
>>> both of them. What I am wondering about is that in my case I could easily
>>> impact the performance negatively by performing more bitmap scanning.
>>>
>>>
>> I realized that it may not be possible for me to directly adopt either refcnt
>> or zone flags just because of the way I have page reporting setup right now.
>>
>> For now, I will just replace the refcnt with a bitflag as that should work
>> for most of the cases.  Nevertheless, I will also keep looking for a better way.
> If nothing else something you could consider is a refcnt for the
> number of bits you have set in your bitfield. Then all you would need
> to be doing is replace the cmpxchg with just a atomic_fetch_inc and
> what you would need to do is have your worker thread track how many
> bits it has cleared and subtract that from the refcnt at the end.

Thanks, I will think about this suggestion as well.

Based on your previous suggestion and what you have already proposed in your
series I can think of a way to atleast ensure reporting for zones freeing pages
after getting scanned in wq.
(In case I decide to go ahead with this approach I will mention that this change
is based on your series. Please do let me know if there is a better way to give
credit)

However, a situation where the same zone is reporting pages from the bitmap
section already scanned with zero freeing activity on other zones, may not
be entirely handled.

In any case, I think what I have in my mind will be better than what I have
right now. I will try to implement and test it to see if it can actually work.

-- 
Thanks
Nitesh



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-12 20:04     ` Nitesh Narayan Lal
@ 2019-08-20 14:11       ` Nitesh Narayan Lal
  0 siblings, 0 replies; 35+ messages in thread
From: Nitesh Narayan Lal @ 2019-08-20 14:11 UTC (permalink / raw)
  To: Alexander Duyck, David Hildenbrand
  Cc: kvm list, LKML, linux-mm, virtio-dev, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, Rik van Riel,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko, cohuck


On 8/12/19 4:04 PM, Nitesh Narayan Lal wrote:
> On 8/12/19 2:47 PM, Alexander Duyck wrote:
>> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>> This patch introduces the core infrastructure for free page reporting in
>>> virtual environments. It enables the kernel to track the free pages which
>>> can be reported to its hypervisor so that the hypervisor could
>>> free and reuse that memory as per its requirement.
>>>
>>> While the pages are getting processed in the hypervisor (e.g.,
>>> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
>>> would be possible. To avoid such a situation, these pages are
>>> temporarily removed from the buddy. The amount of pages removed
>>> temporarily from the buddy is governed by the backend(virtio-balloon
>>> in our case).
>>>
>>> To efficiently identify free pages that can to be reported to the
>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>>> chunks are reported to the hypervisor - especially, to not break up THP
>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>>> in the bitmap are an indication whether a page *might* be free, not a
>>> guarantee. A new hook after buddy merging sets the bits.
>>>
>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>>> asynchronously processes the bitmaps, trying to isolate and report pages
>>> that are still free. The backend (virtio-balloon) is responsible for
>>> reporting these batched pages to the host synchronously. Once reporting/
>>> freeing is complete, isolated pages are returned back to the buddy.
>>>
>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> So if I understand correctly the hotplug support for this is still not
>> included correct? 
>
> That is correct, I have it as an ongoing-item in my cover-email.
> In case, we decide to go with this approach do you think it is a blocker?

I am planning to defer memory hotplug/hotremove support for the future. Due to
following reasons:
* I would like to first get a basic framework ready and merged (in case we
  decide to go ahead with this approach) and then build on top of it.
* Memory hotplug/hotremove is not a primary use case in our mind right now and
  hence I am not considering this as a blocker for the first step.

Following are the items which I intend to address before my next submission:
* Use zone flag and reference counter to track the number of zones requesting
  page reporting.
* Move the bitmap and its respective fields into a structure whose rcu-protected
  pointer object is maintained on a per-zone basis.
* Pick Alexander's patch for page poisoning support and test them with my patch
  set. (@Alexander: I will keep your signed-off for these patches to indicate
  you are the original author, please do let me know there is a better way to
  give credit).
* Address other suggestions/comments received on v12.

Looking forward to any suggestions/comments.


[...]
> +
> +       /* assign the configuration object provided by the backend */
> +       rcu_assign_pointer(page_reporting_conf, phconf);
> +
> +out:
> +       mutex_unlock(&page_reporting_mutex);
> +       return ret;
> +}
> +EXPORT_SYMBOL_GPL(page_reporting_enable);
> --
> 2.21.0
>
-- 
Thanks
Nitesh



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-12 18:47   ` Alexander Duyck
                       ` (2 preceding siblings ...)
  2019-08-14 15:49     ` Nitesh Narayan Lal
@ 2019-08-30 15:15     ` Nitesh Narayan Lal
  2019-08-30 15:31       ` Alexander Duyck
  3 siblings, 1 reply; 35+ messages in thread
From: Nitesh Narayan Lal @ 2019-08-30 15:15 UTC (permalink / raw)
  To: Alexander Duyck, David Hildenbrand
  Cc: kvm list, LKML, linux-mm, virtio-dev, Paolo Bonzini, lcapitulino,
	pagupta, wei.w.wang, Yang Zhang, Rik van Riel,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko, cohuck


On 8/12/19 2:47 PM, Alexander Duyck wrote:
> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> This patch introduces the core infrastructure for free page reporting in
>> virtual environments. It enables the kernel to track the free pages which
>> can be reported to its hypervisor so that the hypervisor could
>> free and reuse that memory as per its requirement.
>>
>> While the pages are getting processed in the hypervisor (e.g.,
>> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
>> would be possible. To avoid such a situation, these pages are
>> temporarily removed from the buddy. The amount of pages removed
>> temporarily from the buddy is governed by the backend(virtio-balloon
>> in our case).
>>
>> To efficiently identify free pages that can to be reported to the
>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>> chunks are reported to the hypervisor - especially, to not break up THP
>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>> in the bitmap are an indication whether a page *might* be free, not a
>> guarantee. A new hook after buddy merging sets the bits.
>>
>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>> asynchronously processes the bitmaps, trying to isolate and report pages
>> that are still free. The backend (virtio-balloon) is responsible for
>> reporting these batched pages to the host synchronously. Once reporting/
>> freeing is complete, isolated pages are returned back to the buddy.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
[...]
>> +static void scan_zone_bitmap(struct page_reporting_config *phconf,
>> +                            struct zone *zone)
>> +{
>> +       unsigned long setbit;
>> +       struct page *page;
>> +       int count = 0;
>> +
>> +       sg_init_table(phconf->sg, phconf->max_pages);
>> +
>> +       for_each_set_bit(setbit, zone->bitmap, zone->nbits) {
>> +               /* Process only if the page is still online */
>> +               page = pfn_to_online_page((setbit << PAGE_REPORTING_MIN_ORDER) +
>> +                                         zone->base_pfn);
>> +               if (!page)
>> +                       continue;
>> +
> Shouldn't you be clearing the bit and dropping the reference to
> free_pages before you move on to the next bit? Otherwise you are going
> to be stuck with those aren't you?
>
>> +               spin_lock(&zone->lock);
>> +
>> +               /* Ensure page is still free and can be processed */
>> +               if (PageBuddy(page) && page_private(page) >=
>> +                   PAGE_REPORTING_MIN_ORDER)
>> +                       count = process_free_page(page, phconf, count);
>> +
>> +               spin_unlock(&zone->lock);
> So I kind of wonder just how much overhead you are taking for bouncing
> the zone lock once per page here. Especially since it can result in
> you not actually making any progress since the page may have already
> been reallocated.
>

I am wondering if there is a way to measure this overhead?
After thinking about this, I do understand your point.
One possible way which I can think of to address this is by having a
page_reporting_dequeue() hook somewhere in the allocation path.
    
For some reason, I am not seeing this work as I would have expected
but I don't have solid reasoning to share yet. It could be simply
because I am putting my hook at the wrong place. I will continue
investigating this.

In any case, I may be over complicating things here, so please let me
if there is a better way to do this.

If this overhead is not significant we can probably live with it.

-- 
Thanks
Nitesh



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-30 15:15     ` Nitesh Narayan Lal
@ 2019-08-30 15:31       ` Alexander Duyck
  2019-08-30 16:05         ` Nitesh Narayan Lal
  0 siblings, 1 reply; 35+ messages in thread
From: Alexander Duyck @ 2019-08-30 15:31 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: David Hildenbrand, kvm list, LKML, linux-mm, virtio-dev,
	Paolo Bonzini, lcapitulino, Pankaj Gupta, Wang, Wei W,
	Yang Zhang, Rik van Riel, Michael S. Tsirkin, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli, john.starks,
	Dave Hansen, Michal Hocko, cohuck

On Fri, Aug 30, 2019 at 8:15 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 8/12/19 2:47 PM, Alexander Duyck wrote:
> > On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >> This patch introduces the core infrastructure for free page reporting in
> >> virtual environments. It enables the kernel to track the free pages which
> >> can be reported to its hypervisor so that the hypervisor could
> >> free and reuse that memory as per its requirement.
> >>
> >> While the pages are getting processed in the hypervisor (e.g.,
> >> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
> >> would be possible. To avoid such a situation, these pages are
> >> temporarily removed from the buddy. The amount of pages removed
> >> temporarily from the buddy is governed by the backend(virtio-balloon
> >> in our case).
> >>
> >> To efficiently identify free pages that can to be reported to the
> >> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> >> chunks are reported to the hypervisor - especially, to not break up THP
> >> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> >> in the bitmap are an indication whether a page *might* be free, not a
> >> guarantee. A new hook after buddy merging sets the bits.
> >>
> >> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> >> asynchronously processes the bitmaps, trying to isolate and report pages
> >> that are still free. The backend (virtio-balloon) is responsible for
> >> reporting these batched pages to the host synchronously. Once reporting/
> >> freeing is complete, isolated pages are returned back to the buddy.
> >>
> >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> [...]
> >> +static void scan_zone_bitmap(struct page_reporting_config *phconf,
> >> +                            struct zone *zone)
> >> +{
> >> +       unsigned long setbit;
> >> +       struct page *page;
> >> +       int count = 0;
> >> +
> >> +       sg_init_table(phconf->sg, phconf->max_pages);
> >> +
> >> +       for_each_set_bit(setbit, zone->bitmap, zone->nbits) {
> >> +               /* Process only if the page is still online */
> >> +               page = pfn_to_online_page((setbit << PAGE_REPORTING_MIN_ORDER) +
> >> +                                         zone->base_pfn);
> >> +               if (!page)
> >> +                       continue;
> >> +
> > Shouldn't you be clearing the bit and dropping the reference to
> > free_pages before you move on to the next bit? Otherwise you are going
> > to be stuck with those aren't you?
> >
> >> +               spin_lock(&zone->lock);
> >> +
> >> +               /* Ensure page is still free and can be processed */
> >> +               if (PageBuddy(page) && page_private(page) >=
> >> +                   PAGE_REPORTING_MIN_ORDER)
> >> +                       count = process_free_page(page, phconf, count);
> >> +
> >> +               spin_unlock(&zone->lock);
> > So I kind of wonder just how much overhead you are taking for bouncing
> > the zone lock once per page here. Especially since it can result in
> > you not actually making any progress since the page may have already
> > been reallocated.
> >
>
> I am wondering if there is a way to measure this overhead?
> After thinking about this, I do understand your point.
> One possible way which I can think of to address this is by having a
> page_reporting_dequeue() hook somewhere in the allocation path.

Really in order to stress this you probably need to have a lot of
CPUs, a lot of memory, and something that forces a lot of pages to get
hit such as the memory shuffling feature.

> For some reason, I am not seeing this work as I would have expected
> but I don't have solid reasoning to share yet. It could be simply
> because I am putting my hook at the wrong place. I will continue
> investigating this.
>
> In any case, I may be over complicating things here, so please let me
> if there is a better way to do this.

I have already been demonstrating the "better way" I think there is to
do this. I will push v7 of it early next week unless there is some
other feedback. By putting the bit in the page and controlling what
comes into and out of the lists it makes most of this quite a bit
easier. The only limitation is you have to modify where things get
placed in the lists so you don't create a "vapor lock" that would
stall the feed of pages into the reporting engine.

> If this overhead is not significant we can probably live with it.

You have bigger issues you still have to overcome as I recall. Didn't
you still need to sort out hotplug and a sparse map with a wide span
in a zone? Without those resolved the bitmap approach is still a no-go
regardless of performance.

- Alex


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-30 15:31       ` Alexander Duyck
@ 2019-08-30 16:05         ` Nitesh Narayan Lal
  2019-09-04  8:40           ` [virtio-dev] " David Hildenbrand
  0 siblings, 1 reply; 35+ messages in thread
From: Nitesh Narayan Lal @ 2019-08-30 16:05 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Hildenbrand, kvm list, LKML, linux-mm, virtio-dev,
	Paolo Bonzini, lcapitulino, Pankaj Gupta, Wang, Wei W,
	Yang Zhang, Rik van Riel, Michael S. Tsirkin, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli, john.starks,
	Dave Hansen, Michal Hocko, cohuck


On 8/30/19 11:31 AM, Alexander Duyck wrote:
> On Fri, Aug 30, 2019 at 8:15 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>
>> On 8/12/19 2:47 PM, Alexander Duyck wrote:
>>> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>> This patch introduces the core infrastructure for free page reporting in
>>>> virtual environments. It enables the kernel to track the free pages which
>>>> can be reported to its hypervisor so that the hypervisor could
>>>> free and reuse that memory as per its requirement.
>>>>
>>>> While the pages are getting processed in the hypervisor (e.g.,
>>>> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
>>>> would be possible. To avoid such a situation, these pages are
>>>> temporarily removed from the buddy. The amount of pages removed
>>>> temporarily from the buddy is governed by the backend(virtio-balloon
>>>> in our case).
>>>>
>>>> To efficiently identify free pages that can to be reported to the
>>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>>>> chunks are reported to the hypervisor - especially, to not break up THP
>>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>>>> in the bitmap are an indication whether a page *might* be free, not a
>>>> guarantee. A new hook after buddy merging sets the bits.
>>>>
>>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>>>> asynchronously processes the bitmaps, trying to isolate and report pages
>>>> that are still free. The backend (virtio-balloon) is responsible for
>>>> reporting these batched pages to the host synchronously. Once reporting/
>>>> freeing is complete, isolated pages are returned back to the buddy.
>>>>
>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> [...]
>>>> +static void scan_zone_bitmap(struct page_reporting_config *phconf,
>>>> +                            struct zone *zone)
>>>> +{
>>>> +       unsigned long setbit;
>>>> +       struct page *page;
>>>> +       int count = 0;
>>>> +
>>>> +       sg_init_table(phconf->sg, phconf->max_pages);
>>>> +
>>>> +       for_each_set_bit(setbit, zone->bitmap, zone->nbits) {
>>>> +               /* Process only if the page is still online */
>>>> +               page = pfn_to_online_page((setbit << PAGE_REPORTING_MIN_ORDER) +
>>>> +                                         zone->base_pfn);
>>>> +               if (!page)
>>>> +                       continue;
>>>> +
>>> Shouldn't you be clearing the bit and dropping the reference to
>>> free_pages before you move on to the next bit? Otherwise you are going
>>> to be stuck with those aren't you?
>>>
>>>> +               spin_lock(&zone->lock);
>>>> +
>>>> +               /* Ensure page is still free and can be processed */
>>>> +               if (PageBuddy(page) && page_private(page) >=
>>>> +                   PAGE_REPORTING_MIN_ORDER)
>>>> +                       count = process_free_page(page, phconf, count);
>>>> +
>>>> +               spin_unlock(&zone->lock);
>>> So I kind of wonder just how much overhead you are taking for bouncing
>>> the zone lock once per page here. Especially since it can result in
>>> you not actually making any progress since the page may have already
>>> been reallocated.
>>>
>> I am wondering if there is a way to measure this overhead?
>> After thinking about this, I do understand your point.
>> One possible way which I can think of to address this is by having a
>> page_reporting_dequeue() hook somewhere in the allocation path.
> Really in order to stress this you probably need to have a lot of
> CPUs, a lot of memory, and something that forces a lot of pages to get
> hit such as the memory shuffling feature.

I will think about it, thanks for the suggestion.

>
>> For some reason, I am not seeing this work as I would have expected
>> but I don't have solid reasoning to share yet. It could be simply
>> because I am putting my hook at the wrong place. I will continue
>> investigating this.
>>
>> In any case, I may be over complicating things here, so please let me
>> if there is a better way to do this.
> I have already been demonstrating the "better way" I think there is to
> do this. I will push v7 of it early next week unless there is some
> other feedback. By putting the bit in the page and controlling what
> comes into and out of the lists it makes most of this quite a bit
> easier. The only limitation is you have to modify where things get
> placed in the lists so you don't create a "vapor lock" that would
> stall the feed of pages into the reporting engine.
>
>> If this overhead is not significant we can probably live with it.
> You have bigger issues you still have to overcome as I recall. Didn't
> you still need to sort out hotplu

For memory hotplug, my impression is that it should
not be a blocker for taking the first step (in case we do decide to
go ahead with this approach). Another reason why I am considering
this as future work is that memory hot(un)plug is still under
development and requires fixing. (Specifically, issue such as zone
shrinking which will directly impact the bitmap approach is still
under discussion).

> g and a sparse map with a wide span
> in a zone? Without those resolved the bitmap approach is still a no-go
> regardless of performance.

For sparsity, the memory wastage should not be significant as I
am tracking pages on the granularity of (MAX_ORDER - 2) and maintaining
the bitmaps on a per-zone basis (which was not the case earlier).

However, if you do consider this as a block I will think about it and try to fix it.
In the worst case, if I don't find a solution I will add this as a known limitation
for this approach in my cover.

> - Alex
-- 
Thanks
Nitesh



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-30 16:05         ` Nitesh Narayan Lal
@ 2019-09-04  8:40           ` David Hildenbrand
  0 siblings, 0 replies; 35+ messages in thread
From: David Hildenbrand @ 2019-09-04  8:40 UTC (permalink / raw)
  To: Nitesh Narayan Lal, Alexander Duyck
  Cc: kvm list, LKML, linux-mm, virtio-dev, Paolo Bonzini, lcapitulino,
	Pankaj Gupta, Wang, Wei W, Yang Zhang, Rik van Riel,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko, cohuck

>>> For some reason, I am not seeing this work as I would have expected
>>> but I don't have solid reasoning to share yet. It could be simply
>>> because I am putting my hook at the wrong place. I will continue
>>> investigating this.
>>>
>>> In any case, I may be over complicating things here, so please let me
>>> if there is a better way to do this.
>> I have already been demonstrating the "better way" I think there is to
>> do this. I will push v7 of it early next week unless there is some
>> other feedback. By putting the bit in the page and controlling what
>> comes into and out of the lists it makes most of this quite a bit
>> easier. The only limitation is you have to modify where things get
>> placed in the lists so you don't create a "vapor lock" that would
>> stall the feed of pages into the reporting engine.
>>
>>> If this overhead is not significant we can probably live with it.
>> You have bigger issues you still have to overcome as I recall. Didn't
>> you still need to sort out hotplu
> 
> For memory hotplug, my impression is that it should
> not be a blocker for taking the first step (in case we do decide to
> go ahead with this approach). Another reason why I am considering
> this as future work is that memory hot(un)plug is still under
> development and requires fixing. (Specifically, issue such as zone
> shrinking which will directly impact the bitmap approach is still
> under discussion).

Memory hotplug is way more reliable than memory unplug - however, but
both are used in production. E.g. the zone shrinking you mention is
something that is not "optimal", but works in most scenarios. So both
features are rather in a "production/fixing" stage than a pure
"development" stage.

However, I do agree that memory hot(un)plug is not something
high-priority for free page hinting in the first shot. If it works out
of the box (Alexander's approach) - great - if not it is just one
additional scenario where free page hinting might not be optimal yet
(like vfio where we can't use it completely right now). After all, most
virtual environment (under KVM) I am aware of don't use DIMM-base memory
hot(un)plug at all.

The important part is that it will not break when memory is added/removed.

> 
>> g and a sparse map with a wide span
>> in a zone? Without those resolved the bitmap approach is still a no-go
>> regardless of performance.
> 
> For sparsity, the memory wastage should not be significant as I
> am tracking pages on the granularity of (MAX_ORDER - 2) and maintaining
> the bitmaps on a per-zone basis (which was not the case earlier).

To handle sparsity one would have to have separate bitmaps for each
section. Roughly 64 bit per 128MB section. Scanning the scattered bitmap
would get more expensive. Of course, one could have a hierarchical
bitmap to speed up the search etc.

But again, I think we should have a look how much of an issue sparse
zones/nodes actually are in virtual machines (KVM). The setups I am
aware of minimize sparsity (e.g., QEMU will place DIMMs consecutively in
memory, only aligning to e.g, 128MB on x86 if required). Usually all
memory in VMs is onlined to the NORMAL zone (except special cases of
course). The only "issue" would be if you start mixing DIMMs of
different nodes when hotplugging memory. The overhead for 1bit per 2MB
could be almost neglectable in most setups.

But I do agree that for the bitmap-based approach there might be more
scenarios where you'd rather not want to use free page hinting and
instead disable it.

> 
> However, if you do consider this as a block I will think about it and try to fix it.
> In the worst case, if I don't find a solution I will add this as a known limitation
> for this approach in my cover.
> 
>> - Alex


-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH v12 0/2] mm: Support for page reporting
  2019-08-12 13:12 [RFC][PATCH v12 0/2] mm: Support for page reporting Nitesh Narayan Lal
                   ` (2 preceding siblings ...)
  2019-08-12 13:13 ` [QEMU Patch 1/2] virtio-balloon: adding bit for page reporting support Nitesh Narayan Lal
@ 2019-09-11 12:30 ` David Hildenbrand
  3 siblings, 0 replies; 35+ messages in thread
From: David Hildenbrand @ 2019-09-11 12:30 UTC (permalink / raw)
  To: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, virtio-dev,
	pbonzini, lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel,
	mst, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck,
	john.starks, dave.hansen, mhocko, cohuck

On 12.08.19 15:12, Nitesh Narayan Lal wrote:
> This patch series proposes an efficient mechanism for reporting free memory
> from a guest to its hypervisor. It especially enables guests with no page cache
> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
> rapidly hand back free memory to the hypervisor.
> This approach has a minimal impact on the existing core-mm infrastructure.
> 
> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
> A new hook after buddy merging is used to set the bits in the bitmap for a freed 
> page. Each set bit is cleared after they are processed/checked for
> re-allocation.
> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
> threshold is met, trying to isolate and report pages that are still free.
> The isolated pages are stored in a scatterlist and are reported via
> virtio-balloon, which is responsible for sending batched pages to the
> hypervisor. Once the hypervisor processed the reporting request, the isolated
> pages are returned back to the buddy.
> The thershold which defines the number of pages which will be isolated and
> reported to the hypervisor at a time is currently hardcoded to 16 in the guest.
> 
> Benefit analysis:
> Number of 5 GB guests (each touching 4 to 5 GB memory) that can be launched on a
> 15 GB single NUMA system without using swap space in the host.
> 
> 	    Guest kernel-->	Unmodified		with v12 page reporting
> 	Number of guests-->	    2				7
> 
> Conclusion: In a page-reporting enabled kernel, the guest is able to report
> most of its unused memory back to the host. Due to this on the same host, I was
> able to launch 7 guests without touching any swap compared to 2 which were
> launched with an unmodified kernel.
> 
> Performance Analysis:
> In order to measure the performance impact of this patch-series over an
> unmodified kernel, I am using will-it-scale/page_fault1 on a 30 GB, 24 vcpus
> single NUMA guest which is affined to a single node in the host. Over several
> runs, I observed that with this patch-series there is a degradation of around
> 1-3% for certain cases. This degradation could be a result of page-zeroing
> overhead which comes with every page-fault in the guest.
> I also tried this test on a 2 NUMA node host running page reporting
> enabled 60GB guest also having 2 NUMA nodes and 24 vcpus. I observed a similar
> degradation of around 1-3% in most of the cases.
> For certain cases, the variability even with an unmodified kernel was around
> 4-6% with every fresh boot. I will continue to investigate this further to find
> the reason behind it.
> 
> Ongoing work-items:
> * I have a working prototype for supporting memory hotplug/hotremove with page
>   reporting. However, it still requires more testing and fixes specifically on
>   the hotremove side.
>   Right now, for any memory hotplug or hotremove request bitmap or its
>   respective fields are not changed. Hence, memory added via hotplug is not
>   tracked in the bitmap. Similarly, removed memory is not reported to the
>   hypervisor by using an online memory check. 
> * I will also have to look into the details about how to handle page poisoning
>   scenarios and test with directly assigned devices.
> 
> 
> Changes from v11:
> https://lkml.org/lkml/2019/7/10/742
> * Moved the fields required to manage bitmap of free pages to 'struct zone'.
> * Replaced the list which was used to hold and report the free pages with
>   scatterlist.
> * Tried to fix the anti-kernel patterns and improve overall code quality.
> * Fixed a few bugs in the code which were reported in the last posting.
> * Moved to use MADV_DONTNEED from MADV_FREE.
> * Replaced page hinting in favor of page reporting.
> * Addressed other comments which I received in the last posting.	
> 
> 
> Changes from v10:
> https://lkml.org/lkml/2019/6/3/943
> * Added logic to take care of multiple NUMA nodes scenarios.
> * Simplified the logic for reporting isolated pages to the host. (Eg. replaced
>   dynamically allocated arrays with static ones, introduced wait event instead
>   of the loop in order to wait for a response from the host)
> * Added a mutex to prevent race condition when page reporting is enabled by
>   multiple drivers.
> * Simplified the logic responsible for decrementing free page counter for each
>   zone.
> * Simplified code structuring/naming.
>  

Some current limitations of this patchset seem to be

1. Sparse zones eventually wasting memory (1bit per 2MB).

As I already set, I consider this in most virtual environments a special
case (especially a lot of sparsity). You can simply compare the spanned
vs. present pages and don't allocate a bitmap in case it's too sparse
("currently unsupported environment"). These pieces won't be considered
for free page reporting, however free page reporting is a pure
optimization already either way. We can be smarter in the future (split
up bitmap into sub-bitmaps ...)

2. Memory hot(un)plug support

Memory hotplug should be easy with the memory hotplug notifier. Resize
bitmaps after hotplug if required. Hotunplug is tricky, as it depends on
zone shrinking (shrink bitmaps after offlining). You could scan for
actually online section manually. But with minor modifications after
"[PATCH v4 0/8] mm/memory_hotplug: Shrink zones before removing memory",
at least some cases could also be handled. (sparse handling similar to
1). Of course, initially, you could also simply not try to shrink the
bitmap on unplug ...

3. Scanning speed

I have no idea if that is actually an issue. But there are different
options if it is, for example, a hierarchical bitmap.


Besides these, I think there were other review comments that should be
addressed, but they don't seem to target the concept but rather
implementation details.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-08-12 13:12 ` [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure Nitesh Narayan Lal
  2019-08-12 18:47   ` Alexander Duyck
@ 2019-10-10 20:36   ` Alexander Duyck
  2019-10-11 11:02     ` Nitesh Narayan Lal
  1 sibling, 1 reply; 35+ messages in thread
From: Alexander Duyck @ 2019-10-10 20:36 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, virtio-dev, Paolo Bonzini, lcapitulino,
	Pankaj Gupta, Wang, Wei W, Yang Zhang, Rik van Riel,
	David Hildenbrand, Michael S. Tsirkin, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli, john.starks,
	Dave Hansen, Michal Hocko, cohuck

On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>

<snip>

> +static int process_free_page(struct page *page,
> +                            struct page_reporting_config *phconf, int count)
> +{
> +       int mt, order, ret = 0;
> +
> +       mt = get_pageblock_migratetype(page);
> +       order = page_private(page);
> +       ret = __isolate_free_page(page, order);
> +
> +       if (ret) {
> +               /*
> +                * Preserving order and migratetype for reuse while
> +                * releasing the pages back to the buddy.
> +                */
> +               set_pageblock_migratetype(page, mt);
> +               set_page_private(page, order);
> +
> +               sg_set_page(&phconf->sg[count++], page,
> +                           PAGE_SIZE << order, 0);
> +       }
> +
> +       return count;
> +}
> +
> +/**
> + * scan_zone_bitmap - scans the bitmap for the requested zone.
> + * @phconf: page reporting configuration object initialized by the backend.
> + * @zone: zone for which page reporting is requested.
> + *
> + * For every page marked in the bitmap it checks if it is still free if so it
> + * isolates and adds them to a scatterlist. As soon as the number of isolated
> + * pages reach the threshold set by the backend, they are reported to the
> + * hypervisor by the backend. Once the hypervisor responds after processing
> + * they are returned back to the buddy for reuse.
> + */
> +static void scan_zone_bitmap(struct page_reporting_config *phconf,
> +                            struct zone *zone)
> +{
> +       unsigned long setbit;
> +       struct page *page;
> +       int count = 0;
> +
> +       sg_init_table(phconf->sg, phconf->max_pages);
> +
> +       for_each_set_bit(setbit, zone->bitmap, zone->nbits) {
> +               /* Process only if the page is still online */
> +               page = pfn_to_online_page((setbit << PAGE_REPORTING_MIN_ORDER) +
> +                                         zone->base_pfn);
> +               if (!page)
> +                       continue;
> +
> +               spin_lock(&zone->lock);
> +
> +               /* Ensure page is still free and can be processed */
> +               if (PageBuddy(page) && page_private(page) >=
> +                   PAGE_REPORTING_MIN_ORDER)
> +                       count = process_free_page(page, phconf, count);
> +
> +               spin_unlock(&zone->lock);
> +               /* Page has been processed, adjust its bit and zone counter */
> +               clear_bit(setbit, zone->bitmap);
> +               atomic_dec(&zone->free_pages);
> +
> +               if (count == phconf->max_pages) {
> +                       /* Report isolated pages to the hypervisor */
> +                       phconf->report(phconf, count);
> +
> +                       /* Return processed pages back to the buddy */
> +                       return_isolated_page(zone, phconf);
> +
> +                       /* Reset for next reporting */
> +                       sg_init_table(phconf->sg, phconf->max_pages);
> +                       count = 0;
> +               }
> +       }
> +       /*
> +        * If the number of isolated pages does not meet the max_pages
> +        * threshold, we would still prefer to report them as we have already
> +        * isolated them.
> +        */
> +       if (count) {
> +               sg_mark_end(&phconf->sg[count - 1]);
> +               phconf->report(phconf, count);
> +
> +               return_isolated_page(zone, phconf);
> +       }
> +}
> +

So one thing that occurred to me is that this code is missing checks
so that it doesn't try to hint isolated pages. With the bitmap
approach you need an additional check so that you aren't pulling
isolated pages out and reporting them.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
  2019-10-10 20:36   ` Alexander Duyck
@ 2019-10-11 11:02     ` Nitesh Narayan Lal
  0 siblings, 0 replies; 35+ messages in thread
From: Nitesh Narayan Lal @ 2019-10-11 11:02 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, virtio-dev, Paolo Bonzini, lcapitulino,
	Pankaj Gupta, Wang, Wei W, Yang Zhang, Rik van Riel,
	David Hildenbrand, Michael S. Tsirkin, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli, john.starks,
	Dave Hansen, Michal Hocko, cohuck

On 10/10/19 4:36 PM, Alexander Duyck wrote:
> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> <snip>
>
>> +static int process_free_page(struct page *page,
>> +                            struct page_reporting_config *phconf, int count)
>> +{
>> +       int mt, order, ret = 0;
>> +
>> +       mt = get_pageblock_migratetype(page);
>> +       order = page_private(page);
>> +       ret = __isolate_free_page(page, order);
>> +
>> +       if (ret) {
>> +               /*
>> +                * Preserving order and migratetype for reuse while
>> +                * releasing the pages back to the buddy.
>> +                */
>> +               set_pageblock_migratetype(page, mt);
>> +               set_page_private(page, order);
>> +
>> +               sg_set_page(&phconf->sg[count++], page,
>> +                           PAGE_SIZE << order, 0);
>> +       }
>> +
>> +       return count;
>> +}
>> +
>> +/**
>> + * scan_zone_bitmap - scans the bitmap for the requested zone.
>> + * @phconf: page reporting configuration object initialized by the backend.
>> + * @zone: zone for which page reporting is requested.
>> + *
>> + * For every page marked in the bitmap it checks if it is still free if so it
>> + * isolates and adds them to a scatterlist. As soon as the number of isolated
>> + * pages reach the threshold set by the backend, they are reported to the
>> + * hypervisor by the backend. Once the hypervisor responds after processing
>> + * they are returned back to the buddy for reuse.
>> + */
>> +static void scan_zone_bitmap(struct page_reporting_config *phconf,
>> +                            struct zone *zone)
>> +{
>> +       unsigned long setbit;
>> +       struct page *page;
>> +       int count = 0;
>> +
>> +       sg_init_table(phconf->sg, phconf->max_pages);
>> +
>> +       for_each_set_bit(setbit, zone->bitmap, zone->nbits) {
>> +               /* Process only if the page is still online */
>> +               page = pfn_to_online_page((setbit << PAGE_REPORTING_MIN_ORDER) +
>> +                                         zone->base_pfn);
>> +               if (!page)
>> +                       continue;
>> +
>> +               spin_lock(&zone->lock);
>> +
>> +               /* Ensure page is still free and can be processed */
>> +               if (PageBuddy(page) && page_private(page) >=
>> +                   PAGE_REPORTING_MIN_ORDER)
>> +                       count = process_free_page(page, phconf, count);
>> +
>> +               spin_unlock(&zone->lock);
>> +               /* Page has been processed, adjust its bit and zone counter */
>> +               clear_bit(setbit, zone->bitmap);
>> +               atomic_dec(&zone->free_pages);
>> +
>> +               if (count == phconf->max_pages) {
>> +                       /* Report isolated pages to the hypervisor */
>> +                       phconf->report(phconf, count);
>> +
>> +                       /* Return processed pages back to the buddy */
>> +                       return_isolated_page(zone, phconf);
>> +
>> +                       /* Reset for next reporting */
>> +                       sg_init_table(phconf->sg, phconf->max_pages);
>> +                       count = 0;
>> +               }
>> +       }
>> +       /*
>> +        * If the number of isolated pages does not meet the max_pages
>> +        * threshold, we would still prefer to report them as we have already
>> +        * isolated them.
>> +        */
>> +       if (count) {
>> +               sg_mark_end(&phconf->sg[count - 1]);
>> +               phconf->report(phconf, count);
>> +
>> +               return_isolated_page(zone, phconf);
>> +       }
>> +}
>> +
> So one thing that occurred to me is that this code is missing checks
> so that it doesn't try to hint isolated pages. With the bitmap
> approach you need an additional check so that you aren't pulling
> isolated pages out and reporting them.

I think you mean that we should not report pages of type MIGRATE_ISOLATE.

The current code on which I am working, I have added the
is_migrate_isolate_page() check
to ensure that I am not processing these pages.

-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2019-10-11 11:03 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-12 13:12 [RFC][PATCH v12 0/2] mm: Support for page reporting Nitesh Narayan Lal
2019-08-12 13:12 ` [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure Nitesh Narayan Lal
2019-08-12 18:47   ` Alexander Duyck
2019-08-12 20:04     ` Nitesh Narayan Lal
2019-08-20 14:11       ` Nitesh Narayan Lal
2019-08-12 20:05     ` David Hildenbrand
2019-08-13 10:30       ` Nitesh Narayan Lal
2019-08-13 10:34         ` David Hildenbrand
2019-08-13 10:42           ` Nitesh Narayan Lal
2019-08-13 10:44             ` David Hildenbrand
2019-08-13 23:14           ` Alexander Duyck
2019-08-14  7:07             ` David Hildenbrand
2019-08-14 12:49               ` [virtio-dev] " Nitesh Narayan Lal
2019-08-14 15:49     ` Nitesh Narayan Lal
2019-08-14 16:11       ` Alexander Duyck
2019-08-15 13:15         ` Nitesh Narayan Lal
2019-08-15 19:22           ` Nitesh Narayan Lal
2019-08-15 23:00             ` Alexander Duyck
2019-08-16 18:35               ` Nitesh Narayan Lal
2019-08-30 15:15     ` Nitesh Narayan Lal
2019-08-30 15:31       ` Alexander Duyck
2019-08-30 16:05         ` Nitesh Narayan Lal
2019-09-04  8:40           ` [virtio-dev] " David Hildenbrand
2019-10-10 20:36   ` Alexander Duyck
2019-10-11 11:02     ` Nitesh Narayan Lal
2019-08-12 13:12 ` [RFC][Patch v12 2/2] virtio-balloon: interface to support free page reporting Nitesh Narayan Lal
2019-08-14 10:29   ` Cornelia Huck
2019-08-14 11:47     ` Nitesh Narayan Lal
2019-08-14 13:42       ` Cornelia Huck
2019-08-14 14:01         ` Nitesh Narayan Lal
2019-08-12 13:13 ` [QEMU Patch 1/2] virtio-balloon: adding bit for page reporting support Nitesh Narayan Lal
2019-08-12 13:13   ` [QEMU Patch 2/2] virtio-balloon: support for handling page reporting Nitesh Narayan Lal
2019-08-12 15:18     ` Alexander Duyck
2019-08-12 15:26       ` Nitesh Narayan Lal
2019-09-11 12:30 ` [RFC][PATCH v12 0/2] mm: Support for " David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).