On 6/3/19 3:57 PM, David Hildenbrand wrote: > On 03.06.19 19:03, Nitesh Narayan Lal wrote: >> This patch introduces the core infrastructure for free page hinting in >> virtual environments. It enables the kernel to track the free pages which >> can be reported to its hypervisor so that the hypervisor could >> free and reuse that memory as per its requirement. >> >> While the pages are getting processed in the hypervisor (e.g., >> via MADV_FREE), the guest must not use them, otherwise, data loss >> would be possible. To avoid such a situation, these pages are >> temporarily removed from the buddy. The amount of pages removed >> temporarily from the buddy is governed by the backend(virtio-balloon >> in our case). >> >> To efficiently identify free pages that can to be hinted to the >> hypervisor, bitmaps in a coarse granularity are used. Only fairly big >> chunks are reported to the hypervisor - especially, to not break up THP >> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits >> in the bitmap are an indication whether a page *might* be free, not a >> guarantee. A new hook after buddy merging sets the bits. >> >> Bitmaps are stored per zone, protected by the zone lock. A workqueue >> asynchronously processes the bitmaps, trying to isolate and report pages >> that are still free. The backend (virtio-balloon) is responsible for >> reporting these batched pages to the host synchronously. Once reporting/ >> freeing is complete, isolated pages are returned back to the buddy. >> >> There are still various things to look into (e.g., memory hotplug, more >> efficient locking, possible races when disabling). >> >> Signed-off-by: Nitesh Narayan Lal >> --- >> drivers/virtio/Kconfig | 1 + >> include/linux/page_hinting.h | 46 +++++++ >> mm/Kconfig | 6 + >> mm/Makefile | 2 + >> mm/page_alloc.c | 17 +-- >> mm/page_hinting.c | 236 +++++++++++++++++++++++++++++++++++ >> 6 files changed, 301 insertions(+), 7 deletions(-) >> create mode 100644 include/linux/page_hinting.h >> create mode 100644 mm/page_hinting.c >> >> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig >> index 35897649c24f..5a96b7a2ed1e 100644 >> --- a/drivers/virtio/Kconfig >> +++ b/drivers/virtio/Kconfig >> @@ -46,6 +46,7 @@ config VIRTIO_BALLOON >> tristate "Virtio balloon driver" >> depends on VIRTIO >> select MEMORY_BALLOON >> + select PAGE_HINTING >> ---help--- >> This driver supports increasing and decreasing the amount >> of memory within a KVM guest. >> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h >> new file mode 100644 >> index 000000000000..e65188fe1e6b >> --- /dev/null >> +++ b/include/linux/page_hinting.h >> @@ -0,0 +1,46 @@ >> +/* SPDX-License-Identifier: GPL-2.0 */ >> +#ifndef _LINUX_PAGE_HINTING_H >> +#define _LINUX_PAGE_HINTING_H >> + >> +/* >> + * Minimum page order required for a page to be hinted to the host. >> + */ >> +#define PAGE_HINTING_MIN_ORDER (MAX_ORDER - 2) >> + >> +/* >> + * struct page_hinting_cb: holds the callbacks to store, report and cleanup >> + * isolated pages. >> + * @prepare: Callback responsible for allocating an array to hold >> + * the isolated pages. >> + * @hint_pages: Callback which reports the isolated pages synchornously >> + * to the host. >> + * @cleanup: Callback to free the the array used for reporting the >> + * isolated pages. >> + * @max_pages: Maxmimum pages that are going to be hinted to the host >> + * at a time of granularity >= PAGE_HINTING_MIN_ORDER. >> + */ >> +struct page_hinting_cb { >> + int (*prepare)(void); >> + void (*hint_pages)(struct list_head *list); >> + void (*cleanup)(void); >> + int max_pages; > If we allocate the array in virtio-balloon differently (e.g. similar to > bulk inflation/deflation of pfn's right now), we can most probably get > rid of prepare() and cleanup(), simplifying the code further. Yeap, as Alexander suggested. I will go for static allocation and remove this prepare() and cleanup(). >> +}; >> + >> +#ifdef CONFIG_PAGE_HINTING >> +void page_hinting_enqueue(struct page *page, int order); >> +void page_hinting_enable(const struct page_hinting_cb *cb); >> +void page_hinting_disable(void); >> +#else >> +static inline void page_hinting_enqueue(struct page *page, int order) >> +{ >> +} >> + >> +static inline void page_hinting_enable(struct page_hinting_cb *cb) >> +{ >> +} >> + >> +static inline void page_hinting_disable(void) >> +{ >> +} >> +#endif >> +#endif /* _LINUX_PAGE_HINTING_H */ >> diff --git a/mm/Kconfig b/mm/Kconfig >> index ee8d1f311858..177d858de758 100644 >> --- a/mm/Kconfig >> +++ b/mm/Kconfig >> @@ -764,4 +764,10 @@ config GUP_BENCHMARK >> config ARCH_HAS_PTE_SPECIAL >> bool >> >> +# PAGE_HINTING will allow the guest to report the free pages to the >> +# host in regular interval of time. >> +config PAGE_HINTING >> + bool >> + def_bool n >> + depends on X86_64 >> endmenu >> diff --git a/mm/Makefile b/mm/Makefile >> index ac5e5ba78874..bec456dfee34 100644 >> --- a/mm/Makefile >> +++ b/mm/Makefile >> @@ -41,6 +41,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ >> interval_tree.o list_lru.o workingset.o \ >> debug.o $(mmu-y) >> >> + >> # Give 'page_alloc' its own module-parameter namespace >> page-alloc-y := page_alloc.o >> page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o >> @@ -94,6 +95,7 @@ obj-$(CONFIG_Z3FOLD) += z3fold.o >> obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o >> obj-$(CONFIG_CMA) += cma.o >> obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o >> +obj-$(CONFIG_PAGE_HINTING) += page_hinting.o >> obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o >> obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o >> obj-$(CONFIG_USERFAULTFD) += userfaultfd.o >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 3b13d3914176..d12f69e0e402 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -68,6 +68,7 @@ >> #include >> #include >> #include >> +#include >> >> #include >> #include >> @@ -873,10 +874,10 @@ compaction_capture(struct capture_control *capc, struct page *page, >> * -- nyc >> */ >> >> -static inline void __free_one_page(struct page *page, >> +inline void __free_one_page(struct page *page, >> unsigned long pfn, >> struct zone *zone, unsigned int order, >> - int migratetype) >> + int migratetype, bool hint) >> { >> unsigned long combined_pfn; >> unsigned long uninitialized_var(buddy_pfn); >> @@ -951,6 +952,8 @@ static inline void __free_one_page(struct page *page, >> done_merging: >> set_page_order(page, order); >> >> + if (hint) >> + page_hinting_enqueue(page, order); >> /* >> * If this is not the largest possible page, check if the buddy >> * of the next-highest order is free. If it is, it's possible >> @@ -1262,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, >> if (unlikely(isolated_pageblocks)) >> mt = get_pageblock_migratetype(page); >> >> - __free_one_page(page, page_to_pfn(page), zone, 0, mt); >> + __free_one_page(page, page_to_pfn(page), zone, 0, mt, true); >> trace_mm_page_pcpu_drain(page, 0, mt); >> } >> spin_unlock(&zone->lock); >> @@ -1271,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count, >> static void free_one_page(struct zone *zone, >> struct page *page, unsigned long pfn, >> unsigned int order, >> - int migratetype) >> + int migratetype, bool hint) >> { >> spin_lock(&zone->lock); >> if (unlikely(has_isolate_pageblock(zone) || >> is_migrate_isolate(migratetype))) { >> migratetype = get_pfnblock_migratetype(page, pfn); >> } >> - __free_one_page(page, pfn, zone, order, migratetype); >> + __free_one_page(page, pfn, zone, order, migratetype, hint); >> spin_unlock(&zone->lock); >> } >> >> @@ -1368,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order) >> migratetype = get_pfnblock_migratetype(page, pfn); >> local_irq_save(flags); >> __count_vm_events(PGFREE, 1 << order); >> - free_one_page(page_zone(page), page, pfn, order, migratetype); >> + free_one_page(page_zone(page), page, pfn, order, migratetype, true); >> local_irq_restore(flags); >> } >> >> @@ -2968,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn) >> */ >> if (migratetype >= MIGRATE_PCPTYPES) { >> if (unlikely(is_migrate_isolate(migratetype))) { >> - free_one_page(zone, page, pfn, 0, migratetype); >> + free_one_page(zone, page, pfn, 0, migratetype, true); >> return; >> } >> migratetype = MIGRATE_MOVABLE; >> diff --git a/mm/page_hinting.c b/mm/page_hinting.c >> new file mode 100644 >> index 000000000000..7341c6462de2 >> --- /dev/null >> +++ b/mm/page_hinting.c >> @@ -0,0 +1,236 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +/* >> + * Page hinting support to enable a VM to report the freed pages back >> + * to the host. >> + * >> + * Copyright Red Hat, Inc. 2019 >> + * >> + * Author(s): Nitesh Narayan Lal >> + */ >> + >> +#include >> +#include >> +#include >> +#include >> + >> +/* >> + * struct hinting_bitmap: holds the bitmap pointer which tracks the freed PFNs >> + * and other required parameters which could help in retrieving the original >> + * PFN value using the bitmap. >> + * @bitmap: Pointer to the bitmap of free PFN. >> + * @base_pfn: Starting PFN value for the zone whose bitmap is stored. >> + * @free_pages: Tracks the number of free pages of granularity >> + * PAGE_HINTING_MIN_ORDER. >> + * @nbits: Indicates the total size of the bitmap in bits allocated >> + * at the time of initialization. >> + */ >> +struct hinting_bitmap { >> + unsigned long *bitmap; >> + unsigned long base_pfn; >> + atomic_t free_pages; >> + unsigned long nbits; >> +} bm_zone[MAX_NR_ZONES]; >> + >> +static void init_hinting_wq(struct work_struct *work); >> +extern int __isolate_free_page(struct page *page, unsigned int order); >> +extern void __free_one_page(struct page *page, unsigned long pfn, >> + struct zone *zone, unsigned int order, >> + int migratetype, bool hint); >> +const struct page_hinting_cb *hcb; >> +struct work_struct hinting_work; >> + >> +static unsigned long find_bitmap_size(struct zone *zone) >> +{ >> + unsigned long nbits = ALIGN(zone->spanned_pages, >> + PAGE_HINTING_MIN_ORDER); >> + >> + nbits = nbits >> PAGE_HINTING_MIN_ORDER; >> + return nbits; > I think we can simplify this to > > return (zone->spanned_pages >> PAGE_HINTING_MIN_ORDER) + 1; > I will check this. >> +} >> + >> +void page_hinting_enable(const struct page_hinting_cb *callback) >> +{ >> + struct zone *zone; >> + int idx = 0; >> + unsigned long bitmap_size = 0; > You should probably protect enabling via a mutex and return -EINVAL or > similar if we already have a callback set (if we ever have different > drivers). But this has very little priority :) I will have to look into this. >> + >> + for_each_populated_zone(zone) { >> + spin_lock(&zone->lock); >> + bitmap_size = find_bitmap_size(zone); >> + bm_zone[idx].bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL); >> + if (!bm_zone[idx].bitmap) >> + return; >> + bm_zone[idx].nbits = bitmap_size; >> + bm_zone[idx].base_pfn = zone->zone_start_pfn; >> + spin_unlock(&zone->lock); >> + idx++; >> + } >> + hcb = callback; >> + INIT_WORK(&hinting_work, init_hinting_wq); > There are also possible races when enabling, you will have to take care > of at one point. No page will be enqueued until hcb is not set. I can probably move it below INIT_WORK. >> +} >> +EXPORT_SYMBOL_GPL(page_hinting_enable); >> + >> +void page_hinting_disable(void) >> +{ >> + struct zone *zone; >> + int idx = 0; >> + >> + cancel_work_sync(&hinting_work); >> + hcb = NULL; >> + for_each_populated_zone(zone) { >> + spin_lock(&zone->lock); >> + bitmap_free(bm_zone[idx].bitmap); >> + bm_zone[idx].base_pfn = 0; >> + bm_zone[idx].nbits = 0; >> + atomic_set(&bm_zone[idx].free_pages, 0); >> + spin_unlock(&zone->lock); >> + idx++; >> + } >> +} >> +EXPORT_SYMBOL_GPL(page_hinting_disable); >> + >> +static unsigned long pfn_to_bit(struct page *page, int zonenum) >> +{ >> + unsigned long bitnr; >> + >> + bitnr = (page_to_pfn(page) - bm_zone[zonenum].base_pfn) >> + >> PAGE_HINTING_MIN_ORDER; >> + return bitnr; >> +} >> + >> +static void release_buddy_pages(struct list_head *pages) > maybe "release_isolated_pages", not sure. > >> +{ >> + int mt = 0, zonenum, order; >> + struct page *page, *next; >> + struct zone *zone; >> + unsigned long bitnr; >> + >> + list_for_each_entry_safe(page, next, pages, lru) { >> + zonenum = page_zonenum(page); >> + zone = page_zone(page); >> + bitnr = pfn_to_bit(page, zonenum); >> + spin_lock(&zone->lock); >> + list_del(&page->lru); >> + order = page_private(page); >> + set_page_private(page, 0); >> + mt = get_pageblock_migratetype(page); >> + __free_one_page(page, page_to_pfn(page), zone, >> + order, mt, false); >> + spin_unlock(&zone->lock); >> + } >> +} >> + >> +static void bm_set_pfn(struct page *page) >> +{ >> + unsigned long bitnr = 0; >> + int zonenum = page_zonenum(page); >> + struct zone *zone = page_zone(page); >> + >> + lockdep_assert_held(&zone->lock); >> + bitnr = pfn_to_bit(page, zonenum); >> + if (bm_zone[zonenum].bitmap && >> + bitnr < bm_zone[zonenum].nbits && >> + !test_and_set_bit(bitnr, bm_zone[zonenum].bitmap)) >> + atomic_inc(&bm_zone[zonenum].free_pages); >> +} >> + >> +static void scan_hinting_bitmap(int zonenum, int free_pages) >> +{ >> + unsigned long set_bit, start = 0; >> + struct page *page; >> + struct zone *zone; >> + int scanned_pages = 0, ret = 0, order, isolated_cnt = 0; >> + LIST_HEAD(isolated_pages); >> + >> + ret = hcb->prepare(); >> + if (ret < 0) >> + return; >> + for (;;) { >> + ret = 0; >> + set_bit = find_next_bit(bm_zone[zonenum].bitmap, >> + bm_zone[zonenum].nbits, start); >> + if (set_bit >= bm_zone[zonenum].nbits) >> + break; >> + page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) + >> + bm_zone[zonenum].base_pfn); >> + if (!page) >> + continue; > You are not clearing the bit / decrementing the counter. > >> + zone = page_zone(page); >> + spin_lock(&zone->lock); >> + >> + if (PageBuddy(page) && page_private(page) >= >> + PAGE_HINTING_MIN_ORDER) { >> + order = page_private(page); >> + ret = __isolate_free_page(page, order); >> + } >> + clear_bit(set_bit, bm_zone[zonenum].bitmap); >> + spin_unlock(&zone->lock); >> + if (ret) { >> + /* >> + * restoring page order to use it while releasing >> + * the pages back to the buddy. >> + */ >> + set_page_private(page, order); >> + list_add_tail(&page->lru, &isolated_pages); >> + isolated_cnt++; >> + if (isolated_cnt == hcb->max_pages) { >> + hcb->hint_pages(&isolated_pages); >> + release_buddy_pages(&isolated_pages); >> + isolated_cnt = 0; >> + } >> + } >> + start = set_bit + 1; >> + scanned_pages++; >> + } >> + if (isolated_cnt) { >> + hcb->hint_pages(&isolated_pages); >> + release_buddy_pages(&isolated_pages); >> + } >> + hcb->cleanup(); >> + if (scanned_pages > free_pages) >> + atomic_sub((scanned_pages - free_pages), >> + &bm_zone[zonenum].free_pages); > This looks overly complicated. Can't we somehow simply decrement when > clearing a bit? > >> +} >> + >> +static bool check_hinting_threshold(void) >> +{ >> + int zonenum = 0; >> + >> + for (; zonenum < MAX_NR_ZONES; zonenum++) { >> + if (atomic_read(&bm_zone[zonenum].free_pages) >= >> + hcb->max_pages) >> + return true; >> + } >> + return false; >> +} >> + >> +static void init_hinting_wq(struct work_struct *work) >> +{ >> + int zonenum = 0, free_pages = 0; >> + >> + for (; zonenum < MAX_NR_ZONES; zonenum++) { >> + free_pages = atomic_read(&bm_zone[zonenum].free_pages); >> + if (free_pages >= hcb->max_pages) { >> + /* Find a better way to synchronize per zone >> + * free_pages. >> + */ >> + atomic_sub(free_pages, >> + &bm_zone[zonenum].free_pages); > I can't follow yet why we need that information. We don't want to enqueue multiple jobs just because we are delaying the decrementing of free_pages. I agree, even I am not convinced with the approach which I have right now. I will try to come up with a better way. > Wouldn't it be enough > to just track the number of set bits in the bitmap and start hinting > depending on that count? (there are false positives, but do we really care?) In an attempt to minimize the false positives, I might have overly complicated this part. I will try to simplify this in the next posting. > >> + scan_hinting_bitmap(zonenum, free_pages); >> + } >> + } >> +} >> + >> +void page_hinting_enqueue(struct page *page, int order) >> +{ >> + if (hcb && order >= PAGE_HINTING_MIN_ORDER) >> + bm_set_pfn(page); >> + else >> + return; >> + >> + if (check_hinting_threshold()) { >> + int cpu = smp_processor_id(); >> + >> + queue_work_on(cpu, system_wq, &hinting_work); >> + } >> +} >> > -- Regards Nitesh