Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

From: Nitesh Narayan Lal <nitesh@redhat.com>
To: Alexander Duyck <alexander.duyck@gmail.com>
Cc: kvm list <kvm@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>,
	virtio-dev@lists.oasis-open.org,
	Paolo Bonzini <pbonzini@redhat.com>,
	lcapitulino@redhat.com, pagupta@redhat.com, wei.w.wang@intel.com,
	Yang Zhang <yang.zhang.wz@gmail.com>,
	Rik van Riel <riel@surriel.com>,
	David Hildenbrand <david@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	dodgen@google.com, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	dhildenb@redhat.com, Andrea Arcangeli <aarcange@redhat.com>,
	john.starks@microsoft.com, Dave Hansen <dave.hansen@intel.com>,
	Michal Hocko <mhocko@suse.com>,
	cohuck@redhat.com
Subject: Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure
Date: Mon, 12 Aug 2019 16:04:08 -0400	[thread overview]
Message-ID: <b5d4ee25-ae3e-f012-d7f2-7a27d7bbcc54@redhat.com> (raw)
In-Reply-To: <CAKgT0UcSabyrO=jUwq10KpJKLSuzorHDnKAGrtWVigKVgvD-6Q@mail.gmail.com>

On 8/12/19 2:47 PM, Alexander Duyck wrote:
> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> This patch introduces the core infrastructure for free page reporting in
>> virtual environments. It enables the kernel to track the free pages which
>> can be reported to its hypervisor so that the hypervisor could
>> free and reuse that memory as per its requirement.
>>
>> While the pages are getting processed in the hypervisor (e.g.,
>> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
>> would be possible. To avoid such a situation, these pages are
>> temporarily removed from the buddy. The amount of pages removed
>> temporarily from the buddy is governed by the backend(virtio-balloon
>> in our case).
>>
>> To efficiently identify free pages that can to be reported to the
>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>> chunks are reported to the hypervisor - especially, to not break up THP
>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>> in the bitmap are an indication whether a page *might* be free, not a
>> guarantee. A new hook after buddy merging sets the bits.
>>
>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>> asynchronously processes the bitmaps, trying to isolate and report pages
>> that are still free. The backend (virtio-balloon) is responsible for
>> reporting these batched pages to the host synchronously. Once reporting/
>> freeing is complete, isolated pages are returned back to the buddy.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> So if I understand correctly the hotplug support for this is still not
> included correct? 

That is correct, I have it as an ongoing-item in my cover-email.
In case, we decide to go with this approach do you think it is a blocker?

> I assume that is the case since I don't see any
> logic for zone resizing.
>
> Also I don't see how this dealt with the sparse issue that was pointed
> out earlier. Specifically how would you deal with a zone that has a
> wide range between the base and the end and a huge gap somewhere
> in-between?

It doesn't. However, considering we are tracking page on order of (MAX_ORDER -
2) I don't think the loss will be significant.
In any case, this is certainly a drawback of this approach and I should add this
in my cover.
The one thing which I did change in this version is that now I started
maintaining bitmap for each zone.

>
>> ---
>>  include/linux/mmzone.h         |  11 ++
>>  include/linux/page_reporting.h |  63 +++++++
>>  mm/Kconfig                     |   6 +
>>  mm/Makefile                    |   1 +
>>  mm/page_alloc.c                |  42 ++++-
>>  mm/page_reporting.c            | 332 +++++++++++++++++++++++++++++++++
>>  6 files changed, 448 insertions(+), 7 deletions(-)
>>  create mode 100644 include/linux/page_reporting.h
>>  create mode 100644 mm/page_reporting.c
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index d77d717c620c..ba5f5b508f25 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -559,6 +559,17 @@ struct zone {
>>         /* Zone statistics */
>>         atomic_long_t           vm_stat[NR_VM_ZONE_STAT_ITEMS];
>>         atomic_long_t           vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
>> +#ifdef CONFIG_PAGE_REPORTING
>> +       /* Pointer to the bitmap in PAGE_REPORTING_MIN_ORDER granularity */
>> +       unsigned long *bitmap;
>> +       /* Preserve start and end PFN in case they change due to hotplug */
>> +       unsigned long base_pfn;
>> +       unsigned long end_pfn;
>> +       /* Free pages of granularity PAGE_REPORTING_MIN_ORDER */
>> +       atomic_t free_pages;
>> +       /* Number of bits required in the bitmap */
>> +       unsigned long nbits;
>> +#endif
>>  } ____cacheline_internodealigned_in_smp;
> Okay, so the original thing this patch set had going for it was that
> it was non-invasive. However, now you are adding a bunch of stuff to
> the zone. That kind of loses the non-invasive argument for this patch
> set compared to mine.

I think it is fair to add that it not as invasive as yours. :) (But that has its
own pros and cons)
In any case, I do understand your point.

>
>
> If we are going to continue further with this patch set then it might
> be worth looking into dynamically allocating the space you need for
> this block.

Not sure if I understood this part. Dynamic allocation for the structure which
you are suggesting below?

>  At a minimum you could probably look at making the bitmap
> an RCU based setup so you could define the base and end along with the
> bitmap. It would probably help to resolve the hotplug issues you still
> need to address.

Thanks for the suggestion. I will look into it.

>
>>  enum pgdat_flags {
>> diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
>> new file mode 100644
>> index 000000000000..37a39589939d
>> --- /dev/null
>> +++ b/include/linux/page_reporting.h
>> @@ -0,0 +1,63 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_PAGE_REPORTING_H
>> +#define _LINUX_PAGE_REPORTING_H
>> +
>> +#define PAGE_REPORTING_MIN_ORDER               (MAX_ORDER - 2)
>> +#define PAGE_REPORTING_MAX_PAGES               16
>> +
>> +#ifdef CONFIG_PAGE_REPORTING
>> +struct page_reporting_config {
>> +       /* function to hint batch of isolated pages */
>> +       void (*report)(struct page_reporting_config *phconf,
>> +                      unsigned int num_pages);
>> +
>> +       /* scatterlist to hold the isolated pages to be hinted */
>> +       struct scatterlist *sg;
>> +
>> +       /*
>> +        * Maxmimum pages that are going to be hinted to the hypervisor at a
>> +        * time of granularity >= PAGE_REPORTING_MIN_ORDER.
>> +        */
>> +       int max_pages;
>> +
>> +       /* work object to process page reporting rqeuests */
>> +       struct work_struct reporting_work;
>> +
>> +       /* tracks the number of reporting request processed at a time */
>> +       atomic_t refcnt;
>> +};
>> +
>> +void __page_reporting_enqueue(struct page *page);
>> +void __return_isolated_page(struct zone *zone, struct page *page);
>> +void set_pageblock_migratetype(struct page *page, int migratetype);
>> +
>> +/**
>> + * page_reporting_enqueue - checks the eligibility of the freed page based on
>> + * its order for further page reporting processing.
>> + * @page: page which has been freed.
>> + * @order: order for the the free page.
>> + */
>> +static inline void page_reporting_enqueue(struct page *page, int order)
>> +{
>> +       if (order < PAGE_REPORTING_MIN_ORDER)
>> +               return;
>> +       __page_reporting_enqueue(page);
>> +}
>> +
>> +int page_reporting_enable(struct page_reporting_config *phconf);
>> +void page_reporting_disable(struct page_reporting_config *phconf);
>> +#else
>> +static inline void page_reporting_enqueue(struct page *page, int order)
>> +{
>> +}
>> +
>> +static inline int page_reporting_enable(struct page_reporting_config *phconf)
>> +{
>> +       return -EOPNOTSUPP;
>> +}
>> +
>> +static inline void page_reporting_disable(struct page_reporting_config *phconf)
>> +{
>> +}
>> +#endif /* CONFIG_PAGE_REPORTING */
>> +#endif /* _LINUX_PAGE_REPORTING_H */
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 56cec636a1fc..6a35a9dad350 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -736,4 +736,10 @@ config ARCH_HAS_PTE_SPECIAL
>>  config ARCH_HAS_HUGEPD
>>         bool
>>
>> +# PAGE_REPORTING will allow the guest to report the free pages to the
>> +# host in fixed chunks as soon as a fixed threshold is reached.
>> +config PAGE_REPORTING
>> +       bool
>> +       def_bool n
>> +       depends on X86_64
>>  endmenu
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 338e528ad436..6a32cdfa61c2 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -104,3 +104,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
>>  obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
>>  obj-$(CONFIG_HMM_MIRROR) += hmm.o
>>  obj-$(CONFIG_MEMFD_CREATE) += memfd.o
>> +obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 272c6de1bf4e..aa7c89d50c85 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -68,6 +68,7 @@
>>  #include <linux/lockdep.h>
>>  #include <linux/nmi.h>
>>  #include <linux/psi.h>
>> +#include <linux/page_reporting.h>
>>
>>  #include <asm/sections.h>
>>  #include <asm/tlbflush.h>
>> @@ -903,7 +904,7 @@ compaction_capture(struct capture_control *capc, struct page *page,
>>  static inline void __free_one_page(struct page *page,
>>                 unsigned long pfn,
>>                 struct zone *zone, unsigned int order,
>> -               int migratetype)
>> +               int migratetype, bool needs_reporting)
>>  {
>>         unsigned long combined_pfn;
>>         unsigned long uninitialized_var(buddy_pfn);
>> @@ -1006,7 +1007,8 @@ static inline void __free_one_page(struct page *page,
>>                                 migratetype);
>>         else
>>                 add_to_free_area(page, &zone->free_area[order], migratetype);
>> -
>> +       if (needs_reporting)
>> +               page_reporting_enqueue(page, order);
>>  }
>>
>>  /*
>> @@ -1317,7 +1319,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>                 if (unlikely(isolated_pageblocks))
>>                         mt = get_pageblock_migratetype(page);
>>
>> -               __free_one_page(page, page_to_pfn(page), zone, 0, mt);
>> +               __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
>>                 trace_mm_page_pcpu_drain(page, 0, mt);
>>         }
>>         spin_unlock(&zone->lock);
>> @@ -1326,14 +1328,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>  static void free_one_page(struct zone *zone,
>>                                 struct page *page, unsigned long pfn,
>>                                 unsigned int order,
>> -                               int migratetype)
>> +                               int migratetype, bool needs_reporting)
>>  {
>>         spin_lock(&zone->lock);
>>         if (unlikely(has_isolate_pageblock(zone) ||
>>                 is_migrate_isolate(migratetype))) {
>>                 migratetype = get_pfnblock_migratetype(page, pfn);
>>         }
>> -       __free_one_page(page, pfn, zone, order, migratetype);
>> +       __free_one_page(page, pfn, zone, order, migratetype, needs_reporting);
>>         spin_unlock(&zone->lock);
>>  }
>>
>> @@ -1423,7 +1425,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>>         migratetype = get_pfnblock_migratetype(page, pfn);
>>         local_irq_save(flags);
>>         __count_vm_events(PGFREE, 1 << order);
>> -       free_one_page(page_zone(page), page, pfn, order, migratetype);
>> +       free_one_page(page_zone(page), page, pfn, order, migratetype, true);
>>         local_irq_restore(flags);
>>  }
>>
>> @@ -2197,6 +2199,32 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>>         return NULL;
>>  }
>>
>> +#ifdef CONFIG_PAGE_REPORTING
>> +/**
>> + * return_isolated_page - returns a reported page back to the buddy.
>> + * @zone: zone from where the page was isolated.
>> + * @page: page which will be returned.
>> + */
>> +void __return_isolated_page(struct zone *zone, struct page *page)
>> +{
>> +       unsigned int order, mt;
>> +       unsigned long pfn;
>> +
>> +       /* zone lock should be held when this function is called */
>> +       lockdep_assert_held(&zone->lock);
>> +
>> +       mt = get_pageblock_migratetype(page);
>> +       pfn = page_to_pfn(page);
>> +
>> +       if (unlikely(has_isolate_pageblock(zone) || is_migrate_isolate(mt)))
>> +               mt = get_pfnblock_migratetype(page, pfn);
>> +
>> +       order = page_private(page);
>> +       set_page_private(page, 0);
>> +
>> +       __free_one_page(page, pfn, zone, order, mt, false);
>> +}
>> +#endif /* CONFIG_PAGE_REPORTING */
>>
>>  /*
>>   * This array describes the order lists are fallen back to when
>> @@ -3041,7 +3069,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>>          */
>>         if (migratetype >= MIGRATE_PCPTYPES) {
>>                 if (unlikely(is_migrate_isolate(migratetype))) {
>> -                       free_one_page(zone, page, pfn, 0, migratetype);
>> +                       free_one_page(zone, page, pfn, 0, migratetype, true);
>>                         return;
>>                 }
>>                 migratetype = MIGRATE_MOVABLE;
>> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
>> new file mode 100644
>> index 000000000000..4ee2c172cd9a
>> --- /dev/null
>> +++ b/mm/page_reporting.c
>> @@ -0,0 +1,332 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Page reporting core infrastructure to enable a VM to report free pages to its
>> + * hypervisor.
>> + *
>> + * Copyright Red Hat, Inc. 2019
>> + *
>> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/slab.h>
>> +#include <linux/page_reporting.h>
>> +#include <linux/scatterlist.h>
>> +#include "internal.h"
>> +
>> +static struct page_reporting_config __rcu *page_reporting_conf __read_mostly;
>> +static DEFINE_MUTEX(page_reporting_mutex);
>> +
>> +static inline unsigned long pfn_to_bit(struct page *page, struct zone *zone)
>> +{
>> +       unsigned long bitnr;
>> +
>> +       bitnr = (page_to_pfn(page) - zone->base_pfn) >>
>> +               PAGE_REPORTING_MIN_ORDER;
>> +
>> +       return bitnr;
>> +}
>> +
>> +static void return_isolated_page(struct zone *zone,
>> +                                struct page_reporting_config *phconf)
>> +{
>> +       struct scatterlist *sg = phconf->sg;
>> +
>> +       spin_lock(&zone->lock);
>> +       do {
>> +               __return_isolated_page(zone, sg_page(sg));
>> +       } while (!sg_is_last(sg++));
>> +       spin_unlock(&zone->lock);
>> +}
>> +
>> +static void bitmap_set_bit(struct page *page, struct zone *zone)
>> +{
>> +       unsigned long bitnr = 0;
>> +
>> +       /* zone lock should be held when this function is called */
>> +       lockdep_assert_held(&zone->lock);
>> +
> So why does the zone lock need to be held? What could you possibly
> race against? If nothing else it might make more sense to look at
> moving the bitmap, base, end, and length values into one RCU
> allocation structure so that you could do without the requirement of
> the zone lock to manipulate the bitmap.

Makes sense to me.

>> +       bitnr = pfn_to_bit(page, zone);
>> +       /* set bit if it is not already set and is a valid bit */
>> +       if (zone->bitmap && bitnr < zone->nbits &&
>> +           !test_and_set_bit(bitnr, zone->bitmap))
>> +               atomic_inc(&zone->free_pages);
>> +}
>> +
>> +static int process_free_page(struct page *page,
>> +                            struct page_reporting_config *phconf, int count)
>> +{
>> +       int mt, order, ret = 0;
>> +
>> +       mt = get_pageblock_migratetype(page);
>> +       order = page_private(page);
>> +       ret = __isolate_free_page(page, order);
>> +
>> +       if (ret) {
>> +               /*
>> +                * Preserving order and migratetype for reuse while
>> +                * releasing the pages back to the buddy.
>> +                */
>> +               set_pageblock_migratetype(page, mt);
>> +               set_page_private(page, order);
>> +
>> +               sg_set_page(&phconf->sg[count++], page,
>> +                           PAGE_SIZE << order, 0);
>> +       }
>> +
>> +       return count;
>> +}
>> +
>> +/**
>> + * scan_zone_bitmap - scans the bitmap for the requested zone.
>> + * @phconf: page reporting configuration object initialized by the backend.
>> + * @zone: zone for which page reporting is requested.
>> + *
>> + * For every page marked in the bitmap it checks if it is still free if so it
>> + * isolates and adds them to a scatterlist. As soon as the number of isolated
>> + * pages reach the threshold set by the backend, they are reported to the
>> + * hypervisor by the backend. Once the hypervisor responds after processing
>> + * they are returned back to the buddy for reuse.
>> + */
>> +static void scan_zone_bitmap(struct page_reporting_config *phconf,
>> +                            struct zone *zone)
>> +{
>> +       unsigned long setbit;
>> +       struct page *page;
>> +       int count = 0;
>> +
>> +       sg_init_table(phconf->sg, phconf->max_pages);
>> +
>> +       for_each_set_bit(setbit, zone->bitmap, zone->nbits) {
>> +               /* Process only if the page is still online */
>> +               page = pfn_to_online_page((setbit << PAGE_REPORTING_MIN_ORDER) +
>> +                                         zone->base_pfn);
>> +               if (!page)
>> +                       continue;
>> +
> Shouldn't you be clearing the bit and dropping the reference to
> free_pages before you move on to the next bit? Otherwise you are going
> to be stuck with those aren't you?

+1. Thanks for pointing this out.

>> +               spin_lock(&zone->lock);
>> +
>> +               /* Ensure page is still free and can be processed */
>> +               if (PageBuddy(page) && page_private(page) >=
>> +                   PAGE_REPORTING_MIN_ORDER)
>> +                       count = process_free_page(page, phconf, count);
>> +
>> +               spin_unlock(&zone->lock);
> So I kind of wonder just how much overhead you are taking for bouncing
> the zone lock once per page here. Especially since it can result in
> you not actually making any progress since the page may have already
> been reallocated.

Any suggestion about how can I see the overhead involved here?

>> +               /* Page has been processed, adjust its bit and zone counter */
>> +               clear_bit(setbit, zone->bitmap);
>> +               atomic_dec(&zone->free_pages);
> So earlier you were setting this bit and required that the zone->lock
> be held while doing so. Here you are clearing it without any
> zone->lock being held.

I should have held the zone->lock here while clearing the bit.

>
>> +               if (count == phconf->max_pages) {
>> +                       /* Report isolated pages to the hypervisor */
>> +                       phconf->report(phconf, count);
>> +
>> +                       /* Return processed pages back to the buddy */
>> +                       return_isolated_page(zone, phconf);
>> +
>> +                       /* Reset for next reporting */
>> +                       sg_init_table(phconf->sg, phconf->max_pages);
>> +                       count = 0;
>> +               }
>> +       }
>> +       /*
>> +        * If the number of isolated pages does not meet the max_pages
>> +        * threshold, we would still prefer to report them as we have already
>> +        * isolated them.
>> +        */
>> +       if (count) {
>> +               sg_mark_end(&phconf->sg[count - 1]);
>> +               phconf->report(phconf, count);
>> +
>> +               return_isolated_page(zone, phconf);
>> +       }
>> +}
>> +
>> +/**
>> + * page_reporting_wq - checks the number of free_pages in all the zones and
>> + * invokes a request to scan the respective bitmap if free_pages reaches or
>> + * exceeds the threshold specified by the backend.
>> + */
>> +static void page_reporting_wq(struct work_struct *work)
>> +{
>> +       struct page_reporting_config *phconf =
>> +               container_of(work, struct page_reporting_config,
>> +                            reporting_work);
>> +       struct zone *zone;
>> +
>> +       for_each_populated_zone(zone) {
>> +               if (atomic_read(&zone->free_pages) >= phconf->max_pages)
>> +                       scan_zone_bitmap(phconf, zone);
>> +       }
>> +       /*
>> +        * We have processed all the zones, we can process new page reporting
>> +        * request now.
>> +        */
>> +       atomic_set(&phconf->refcnt, 0);
> It doesn't make sense to reset the refcnt once you have made a single pass.
>
>> +}
>> +
>> +/**
>> + * __page_reporting_enqueue - tracks the freed page in the respective zone's
>> + * bitmap and enqueues a new page reporting job to the workqueue if possible.
>> + */
>> +void __page_reporting_enqueue(struct page *page)
>> +{
>> +       struct page_reporting_config *phconf;
>> +       struct zone *zone;
>> +
>> +       rcu_read_lock();
>> +       /*
>> +        * We should not process this page if either page reporting is not
>> +        * yet completely enabled or it has been disabled by the backend.
>> +        */
>> +       phconf = rcu_dereference(page_reporting_conf);
>> +       if (!phconf)
>> +               return;
>> +
>> +       zone = page_zone(page);
>> +       bitmap_set_bit(page, zone);
>> +
>> +       /*
>> +        * We should not enqueue a job if a previously enqueued reporting work
>> +        * is in progress or we don't have enough free pages in the zone.
>> +        */
>> +       if (atomic_read(&zone->free_pages) >= phconf->max_pages &&
>> +           !atomic_cmpxchg(&phconf->refcnt, 0, 1))
> This doesn't make any sense to me. Why are you only incrementing the
> refcount if it is zero? Combining this with the assignment above, this
> isn't really a refcnt. It is just an oversized bitflag.
>

The reason why I have this here is that at a time I only want a single request
to be en-queued.
Now, every time free_page threshold for any zone is met I am checking each zone
for possible reporting.

I think there are two change required here:
1. rename this flag.
2. use refcnt to actually track the usage of page_hinting_config object separately.

> Also I am pretty sure this results in the opportunity to miss pages
> because there is nothing to prevent you from possibly missing a ton of
> pages you could hint on if a large number of pages are pushed out all
> at once and then the system goes idle in terms of memory allocation
> and freeing.

I have failed to reproduce this kind of situation.
I have a test app which simply allocates large memory chunk, touches it and then
exits. In this case, I get most of the memory back.

>
>> +               schedule_work(&phconf->reporting_work);
>> +
>> +       rcu_read_unlock();
>> +}
>> +
>> +/**
>> + * zone_reporting_cleanup - resets the page reporting fields and free the
>> + * bitmap for all the initialized zones.
>> + */
>> +static void zone_reporting_cleanup(void)
>> +{
>> +       struct zone *zone;
>> +
>> +       for_each_populated_zone(zone) {
>> +               /*
>> +                * Bitmap may not be allocated for all the zones if the
>> +                * initialization fails before reaching to the last one.
>> +                */
>> +               if (!zone->bitmap)
>> +                       continue;
>> +               bitmap_free(zone->bitmap);
>> +               zone->bitmap = NULL;
>> +               atomic_set(&zone->free_pages, 0);
>> +       }
>> +}
>> +
>> +static int zone_bitmap_alloc(struct zone *zone)
>> +{
>> +       unsigned long bitmap_size, pages;
>> +
>> +       pages = zone->end_pfn - zone->base_pfn;
>> +       bitmap_size = (pages >> PAGE_REPORTING_MIN_ORDER) + 1;
>> +
>> +       if (!bitmap_size)
>> +               return 0;
>> +
>> +       zone->bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
>> +       if (!zone->bitmap)
>> +               return -ENOMEM;
>> +
>> +       zone->nbits = bitmap_size;
>> +
>> +       return 0;
>> +}
>> +
> So as per your comments in the cover page, the two functions above
> should also probably be plugged into the zone resizing logic somewhere
> so if a zone is resized the bitmap is adjusted.

I think the right way will be to have a common interface which could be called
here and in memory hotplug/unplug case.
Right now, in my prototype, I have a different functions which adjusts the size
of the bitmap based on the memory notifier action.

>
>> +/**
>> + * zone_reporting_init - For each zone initializes the page reporting fields
>> + * and allocates the respective bitmap.
>> + *
>> + * This function returns 0 on successful initialization, -ENOMEM otherwise.
>> + */
>> +static int zone_reporting_init(void)
>> +{
>> +       struct zone *zone;
>> +       int ret;
>> +
>> +       for_each_populated_zone(zone) {
>> +#ifdef CONFIG_ZONE_DEVICE
>> +               /* we can not report pages which are not in the buddy */
>> +               if (zone_idx(zone) == ZONE_DEVICE)
>> +                       continue;
>> +#endif
> I'm pretty sure this isn't needed since I don't think the ZONE_DEVICE
> zone will be considered "populated".

hmm, I was not aware of it. I will dig in more about it.

>
>> +               spin_lock(&zone->lock);
>> +               zone->base_pfn = zone->zone_start_pfn;
>> +               zone->end_pfn = zone_end_pfn(zone);
>> +               spin_unlock(&zone->lock);
>> +
>> +               ret = zone_bitmap_alloc(zone);
>> +               if (ret < 0) {
>> +                       zone_reporting_cleanup();
>> +                       return ret;
>> +               }
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +void page_reporting_disable(struct page_reporting_config *phconf)
>> +{
>> +       mutex_lock(&page_reporting_mutex);
>> +
>> +       if (rcu_access_pointer(page_reporting_conf) != phconf)
>> +               return;
>> +
>> +       RCU_INIT_POINTER(page_reporting_conf, NULL);
>> +       synchronize_rcu();
>> +
>> +       /* Cancel any pending reporting request */
>> +       cancel_work_sync(&phconf->reporting_work);
>> +
>> +       /* Free the scatterlist used for isolated pages */
>> +       kfree(phconf->sg);
>> +       phconf->sg = NULL;
>> +
>> +       /* Cleanup the bitmaps and old tracking data */
>> +       zone_reporting_cleanup();
>> +
>> +       mutex_unlock(&page_reporting_mutex);
>> +}
>> +EXPORT_SYMBOL_GPL(page_reporting_disable);
>> +
>> +int page_reporting_enable(struct page_reporting_config *phconf)
>> +{
>> +       int ret = 0;
>> +
>> +       mutex_lock(&page_reporting_mutex);
>> +
>> +       /* check if someone is already using page reporting*/
>> +       if (rcu_access_pointer(page_reporting_conf)) {
>> +               ret = -EBUSY;
>> +               goto out;
>> +       }
>> +
>> +       /* allocate scatterlist to hold isolated pages */
>> +       phconf->sg = kcalloc(phconf->max_pages, sizeof(*phconf->sg),
>> +                            GFP_KERNEL);
>> +       if (!phconf->sg) {
>> +               ret = -ENOMEM;
>> +               goto out;
>> +       }
>> +
>> +       /* initialize each zone's fields required for page reporting */
>> +       ret = zone_reporting_init();
>> +       if (ret < 0) {
>> +               kfree(phconf->sg);
>> +               goto out;
>> +       }
>> +
>> +       atomic_set(&phconf->refcnt, 0);
>> +       INIT_WORK(&phconf->reporting_work, page_reporting_wq);
>> +
>> +       /* assign the configuration object provided by the backend */
>> +       rcu_assign_pointer(page_reporting_conf, phconf);
>> +
>> +out:
>> +       mutex_unlock(&page_reporting_mutex);
>> +       return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(page_reporting_enable);
>> --
>> 2.21.0
>>
-- 
Thanks
Nitesh