linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@kernel.org>
To: Wei Wang <wei.w.wang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	virtio-dev@lists.oasis-open.org, linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, kvm@vger.kernel.org,
	linux-mm@kvack.org, mst@redhat.com, pbonzini@redhat.com,
	liliang.opensource@gmail.com, yang.zhang.wz@gmail.com,
	quan.xu0@gmail.com, nilal@redhat.com, riel@redhat.com,
	huangzhichao@huawei.com
Subject: Re: [PATCH v29 1/4] mm: support reporting free page blocks
Date: Tue, 27 Mar 2018 08:33:22 +0200	[thread overview]
Message-ID: <20180327063322.GW5652@dhcp22.suse.cz> (raw)
In-Reply-To: <5AB9E377.30900@intel.com>

On Tue 27-03-18 14:23:51, Wei Wang wrote:
> On 03/27/2018 05:22 AM, Andrew Morton wrote:
> > On Mon, 26 Mar 2018 10:39:51 +0800 Wei Wang <wei.w.wang@intel.com> wrote:
> > 
> > > This patch adds support to walk through the free page blocks in the
> > > system and report them via a callback function. Some page blocks may
> > > leave the free list after zone->lock is released, so it is the caller's
> > > responsibility to either detect or prevent the use of such pages.
> > > 
> > > One use example of this patch is to accelerate live migration by skipping
> > > the transfer of free pages reported from the guest. A popular method used
> > > by the hypervisor to track which part of memory is written during live
> > > migration is to write-protect all the guest memory. So, those pages that
> > > are reported as free pages but are written after the report function
> > > returns will be captured by the hypervisor, and they will be added to the
> > > next round of memory transfer.
> > > 
> > > ...
> > > 
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -4912,6 +4912,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
> > >   	show_swap_cache_info();
> > >   }
> > > +/*
> > > + * Walk through a free page list and report the found pfn range via the
> > > + * callback.
> > > + *
> > > + * Return 0 if it completes the reporting. Otherwise, return the non-zero
> > > + * value returned from the callback.
> > > + */
> > > +static int walk_free_page_list(void *opaque,
> > > +			       struct zone *zone,
> > > +			       int order,
> > > +			       enum migratetype mt,
> > > +			       int (*report_pfn_range)(void *,
> > > +						       unsigned long,
> > > +						       unsigned long))
> > > +{
> > > +	struct page *page;
> > > +	struct list_head *list;
> > > +	unsigned long pfn, flags;
> > > +	int ret = 0;
> > > +
> > > +	spin_lock_irqsave(&zone->lock, flags);
> > > +	list = &zone->free_area[order].free_list[mt];
> > > +	list_for_each_entry(page, list, lru) {
> > > +		pfn = page_to_pfn(page);
> > > +		ret = report_pfn_range(opaque, pfn, 1 << order);
> > > +		if (ret)
> > > +			break;
> > > +	}
> > > +	spin_unlock_irqrestore(&zone->lock, flags);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +/**
> > > + * walk_free_mem_block - Walk through the free page blocks in the system
> > > + * @opaque: the context passed from the caller
> > > + * @min_order: the minimum order of free lists to check
> > > + * @report_pfn_range: the callback to report the pfn range of the free pages
> > > + *
> > > + * If the callback returns a non-zero value, stop iterating the list of free
> > > + * page blocks. Otherwise, continue to report.
> > > + *
> > > + * Please note that there are no locking guarantees for the callback and
> > > + * that the reported pfn range might be freed or disappear after the
> > > + * callback returns so the caller has to be very careful how it is used.
> > > + *
> > > + * The callback itself must not sleep or perform any operations which would
> > > + * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
> > > + * or via any lock dependency. It is generally advisable to implement
> > > + * the callback as simple as possible and defer any heavy lifting to a
> > > + * different context.
> > > + *
> > > + * There is no guarantee that each free range will be reported only once
> > > + * during one walk_free_mem_block invocation.
> > > + *
> > > + * pfn_to_page on the given range is strongly discouraged and if there is
> > > + * an absolute need for that make sure to contact MM people to discuss
> > > + * potential problems.
> > > + *
> > > + * The function itself might sleep so it cannot be called from atomic
> > > + * contexts.
> > I don't see how walk_free_mem_block() can sleep.
> 
> OK, it would be better to remove this sentence for the current version. But
> I think we could probably keep it if we decide to add cond_resched() below.

The point of this sentence was to make any user aware that the function
might sleep from the very begining rather than chase existing callers
when we need to add cond_resched or sleep for any other reason. So I
would rather keep it.
-- 
Michal Hocko
SUSE Labs

  reply	other threads:[~2018-03-27  6:33 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-26  2:39 [PATCH v29 0/4] Virtio-balloon: support free page reporting Wei Wang
2018-03-26  2:39 ` [PATCH v29 1/4] mm: support reporting free page blocks Wei Wang
2018-03-26 21:22   ` Andrew Morton
2018-03-27  6:23     ` Wei Wang
2018-03-27  6:33       ` Michal Hocko [this message]
2018-03-27 16:07         ` Michael S. Tsirkin
2018-03-28  7:01           ` Michal Hocko
2018-04-10 18:19     ` Michael S. Tsirkin
2018-04-10 20:54       ` Andrew Morton
2018-04-10 23:25         ` Michael S. Tsirkin
2018-04-11  1:22           ` Wei Wang
2018-03-26  2:39 ` [PATCH v29 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT Wei Wang
2018-03-26  2:39 ` [PATCH v29 3/4] mm/page_poison: expose page_poisoning_enabled to kernel modules Wei Wang
2018-03-26  3:24   ` Wang, Wei W
2018-03-26 21:24   ` Andrew Morton
2018-03-26  2:39 ` [PATCH v29 4/4] virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON Wei Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180327063322.GW5652@dhcp22.suse.cz \
    --to=mhocko@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=huangzhichao@huawei.com \
    --cc=kvm@vger.kernel.org \
    --cc=liliang.opensource@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mst@redhat.com \
    --cc=nilal@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=quan.xu0@gmail.com \
    --cc=riel@redhat.com \
    --cc=virtio-dev@lists.oasis-open.org \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=wei.w.wang@intel.com \
    --cc=yang.zhang.wz@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).