From: Wei Wang <wei.w.wang@intel.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: virtio-dev@lists.oasis-open.org, linux-kernel@vger.kernel.org,
qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org,
kvm@vger.kernel.org, linux-mm@kvack.org, mst@redhat.com,
akpm@linux-foundation.org, mawilcox@microsoft.com,
david@redhat.com, cornelia.huck@de.ibm.com,
mgorman@techsingularity.net, aarcange@redhat.com,
amit.shah@redhat.com, pbonzini@redhat.com, willy@infradead.org,
liliang.opensource@gmail.com, yang.zhang.wz@gmail.com,
quan.xu@aliyun.com
Subject: Re: [PATCH v14 4/5] mm: support reporting free page blocks
Date: Mon, 21 Aug 2017 14:12:47 +0800 [thread overview]
Message-ID: <599A79DF.2000707@intel.com> (raw)
In-Reply-To: <20170818134650.GC18499@dhcp22.suse.cz>
On 08/18/2017 09:46 PM, Michal Hocko wrote:
> On Thu 17-08-17 11:26:55, Wei Wang wrote:
>> This patch adds support to walk through the free page blocks in the
>> system and report them via a callback function. Some page blocks may
>> leave the free list after zone->lock is released, so it is the caller's
>> responsibility to either detect or prevent the use of such pages.
> This could see more details to be honest. Especially the usecase you are
> going to use this for. This will help us to understand the motivation
> in future when the current user might be gone a new ones largely diverge
> into a different usage. This wouldn't be the first time I have seen
> something like that.
OK, I will more details here about how it's used to accelerate live
migration.
>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Michael S. Tsirkin <mst@redhat.com>
>> ---
>> include/linux/mm.h | 6 ++++++
>> mm/page_alloc.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 50 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 46b9ac5..cd29b9f 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>> unsigned long zone_start_pfn, unsigned long *zholes_size);
>> extern void free_initmem(void);
>>
>> +extern void walk_free_mem_block(void *opaque1,
>> + unsigned int min_order,
>> + void (*visit)(void *opaque2,
>> + unsigned long pfn,
>> + unsigned long nr_pages));
>> +
>> /*
>> * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>> * into the buddy system. The freed pages will be poisoned with pattern
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6d00f74..a721a35 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>> show_swap_cache_info();
>> }
>>
>> +/**
>> + * walk_free_mem_block - Walk through the free page blocks in the system
>> + * @opaque1: the context passed from the caller
>> + * @min_order: the minimum order of free lists to check
>> + * @visit: the callback function given by the caller
> The original suggestion for using visit was motivated by a visit design
> pattern but I can see how this can be confusing. Maybe a more explicit
> name wold be better. What about report_free_range.
I'm afraid that name would be too long to fit in nicely.
How about simply naming it "report"?
>
>> + *
>> + * The function is used to walk through the free page blocks in the system,
>> + * and each free page block is reported to the caller via the @visit callback.
>> + * Please note:
>> + * 1) The function is used to report hints of free pages, so the caller should
>> + * not use those reported pages after the callback returns.
>> + * 2) The callback is invoked with the zone->lock being held, so it should not
>> + * block and should finish as soon as possible.
> I think that the explicit note about zone->lock is not really need. This
> can change in future and I would even bet that somebody might rely on
> the lock being held for some purpose and silently get broken with the
> change. Instead I would much rather see something like the following:
> "
> Please note that there are no locking guarantees for the callback
Just a little confused with this one:
The callback is invoked within zone->lock, why would we claim it "no
locking guarantees for the callback"?
> and
> that the reported pfn range might be freed or disappear after the
> callback returns so the caller has to be very careful how it is used.
>
> The callback itself must not sleep or perform any operations which would
> require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
> or via any lock dependency. It is generally advisable to implement
> the callback as simple as possible and defer any heavy lifting to a
> different context.
>
> There is no guarantee that each free range will be reported only once
> during one walk_free_mem_block invocation.
>
> pfn_to_page on the given range is strongly discouraged and if there is
> an absolute need for that make sure to contact MM people to discuss
> potential problems.
>
> The function itself might sleep so it cannot be called from atomic
> contexts.
>
> In general low orders tend to be very volatile and so it makes more
> sense to query larger ones for various optimizations which like
> ballooning etc... This will reduce the overhead as well.
> "
I think it looks quite comprehensive. Thanks.
Best,
Wei
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-08-21 6:09 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-08-17 3:26 [PATCH v14 0/5] Virtio-balloon Enhancement Wei Wang
2017-08-17 3:26 ` [PATCH v14 1/5] lib/xbitmap: Introduce xbitmap Wei Wang
2017-08-19 20:30 ` kbuild test robot
2017-08-17 3:26 ` [PATCH v14 2/5] lib/xbitmap: add xb_find_next_bit() and xb_zero() Wei Wang
2017-08-17 3:26 ` [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
2017-08-18 2:22 ` Michael S. Tsirkin
2017-08-18 7:39 ` Wei Wang
2017-08-21 20:22 ` Michael S. Tsirkin
2017-08-19 21:37 ` kbuild test robot
2017-08-17 3:26 ` [PATCH v14 4/5] mm: support reporting free page blocks Wei Wang
2017-08-18 13:46 ` Michal Hocko
2017-08-21 6:12 ` Wei Wang [this message]
2017-08-21 6:14 ` Michal Hocko
2017-08-18 17:23 ` Michael S. Tsirkin
2017-08-21 6:18 ` Michal Hocko
2017-08-17 3:26 ` [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ Wei Wang
2017-08-18 2:13 ` Michael S. Tsirkin
2017-08-18 8:41 ` Wei Wang
2017-08-18 18:26 ` Michael S. Tsirkin
2017-08-21 5:21 ` Wei Wang
2017-08-18 2:28 ` Michael S. Tsirkin
2017-08-18 8:36 ` Wei Wang
2017-08-18 18:10 ` Michael S. Tsirkin
2017-08-21 5:28 ` [virtio-dev] " Wei Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=599A79DF.2000707@intel.com \
--to=wei.w.wang@intel.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=amit.shah@redhat.com \
--cc=cornelia.huck@de.ibm.com \
--cc=david@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=liliang.opensource@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mawilcox@microsoft.com \
--cc=mgorman@techsingularity.net \
--cc=mhocko@kernel.org \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quan.xu@aliyun.com \
--cc=virtio-dev@lists.oasis-open.org \
--cc=virtualization@lists.linux-foundation.org \
--cc=willy@infradead.org \
--cc=yang.zhang.wz@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).