From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1752360AbcLFErg convert rfc822-to-8bit (ORCPT <rfc822;w@1wt.eu>);
        Mon, 5 Dec 2016 23:47:36 -0500
Received: from mga02.intel.com ([134.134.136.20]:28415 "EHLO mga02.intel.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1751327AbcLFErd (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 5 Dec 2016 23:47:33 -0500
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.33,751,1477983600"; 
   d="scan'208";a="1068425004"
From: "Li, Liang Z" <liang.z.li@intel.com>
To: "Hansen, Dave" <dave.hansen@intel.com>,
        "kvm@vger.kernel.org" <kvm@vger.kernel.org>
CC: "virtualization@lists.linux-foundation.org" 
        <virtualization@lists.linux-foundation.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "linux-mm@kvack.org" <linux-mm@kvack.org>,
        "virtio-dev@lists.oasis-open.org" <virtio-dev@lists.oasis-open.org>,
        "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
        "quintela@redhat.com" <quintela@redhat.com>,
        "dgilbert@redhat.com" <dgilbert@redhat.com>,
        "mst@redhat.com" <mst@redhat.com>,
        "jasowang@redhat.com" <jasowang@redhat.com>,
        "kirill.shutemov@linux.intel.com" <kirill.shutemov@linux.intel.com>,
        "akpm@linux-foundation.org" <akpm@linux-foundation.org>,
        "mhocko@suse.com" <mhocko@suse.com>,
        "pbonzini@redhat.com" <pbonzini@redhat.com>,
        Mel Gorman <mgorman@techsingularity.net>,
        Cornelia Huck <cornelia.huck@de.ibm.com>,
        "Amit Shah" <amit.shah@redhat.com>
Subject: RE: [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused
 page info
Thread-Topic: [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused
 page info
Thread-Index: AQHSSuf7C044VX4sFkSRMSEWWM0WiqDxYEaAgAZgGRCAAVv/gIABHPvQ
Date: Tue, 6 Dec 2016 04:47:27 +0000
Message-ID: <F2CBF3009FA73547804AE4C663CAB28E3A12F831@shsmsx102.ccr.corp.intel.com>
References: <1480495397-23225-1-git-send-email-liang.z.li@intel.com>
 <1480495397-23225-6-git-send-email-liang.z.li@intel.com>
 <438dd41a-fdf1-2a77-ef9c-8c103f492b2f@intel.com>
 <F2CBF3009FA73547804AE4C663CAB28E3A12D814@shsmsx102.ccr.corp.intel.com>
 <70ece7a5-348b-2eb9-c40a-f21b08df042c@intel.com>
In-Reply-To: <70ece7a5-348b-2eb9-c40a-f21b08df042c@intel.com>
Accept-Language: zh-CN, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiZTk0ZTYwZTEtMDJhOC00ZTI3LTk5MWMtMGQ0OWI5MzI1NjY4IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE1LjkuNi42IiwiVHJ1c3RlZExhYmVsSGFzaCI6IjE3ZUNlK2RRVkJFSStPMEFxcnl4NG9iSUpCWUFGdnRtMU1icEh3Q3BONnc9In0=
x-ctpclassification: CTP_IC
x-originating-ip: [10.239.127.40]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8BIT
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

> >>> +	mutex_lock(&vb->balloon_lock);
> >>> +
> >>> +	for (order = MAX_ORDER - 1; order >= 0; order--) {
> >>
> >> I scratched my head for a bit on this one.  Why are you walking over
> >> orders,
> >> *then* zones.  I *think* you're doing it because you can efficiently
> >> fill the bitmaps at a given order for all zones, then move to a new
> >> bitmap.  But, it would be interesting to document this.
> >
> > Yes, use the order is somewhat strange, but it's helpful to keep the API simple.
> > Do you think it's acceptable?
> 
> Yeah, it's fine.  Just comment it, please.
> 
Good!

> >>> +		if (ret == -ENOSPC) {
> >>> +			void *new_resp_data;
> >>> +
> >>> +			new_resp_data = kmalloc(2 * vb->resp_buf_size,
> >>> +						GFP_KERNEL);
> >>> +			if (new_resp_data) {
> >>> +				kfree(vb->resp_data);
> >>> +				vb->resp_data = new_resp_data;
> >>> +				vb->resp_buf_size *= 2;
> >>
> >> What happens to the data in ->resp_data at this point?  Doesn't this
> >> just throw it away?
> >
> > Yes, so we should make sure the data in resp_data is not inuse.
> 
> But doesn't it have valid data that we just collected and haven't told the
> hypervisor about yet?  Aren't we throwing away good data that cost us
> something to collect?

Indeed.  Some filled data may exist for the previous zone. Should we
change the API to 
'int get_unused_pages(unsigned long *unused_pages, unsigned long size,
		int order, unsigned long *pos, struct zone *zone)' ?

then we can use the 'zone' to record the zone to retry and not discard the
filled data.

> >> ...
> >>> +struct page_info_item {
> >>> +	__le64 start_pfn : 52; /* start pfn for the bitmap */
> >>> +	__le64 page_shift : 6; /* page shift width, in bytes */
> 
> What does a page_shift "in bytes" mean? :)

Obviously, you know. :o
I will try to make it clear.

> 
> >>> +	__le64 bmap_len : 6;  /* bitmap length, in bytes */ };
> >>
> >> Is 'bmap_len' too short?  a 64-byte buffer is a bit tiny.  Right?
> >
> > Currently, we just use the 8 bytes and 0 bytes bitmap, should we support
> more than 64 bytes?
> 
> It just means that with this format, you end up wasting at least ~1/8th of the
> space with metadata.  That's a bit unfortunate, but I guess it's not fatal.
> 
> I'd definitely call it out in the patch description and make sure other folks take
> a look at it.

OK.

> 
> There's a somewhat easy fix, but that would make the qemu implementation
> more complicated: You could just have bmap_len==0x3f imply that there's
> another field that contains an extended bitmap length for when you need long
> bitmaps.
> 
> But, as you note, there's no need for it, so it's a matter of trading the extra
> complexity versus the desire to not habing to change the ABI again for longer
> (hopefully).
> 

Your suggestion still works without changing the current code, just reserve
 ' bmap_len==0x3f' for future extension, and it's not used by the current code.

> >>> +static int  mark_unused_pages(struct zone *zone,
> >>> +		unsigned long *unused_pages, unsigned long size,
> >>> +		int order, unsigned long *pos)
> >>> +{
> >>> +	unsigned long pfn, flags;
> >>> +	unsigned int t;
> >>> +	struct list_head *curr;
> >>> +	struct page_info_item *info;
> >>> +
> >>> +	if (zone_is_empty(zone))
> >>> +		return 0;
> >>> +
> >>> +	spin_lock_irqsave(&zone->lock, flags);
> >>> +
> >>> +	if (*pos + zone->free_area[order].nr_free > size)
> >>> +		return -ENOSPC;
> >>
> >> Urg, so this won't partially fill?  So, what the nr_free pages limit
> >> where we no longer fit in the kmalloc()'d buffer where this simply won't
> work?
> >
> > Yes.  My initial implementation is partially fill, it's better for the worst case.
> > I thought the above code is more efficient for most case ...
> > Do you think partially fill the bitmap is better?
> 
> Could you please answer the question I asked?
> 

For your question:
-------------------------------------------------------------------------------------------------------
>So, what the nr_free pages limit where we no longer fit in the kmalloc()'d buffer
> where this simply won't work?
------------------------------------------------------------------------------------------------------
No, if the buffer is not big enough to save 'nr_free'  pages, get_unused_pages() will return
'-ENOSPC', and the following code will try to allocate a 2x times size buffer for retrying,
until the proper size buffer is allocated. The current order will not be skipped unless the
buffer allocation failed.

> Because if you don't get this right, it could mean that there are system that
> simply *fail* here.

From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Li, Liang Z" <liang.z.li@intel.com>
Subject: RE: [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused
	page info
Date: Tue, 6 Dec 2016 04:47:27 +0000
Message-ID: <F2CBF3009FA73547804AE4C663CAB28E3A12F831@shsmsx102.ccr.corp.intel.com>
References: <1480495397-23225-1-git-send-email-liang.z.li@intel.com>
	<1480495397-23225-6-git-send-email-liang.z.li@intel.com>
	<438dd41a-fdf1-2a77-ef9c-8c103f492b2f@intel.com>
	<F2CBF3009FA73547804AE4C663CAB28E3A12D814@shsmsx102.ccr.corp.intel.com>
	<70ece7a5-348b-2eb9-c40a-f21b08df042c@intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Cc: "virtio-dev@lists.oasis-open.org" <virtio-dev@lists.oasis-open.org>,
	"mhocko@suse.com" <mhocko@suse.com>, Amit Shah <amit.shah@redhat.com>,
	"mst@redhat.com" <mst@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"kirill.shutemov@linux.intel.com" <kirill.shutemov@linux.intel.com>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"virtualization@lists.linux-foundation.org"
	<virtualization@lists.linux-foundation.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	"dgilbert@redhat.com" <dgilbert@redhat.com>
To: "Hansen, Dave" <dave.hansen@intel.com>, "kvm@vger.kernel.org"
	<kvm@vger.kernel.org>
Return-path: <virtualization-bounces@lists.linux-foundation.org>
In-Reply-To: <70ece7a5-348b-2eb9-c40a-f21b08df042c@intel.com>
Content-Language: en-US
List-Unsubscribe: <https://lists.linuxfoundation.org/mailman/options/virtualization>,
	<mailto:virtualization-request@lists.linux-foundation.org?subject=unsubscribe>
List-Archive: <http://lists.linuxfoundation.org/pipermail/virtualization/>
List-Post: <mailto:virtualization@lists.linux-foundation.org>
List-Help: <mailto:virtualization-request@lists.linux-foundation.org?subject=help>
List-Subscribe: <https://lists.linuxfoundation.org/mailman/listinfo/virtualization>,
	<mailto:virtualization-request@lists.linux-foundation.org?subject=subscribe>
Sender: virtualization-bounces@lists.linux-foundation.org
Errors-To: virtualization-bounces@lists.linux-foundation.org
List-Id: kvm.vger.kernel.org

> >>> +	mutex_lock(&vb->balloon_lock);
> >>> +
> >>> +	for (order = MAX_ORDER - 1; order >= 0; order--) {
> >>
> >> I scratched my head for a bit on this one.  Why are you walking over
> >> orders,
> >> *then* zones.  I *think* you're doing it because you can efficiently
> >> fill the bitmaps at a given order for all zones, then move to a new
> >> bitmap.  But, it would be interesting to document this.
> >
> > Yes, use the order is somewhat strange, but it's helpful to keep the API simple.
> > Do you think it's acceptable?
> 
> Yeah, it's fine.  Just comment it, please.
> 
Good!

> >>> +		if (ret == -ENOSPC) {
> >>> +			void *new_resp_data;
> >>> +
> >>> +			new_resp_data = kmalloc(2 * vb->resp_buf_size,
> >>> +						GFP_KERNEL);
> >>> +			if (new_resp_data) {
> >>> +				kfree(vb->resp_data);
> >>> +				vb->resp_data = new_resp_data;
> >>> +				vb->resp_buf_size *= 2;
> >>
> >> What happens to the data in ->resp_data at this point?  Doesn't this
> >> just throw it away?
> >
> > Yes, so we should make sure the data in resp_data is not inuse.
> 
> But doesn't it have valid data that we just collected and haven't told the
> hypervisor about yet?  Aren't we throwing away good data that cost us
> something to collect?

Indeed.  Some filled data may exist for the previous zone. Should we
change the API to 
'int get_unused_pages(unsigned long *unused_pages, unsigned long size,
		int order, unsigned long *pos, struct zone *zone)' ?

then we can use the 'zone' to record the zone to retry and not discard the
filled data.

> >> ...
> >>> +struct page_info_item {
> >>> +	__le64 start_pfn : 52; /* start pfn for the bitmap */
> >>> +	__le64 page_shift : 6; /* page shift width, in bytes */
> 
> What does a page_shift "in bytes" mean? :)

Obviously, you know. :o
I will try to make it clear.

> 
> >>> +	__le64 bmap_len : 6;  /* bitmap length, in bytes */ };
> >>
> >> Is 'bmap_len' too short?  a 64-byte buffer is a bit tiny.  Right?
> >
> > Currently, we just use the 8 bytes and 0 bytes bitmap, should we support
> more than 64 bytes?
> 
> It just means that with this format, you end up wasting at least ~1/8th of the
> space with metadata.  That's a bit unfortunate, but I guess it's not fatal.
> 
> I'd definitely call it out in the patch description and make sure other folks take
> a look at it.

OK.

> 
> There's a somewhat easy fix, but that would make the qemu implementation
> more complicated: You could just have bmap_len==0x3f imply that there's
> another field that contains an extended bitmap length for when you need long
> bitmaps.
> 
> But, as you note, there's no need for it, so it's a matter of trading the extra
> complexity versus the desire to not habing to change the ABI again for longer
> (hopefully).
> 

Your suggestion still works without changing the current code, just reserve
 ' bmap_len==0x3f' for future extension, and it's not used by the current code.

> >>> +static int  mark_unused_pages(struct zone *zone,
> >>> +		unsigned long *unused_pages, unsigned long size,
> >>> +		int order, unsigned long *pos)
> >>> +{
> >>> +	unsigned long pfn, flags;
> >>> +	unsigned int t;
> >>> +	struct list_head *curr;
> >>> +	struct page_info_item *info;
> >>> +
> >>> +	if (zone_is_empty(zone))
> >>> +		return 0;
> >>> +
> >>> +	spin_lock_irqsave(&zone->lock, flags);
> >>> +
> >>> +	if (*pos + zone->free_area[order].nr_free > size)
> >>> +		return -ENOSPC;
> >>
> >> Urg, so this won't partially fill?  So, what the nr_free pages limit
> >> where we no longer fit in the kmalloc()'d buffer where this simply won't
> work?
> >
> > Yes.  My initial implementation is partially fill, it's better for the worst case.
> > I thought the above code is more efficient for most case ...
> > Do you think partially fill the bitmap is better?
> 
> Could you please answer the question I asked?
> 

For your question:
-------------------------------------------------------------------------------------------------------
>So, what the nr_free pages limit where we no longer fit in the kmalloc()'d buffer
> where this simply won't work?
------------------------------------------------------------------------------------------------------
No, if the buffer is not big enough to save 'nr_free'  pages, get_unused_pages() will return
'-ENOSPC', and the following code will try to allocate a 2x times size buffer for retrying,
until the proper size buffer is allocated. The current order will not be skipped unless the
buffer allocation failed.

> Because if you don't get this right, it could mean that there are system that
> simply *fail* here.

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199])
	by kanga.kvack.org (Postfix) with ESMTP id 2D1F76B0069
	for <linux-mm@kvack.org>; Mon,  5 Dec 2016 23:47:34 -0500 (EST)
Received: by mail-pf0-f199.google.com with SMTP id a8so533791496pfg.0
        for <linux-mm@kvack.org>; Mon, 05 Dec 2016 20:47:34 -0800 (PST)
Received: from mga09.intel.com (mga09.intel.com. [134.134.136.24])
        by mx.google.com with ESMTPS id t1si17629552plb.188.2016.12.05.20.47.33
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Mon, 05 Dec 2016 20:47:33 -0800 (PST)
From: "Li, Liang Z" <liang.z.li@intel.com>
Subject: RE: [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused
 page info
Date: Tue, 6 Dec 2016 04:47:27 +0000
Message-ID: <F2CBF3009FA73547804AE4C663CAB28E3A12F831@shsmsx102.ccr.corp.intel.com>
References: <1480495397-23225-1-git-send-email-liang.z.li@intel.com>
 <1480495397-23225-6-git-send-email-liang.z.li@intel.com>
 <438dd41a-fdf1-2a77-ef9c-8c103f492b2f@intel.com>
 <F2CBF3009FA73547804AE4C663CAB28E3A12D814@shsmsx102.ccr.corp.intel.com>
 <70ece7a5-348b-2eb9-c40a-f21b08df042c@intel.com>
In-Reply-To: <70ece7a5-348b-2eb9-c40a-f21b08df042c@intel.com>
Content-Language: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Sender: owner-linux-mm@kvack.org
List-ID: <linux-mm.kvack.org>
To: "Hansen, Dave" <dave.hansen@intel.com>, "kvm@vger.kernel.org" <kvm@vger.kernel.org>
Cc: "virtualization@lists.linux-foundation.org" <virtualization@lists.linux-foundation.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "linux-mm@kvack.org" <linux-mm@kvack.org>, "virtio-dev@lists.oasis-open.org" <virtio-dev@lists.oasis-open.org>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, "quintela@redhat.com" <quintela@redhat.com>, "dgilbert@redhat.com" <dgilbert@redhat.com>, "mst@redhat.com" <mst@redhat.com>, "jasowang@redhat.com" <jasowang@redhat.com>, "kirill.shutemov@linux.intel.com" <kirill.shutemov@linux.intel.com>, "akpm@linux-foundation.org" <akpm@linux-foundation.org>, "mhocko@suse.com" <mhocko@suse.com>, "pbonzini@redhat.com" <pbonzini@redhat.com>, Mel Gorman <mgorman@techsingularity.net>, Cornelia Huck <cornelia.huck@de.ibm.com>, Amit Shah <amit.shah@redhat.com>

> >>> +	mutex_lock(&vb->balloon_lock);
> >>> +
> >>> +	for (order =3D MAX_ORDER - 1; order >=3D 0; order--) {
> >>
> >> I scratched my head for a bit on this one.  Why are you walking over
> >> orders,
> >> *then* zones.  I *think* you're doing it because you can efficiently
> >> fill the bitmaps at a given order for all zones, then move to a new
> >> bitmap.  But, it would be interesting to document this.
> >
> > Yes, use the order is somewhat strange, but it's helpful to keep the AP=
I simple.
> > Do you think it's acceptable?
>=20
> Yeah, it's fine.  Just comment it, please.
>=20
Good!

> >>> +		if (ret =3D=3D -ENOSPC) {
> >>> +			void *new_resp_data;
> >>> +
> >>> +			new_resp_data =3D kmalloc(2 * vb->resp_buf_size,
> >>> +						GFP_KERNEL);
> >>> +			if (new_resp_data) {
> >>> +				kfree(vb->resp_data);
> >>> +				vb->resp_data =3D new_resp_data;
> >>> +				vb->resp_buf_size *=3D 2;
> >>
> >> What happens to the data in ->resp_data at this point?  Doesn't this
> >> just throw it away?
> >
> > Yes, so we should make sure the data in resp_data is not inuse.
>=20
> But doesn't it have valid data that we just collected and haven't told th=
e
> hypervisor about yet?  Aren't we throwing away good data that cost us
> something to collect?

Indeed.  Some filled data may exist for the previous zone. Should we
change the API to=20
'int get_unused_pages(unsigned long *unused_pages, unsigned long size,
		int order, unsigned long *pos, struct zone *zone)' ?

then we can use the 'zone' to record the zone to retry and not discard the
filled data.

> >> ...
> >>> +struct page_info_item {
> >>> +	__le64 start_pfn : 52; /* start pfn for the bitmap */
> >>> +	__le64 page_shift : 6; /* page shift width, in bytes */
>=20
> What does a page_shift "in bytes" mean? :)

Obviously, you know. :o
I will try to make it clear.

>=20
> >>> +	__le64 bmap_len : 6;  /* bitmap length, in bytes */ };
> >>
> >> Is 'bmap_len' too short?  a 64-byte buffer is a bit tiny.  Right?
> >
> > Currently, we just use the 8 bytes and 0 bytes bitmap, should we suppor=
t
> more than 64 bytes?
>=20
> It just means that with this format, you end up wasting at least ~1/8th o=
f the
> space with metadata.  That's a bit unfortunate, but I guess it's not fata=
l.
>=20
> I'd definitely call it out in the patch description and make sure other f=
olks take
> a look at it.

OK.

>=20
> There's a somewhat easy fix, but that would make the qemu implementation
> more complicated: You could just have bmap_len=3D=3D0x3f imply that there=
's
> another field that contains an extended bitmap length for when you need l=
ong
> bitmaps.
>=20
> But, as you note, there's no need for it, so it's a matter of trading the=
 extra
> complexity versus the desire to not habing to change the ABI again for lo=
nger
> (hopefully).
>=20

Your suggestion still works without changing the current code, just reserve
 ' bmap_len=3D=3D0x3f' for future extension, and it's not used by the curre=
nt code.

> >>> +static int  mark_unused_pages(struct zone *zone,
> >>> +		unsigned long *unused_pages, unsigned long size,
> >>> +		int order, unsigned long *pos)
> >>> +{
> >>> +	unsigned long pfn, flags;
> >>> +	unsigned int t;
> >>> +	struct list_head *curr;
> >>> +	struct page_info_item *info;
> >>> +
> >>> +	if (zone_is_empty(zone))
> >>> +		return 0;
> >>> +
> >>> +	spin_lock_irqsave(&zone->lock, flags);
> >>> +
> >>> +	if (*pos + zone->free_area[order].nr_free > size)
> >>> +		return -ENOSPC;
> >>
> >> Urg, so this won't partially fill?  So, what the nr_free pages limit
> >> where we no longer fit in the kmalloc()'d buffer where this simply won=
't
> work?
> >
> > Yes.  My initial implementation is partially fill, it's better for the =
worst case.
> > I thought the above code is more efficient for most case ...
> > Do you think partially fill the bitmap is better?
>=20
> Could you please answer the question I asked?
>=20

For your question:
---------------------------------------------------------------------------=
----------------------------
>So, what the nr_free pages limit where we no longer fit in the kmalloc()'d=
 buffer
> where this simply won't work?
---------------------------------------------------------------------------=
---------------------------
No, if the buffer is not big enough to save 'nr_free'  pages, get_unused_pa=
ges() will return
'-ENOSPC', and the following code will try to allocate a 2x times size buff=
er for retrying,
until the proper size buffer is allocated. The current order will not be sk=
ipped unless the
buffer allocation failed.

> Because if you don't get this right, it could mean that there are system =
that
> simply *fail* here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:60506)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <liang.z.li@intel.com>) id 1cE7fJ-0004fo-4C
	for qemu-devel@nongnu.org; Mon, 05 Dec 2016 23:47:42 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <liang.z.li@intel.com>) id 1cE7fG-0007jD-2B
	for qemu-devel@nongnu.org; Mon, 05 Dec 2016 23:47:41 -0500
Received: from mga03.intel.com ([134.134.136.65]:39567)
	by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
	(Exim 4.71) (envelope-from <liang.z.li@intel.com>)
	id 1cE7fF-0007iy-Mu
	for qemu-devel@nongnu.org; Mon, 05 Dec 2016 23:47:38 -0500
From: "Li, Liang Z" <liang.z.li@intel.com>
Date: Tue, 6 Dec 2016 04:47:27 +0000
Message-ID: <F2CBF3009FA73547804AE4C663CAB28E3A12F831@shsmsx102.ccr.corp.intel.com>
References: <1480495397-23225-1-git-send-email-liang.z.li@intel.com>
	<1480495397-23225-6-git-send-email-liang.z.li@intel.com>
	<438dd41a-fdf1-2a77-ef9c-8c103f492b2f@intel.com>
	<F2CBF3009FA73547804AE4C663CAB28E3A12D814@shsmsx102.ccr.corp.intel.com>
	<70ece7a5-348b-2eb9-c40a-f21b08df042c@intel.com>
In-Reply-To: <70ece7a5-348b-2eb9-c40a-f21b08df042c@intel.com>
Content-Language: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [PATCH kernel v5 5/5] virtio-balloon: tell host
 vm's unused page info
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Hansen, Dave" <dave.hansen@intel.com>, "kvm@vger.kernel.org" <kvm@vger.kernel.org>
Cc: "virtualization@lists.linux-foundation.org" <virtualization@lists.linux-foundation.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "linux-mm@kvack.org" <linux-mm@kvack.org>, "virtio-dev@lists.oasis-open.org" <virtio-dev@lists.oasis-open.org>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, "quintela@redhat.com" <quintela@redhat.com>, "dgilbert@redhat.com" <dgilbert@redhat.com>, "mst@redhat.com" <mst@redhat.com>, "jasowang@redhat.com" <jasowang@redhat.com>, "kirill.shutemov@linux.intel.com" <kirill.shutemov@linux.intel.com>, "akpm@linux-foundation.org" <akpm@linux-foundation.org>, "mhocko@suse.com" <mhocko@suse.com>, "pbonzini@redhat.com" <pbonzini@redhat.com>, Mel Gorman <mgorman@techsingularity.net>, Cornelia Huck <cornelia.huck@de.ibm.com>, Amit Shah <amit.shah@redhat.com>

> >>> +	mutex_lock(&vb->balloon_lock);
> >>> +
> >>> +	for (order =3D MAX_ORDER - 1; order >=3D 0; order--) {
> >>
> >> I scratched my head for a bit on this one.  Why are you walking over
> >> orders,
> >> *then* zones.  I *think* you're doing it because you can efficiently
> >> fill the bitmaps at a given order for all zones, then move to a new
> >> bitmap.  But, it would be interesting to document this.
> >
> > Yes, use the order is somewhat strange, but it's helpful to keep the AP=
I simple.
> > Do you think it's acceptable?
>=20
> Yeah, it's fine.  Just comment it, please.
>=20
Good!

> >>> +		if (ret =3D=3D -ENOSPC) {
> >>> +			void *new_resp_data;
> >>> +
> >>> +			new_resp_data =3D kmalloc(2 * vb->resp_buf_size,
> >>> +						GFP_KERNEL);
> >>> +			if (new_resp_data) {
> >>> +				kfree(vb->resp_data);
> >>> +				vb->resp_data =3D new_resp_data;
> >>> +				vb->resp_buf_size *=3D 2;
> >>
> >> What happens to the data in ->resp_data at this point?  Doesn't this
> >> just throw it away?
> >
> > Yes, so we should make sure the data in resp_data is not inuse.
>=20
> But doesn't it have valid data that we just collected and haven't told th=
e
> hypervisor about yet?  Aren't we throwing away good data that cost us
> something to collect?

Indeed.  Some filled data may exist for the previous zone. Should we
change the API to=20
'int get_unused_pages(unsigned long *unused_pages, unsigned long size,
		int order, unsigned long *pos, struct zone *zone)' ?

then we can use the 'zone' to record the zone to retry and not discard the
filled data.

> >> ...
> >>> +struct page_info_item {
> >>> +	__le64 start_pfn : 52; /* start pfn for the bitmap */
> >>> +	__le64 page_shift : 6; /* page shift width, in bytes */
>=20
> What does a page_shift "in bytes" mean? :)

Obviously, you know. :o
I will try to make it clear.

>=20
> >>> +	__le64 bmap_len : 6;  /* bitmap length, in bytes */ };
> >>
> >> Is 'bmap_len' too short?  a 64-byte buffer is a bit tiny.  Right?
> >
> > Currently, we just use the 8 bytes and 0 bytes bitmap, should we suppor=
t
> more than 64 bytes?
>=20
> It just means that with this format, you end up wasting at least ~1/8th o=
f the
> space with metadata.  That's a bit unfortunate, but I guess it's not fata=
l.
>=20
> I'd definitely call it out in the patch description and make sure other f=
olks take
> a look at it.

OK.

>=20
> There's a somewhat easy fix, but that would make the qemu implementation
> more complicated: You could just have bmap_len=3D=3D0x3f imply that there=
's
> another field that contains an extended bitmap length for when you need l=
ong
> bitmaps.
>=20
> But, as you note, there's no need for it, so it's a matter of trading the=
 extra
> complexity versus the desire to not habing to change the ABI again for lo=
nger
> (hopefully).
>=20

Your suggestion still works without changing the current code, just reserve
 ' bmap_len=3D=3D0x3f' for future extension, and it's not used by the curre=
nt code.

> >>> +static int  mark_unused_pages(struct zone *zone,
> >>> +		unsigned long *unused_pages, unsigned long size,
> >>> +		int order, unsigned long *pos)
> >>> +{
> >>> +	unsigned long pfn, flags;
> >>> +	unsigned int t;
> >>> +	struct list_head *curr;
> >>> +	struct page_info_item *info;
> >>> +
> >>> +	if (zone_is_empty(zone))
> >>> +		return 0;
> >>> +
> >>> +	spin_lock_irqsave(&zone->lock, flags);
> >>> +
> >>> +	if (*pos + zone->free_area[order].nr_free > size)
> >>> +		return -ENOSPC;
> >>
> >> Urg, so this won't partially fill?  So, what the nr_free pages limit
> >> where we no longer fit in the kmalloc()'d buffer where this simply won=
't
> work?
> >
> > Yes.  My initial implementation is partially fill, it's better for the =
worst case.
> > I thought the above code is more efficient for most case ...
> > Do you think partially fill the bitmap is better?
>=20
> Could you please answer the question I asked?
>=20

For your question:
---------------------------------------------------------------------------=
----------------------------
>So, what the nr_free pages limit where we no longer fit in the kmalloc()'d=
 buffer
> where this simply won't work?
---------------------------------------------------------------------------=
---------------------------
No, if the buffer is not big enough to save 'nr_free'  pages, get_unused_pa=
ges() will return
'-ENOSPC', and the following code will try to allocate a 2x times size buff=
er for retrying,
until the proper size buffer is allocated. The current order will not be sk=
ipped unless the
buffer allocation failed.

> Because if you don't get this right, it could mean that there are system =
that
> simply *fail* here.