From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752360AbcLFErg convert rfc822-to-8bit (ORCPT ); Mon, 5 Dec 2016 23:47:36 -0500 Received: from mga02.intel.com ([134.134.136.20]:28415 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751327AbcLFErd (ORCPT ); Mon, 5 Dec 2016 23:47:33 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.33,751,1477983600"; d="scan'208";a="1068425004" From: "Li, Liang Z" To: "Hansen, Dave" , "kvm@vger.kernel.org" CC: "virtualization@lists.linux-foundation.org" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "virtio-dev@lists.oasis-open.org" , "qemu-devel@nongnu.org" , "quintela@redhat.com" , "dgilbert@redhat.com" , "mst@redhat.com" , "jasowang@redhat.com" , "kirill.shutemov@linux.intel.com" , "akpm@linux-foundation.org" , "mhocko@suse.com" , "pbonzini@redhat.com" , Mel Gorman , Cornelia Huck , "Amit Shah" Subject: RE: [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info Thread-Topic: [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info Thread-Index: AQHSSuf7C044VX4sFkSRMSEWWM0WiqDxYEaAgAZgGRCAAVv/gIABHPvQ Date: Tue, 6 Dec 2016 04:47:27 +0000 Message-ID: References: <1480495397-23225-1-git-send-email-liang.z.li@intel.com> <1480495397-23225-6-git-send-email-liang.z.li@intel.com> <438dd41a-fdf1-2a77-ef9c-8c103f492b2f@intel.com> <70ece7a5-348b-2eb9-c40a-f21b08df042c@intel.com> In-Reply-To: <70ece7a5-348b-2eb9-c40a-f21b08df042c@intel.com> Accept-Language: zh-CN, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiZTk0ZTYwZTEtMDJhOC00ZTI3LTk5MWMtMGQ0OWI5MzI1NjY4IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE1LjkuNi42IiwiVHJ1c3RlZExhYmVsSGFzaCI6IjE3ZUNlK2RRVkJFSStPMEFxcnl4NG9iSUpCWUFGdnRtMU1icEh3Q3BONnc9In0= x-ctpclassification: CTP_IC x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > >>> + mutex_lock(&vb->balloon_lock); > >>> + > >>> + for (order = MAX_ORDER - 1; order >= 0; order--) { > >> > >> I scratched my head for a bit on this one. Why are you walking over > >> orders, > >> *then* zones. I *think* you're doing it because you can efficiently > >> fill the bitmaps at a given order for all zones, then move to a new > >> bitmap. But, it would be interesting to document this. > > > > Yes, use the order is somewhat strange, but it's helpful to keep the API simple. > > Do you think it's acceptable? > > Yeah, it's fine. Just comment it, please. > Good! > >>> + if (ret == -ENOSPC) { > >>> + void *new_resp_data; > >>> + > >>> + new_resp_data = kmalloc(2 * vb->resp_buf_size, > >>> + GFP_KERNEL); > >>> + if (new_resp_data) { > >>> + kfree(vb->resp_data); > >>> + vb->resp_data = new_resp_data; > >>> + vb->resp_buf_size *= 2; > >> > >> What happens to the data in ->resp_data at this point? Doesn't this > >> just throw it away? > > > > Yes, so we should make sure the data in resp_data is not inuse. > > But doesn't it have valid data that we just collected and haven't told the > hypervisor about yet? Aren't we throwing away good data that cost us > something to collect? Indeed. Some filled data may exist for the previous zone. Should we change the API to 'int get_unused_pages(unsigned long *unused_pages, unsigned long size, int order, unsigned long *pos, struct zone *zone)' ? then we can use the 'zone' to record the zone to retry and not discard the filled data. > >> ... > >>> +struct page_info_item { > >>> + __le64 start_pfn : 52; /* start pfn for the bitmap */ > >>> + __le64 page_shift : 6; /* page shift width, in bytes */ > > What does a page_shift "in bytes" mean? :) Obviously, you know. :o I will try to make it clear. > > >>> + __le64 bmap_len : 6; /* bitmap length, in bytes */ }; > >> > >> Is 'bmap_len' too short? a 64-byte buffer is a bit tiny. Right? > > > > Currently, we just use the 8 bytes and 0 bytes bitmap, should we support > more than 64 bytes? > > It just means that with this format, you end up wasting at least ~1/8th of the > space with metadata. That's a bit unfortunate, but I guess it's not fatal. > > I'd definitely call it out in the patch description and make sure other folks take > a look at it. OK. > > There's a somewhat easy fix, but that would make the qemu implementation > more complicated: You could just have bmap_len==0x3f imply that there's > another field that contains an extended bitmap length for when you need long > bitmaps. > > But, as you note, there's no need for it, so it's a matter of trading the extra > complexity versus the desire to not habing to change the ABI again for longer > (hopefully). > Your suggestion still works without changing the current code, just reserve ' bmap_len==0x3f' for future extension, and it's not used by the current code. > >>> +static int mark_unused_pages(struct zone *zone, > >>> + unsigned long *unused_pages, unsigned long size, > >>> + int order, unsigned long *pos) > >>> +{ > >>> + unsigned long pfn, flags; > >>> + unsigned int t; > >>> + struct list_head *curr; > >>> + struct page_info_item *info; > >>> + > >>> + if (zone_is_empty(zone)) > >>> + return 0; > >>> + > >>> + spin_lock_irqsave(&zone->lock, flags); > >>> + > >>> + if (*pos + zone->free_area[order].nr_free > size) > >>> + return -ENOSPC; > >> > >> Urg, so this won't partially fill? So, what the nr_free pages limit > >> where we no longer fit in the kmalloc()'d buffer where this simply won't > work? > > > > Yes. My initial implementation is partially fill, it's better for the worst case. > > I thought the above code is more efficient for most case ... > > Do you think partially fill the bitmap is better? > > Could you please answer the question I asked? > For your question: ------------------------------------------------------------------------------------------------------- >So, what the nr_free pages limit where we no longer fit in the kmalloc()'d buffer > where this simply won't work? ------------------------------------------------------------------------------------------------------ No, if the buffer is not big enough to save 'nr_free' pages, get_unused_pages() will return '-ENOSPC', and the following code will try to allocate a 2x times size buffer for retrying, until the proper size buffer is allocated. The current order will not be skipped unless the buffer allocation failed. > Because if you don't get this right, it could mean that there are system that > simply *fail* here. From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Li, Liang Z" Subject: RE: [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info Date: Tue, 6 Dec 2016 04:47:27 +0000 Message-ID: References: <1480495397-23225-1-git-send-email-liang.z.li@intel.com> <1480495397-23225-6-git-send-email-liang.z.li@intel.com> <438dd41a-fdf1-2a77-ef9c-8c103f492b2f@intel.com> <70ece7a5-348b-2eb9-c40a-f21b08df042c@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Cc: "virtio-dev@lists.oasis-open.org" , "mhocko@suse.com" , Amit Shah , "mst@redhat.com" , "qemu-devel@nongnu.org" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "kirill.shutemov@linux.intel.com" , "pbonzini@redhat.com" , "akpm@linux-foundation.org" , "virtualization@lists.linux-foundation.org" , Mel Gorman , "dgilbert@redhat.com" To: "Hansen, Dave" , "kvm@vger.kernel.org" Return-path: In-Reply-To: <70ece7a5-348b-2eb9-c40a-f21b08df042c@intel.com> Content-Language: en-US List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: virtualization-bounces@lists.linux-foundation.org Errors-To: virtualization-bounces@lists.linux-foundation.org List-Id: kvm.vger.kernel.org > >>> + mutex_lock(&vb->balloon_lock); > >>> + > >>> + for (order = MAX_ORDER - 1; order >= 0; order--) { > >> > >> I scratched my head for a bit on this one. Why are you walking over > >> orders, > >> *then* zones. I *think* you're doing it because you can efficiently > >> fill the bitmaps at a given order for all zones, then move to a new > >> bitmap. But, it would be interesting to document this. > > > > Yes, use the order is somewhat strange, but it's helpful to keep the API simple. > > Do you think it's acceptable? > > Yeah, it's fine. Just comment it, please. > Good! > >>> + if (ret == -ENOSPC) { > >>> + void *new_resp_data; > >>> + > >>> + new_resp_data = kmalloc(2 * vb->resp_buf_size, > >>> + GFP_KERNEL); > >>> + if (new_resp_data) { > >>> + kfree(vb->resp_data); > >>> + vb->resp_data = new_resp_data; > >>> + vb->resp_buf_size *= 2; > >> > >> What happens to the data in ->resp_data at this point? Doesn't this > >> just throw it away? > > > > Yes, so we should make sure the data in resp_data is not inuse. > > But doesn't it have valid data that we just collected and haven't told the > hypervisor about yet? Aren't we throwing away good data that cost us > something to collect? Indeed. Some filled data may exist for the previous zone. Should we change the API to 'int get_unused_pages(unsigned long *unused_pages, unsigned long size, int order, unsigned long *pos, struct zone *zone)' ? then we can use the 'zone' to record the zone to retry and not discard the filled data. > >> ... > >>> +struct page_info_item { > >>> + __le64 start_pfn : 52; /* start pfn for the bitmap */ > >>> + __le64 page_shift : 6; /* page shift width, in bytes */ > > What does a page_shift "in bytes" mean? :) Obviously, you know. :o I will try to make it clear. > > >>> + __le64 bmap_len : 6; /* bitmap length, in bytes */ }; > >> > >> Is 'bmap_len' too short? a 64-byte buffer is a bit tiny. Right? > > > > Currently, we just use the 8 bytes and 0 bytes bitmap, should we support > more than 64 bytes? > > It just means that with this format, you end up wasting at least ~1/8th of the > space with metadata. That's a bit unfortunate, but I guess it's not fatal. > > I'd definitely call it out in the patch description and make sure other folks take > a look at it. OK. > > There's a somewhat easy fix, but that would make the qemu implementation > more complicated: You could just have bmap_len==0x3f imply that there's > another field that contains an extended bitmap length for when you need long > bitmaps. > > But, as you note, there's no need for it, so it's a matter of trading the extra > complexity versus the desire to not habing to change the ABI again for longer > (hopefully). > Your suggestion still works without changing the current code, just reserve ' bmap_len==0x3f' for future extension, and it's not used by the current code. > >>> +static int mark_unused_pages(struct zone *zone, > >>> + unsigned long *unused_pages, unsigned long size, > >>> + int order, unsigned long *pos) > >>> +{ > >>> + unsigned long pfn, flags; > >>> + unsigned int t; > >>> + struct list_head *curr; > >>> + struct page_info_item *info; > >>> + > >>> + if (zone_is_empty(zone)) > >>> + return 0; > >>> + > >>> + spin_lock_irqsave(&zone->lock, flags); > >>> + > >>> + if (*pos + zone->free_area[order].nr_free > size) > >>> + return -ENOSPC; > >> > >> Urg, so this won't partially fill? So, what the nr_free pages limit > >> where we no longer fit in the kmalloc()'d buffer where this simply won't > work? > > > > Yes. My initial implementation is partially fill, it's better for the worst case. > > I thought the above code is more efficient for most case ... > > Do you think partially fill the bitmap is better? > > Could you please answer the question I asked? > For your question: ------------------------------------------------------------------------------------------------------- >So, what the nr_free pages limit where we no longer fit in the kmalloc()'d buffer > where this simply won't work? ------------------------------------------------------------------------------------------------------ No, if the buffer is not big enough to save 'nr_free' pages, get_unused_pages() will return '-ENOSPC', and the following code will try to allocate a 2x times size buffer for retrying, until the proper size buffer is allocated. The current order will not be skipped unless the buffer allocation failed. > Because if you don't get this right, it could mean that there are system that > simply *fail* here. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id 2D1F76B0069 for ; Mon, 5 Dec 2016 23:47:34 -0500 (EST) Received: by mail-pf0-f199.google.com with SMTP id a8so533791496pfg.0 for ; Mon, 05 Dec 2016 20:47:34 -0800 (PST) Received: from mga09.intel.com (mga09.intel.com. [134.134.136.24]) by mx.google.com with ESMTPS id t1si17629552plb.188.2016.12.05.20.47.33 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Dec 2016 20:47:33 -0800 (PST) From: "Li, Liang Z" Subject: RE: [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info Date: Tue, 6 Dec 2016 04:47:27 +0000 Message-ID: References: <1480495397-23225-1-git-send-email-liang.z.li@intel.com> <1480495397-23225-6-git-send-email-liang.z.li@intel.com> <438dd41a-fdf1-2a77-ef9c-8c103f492b2f@intel.com> <70ece7a5-348b-2eb9-c40a-f21b08df042c@intel.com> In-Reply-To: <70ece7a5-348b-2eb9-c40a-f21b08df042c@intel.com> Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: "Hansen, Dave" , "kvm@vger.kernel.org" Cc: "virtualization@lists.linux-foundation.org" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "virtio-dev@lists.oasis-open.org" , "qemu-devel@nongnu.org" , "quintela@redhat.com" , "dgilbert@redhat.com" , "mst@redhat.com" , "jasowang@redhat.com" , "kirill.shutemov@linux.intel.com" , "akpm@linux-foundation.org" , "mhocko@suse.com" , "pbonzini@redhat.com" , Mel Gorman , Cornelia Huck , Amit Shah > >>> + mutex_lock(&vb->balloon_lock); > >>> + > >>> + for (order =3D MAX_ORDER - 1; order >=3D 0; order--) { > >> > >> I scratched my head for a bit on this one. Why are you walking over > >> orders, > >> *then* zones. I *think* you're doing it because you can efficiently > >> fill the bitmaps at a given order for all zones, then move to a new > >> bitmap. But, it would be interesting to document this. > > > > Yes, use the order is somewhat strange, but it's helpful to keep the AP= I simple. > > Do you think it's acceptable? >=20 > Yeah, it's fine. Just comment it, please. >=20 Good! > >>> + if (ret =3D=3D -ENOSPC) { > >>> + void *new_resp_data; > >>> + > >>> + new_resp_data =3D kmalloc(2 * vb->resp_buf_size, > >>> + GFP_KERNEL); > >>> + if (new_resp_data) { > >>> + kfree(vb->resp_data); > >>> + vb->resp_data =3D new_resp_data; > >>> + vb->resp_buf_size *=3D 2; > >> > >> What happens to the data in ->resp_data at this point? Doesn't this > >> just throw it away? > > > > Yes, so we should make sure the data in resp_data is not inuse. >=20 > But doesn't it have valid data that we just collected and haven't told th= e > hypervisor about yet? Aren't we throwing away good data that cost us > something to collect? Indeed. Some filled data may exist for the previous zone. Should we change the API to=20 'int get_unused_pages(unsigned long *unused_pages, unsigned long size, int order, unsigned long *pos, struct zone *zone)' ? then we can use the 'zone' to record the zone to retry and not discard the filled data. > >> ... > >>> +struct page_info_item { > >>> + __le64 start_pfn : 52; /* start pfn for the bitmap */ > >>> + __le64 page_shift : 6; /* page shift width, in bytes */ >=20 > What does a page_shift "in bytes" mean? :) Obviously, you know. :o I will try to make it clear. >=20 > >>> + __le64 bmap_len : 6; /* bitmap length, in bytes */ }; > >> > >> Is 'bmap_len' too short? a 64-byte buffer is a bit tiny. Right? > > > > Currently, we just use the 8 bytes and 0 bytes bitmap, should we suppor= t > more than 64 bytes? >=20 > It just means that with this format, you end up wasting at least ~1/8th o= f the > space with metadata. That's a bit unfortunate, but I guess it's not fata= l. >=20 > I'd definitely call it out in the patch description and make sure other f= olks take > a look at it. OK. >=20 > There's a somewhat easy fix, but that would make the qemu implementation > more complicated: You could just have bmap_len=3D=3D0x3f imply that there= 's > another field that contains an extended bitmap length for when you need l= ong > bitmaps. >=20 > But, as you note, there's no need for it, so it's a matter of trading the= extra > complexity versus the desire to not habing to change the ABI again for lo= nger > (hopefully). >=20 Your suggestion still works without changing the current code, just reserve ' bmap_len=3D=3D0x3f' for future extension, and it's not used by the curre= nt code. > >>> +static int mark_unused_pages(struct zone *zone, > >>> + unsigned long *unused_pages, unsigned long size, > >>> + int order, unsigned long *pos) > >>> +{ > >>> + unsigned long pfn, flags; > >>> + unsigned int t; > >>> + struct list_head *curr; > >>> + struct page_info_item *info; > >>> + > >>> + if (zone_is_empty(zone)) > >>> + return 0; > >>> + > >>> + spin_lock_irqsave(&zone->lock, flags); > >>> + > >>> + if (*pos + zone->free_area[order].nr_free > size) > >>> + return -ENOSPC; > >> > >> Urg, so this won't partially fill? So, what the nr_free pages limit > >> where we no longer fit in the kmalloc()'d buffer where this simply won= 't > work? > > > > Yes. My initial implementation is partially fill, it's better for the = worst case. > > I thought the above code is more efficient for most case ... > > Do you think partially fill the bitmap is better? >=20 > Could you please answer the question I asked? >=20 For your question: ---------------------------------------------------------------------------= ---------------------------- >So, what the nr_free pages limit where we no longer fit in the kmalloc()'d= buffer > where this simply won't work? ---------------------------------------------------------------------------= --------------------------- No, if the buffer is not big enough to save 'nr_free' pages, get_unused_pa= ges() will return '-ENOSPC', and the following code will try to allocate a 2x times size buff= er for retrying, until the proper size buffer is allocated. The current order will not be sk= ipped unless the buffer allocation failed. > Because if you don't get this right, it could mean that there are system = that > simply *fail* here. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:60506) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cE7fJ-0004fo-4C for qemu-devel@nongnu.org; Mon, 05 Dec 2016 23:47:42 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cE7fG-0007jD-2B for qemu-devel@nongnu.org; Mon, 05 Dec 2016 23:47:41 -0500 Received: from mga03.intel.com ([134.134.136.65]:39567) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1cE7fF-0007iy-Mu for qemu-devel@nongnu.org; Mon, 05 Dec 2016 23:47:38 -0500 From: "Li, Liang Z" Date: Tue, 6 Dec 2016 04:47:27 +0000 Message-ID: References: <1480495397-23225-1-git-send-email-liang.z.li@intel.com> <1480495397-23225-6-git-send-email-liang.z.li@intel.com> <438dd41a-fdf1-2a77-ef9c-8c103f492b2f@intel.com> <70ece7a5-348b-2eb9-c40a-f21b08df042c@intel.com> In-Reply-To: <70ece7a5-348b-2eb9-c40a-f21b08df042c@intel.com> Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [Qemu-devel] [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Hansen, Dave" , "kvm@vger.kernel.org" Cc: "virtualization@lists.linux-foundation.org" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "virtio-dev@lists.oasis-open.org" , "qemu-devel@nongnu.org" , "quintela@redhat.com" , "dgilbert@redhat.com" , "mst@redhat.com" , "jasowang@redhat.com" , "kirill.shutemov@linux.intel.com" , "akpm@linux-foundation.org" , "mhocko@suse.com" , "pbonzini@redhat.com" , Mel Gorman , Cornelia Huck , Amit Shah > >>> + mutex_lock(&vb->balloon_lock); > >>> + > >>> + for (order =3D MAX_ORDER - 1; order >=3D 0; order--) { > >> > >> I scratched my head for a bit on this one. Why are you walking over > >> orders, > >> *then* zones. I *think* you're doing it because you can efficiently > >> fill the bitmaps at a given order for all zones, then move to a new > >> bitmap. But, it would be interesting to document this. > > > > Yes, use the order is somewhat strange, but it's helpful to keep the AP= I simple. > > Do you think it's acceptable? >=20 > Yeah, it's fine. Just comment it, please. >=20 Good! > >>> + if (ret =3D=3D -ENOSPC) { > >>> + void *new_resp_data; > >>> + > >>> + new_resp_data =3D kmalloc(2 * vb->resp_buf_size, > >>> + GFP_KERNEL); > >>> + if (new_resp_data) { > >>> + kfree(vb->resp_data); > >>> + vb->resp_data =3D new_resp_data; > >>> + vb->resp_buf_size *=3D 2; > >> > >> What happens to the data in ->resp_data at this point? Doesn't this > >> just throw it away? > > > > Yes, so we should make sure the data in resp_data is not inuse. >=20 > But doesn't it have valid data that we just collected and haven't told th= e > hypervisor about yet? Aren't we throwing away good data that cost us > something to collect? Indeed. Some filled data may exist for the previous zone. Should we change the API to=20 'int get_unused_pages(unsigned long *unused_pages, unsigned long size, int order, unsigned long *pos, struct zone *zone)' ? then we can use the 'zone' to record the zone to retry and not discard the filled data. > >> ... > >>> +struct page_info_item { > >>> + __le64 start_pfn : 52; /* start pfn for the bitmap */ > >>> + __le64 page_shift : 6; /* page shift width, in bytes */ >=20 > What does a page_shift "in bytes" mean? :) Obviously, you know. :o I will try to make it clear. >=20 > >>> + __le64 bmap_len : 6; /* bitmap length, in bytes */ }; > >> > >> Is 'bmap_len' too short? a 64-byte buffer is a bit tiny. Right? > > > > Currently, we just use the 8 bytes and 0 bytes bitmap, should we suppor= t > more than 64 bytes? >=20 > It just means that with this format, you end up wasting at least ~1/8th o= f the > space with metadata. That's a bit unfortunate, but I guess it's not fata= l. >=20 > I'd definitely call it out in the patch description and make sure other f= olks take > a look at it. OK. >=20 > There's a somewhat easy fix, but that would make the qemu implementation > more complicated: You could just have bmap_len=3D=3D0x3f imply that there= 's > another field that contains an extended bitmap length for when you need l= ong > bitmaps. >=20 > But, as you note, there's no need for it, so it's a matter of trading the= extra > complexity versus the desire to not habing to change the ABI again for lo= nger > (hopefully). >=20 Your suggestion still works without changing the current code, just reserve ' bmap_len=3D=3D0x3f' for future extension, and it's not used by the curre= nt code. > >>> +static int mark_unused_pages(struct zone *zone, > >>> + unsigned long *unused_pages, unsigned long size, > >>> + int order, unsigned long *pos) > >>> +{ > >>> + unsigned long pfn, flags; > >>> + unsigned int t; > >>> + struct list_head *curr; > >>> + struct page_info_item *info; > >>> + > >>> + if (zone_is_empty(zone)) > >>> + return 0; > >>> + > >>> + spin_lock_irqsave(&zone->lock, flags); > >>> + > >>> + if (*pos + zone->free_area[order].nr_free > size) > >>> + return -ENOSPC; > >> > >> Urg, so this won't partially fill? So, what the nr_free pages limit > >> where we no longer fit in the kmalloc()'d buffer where this simply won= 't > work? > > > > Yes. My initial implementation is partially fill, it's better for the = worst case. > > I thought the above code is more efficient for most case ... > > Do you think partially fill the bitmap is better? >=20 > Could you please answer the question I asked? >=20 For your question: ---------------------------------------------------------------------------= ---------------------------- >So, what the nr_free pages limit where we no longer fit in the kmalloc()'d= buffer > where this simply won't work? ---------------------------------------------------------------------------= --------------------------- No, if the buffer is not big enough to save 'nr_free' pages, get_unused_pa= ges() will return '-ENOSPC', and the following code will try to allocate a 2x times size buff= er for retrying, until the proper size buffer is allocated. The current order will not be sk= ipped unless the buffer allocation failed. > Because if you don't get this right, it could mean that there are system = that > simply *fail* here.