All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
@ 2012-11-06  1:31 ` Jiang Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-06  1:31 UTC (permalink / raw)
  To: Andrew Morton, Maciej Rutecki
  Cc: Jiang Liu, Jianguo Wu, Chris Clayton, Rafael J. Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	linux-mm, linux-kernel

Changeset 7f1290f2f2 tries to fix a issue when calculating
zone->present_pages, but it causes a regression to 32bit systems with
HIGHMEM. With that changeset, function reset_zone_present_pages()
resets all zone->present_pages to zero, and fixup_zone_present_pages()
is called to recalculate zone->present_pages when boot allocator frees
core memory pages into buddy allocator. Because highmem pages are not
freed by bootmem allocator, all highmem zones' present_pages becomes
zero.

Actually there's no need to recalculate present_pages for highmem zone
because bootmem allocator never allocates pages from them. So fix the
regression by skipping highmem in function reset_zone_present_pages()
and fixup_zone_present_pages().

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Reported-by: Maciej Rutecki <maciej.rutecki@gmail.com>
Tested-by: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Chris Clayton <chris2553@googlemail.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org

---

Hi Maciej,
	Thanks for reporting and bisecting. We have analyzed the regression
and worked out a patch for it. Could you please help to verify whether it
fix the regression?
	Thanks!
	Gerry

---
 mm/page_alloc.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b74de6..2311f15 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
 	for_each_node_state(nid, N_HIGH_MEMORY) {
 		for (i = 0; i < MAX_NR_ZONES; i++) {
 			z = NODE_DATA(nid)->node_zones + i;
-			z->present_pages = 0;
+			if (!is_highmem(z))
+				z->present_pages = 0;
 		}
 	}
 }
@@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
 
 	for (i = 0; i < MAX_NR_ZONES; i++) {
 		z = NODE_DATA(nid)->node_zones + i;
+		if (is_highmem(z))
+			continue;
+
 		zone_start_pfn = z->zone_start_pfn;
 		zone_end_pfn = zone_start_pfn + z->spanned_pages;
-
-		/* if the two regions intersect */
 		if (!(zone_start_pfn >= end_pfn	|| zone_end_pfn <= start_pfn))
 			z->present_pages += min(end_pfn, zone_end_pfn) -
 					    max(start_pfn, zone_start_pfn);
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
@ 2012-11-06  1:31 ` Jiang Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-06  1:31 UTC (permalink / raw)
  To: Andrew Morton, Maciej Rutecki
  Cc: Jiang Liu, Jianguo Wu, Chris Clayton, Rafael J. Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	linux-mm, linux-kernel

Changeset 7f1290f2f2 tries to fix a issue when calculating
zone->present_pages, but it causes a regression to 32bit systems with
HIGHMEM. With that changeset, function reset_zone_present_pages()
resets all zone->present_pages to zero, and fixup_zone_present_pages()
is called to recalculate zone->present_pages when boot allocator frees
core memory pages into buddy allocator. Because highmem pages are not
freed by bootmem allocator, all highmem zones' present_pages becomes
zero.

Actually there's no need to recalculate present_pages for highmem zone
because bootmem allocator never allocates pages from them. So fix the
regression by skipping highmem in function reset_zone_present_pages()
and fixup_zone_present_pages().

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Reported-by: Maciej Rutecki <maciej.rutecki@gmail.com>
Tested-by: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Chris Clayton <chris2553@googlemail.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org

---

Hi Maciej,
	Thanks for reporting and bisecting. We have analyzed the regression
and worked out a patch for it. Could you please help to verify whether it
fix the regression?
	Thanks!
	Gerry

---
 mm/page_alloc.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b74de6..2311f15 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
 	for_each_node_state(nid, N_HIGH_MEMORY) {
 		for (i = 0; i < MAX_NR_ZONES; i++) {
 			z = NODE_DATA(nid)->node_zones + i;
-			z->present_pages = 0;
+			if (!is_highmem(z))
+				z->present_pages = 0;
 		}
 	}
 }
@@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
 
 	for (i = 0; i < MAX_NR_ZONES; i++) {
 		z = NODE_DATA(nid)->node_zones + i;
+		if (is_highmem(z))
+			continue;
+
 		zone_start_pfn = z->zone_start_pfn;
 		zone_end_pfn = zone_start_pfn + z->spanned_pages;
-
-		/* if the two regions intersect */
 		if (!(zone_start_pfn >= end_pfn	|| zone_end_pfn <= start_pfn))
 			z->present_pages += min(end_pfn, zone_end_pfn) -
 					    max(start_pfn, zone_start_pfn);
-- 
1.7.1


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
  2012-11-06  1:31 ` Jiang Liu
@ 2012-11-06 10:23   ` Chris Clayton
  -1 siblings, 0 replies; 98+ messages in thread
From: Chris Clayton @ 2012-11-06 10:23 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Maciej Rutecki, Jianguo Wu, Rafael J. Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	linux-mm, linux-kernel



On 11/06/12 01:31, Jiang Liu wrote:
> Changeset 7f1290f2f2 tries to fix a issue when calculating
> zone->present_pages, but it causes a regression to 32bit systems with
> HIGHMEM. With that changeset, function reset_zone_present_pages()
> resets all zone->present_pages to zero, and fixup_zone_present_pages()
> is called to recalculate zone->present_pages when boot allocator frees
> core memory pages into buddy allocator. Because highmem pages are not
> freed by bootmem allocator, all highmem zones' present_pages becomes
> zero.
>
> Actually there's no need to recalculate present_pages for highmem zone
> because bootmem allocator never allocates pages from them. So fix the
> regression by skipping highmem in function reset_zone_present_pages()
> and fixup_zone_present_pages().
>
> Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
> Reported-by: Maciej Rutecki <maciej.rutecki@gmail.com>
> Tested-by: Maciej Rutecki <maciej.rutecki@gmail.com>
> Cc: Chris Clayton <chris2553@googlemail.com>
> Cc: Rafael J. Wysocki <rjw@sisk.pl>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
>
> ---
>
> Hi Maciej,
> 	Thanks for reporting and bisecting. We have analyzed the regression
> and worked out a patch for it. Could you please help to verify whether it
> fix the regression?
> 	Thanks!
> 	Gerry
>

Thanks Gerry.

I've applied this patch to 3.7.0-rc4 and can confirm that it fixes the 
problem I had with my laptop failing to resume after a suspend to disk.

Tested-by: Chris Clayton <chris2553@googlemail.com>

> ---
>   mm/page_alloc.c |    8 +++++---
>   1 files changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5b74de6..2311f15 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
>   	for_each_node_state(nid, N_HIGH_MEMORY) {
>   		for (i = 0; i < MAX_NR_ZONES; i++) {
>   			z = NODE_DATA(nid)->node_zones + i;
> -			z->present_pages = 0;
> +			if (!is_highmem(z))
> +				z->present_pages = 0;
>   		}
>   	}
>   }
> @@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
>
>   	for (i = 0; i < MAX_NR_ZONES; i++) {
>   		z = NODE_DATA(nid)->node_zones + i;
> +		if (is_highmem(z))
> +			continue;
> +
>   		zone_start_pfn = z->zone_start_pfn;
>   		zone_end_pfn = zone_start_pfn + z->spanned_pages;
> -
> -		/* if the two regions intersect */
>   		if (!(zone_start_pfn >= end_pfn	|| zone_end_pfn <= start_pfn))
>   			z->present_pages += min(end_pfn, zone_end_pfn) -
>   					    max(start_pfn, zone_start_pfn);
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
@ 2012-11-06 10:23   ` Chris Clayton
  0 siblings, 0 replies; 98+ messages in thread
From: Chris Clayton @ 2012-11-06 10:23 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Maciej Rutecki, Jianguo Wu, Rafael J. Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	linux-mm, linux-kernel



On 11/06/12 01:31, Jiang Liu wrote:
> Changeset 7f1290f2f2 tries to fix a issue when calculating
> zone->present_pages, but it causes a regression to 32bit systems with
> HIGHMEM. With that changeset, function reset_zone_present_pages()
> resets all zone->present_pages to zero, and fixup_zone_present_pages()
> is called to recalculate zone->present_pages when boot allocator frees
> core memory pages into buddy allocator. Because highmem pages are not
> freed by bootmem allocator, all highmem zones' present_pages becomes
> zero.
>
> Actually there's no need to recalculate present_pages for highmem zone
> because bootmem allocator never allocates pages from them. So fix the
> regression by skipping highmem in function reset_zone_present_pages()
> and fixup_zone_present_pages().
>
> Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
> Reported-by: Maciej Rutecki <maciej.rutecki@gmail.com>
> Tested-by: Maciej Rutecki <maciej.rutecki@gmail.com>
> Cc: Chris Clayton <chris2553@googlemail.com>
> Cc: Rafael J. Wysocki <rjw@sisk.pl>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
>
> ---
>
> Hi Maciej,
> 	Thanks for reporting and bisecting. We have analyzed the regression
> and worked out a patch for it. Could you please help to verify whether it
> fix the regression?
> 	Thanks!
> 	Gerry
>

Thanks Gerry.

I've applied this patch to 3.7.0-rc4 and can confirm that it fixes the 
problem I had with my laptop failing to resume after a suspend to disk.

Tested-by: Chris Clayton <chris2553@googlemail.com>

> ---
>   mm/page_alloc.c |    8 +++++---
>   1 files changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5b74de6..2311f15 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
>   	for_each_node_state(nid, N_HIGH_MEMORY) {
>   		for (i = 0; i < MAX_NR_ZONES; i++) {
>   			z = NODE_DATA(nid)->node_zones + i;
> -			z->present_pages = 0;
> +			if (!is_highmem(z))
> +				z->present_pages = 0;
>   		}
>   	}
>   }
> @@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
>
>   	for (i = 0; i < MAX_NR_ZONES; i++) {
>   		z = NODE_DATA(nid)->node_zones + i;
> +		if (is_highmem(z))
> +			continue;
> +
>   		zone_start_pfn = z->zone_start_pfn;
>   		zone_end_pfn = zone_start_pfn + z->spanned_pages;
> -
> -		/* if the two regions intersect */
>   		if (!(zone_start_pfn >= end_pfn	|| zone_end_pfn <= start_pfn))
>   			z->present_pages += min(end_pfn, zone_end_pfn) -
>   					    max(start_pfn, zone_start_pfn);
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
  2012-11-06  1:31 ` Jiang Liu
@ 2012-11-06 20:43   ` Andrew Morton
  -1 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2012-11-06 20:43 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Maciej Rutecki, Jianguo Wu, Chris Clayton, Rafael J. Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	linux-mm, linux-kernel

On Tue, 6 Nov 2012 09:31:57 +0800
Jiang Liu <jiang.liu@huawei.com> wrote:

> Changeset 7f1290f2f2 tries to fix a issue when calculating
> zone->present_pages, but it causes a regression to 32bit systems with
> HIGHMEM. With that changeset, function reset_zone_present_pages()
> resets all zone->present_pages to zero, and fixup_zone_present_pages()
> is called to recalculate zone->present_pages when boot allocator frees
> core memory pages into buddy allocator. Because highmem pages are not
> freed by bootmem allocator, all highmem zones' present_pages becomes
> zero.
> 
> Actually there's no need to recalculate present_pages for highmem zone
> because bootmem allocator never allocates pages from them. So fix the
> regression by skipping highmem in function reset_zone_present_pages()
> and fixup_zone_present_pages().
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
>  	for_each_node_state(nid, N_HIGH_MEMORY) {
>  		for (i = 0; i < MAX_NR_ZONES; i++) {
>  			z = NODE_DATA(nid)->node_zones + i;
> -			z->present_pages = 0;
> +			if (!is_highmem(z))
> +				z->present_pages = 0;
>  		}
>  	}
>  }
> @@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
>  
>  	for (i = 0; i < MAX_NR_ZONES; i++) {
>  		z = NODE_DATA(nid)->node_zones + i;
> +		if (is_highmem(z))
> +			continue;
> +
>  		zone_start_pfn = z->zone_start_pfn;
>  		zone_end_pfn = zone_start_pfn + z->spanned_pages;
> -
> -		/* if the two regions intersect */
>  		if (!(zone_start_pfn >= end_pfn	|| zone_end_pfn <= start_pfn))
>  			z->present_pages += min(end_pfn, zone_end_pfn) -
>  					    max(start_pfn, zone_start_pfn);

This ...  isn't very nice.  It is embeds within
reset_zone_present_pages() and fixup_zone_present_pages() knowledge
about their caller's state.  Or, more specifically, it is emebedding
knowledge about the overall state of the system when these functions
are called.

I mean, a function called "reset_zone_present_pages" should reset
->present_pages!

The fact that fixup_zone_present_page() has multiple call sites makes
this all even more risky.  And what are the interactions between this
and memory hotplug?

Can we find a cleaner fix?

Please tell us more about what's happening here.  Is it the case that
reset_zone_present_pages() is being called *after* highmem has been
populated?  If so, then fixup_zone_present_pages() should work
correctly for highmem?  Or is it the case that highmem hasn't yet been
setup?  IOW, what is the sequence of operations here?

Is the problem that we're *missing* a call to
fixup_zone_present_pages(), perhaps?  If we call
fixup_zone_present_pages() after highmem has been populated,
fixup_zone_present_pages() should correctly fill in the highmem zone's
->present_pages?



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
@ 2012-11-06 20:43   ` Andrew Morton
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2012-11-06 20:43 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Maciej Rutecki, Jianguo Wu, Chris Clayton, Rafael J. Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	linux-mm, linux-kernel

On Tue, 6 Nov 2012 09:31:57 +0800
Jiang Liu <jiang.liu@huawei.com> wrote:

> Changeset 7f1290f2f2 tries to fix a issue when calculating
> zone->present_pages, but it causes a regression to 32bit systems with
> HIGHMEM. With that changeset, function reset_zone_present_pages()
> resets all zone->present_pages to zero, and fixup_zone_present_pages()
> is called to recalculate zone->present_pages when boot allocator frees
> core memory pages into buddy allocator. Because highmem pages are not
> freed by bootmem allocator, all highmem zones' present_pages becomes
> zero.
> 
> Actually there's no need to recalculate present_pages for highmem zone
> because bootmem allocator never allocates pages from them. So fix the
> regression by skipping highmem in function reset_zone_present_pages()
> and fixup_zone_present_pages().
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
>  	for_each_node_state(nid, N_HIGH_MEMORY) {
>  		for (i = 0; i < MAX_NR_ZONES; i++) {
>  			z = NODE_DATA(nid)->node_zones + i;
> -			z->present_pages = 0;
> +			if (!is_highmem(z))
> +				z->present_pages = 0;
>  		}
>  	}
>  }
> @@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
>  
>  	for (i = 0; i < MAX_NR_ZONES; i++) {
>  		z = NODE_DATA(nid)->node_zones + i;
> +		if (is_highmem(z))
> +			continue;
> +
>  		zone_start_pfn = z->zone_start_pfn;
>  		zone_end_pfn = zone_start_pfn + z->spanned_pages;
> -
> -		/* if the two regions intersect */
>  		if (!(zone_start_pfn >= end_pfn	|| zone_end_pfn <= start_pfn))
>  			z->present_pages += min(end_pfn, zone_end_pfn) -
>  					    max(start_pfn, zone_start_pfn);

This ...  isn't very nice.  It is embeds within
reset_zone_present_pages() and fixup_zone_present_pages() knowledge
about their caller's state.  Or, more specifically, it is emebedding
knowledge about the overall state of the system when these functions
are called.

I mean, a function called "reset_zone_present_pages" should reset
->present_pages!

The fact that fixup_zone_present_page() has multiple call sites makes
this all even more risky.  And what are the interactions between this
and memory hotplug?

Can we find a cleaner fix?

Please tell us more about what's happening here.  Is it the case that
reset_zone_present_pages() is being called *after* highmem has been
populated?  If so, then fixup_zone_present_pages() should work
correctly for highmem?  Or is it the case that highmem hasn't yet been
setup?  IOW, what is the sequence of operations here?

Is the problem that we're *missing* a call to
fixup_zone_present_pages(), perhaps?  If we call
fixup_zone_present_pages() after highmem has been populated,
fixup_zone_present_pages() should correctly fill in the highmem zone's
->present_pages?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
  2012-11-06 20:43   ` Andrew Morton
@ 2012-11-14 14:52     ` Jiang Liu
  -1 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-14 14:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiang Liu, Maciej Rutecki, Jianguo Wu, Chris Clayton,
	Rafael J. Wysocki, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
	Michal Hocko, linux-mm, linux-kernel, Daniel Vetter

On 11/07/2012 04:43 AM, Andrew Morton wrote:
> On Tue, 6 Nov 2012 09:31:57 +0800
> Jiang Liu <jiang.liu@huawei.com> wrote:
> 
>> Changeset 7f1290f2f2 tries to fix a issue when calculating
>> zone->present_pages, but it causes a regression to 32bit systems with
>> HIGHMEM. With that changeset, function reset_zone_present_pages()
>> resets all zone->present_pages to zero, and fixup_zone_present_pages()
>> is called to recalculate zone->present_pages when boot allocator frees
>> core memory pages into buddy allocator. Because highmem pages are not
>> freed by bootmem allocator, all highmem zones' present_pages becomes
>> zero.
>>
>> Actually there's no need to recalculate present_pages for highmem zone
>> because bootmem allocator never allocates pages from them. So fix the
>> regression by skipping highmem in function reset_zone_present_pages()
>> and fixup_zone_present_pages().
>>
>> ...
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
>>  	for_each_node_state(nid, N_HIGH_MEMORY) {
>>  		for (i = 0; i < MAX_NR_ZONES; i++) {
>>  			z = NODE_DATA(nid)->node_zones + i;
>> -			z->present_pages = 0;
>> +			if (!is_highmem(z))
>> +				z->present_pages = 0;
>>  		}
>>  	}
>>  }
>> @@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
>>  
>>  	for (i = 0; i < MAX_NR_ZONES; i++) {
>>  		z = NODE_DATA(nid)->node_zones + i;
>> +		if (is_highmem(z))
>> +			continue;
>> +
>>  		zone_start_pfn = z->zone_start_pfn;
>>  		zone_end_pfn = zone_start_pfn + z->spanned_pages;
>> -
>> -		/* if the two regions intersect */
>>  		if (!(zone_start_pfn >= end_pfn	|| zone_end_pfn <= start_pfn))
>>  			z->present_pages += min(end_pfn, zone_end_pfn) -
>>  					    max(start_pfn, zone_start_pfn);
> 
> This ...  isn't very nice.  It is embeds within
> reset_zone_present_pages() and fixup_zone_present_pages() knowledge
> about their caller's state.  Or, more specifically, it is emebedding
> knowledge about the overall state of the system when these functions
> are called.
> 
> I mean, a function called "reset_zone_present_pages" should reset
> ->present_pages!
> 
> The fact that fixup_zone_present_page() has multiple call sites makes
> this all even more risky.  And what are the interactions between this
> and memory hotplug?
> 
> Can we find a cleaner fix?
> 
> Please tell us more about what's happening here.  Is it the case that
> reset_zone_present_pages() is being called *after* highmem has been
> populated?  If so, then fixup_zone_present_pages() should work
> correctly for highmem?  Or is it the case that highmem hasn't yet been
> setup?  IOW, what is the sequence of operations here?
> 
> Is the problem that we're *missing* a call to
> fixup_zone_present_pages(), perhaps?  If we call
> fixup_zone_present_pages() after highmem has been populated,
> fixup_zone_present_pages() should correctly fill in the highmem zone's
> ->present_pages?
Hi Andrew,
	Sorry for the late response:(
	I have done more investigations according to your suggestions. Currently
we have only called fixup_zone_present_pages() for memory freed by bootmem
allocator and missed HIGHMEM pages. We could also call fixup_zone_present_pages()
for HIGHMEM pages, but that will need to change arch specific code for x86, powerpc,
sparc, microblaze, arm, mips, um and tile etc. Seems a little overhead.
	And sadly enough, I found the quick fix is still incomplete. The original
patch still have another issue that, reset_zone_present_pages() is only called
for IA64, so it will cause trouble for other arches which make use of "bootmem.c".
	Then I feel a little guilty and tried to find a cleaner solution without
touching arch specific code. But things are more complex than my expectation and
I'm still working on that.
	So how about totally reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
and I will post another version once I found a cleaner way?
	Thanks!
	Gerry

> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
@ 2012-11-14 14:52     ` Jiang Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-14 14:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiang Liu, Maciej Rutecki, Jianguo Wu, Chris Clayton,
	Rafael J. Wysocki, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
	Michal Hocko, linux-mm, linux-kernel, Daniel Vetter

On 11/07/2012 04:43 AM, Andrew Morton wrote:
> On Tue, 6 Nov 2012 09:31:57 +0800
> Jiang Liu <jiang.liu@huawei.com> wrote:
> 
>> Changeset 7f1290f2f2 tries to fix a issue when calculating
>> zone->present_pages, but it causes a regression to 32bit systems with
>> HIGHMEM. With that changeset, function reset_zone_present_pages()
>> resets all zone->present_pages to zero, and fixup_zone_present_pages()
>> is called to recalculate zone->present_pages when boot allocator frees
>> core memory pages into buddy allocator. Because highmem pages are not
>> freed by bootmem allocator, all highmem zones' present_pages becomes
>> zero.
>>
>> Actually there's no need to recalculate present_pages for highmem zone
>> because bootmem allocator never allocates pages from them. So fix the
>> regression by skipping highmem in function reset_zone_present_pages()
>> and fixup_zone_present_pages().
>>
>> ...
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
>>  	for_each_node_state(nid, N_HIGH_MEMORY) {
>>  		for (i = 0; i < MAX_NR_ZONES; i++) {
>>  			z = NODE_DATA(nid)->node_zones + i;
>> -			z->present_pages = 0;
>> +			if (!is_highmem(z))
>> +				z->present_pages = 0;
>>  		}
>>  	}
>>  }
>> @@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
>>  
>>  	for (i = 0; i < MAX_NR_ZONES; i++) {
>>  		z = NODE_DATA(nid)->node_zones + i;
>> +		if (is_highmem(z))
>> +			continue;
>> +
>>  		zone_start_pfn = z->zone_start_pfn;
>>  		zone_end_pfn = zone_start_pfn + z->spanned_pages;
>> -
>> -		/* if the two regions intersect */
>>  		if (!(zone_start_pfn >= end_pfn	|| zone_end_pfn <= start_pfn))
>>  			z->present_pages += min(end_pfn, zone_end_pfn) -
>>  					    max(start_pfn, zone_start_pfn);
> 
> This ...  isn't very nice.  It is embeds within
> reset_zone_present_pages() and fixup_zone_present_pages() knowledge
> about their caller's state.  Or, more specifically, it is emebedding
> knowledge about the overall state of the system when these functions
> are called.
> 
> I mean, a function called "reset_zone_present_pages" should reset
> ->present_pages!
> 
> The fact that fixup_zone_present_page() has multiple call sites makes
> this all even more risky.  And what are the interactions between this
> and memory hotplug?
> 
> Can we find a cleaner fix?
> 
> Please tell us more about what's happening here.  Is it the case that
> reset_zone_present_pages() is being called *after* highmem has been
> populated?  If so, then fixup_zone_present_pages() should work
> correctly for highmem?  Or is it the case that highmem hasn't yet been
> setup?  IOW, what is the sequence of operations here?
> 
> Is the problem that we're *missing* a call to
> fixup_zone_present_pages(), perhaps?  If we call
> fixup_zone_present_pages() after highmem has been populated,
> fixup_zone_present_pages() should correctly fill in the highmem zone's
> ->present_pages?
Hi Andrew,
	Sorry for the late response:(
	I have done more investigations according to your suggestions. Currently
we have only called fixup_zone_present_pages() for memory freed by bootmem
allocator and missed HIGHMEM pages. We could also call fixup_zone_present_pages()
for HIGHMEM pages, but that will need to change arch specific code for x86, powerpc,
sparc, microblaze, arm, mips, um and tile etc. Seems a little overhead.
	And sadly enough, I found the quick fix is still incomplete. The original
patch still have another issue that, reset_zone_present_pages() is only called
for IA64, so it will cause trouble for other arches which make use of "bootmem.c".
	Then I feel a little guilty and tried to find a cleaner solution without
touching arch specific code. But things are more complex than my expectation and
I'm still working on that.
	So how about totally reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
and I will post another version once I found a cleaner way?
	Thanks!
	Gerry

> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
  2012-11-14 14:52     ` Jiang Liu
@ 2012-11-15  9:22       ` Wen Congyang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wen Congyang @ 2012-11-15  9:22 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Jiang Liu, Maciej Rutecki, Jianguo Wu,
	Chris Clayton, Rafael J. Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, linux-mm, linux-kernel,
	Daniel Vetter

Hi, Liu Jiang

At 11/14/2012 10:52 PM, Jiang Liu Wrote:
> On 11/07/2012 04:43 AM, Andrew Morton wrote:
>> On Tue, 6 Nov 2012 09:31:57 +0800
>> Jiang Liu <jiang.liu@huawei.com> wrote:
>>
>>> Changeset 7f1290f2f2 tries to fix a issue when calculating
>>> zone->present_pages, but it causes a regression to 32bit systems with
>>> HIGHMEM. With that changeset, function reset_zone_present_pages()
>>> resets all zone->present_pages to zero, and fixup_zone_present_pages()
>>> is called to recalculate zone->present_pages when boot allocator frees
>>> core memory pages into buddy allocator. Because highmem pages are not
>>> freed by bootmem allocator, all highmem zones' present_pages becomes
>>> zero.
>>>
>>> Actually there's no need to recalculate present_pages for highmem zone
>>> because bootmem allocator never allocates pages from them. So fix the
>>> regression by skipping highmem in function reset_zone_present_pages()
>>> and fixup_zone_present_pages().
>>>
>>> ...
>>>
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
>>>  	for_each_node_state(nid, N_HIGH_MEMORY) {
>>>  		for (i = 0; i < MAX_NR_ZONES; i++) {
>>>  			z = NODE_DATA(nid)->node_zones + i;
>>> -			z->present_pages = 0;
>>> +			if (!is_highmem(z))
>>> +				z->present_pages = 0;
>>>  		}
>>>  	}
>>>  }
>>> @@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
>>>  
>>>  	for (i = 0; i < MAX_NR_ZONES; i++) {
>>>  		z = NODE_DATA(nid)->node_zones + i;
>>> +		if (is_highmem(z))
>>> +			continue;
>>> +
>>>  		zone_start_pfn = z->zone_start_pfn;
>>>  		zone_end_pfn = zone_start_pfn + z->spanned_pages;
>>> -
>>> -		/* if the two regions intersect */
>>>  		if (!(zone_start_pfn >= end_pfn	|| zone_end_pfn <= start_pfn))
>>>  			z->present_pages += min(end_pfn, zone_end_pfn) -
>>>  					    max(start_pfn, zone_start_pfn);
>>
>> This ...  isn't very nice.  It is embeds within
>> reset_zone_present_pages() and fixup_zone_present_pages() knowledge
>> about their caller's state.  Or, more specifically, it is emebedding
>> knowledge about the overall state of the system when these functions
>> are called.
>>
>> I mean, a function called "reset_zone_present_pages" should reset
>> ->present_pages!
>>
>> The fact that fixup_zone_present_page() has multiple call sites makes
>> this all even more risky.  And what are the interactions between this
>> and memory hotplug?
>>
>> Can we find a cleaner fix?
>>
>> Please tell us more about what's happening here.  Is it the case that
>> reset_zone_present_pages() is being called *after* highmem has been
>> populated?  If so, then fixup_zone_present_pages() should work
>> correctly for highmem?  Or is it the case that highmem hasn't yet been
>> setup?  IOW, what is the sequence of operations here?
>>
>> Is the problem that we're *missing* a call to
>> fixup_zone_present_pages(), perhaps?  If we call
>> fixup_zone_present_pages() after highmem has been populated,
>> fixup_zone_present_pages() should correctly fill in the highmem zone's
>> ->present_pages?
> Hi Andrew,
> 	Sorry for the late response:(
> 	I have done more investigations according to your suggestions. Currently
> we have only called fixup_zone_present_pages() for memory freed by bootmem
> allocator and missed HIGHMEM pages. We could also call fixup_zone_present_pages()
> for HIGHMEM pages, but that will need to change arch specific code for x86, powerpc,
> sparc, microblaze, arm, mips, um and tile etc. Seems a little overhead.
> 	And sadly enough, I found the quick fix is still incomplete. The original
> patch still have another issue that, reset_zone_present_pages() is only called
> for IA64, so it will cause trouble for other arches which make use of "bootmem.c".
> 	Then I feel a little guilty and tried to find a cleaner solution without
> touching arch specific code. But things are more complex than my expectation and
> I'm still working on that.
> 	So how about totally reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
> and I will post another version once I found a cleaner way?

I think fixup_zone_present_pages() are very useful for memory hotplug.

We calculate zone->present_pages in free_area_init_core(), but its value is wrong.
So it is why we fix it in fixup_zone_present_pages().

What about this:
1. init zone->present_pages to the present pages in this zone(include bootmem)
2. don't reset zone->present_pages for HIGHMEM pages

We don't allocate bootmem from HIGHMEM. So its present pages is inited in step1
and there is no need to fix it in step2.

Is it OK?

If it is OK, I will resend the patch for step1(the patch is from laijs).

Thanks
Wen Congyang

> 	Thanks!
> 	Gerry
> 
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
@ 2012-11-15  9:22       ` Wen Congyang
  0 siblings, 0 replies; 98+ messages in thread
From: Wen Congyang @ 2012-11-15  9:22 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Jiang Liu, Maciej Rutecki, Jianguo Wu,
	Chris Clayton, Rafael J. Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, linux-mm, linux-kernel,
	Daniel Vetter

Hi, Liu Jiang

At 11/14/2012 10:52 PM, Jiang Liu Wrote:
> On 11/07/2012 04:43 AM, Andrew Morton wrote:
>> On Tue, 6 Nov 2012 09:31:57 +0800
>> Jiang Liu <jiang.liu@huawei.com> wrote:
>>
>>> Changeset 7f1290f2f2 tries to fix a issue when calculating
>>> zone->present_pages, but it causes a regression to 32bit systems with
>>> HIGHMEM. With that changeset, function reset_zone_present_pages()
>>> resets all zone->present_pages to zero, and fixup_zone_present_pages()
>>> is called to recalculate zone->present_pages when boot allocator frees
>>> core memory pages into buddy allocator. Because highmem pages are not
>>> freed by bootmem allocator, all highmem zones' present_pages becomes
>>> zero.
>>>
>>> Actually there's no need to recalculate present_pages for highmem zone
>>> because bootmem allocator never allocates pages from them. So fix the
>>> regression by skipping highmem in function reset_zone_present_pages()
>>> and fixup_zone_present_pages().
>>>
>>> ...
>>>
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
>>>  	for_each_node_state(nid, N_HIGH_MEMORY) {
>>>  		for (i = 0; i < MAX_NR_ZONES; i++) {
>>>  			z = NODE_DATA(nid)->node_zones + i;
>>> -			z->present_pages = 0;
>>> +			if (!is_highmem(z))
>>> +				z->present_pages = 0;
>>>  		}
>>>  	}
>>>  }
>>> @@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
>>>  
>>>  	for (i = 0; i < MAX_NR_ZONES; i++) {
>>>  		z = NODE_DATA(nid)->node_zones + i;
>>> +		if (is_highmem(z))
>>> +			continue;
>>> +
>>>  		zone_start_pfn = z->zone_start_pfn;
>>>  		zone_end_pfn = zone_start_pfn + z->spanned_pages;
>>> -
>>> -		/* if the two regions intersect */
>>>  		if (!(zone_start_pfn >= end_pfn	|| zone_end_pfn <= start_pfn))
>>>  			z->present_pages += min(end_pfn, zone_end_pfn) -
>>>  					    max(start_pfn, zone_start_pfn);
>>
>> This ...  isn't very nice.  It is embeds within
>> reset_zone_present_pages() and fixup_zone_present_pages() knowledge
>> about their caller's state.  Or, more specifically, it is emebedding
>> knowledge about the overall state of the system when these functions
>> are called.
>>
>> I mean, a function called "reset_zone_present_pages" should reset
>> ->present_pages!
>>
>> The fact that fixup_zone_present_page() has multiple call sites makes
>> this all even more risky.  And what are the interactions between this
>> and memory hotplug?
>>
>> Can we find a cleaner fix?
>>
>> Please tell us more about what's happening here.  Is it the case that
>> reset_zone_present_pages() is being called *after* highmem has been
>> populated?  If so, then fixup_zone_present_pages() should work
>> correctly for highmem?  Or is it the case that highmem hasn't yet been
>> setup?  IOW, what is the sequence of operations here?
>>
>> Is the problem that we're *missing* a call to
>> fixup_zone_present_pages(), perhaps?  If we call
>> fixup_zone_present_pages() after highmem has been populated,
>> fixup_zone_present_pages() should correctly fill in the highmem zone's
>> ->present_pages?
> Hi Andrew,
> 	Sorry for the late response:(
> 	I have done more investigations according to your suggestions. Currently
> we have only called fixup_zone_present_pages() for memory freed by bootmem
> allocator and missed HIGHMEM pages. We could also call fixup_zone_present_pages()
> for HIGHMEM pages, but that will need to change arch specific code for x86, powerpc,
> sparc, microblaze, arm, mips, um and tile etc. Seems a little overhead.
> 	And sadly enough, I found the quick fix is still incomplete. The original
> patch still have another issue that, reset_zone_present_pages() is only called
> for IA64, so it will cause trouble for other arches which make use of "bootmem.c".
> 	Then I feel a little guilty and tried to find a cleaner solution without
> touching arch specific code. But things are more complex than my expectation and
> I'm still working on that.
> 	So how about totally reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
> and I will post another version once I found a cleaner way?

I think fixup_zone_present_pages() are very useful for memory hotplug.

We calculate zone->present_pages in free_area_init_core(), but its value is wrong.
So it is why we fix it in fixup_zone_present_pages().

What about this:
1. init zone->present_pages to the present pages in this zone(include bootmem)
2. don't reset zone->present_pages for HIGHMEM pages

We don't allocate bootmem from HIGHMEM. So its present pages is inited in step1
and there is no need to fix it in step2.

Is it OK?

If it is OK, I will resend the patch for step1(the patch is from laijs).

Thanks
Wen Congyang

> 	Thanks!
> 	Gerry
> 
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
  2012-11-15  9:22       ` Wen Congyang
@ 2012-11-15 11:28         ` Bob Liu
  -1 siblings, 0 replies; 98+ messages in thread
From: Bob Liu @ 2012-11-15 11:28 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Jiang Liu, Andrew Morton, Jiang Liu, Maciej Rutecki, Jianguo Wu,
	Chris Clayton, Rafael J. Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, linux-mm, linux-kernel,
	Daniel Vetter

On Thu, Nov 15, 2012 at 5:22 PM, Wen Congyang <wency@cn.fujitsu.com> wrote:
> Hi, Liu Jiang
>
> At 11/14/2012 10:52 PM, Jiang Liu Wrote:
>> On 11/07/2012 04:43 AM, Andrew Morton wrote:
>>> On Tue, 6 Nov 2012 09:31:57 +0800
>>> Jiang Liu <jiang.liu@huawei.com> wrote:
>>>
>>>> Changeset 7f1290f2f2 tries to fix a issue when calculating
>>>> zone->present_pages, but it causes a regression to 32bit systems with
>>>> HIGHMEM. With that changeset, function reset_zone_present_pages()
>>>> resets all zone->present_pages to zero, and fixup_zone_present_pages()
>>>> is called to recalculate zone->present_pages when boot allocator frees
>>>> core memory pages into buddy allocator. Because highmem pages are not
>>>> freed by bootmem allocator, all highmem zones' present_pages becomes
>>>> zero.
>>>>
>>>> Actually there's no need to recalculate present_pages for highmem zone
>>>> because bootmem allocator never allocates pages from them. So fix the
>>>> regression by skipping highmem in function reset_zone_present_pages()
>>>> and fixup_zone_present_pages().
>>>>
>>>> ...
>>>>
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
>>>>     for_each_node_state(nid, N_HIGH_MEMORY) {
>>>>             for (i = 0; i < MAX_NR_ZONES; i++) {
>>>>                     z = NODE_DATA(nid)->node_zones + i;
>>>> -                   z->present_pages = 0;
>>>> +                   if (!is_highmem(z))
>>>> +                           z->present_pages = 0;
>>>>             }
>>>>     }
>>>>  }
>>>> @@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
>>>>
>>>>     for (i = 0; i < MAX_NR_ZONES; i++) {
>>>>             z = NODE_DATA(nid)->node_zones + i;
>>>> +           if (is_highmem(z))
>>>> +                   continue;
>>>> +
>>>>             zone_start_pfn = z->zone_start_pfn;
>>>>             zone_end_pfn = zone_start_pfn + z->spanned_pages;
>>>> -
>>>> -           /* if the two regions intersect */
>>>>             if (!(zone_start_pfn >= end_pfn || zone_end_pfn <= start_pfn))
>>>>                     z->present_pages += min(end_pfn, zone_end_pfn) -
>>>>                                         max(start_pfn, zone_start_pfn);
>>>
>>> This ...  isn't very nice.  It is embeds within
>>> reset_zone_present_pages() and fixup_zone_present_pages() knowledge
>>> about their caller's state.  Or, more specifically, it is emebedding
>>> knowledge about the overall state of the system when these functions
>>> are called.
>>>
>>> I mean, a function called "reset_zone_present_pages" should reset
>>> ->present_pages!
>>>
>>> The fact that fixup_zone_present_page() has multiple call sites makes
>>> this all even more risky.  And what are the interactions between this
>>> and memory hotplug?
>>>
>>> Can we find a cleaner fix?
>>>
>>> Please tell us more about what's happening here.  Is it the case that
>>> reset_zone_present_pages() is being called *after* highmem has been
>>> populated?  If so, then fixup_zone_present_pages() should work
>>> correctly for highmem?  Or is it the case that highmem hasn't yet been
>>> setup?  IOW, what is the sequence of operations here?
>>>
>>> Is the problem that we're *missing* a call to
>>> fixup_zone_present_pages(), perhaps?  If we call
>>> fixup_zone_present_pages() after highmem has been populated,
>>> fixup_zone_present_pages() should correctly fill in the highmem zone's
>>> ->present_pages?
>> Hi Andrew,
>>       Sorry for the late response:(
>>       I have done more investigations according to your suggestions. Currently
>> we have only called fixup_zone_present_pages() for memory freed by bootmem
>> allocator and missed HIGHMEM pages. We could also call fixup_zone_present_pages()
>> for HIGHMEM pages, but that will need to change arch specific code for x86, powerpc,
>> sparc, microblaze, arm, mips, um and tile etc. Seems a little overhead.
>>       And sadly enough, I found the quick fix is still incomplete. The original
>> patch still have another issue that, reset_zone_present_pages() is only called
>> for IA64, so it will cause trouble for other arches which make use of "bootmem.c".
>>       Then I feel a little guilty and tried to find a cleaner solution without
>> touching arch specific code. But things are more complex than my expectation and
>> I'm still working on that.
>>       So how about totally reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
>> and I will post another version once I found a cleaner way?
>
> I think fixup_zone_present_pages() are very useful for memory hotplug.
>

I might miss something, but if memory hotplug is the only user depends on
fixup_zone_present_pages().
Why not reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
And add checking to offline_pages() like:
if (zone->present_pages >= offlined_page)
        zone->present_pages -= offlined_pages;
else
        zone->present_pages = 0;

It's more simple and can minimize the effect to other parts of kernel.

> We calculate zone->present_pages in free_area_init_core(), but its value is wrong.
> So it is why we fix it in fixup_zone_present_pages().
>
> What about this:
> 1. init zone->present_pages to the present pages in this zone(include bootmem)
> 2. don't reset zone->present_pages for HIGHMEM pages
>
> We don't allocate bootmem from HIGHMEM. So its present pages is inited in step1
> and there is no need to fix it in step2.
>
> Is it OK?
>
> If it is OK, I will resend the patch for step1(the patch is from laijs).
>

-- 
Regards,
--Bob

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
@ 2012-11-15 11:28         ` Bob Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Bob Liu @ 2012-11-15 11:28 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Jiang Liu, Andrew Morton, Jiang Liu, Maciej Rutecki, Jianguo Wu,
	Chris Clayton, Rafael J. Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, linux-mm, linux-kernel,
	Daniel Vetter

On Thu, Nov 15, 2012 at 5:22 PM, Wen Congyang <wency@cn.fujitsu.com> wrote:
> Hi, Liu Jiang
>
> At 11/14/2012 10:52 PM, Jiang Liu Wrote:
>> On 11/07/2012 04:43 AM, Andrew Morton wrote:
>>> On Tue, 6 Nov 2012 09:31:57 +0800
>>> Jiang Liu <jiang.liu@huawei.com> wrote:
>>>
>>>> Changeset 7f1290f2f2 tries to fix a issue when calculating
>>>> zone->present_pages, but it causes a regression to 32bit systems with
>>>> HIGHMEM. With that changeset, function reset_zone_present_pages()
>>>> resets all zone->present_pages to zero, and fixup_zone_present_pages()
>>>> is called to recalculate zone->present_pages when boot allocator frees
>>>> core memory pages into buddy allocator. Because highmem pages are not
>>>> freed by bootmem allocator, all highmem zones' present_pages becomes
>>>> zero.
>>>>
>>>> Actually there's no need to recalculate present_pages for highmem zone
>>>> because bootmem allocator never allocates pages from them. So fix the
>>>> regression by skipping highmem in function reset_zone_present_pages()
>>>> and fixup_zone_present_pages().
>>>>
>>>> ...
>>>>
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
>>>>     for_each_node_state(nid, N_HIGH_MEMORY) {
>>>>             for (i = 0; i < MAX_NR_ZONES; i++) {
>>>>                     z = NODE_DATA(nid)->node_zones + i;
>>>> -                   z->present_pages = 0;
>>>> +                   if (!is_highmem(z))
>>>> +                           z->present_pages = 0;
>>>>             }
>>>>     }
>>>>  }
>>>> @@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
>>>>
>>>>     for (i = 0; i < MAX_NR_ZONES; i++) {
>>>>             z = NODE_DATA(nid)->node_zones + i;
>>>> +           if (is_highmem(z))
>>>> +                   continue;
>>>> +
>>>>             zone_start_pfn = z->zone_start_pfn;
>>>>             zone_end_pfn = zone_start_pfn + z->spanned_pages;
>>>> -
>>>> -           /* if the two regions intersect */
>>>>             if (!(zone_start_pfn >= end_pfn || zone_end_pfn <= start_pfn))
>>>>                     z->present_pages += min(end_pfn, zone_end_pfn) -
>>>>                                         max(start_pfn, zone_start_pfn);
>>>
>>> This ...  isn't very nice.  It is embeds within
>>> reset_zone_present_pages() and fixup_zone_present_pages() knowledge
>>> about their caller's state.  Or, more specifically, it is emebedding
>>> knowledge about the overall state of the system when these functions
>>> are called.
>>>
>>> I mean, a function called "reset_zone_present_pages" should reset
>>> ->present_pages!
>>>
>>> The fact that fixup_zone_present_page() has multiple call sites makes
>>> this all even more risky.  And what are the interactions between this
>>> and memory hotplug?
>>>
>>> Can we find a cleaner fix?
>>>
>>> Please tell us more about what's happening here.  Is it the case that
>>> reset_zone_present_pages() is being called *after* highmem has been
>>> populated?  If so, then fixup_zone_present_pages() should work
>>> correctly for highmem?  Or is it the case that highmem hasn't yet been
>>> setup?  IOW, what is the sequence of operations here?
>>>
>>> Is the problem that we're *missing* a call to
>>> fixup_zone_present_pages(), perhaps?  If we call
>>> fixup_zone_present_pages() after highmem has been populated,
>>> fixup_zone_present_pages() should correctly fill in the highmem zone's
>>> ->present_pages?
>> Hi Andrew,
>>       Sorry for the late response:(
>>       I have done more investigations according to your suggestions. Currently
>> we have only called fixup_zone_present_pages() for memory freed by bootmem
>> allocator and missed HIGHMEM pages. We could also call fixup_zone_present_pages()
>> for HIGHMEM pages, but that will need to change arch specific code for x86, powerpc,
>> sparc, microblaze, arm, mips, um and tile etc. Seems a little overhead.
>>       And sadly enough, I found the quick fix is still incomplete. The original
>> patch still have another issue that, reset_zone_present_pages() is only called
>> for IA64, so it will cause trouble for other arches which make use of "bootmem.c".
>>       Then I feel a little guilty and tried to find a cleaner solution without
>> touching arch specific code. But things are more complex than my expectation and
>> I'm still working on that.
>>       So how about totally reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
>> and I will post another version once I found a cleaner way?
>
> I think fixup_zone_present_pages() are very useful for memory hotplug.
>

I might miss something, but if memory hotplug is the only user depends on
fixup_zone_present_pages().
Why not reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
And add checking to offline_pages() like:
if (zone->present_pages >= offlined_page)
        zone->present_pages -= offlined_pages;
else
        zone->present_pages = 0;

It's more simple and can minimize the effect to other parts of kernel.

> We calculate zone->present_pages in free_area_init_core(), but its value is wrong.
> So it is why we fix it in fixup_zone_present_pages().
>
> What about this:
> 1. init zone->present_pages to the present pages in this zone(include bootmem)
> 2. don't reset zone->present_pages for HIGHMEM pages
>
> We don't allocate bootmem from HIGHMEM. So its present pages is inited in step1
> and there is no need to fix it in step2.
>
> Is it OK?
>
> If it is OK, I will resend the patch for step1(the patch is from laijs).
>

-- 
Regards,
--Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
  2012-11-15 11:28         ` Bob Liu
@ 2012-11-15 14:23           ` Wen Congyang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wen Congyang @ 2012-11-15 14:23 UTC (permalink / raw)
  To: Bob Liu
  Cc: Wen Congyang, Jiang Liu, Andrew Morton, Jiang Liu,
	Maciej Rutecki, Jianguo Wu, Chris Clayton, Rafael J. Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	linux-mm, linux-kernel, Daniel Vetter

At 2012/11/15 19:28, Bob Liu Wrote:
> On Thu, Nov 15, 2012 at 5:22 PM, Wen Congyang<wency@cn.fujitsu.com>  wrote:
>> Hi, Liu Jiang
>>
>> At 11/14/2012 10:52 PM, Jiang Liu Wrote:
>>> On 11/07/2012 04:43 AM, Andrew Morton wrote:
>>>> On Tue, 6 Nov 2012 09:31:57 +0800
>>>> Jiang Liu<jiang.liu@huawei.com>  wrote:
>>>>
>>>>> Changeset 7f1290f2f2 tries to fix a issue when calculating
>>>>> zone->present_pages, but it causes a regression to 32bit systems with
>>>>> HIGHMEM. With that changeset, function reset_zone_present_pages()
>>>>> resets all zone->present_pages to zero, and fixup_zone_present_pages()
>>>>> is called to recalculate zone->present_pages when boot allocator frees
>>>>> core memory pages into buddy allocator. Because highmem pages are not
>>>>> freed by bootmem allocator, all highmem zones' present_pages becomes
>>>>> zero.
>>>>>
>>>>> Actually there's no need to recalculate present_pages for highmem zone
>>>>> because bootmem allocator never allocates pages from them. So fix the
>>>>> regression by skipping highmem in function reset_zone_present_pages()
>>>>> and fixup_zone_present_pages().
>>>>>
>>>>> ...
>>>>>
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
>>>>>      for_each_node_state(nid, N_HIGH_MEMORY) {
>>>>>              for (i = 0; i<  MAX_NR_ZONES; i++) {
>>>>>                      z = NODE_DATA(nid)->node_zones + i;
>>>>> -                   z->present_pages = 0;
>>>>> +                   if (!is_highmem(z))
>>>>> +                           z->present_pages = 0;
>>>>>              }
>>>>>      }
>>>>>   }
>>>>> @@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
>>>>>
>>>>>      for (i = 0; i<  MAX_NR_ZONES; i++) {
>>>>>              z = NODE_DATA(nid)->node_zones + i;
>>>>> +           if (is_highmem(z))
>>>>> +                   continue;
>>>>> +
>>>>>              zone_start_pfn = z->zone_start_pfn;
>>>>>              zone_end_pfn = zone_start_pfn + z->spanned_pages;
>>>>> -
>>>>> -           /* if the two regions intersect */
>>>>>              if (!(zone_start_pfn>= end_pfn || zone_end_pfn<= start_pfn))
>>>>>                      z->present_pages += min(end_pfn, zone_end_pfn) -
>>>>>                                          max(start_pfn, zone_start_pfn);
>>>>
>>>> This ...  isn't very nice.  It is embeds within
>>>> reset_zone_present_pages() and fixup_zone_present_pages() knowledge
>>>> about their caller's state.  Or, more specifically, it is emebedding
>>>> knowledge about the overall state of the system when these functions
>>>> are called.
>>>>
>>>> I mean, a function called "reset_zone_present_pages" should reset
>>>> ->present_pages!
>>>>
>>>> The fact that fixup_zone_present_page() has multiple call sites makes
>>>> this all even more risky.  And what are the interactions between this
>>>> and memory hotplug?
>>>>
>>>> Can we find a cleaner fix?
>>>>
>>>> Please tell us more about what's happening here.  Is it the case that
>>>> reset_zone_present_pages() is being called *after* highmem has been
>>>> populated?  If so, then fixup_zone_present_pages() should work
>>>> correctly for highmem?  Or is it the case that highmem hasn't yet been
>>>> setup?  IOW, what is the sequence of operations here?
>>>>
>>>> Is the problem that we're *missing* a call to
>>>> fixup_zone_present_pages(), perhaps?  If we call
>>>> fixup_zone_present_pages() after highmem has been populated,
>>>> fixup_zone_present_pages() should correctly fill in the highmem zone's
>>>> ->present_pages?
>>> Hi Andrew,
>>>        Sorry for the late response:(
>>>        I have done more investigations according to your suggestions. Currently
>>> we have only called fixup_zone_present_pages() for memory freed by bootmem
>>> allocator and missed HIGHMEM pages. We could also call fixup_zone_present_pages()
>>> for HIGHMEM pages, but that will need to change arch specific code for x86, powerpc,
>>> sparc, microblaze, arm, mips, um and tile etc. Seems a little overhead.
>>>        And sadly enough, I found the quick fix is still incomplete. The original
>>> patch still have another issue that, reset_zone_present_pages() is only called
>>> for IA64, so it will cause trouble for other arches which make use of "bootmem.c".
>>>        Then I feel a little guilty and tried to find a cleaner solution without
>>> touching arch specific code. But things are more complex than my expectation and
>>> I'm still working on that.
>>>        So how about totally reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
>>> and I will post another version once I found a cleaner way?
>>
>> I think fixup_zone_present_pages() are very useful for memory hotplug.
>>
>
> I might miss something, but if memory hotplug is the only user depends on
> fixup_zone_present_pages().

IIRC, water_mask depends on zone->present_pages. But I don't meet any 
problem
even if zone->present_pages is wrong.

> Why not reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
> And add checking to offline_pages() like:
> if (zone->present_pages>= offlined_page)
>          zone->present_pages -= offlined_pages;
> else
>          zone->present_pages = 0;
>
> It's more simple and can minimize the effect to other parts of kernel.

Hmm, zone->present_pages may be 0 when there is memory in this zone which is
onlined and in use. If zone->present_pages becomes to 0, we will free pcp
list for this zone. It will cause some unexpected error.

>
>> We calculate zone->present_pages in free_area_init_core(), but its value is wrong.
>> So it is why we fix it in fixup_zone_present_pages().
>>
>> What about this:
>> 1. init zone->present_pages to the present pages in this zone(include bootmem)
>> 2. don't reset zone->present_pages for HIGHMEM pages
>>
>> We don't allocate bootmem from HIGHMEM. So its present pages is inited in step1
>> and there is no need to fix it in step2.
>>
>> Is it OK?
>>
>> If it is OK, I will resend the patch for step1(the patch is from laijs).
>>
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
@ 2012-11-15 14:23           ` Wen Congyang
  0 siblings, 0 replies; 98+ messages in thread
From: Wen Congyang @ 2012-11-15 14:23 UTC (permalink / raw)
  To: Bob Liu
  Cc: Wen Congyang, Jiang Liu, Andrew Morton, Jiang Liu,
	Maciej Rutecki, Jianguo Wu, Chris Clayton, Rafael J. Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	linux-mm, linux-kernel, Daniel Vetter

At 2012/11/15 19:28, Bob Liu Wrote:
> On Thu, Nov 15, 2012 at 5:22 PM, Wen Congyang<wency@cn.fujitsu.com>  wrote:
>> Hi, Liu Jiang
>>
>> At 11/14/2012 10:52 PM, Jiang Liu Wrote:
>>> On 11/07/2012 04:43 AM, Andrew Morton wrote:
>>>> On Tue, 6 Nov 2012 09:31:57 +0800
>>>> Jiang Liu<jiang.liu@huawei.com>  wrote:
>>>>
>>>>> Changeset 7f1290f2f2 tries to fix a issue when calculating
>>>>> zone->present_pages, but it causes a regression to 32bit systems with
>>>>> HIGHMEM. With that changeset, function reset_zone_present_pages()
>>>>> resets all zone->present_pages to zero, and fixup_zone_present_pages()
>>>>> is called to recalculate zone->present_pages when boot allocator frees
>>>>> core memory pages into buddy allocator. Because highmem pages are not
>>>>> freed by bootmem allocator, all highmem zones' present_pages becomes
>>>>> zero.
>>>>>
>>>>> Actually there's no need to recalculate present_pages for highmem zone
>>>>> because bootmem allocator never allocates pages from them. So fix the
>>>>> regression by skipping highmem in function reset_zone_present_pages()
>>>>> and fixup_zone_present_pages().
>>>>>
>>>>> ...
>>>>>
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
>>>>>      for_each_node_state(nid, N_HIGH_MEMORY) {
>>>>>              for (i = 0; i<  MAX_NR_ZONES; i++) {
>>>>>                      z = NODE_DATA(nid)->node_zones + i;
>>>>> -                   z->present_pages = 0;
>>>>> +                   if (!is_highmem(z))
>>>>> +                           z->present_pages = 0;
>>>>>              }
>>>>>      }
>>>>>   }
>>>>> @@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
>>>>>
>>>>>      for (i = 0; i<  MAX_NR_ZONES; i++) {
>>>>>              z = NODE_DATA(nid)->node_zones + i;
>>>>> +           if (is_highmem(z))
>>>>> +                   continue;
>>>>> +
>>>>>              zone_start_pfn = z->zone_start_pfn;
>>>>>              zone_end_pfn = zone_start_pfn + z->spanned_pages;
>>>>> -
>>>>> -           /* if the two regions intersect */
>>>>>              if (!(zone_start_pfn>= end_pfn || zone_end_pfn<= start_pfn))
>>>>>                      z->present_pages += min(end_pfn, zone_end_pfn) -
>>>>>                                          max(start_pfn, zone_start_pfn);
>>>>
>>>> This ...  isn't very nice.  It is embeds within
>>>> reset_zone_present_pages() and fixup_zone_present_pages() knowledge
>>>> about their caller's state.  Or, more specifically, it is emebedding
>>>> knowledge about the overall state of the system when these functions
>>>> are called.
>>>>
>>>> I mean, a function called "reset_zone_present_pages" should reset
>>>> ->present_pages!
>>>>
>>>> The fact that fixup_zone_present_page() has multiple call sites makes
>>>> this all even more risky.  And what are the interactions between this
>>>> and memory hotplug?
>>>>
>>>> Can we find a cleaner fix?
>>>>
>>>> Please tell us more about what's happening here.  Is it the case that
>>>> reset_zone_present_pages() is being called *after* highmem has been
>>>> populated?  If so, then fixup_zone_present_pages() should work
>>>> correctly for highmem?  Or is it the case that highmem hasn't yet been
>>>> setup?  IOW, what is the sequence of operations here?
>>>>
>>>> Is the problem that we're *missing* a call to
>>>> fixup_zone_present_pages(), perhaps?  If we call
>>>> fixup_zone_present_pages() after highmem has been populated,
>>>> fixup_zone_present_pages() should correctly fill in the highmem zone's
>>>> ->present_pages?
>>> Hi Andrew,
>>>        Sorry for the late response:(
>>>        I have done more investigations according to your suggestions. Currently
>>> we have only called fixup_zone_present_pages() for memory freed by bootmem
>>> allocator and missed HIGHMEM pages. We could also call fixup_zone_present_pages()
>>> for HIGHMEM pages, but that will need to change arch specific code for x86, powerpc,
>>> sparc, microblaze, arm, mips, um and tile etc. Seems a little overhead.
>>>        And sadly enough, I found the quick fix is still incomplete. The original
>>> patch still have another issue that, reset_zone_present_pages() is only called
>>> for IA64, so it will cause trouble for other arches which make use of "bootmem.c".
>>>        Then I feel a little guilty and tried to find a cleaner solution without
>>> touching arch specific code. But things are more complex than my expectation and
>>> I'm still working on that.
>>>        So how about totally reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
>>> and I will post another version once I found a cleaner way?
>>
>> I think fixup_zone_present_pages() are very useful for memory hotplug.
>>
>
> I might miss something, but if memory hotplug is the only user depends on
> fixup_zone_present_pages().

IIRC, water_mask depends on zone->present_pages. But I don't meet any 
problem
even if zone->present_pages is wrong.

> Why not reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
> And add checking to offline_pages() like:
> if (zone->present_pages>= offlined_page)
>          zone->present_pages -= offlined_pages;
> else
>          zone->present_pages = 0;
>
> It's more simple and can minimize the effect to other parts of kernel.

Hmm, zone->present_pages may be 0 when there is memory in this zone which is
onlined and in use. If zone->present_pages becomes to 0, we will free pcp
list for this zone. It will cause some unexpected error.

>
>> We calculate zone->present_pages in free_area_init_core(), but its value is wrong.
>> So it is why we fix it in fixup_zone_present_pages().
>>
>> What about this:
>> 1. init zone->present_pages to the present pages in this zone(include bootmem)
>> 2. don't reset zone->present_pages for HIGHMEM pages
>>
>> We don't allocate bootmem from HIGHMEM. So its present pages is inited in step1
>> and there is no need to fix it in step2.
>>
>> Is it OK?
>>
>> If it is OK, I will resend the patch for step1(the patch is from laijs).
>>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
  2012-11-15  9:22       ` Wen Congyang
@ 2012-11-15 15:40         ` Jiang Liu
  -1 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-15 15:40 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Andrew Morton, Jiang Liu, Maciej Rutecki, Jianguo Wu,
	Chris Clayton, Rafael J. Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, linux-mm, linux-kernel,
	Daniel Vetter

On 11/15/2012 05:22 PM, Wen Congyang wrote:
> Hi, Liu Jiang
> 
> At 11/14/2012 10:52 PM, Jiang Liu Wrote:
>> On 11/07/2012 04:43 AM, Andrew Morton wrote:
>>> On Tue, 6 Nov 2012 09:31:57 +0800
>>> Jiang Liu <jiang.liu@huawei.com> wrote:
>>>
>>>> Changeset 7f1290f2f2 tries to fix a issue when calculating
>>>> zone->present_pages, but it causes a regression to 32bit systems with
>>>> HIGHMEM. With that changeset, function reset_zone_present_pages()
>>>> resets all zone->present_pages to zero, and fixup_zone_present_pages()
>>>> is called to recalculate zone->present_pages when boot allocator frees
>>>> core memory pages into buddy allocator. Because highmem pages are not
>>>> freed by bootmem allocator, all highmem zones' present_pages becomes
>>>> zero.
>>>>
>>>> Actually there's no need to recalculate present_pages for highmem zone
>>>> because bootmem allocator never allocates pages from them. So fix the
>>>> regression by skipping highmem in function reset_zone_present_pages()
>>>> and fixup_zone_present_pages().
>>>>
>>>> ...
>>>>
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
>>>>  	for_each_node_state(nid, N_HIGH_MEMORY) {
>>>>  		for (i = 0; i < MAX_NR_ZONES; i++) {
>>>>  			z = NODE_DATA(nid)->node_zones + i;
>>>> -			z->present_pages = 0;
>>>> +			if (!is_highmem(z))
>>>> +				z->present_pages = 0;
>>>>  		}
>>>>  	}
>>>>  }
>>>> @@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
>>>>  
>>>>  	for (i = 0; i < MAX_NR_ZONES; i++) {
>>>>  		z = NODE_DATA(nid)->node_zones + i;
>>>> +		if (is_highmem(z))
>>>> +			continue;
>>>> +
>>>>  		zone_start_pfn = z->zone_start_pfn;
>>>>  		zone_end_pfn = zone_start_pfn + z->spanned_pages;
>>>> -
>>>> -		/* if the two regions intersect */
>>>>  		if (!(zone_start_pfn >= end_pfn	|| zone_end_pfn <= start_pfn))
>>>>  			z->present_pages += min(end_pfn, zone_end_pfn) -
>>>>  					    max(start_pfn, zone_start_pfn);
>>>
>>> This ...  isn't very nice.  It is embeds within
>>> reset_zone_present_pages() and fixup_zone_present_pages() knowledge
>>> about their caller's state.  Or, more specifically, it is emebedding
>>> knowledge about the overall state of the system when these functions
>>> are called.
>>>
>>> I mean, a function called "reset_zone_present_pages" should reset
>>> ->present_pages!
>>>
>>> The fact that fixup_zone_present_page() has multiple call sites makes
>>> this all even more risky.  And what are the interactions between this
>>> and memory hotplug?
>>>
>>> Can we find a cleaner fix?
>>>
>>> Please tell us more about what's happening here.  Is it the case that
>>> reset_zone_present_pages() is being called *after* highmem has been
>>> populated?  If so, then fixup_zone_present_pages() should work
>>> correctly for highmem?  Or is it the case that highmem hasn't yet been
>>> setup?  IOW, what is the sequence of operations here?
>>>
>>> Is the problem that we're *missing* a call to
>>> fixup_zone_present_pages(), perhaps?  If we call
>>> fixup_zone_present_pages() after highmem has been populated,
>>> fixup_zone_present_pages() should correctly fill in the highmem zone's
>>> ->present_pages?
>> Hi Andrew,
>> 	Sorry for the late response:(
>> 	I have done more investigations according to your suggestions. Currently
>> we have only called fixup_zone_present_pages() for memory freed by bootmem
>> allocator and missed HIGHMEM pages. We could also call fixup_zone_present_pages()
>> for HIGHMEM pages, but that will need to change arch specific code for x86, powerpc,
>> sparc, microblaze, arm, mips, um and tile etc. Seems a little overhead.
>> 	And sadly enough, I found the quick fix is still incomplete. The original
>> patch still have another issue that, reset_zone_present_pages() is only called
>> for IA64, so it will cause trouble for other arches which make use of "bootmem.c".
>> 	Then I feel a little guilty and tried to find a cleaner solution without
>> touching arch specific code. But things are more complex than my expectation and
>> I'm still working on that.
>> 	So how about totally reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
>> and I will post another version once I found a cleaner way?
> 
> I think fixup_zone_present_pages() are very useful for memory hotplug.
> 
> We calculate zone->present_pages in free_area_init_core(), but its value is wrong.
> So it is why we fix it in fixup_zone_present_pages().
> 
> What about this:
> 1. init zone->present_pages to the present pages in this zone(include bootmem)
> 2. don't reset zone->present_pages for HIGHMEM pages
> 
> We don't allocate bootmem from HIGHMEM. So its present pages is inited in step1
> and there is no need to fix it in step2.
Hi Congyang,

I feel that zone->present_pages has been abused. I guess it means "physical pages 
present in this zone" originally, but now sometimes zone->present_pages is used as
"pages in this zone managed by the buddy system". So I'm trying to add a new
field "managed_pages" into zone, which accounts for pages managed by buddy system.
That's why I thought the clean solution is a little complex:(

Why do we need "managed_pages"? With HIGHMEM enabled, there may be bigger difference
between "present_pages" and "managed_pages" on ZONE_NORMAL because it also hosts
page array for ZONE_HIGHMEM. That may cause disturbance to page allocator or scanner.

What's your thoughts?

Thanks
Gerry

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
@ 2012-11-15 15:40         ` Jiang Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-15 15:40 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Andrew Morton, Jiang Liu, Maciej Rutecki, Jianguo Wu,
	Chris Clayton, Rafael J. Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, linux-mm, linux-kernel,
	Daniel Vetter

On 11/15/2012 05:22 PM, Wen Congyang wrote:
> Hi, Liu Jiang
> 
> At 11/14/2012 10:52 PM, Jiang Liu Wrote:
>> On 11/07/2012 04:43 AM, Andrew Morton wrote:
>>> On Tue, 6 Nov 2012 09:31:57 +0800
>>> Jiang Liu <jiang.liu@huawei.com> wrote:
>>>
>>>> Changeset 7f1290f2f2 tries to fix a issue when calculating
>>>> zone->present_pages, but it causes a regression to 32bit systems with
>>>> HIGHMEM. With that changeset, function reset_zone_present_pages()
>>>> resets all zone->present_pages to zero, and fixup_zone_present_pages()
>>>> is called to recalculate zone->present_pages when boot allocator frees
>>>> core memory pages into buddy allocator. Because highmem pages are not
>>>> freed by bootmem allocator, all highmem zones' present_pages becomes
>>>> zero.
>>>>
>>>> Actually there's no need to recalculate present_pages for highmem zone
>>>> because bootmem allocator never allocates pages from them. So fix the
>>>> regression by skipping highmem in function reset_zone_present_pages()
>>>> and fixup_zone_present_pages().
>>>>
>>>> ...
>>>>
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
>>>>  	for_each_node_state(nid, N_HIGH_MEMORY) {
>>>>  		for (i = 0; i < MAX_NR_ZONES; i++) {
>>>>  			z = NODE_DATA(nid)->node_zones + i;
>>>> -			z->present_pages = 0;
>>>> +			if (!is_highmem(z))
>>>> +				z->present_pages = 0;
>>>>  		}
>>>>  	}
>>>>  }
>>>> @@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
>>>>  
>>>>  	for (i = 0; i < MAX_NR_ZONES; i++) {
>>>>  		z = NODE_DATA(nid)->node_zones + i;
>>>> +		if (is_highmem(z))
>>>> +			continue;
>>>> +
>>>>  		zone_start_pfn = z->zone_start_pfn;
>>>>  		zone_end_pfn = zone_start_pfn + z->spanned_pages;
>>>> -
>>>> -		/* if the two regions intersect */
>>>>  		if (!(zone_start_pfn >= end_pfn	|| zone_end_pfn <= start_pfn))
>>>>  			z->present_pages += min(end_pfn, zone_end_pfn) -
>>>>  					    max(start_pfn, zone_start_pfn);
>>>
>>> This ...  isn't very nice.  It is embeds within
>>> reset_zone_present_pages() and fixup_zone_present_pages() knowledge
>>> about their caller's state.  Or, more specifically, it is emebedding
>>> knowledge about the overall state of the system when these functions
>>> are called.
>>>
>>> I mean, a function called "reset_zone_present_pages" should reset
>>> ->present_pages!
>>>
>>> The fact that fixup_zone_present_page() has multiple call sites makes
>>> this all even more risky.  And what are the interactions between this
>>> and memory hotplug?
>>>
>>> Can we find a cleaner fix?
>>>
>>> Please tell us more about what's happening here.  Is it the case that
>>> reset_zone_present_pages() is being called *after* highmem has been
>>> populated?  If so, then fixup_zone_present_pages() should work
>>> correctly for highmem?  Or is it the case that highmem hasn't yet been
>>> setup?  IOW, what is the sequence of operations here?
>>>
>>> Is the problem that we're *missing* a call to
>>> fixup_zone_present_pages(), perhaps?  If we call
>>> fixup_zone_present_pages() after highmem has been populated,
>>> fixup_zone_present_pages() should correctly fill in the highmem zone's
>>> ->present_pages?
>> Hi Andrew,
>> 	Sorry for the late response:(
>> 	I have done more investigations according to your suggestions. Currently
>> we have only called fixup_zone_present_pages() for memory freed by bootmem
>> allocator and missed HIGHMEM pages. We could also call fixup_zone_present_pages()
>> for HIGHMEM pages, but that will need to change arch specific code for x86, powerpc,
>> sparc, microblaze, arm, mips, um and tile etc. Seems a little overhead.
>> 	And sadly enough, I found the quick fix is still incomplete. The original
>> patch still have another issue that, reset_zone_present_pages() is only called
>> for IA64, so it will cause trouble for other arches which make use of "bootmem.c".
>> 	Then I feel a little guilty and tried to find a cleaner solution without
>> touching arch specific code. But things are more complex than my expectation and
>> I'm still working on that.
>> 	So how about totally reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
>> and I will post another version once I found a cleaner way?
> 
> I think fixup_zone_present_pages() are very useful for memory hotplug.
> 
> We calculate zone->present_pages in free_area_init_core(), but its value is wrong.
> So it is why we fix it in fixup_zone_present_pages().
> 
> What about this:
> 1. init zone->present_pages to the present pages in this zone(include bootmem)
> 2. don't reset zone->present_pages for HIGHMEM pages
> 
> We don't allocate bootmem from HIGHMEM. So its present pages is inited in step1
> and there is no need to fix it in step2.
Hi Congyang,

I feel that zone->present_pages has been abused. I guess it means "physical pages 
present in this zone" originally, but now sometimes zone->present_pages is used as
"pages in this zone managed by the buddy system". So I'm trying to add a new
field "managed_pages" into zone, which accounts for pages managed by buddy system.
That's why I thought the clean solution is a little complex:(

Why do we need "managed_pages"? With HIGHMEM enabled, there may be bigger difference
between "present_pages" and "managed_pages" on ZONE_NORMAL because it also hosts
page array for ZONE_HIGHMEM. That may cause disturbance to page allocator or scanner.

What's your thoughts?

Thanks
Gerry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
  2012-11-14 14:52     ` Jiang Liu
@ 2012-11-15 19:24       ` Andrew Morton
  -1 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2012-11-15 19:24 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Jiang Liu, Maciej Rutecki, Jianguo Wu, Chris Clayton,
	Rafael J. Wysocki, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
	Michal Hocko, linux-mm, linux-kernel, Daniel Vetter

On Wed, 14 Nov 2012 22:52:03 +0800
Jiang Liu <liuj97@gmail.com> wrote:

> 	So how about totally reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
> and I will post another version once I found a cleaner way?

We do need to get this regression fixed and I guess that a
straightforward revert is an acceptable way of doing that, for now.


I queued the below, with a plan to send it to Linus next week.


From: Andrew Morton <akpm@linux-foundation.org>
Subject: revert "mm: fix-up zone present pages"

Revert

commit 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
Author:     Jianguo Wu <wujianguo@huawei.com>
AuthorDate: Mon Oct 8 16:33:06 2012 -0700
Commit:     Linus Torvalds <torvalds@linux-foundation.org>
CommitDate: Tue Oct 9 16:22:54 2012 +0900

    mm: fix-up zone present pages


That patch tried to fix a issue when calculating zone->present_pages, but
it caused a regression on 32bit systems with HIGHMEM.  With that
changeset, reset_zone_present_pages() resets all zone->present_pages to
zero, and fixup_zone_present_pages() is called to recalculate
zone->present_pages when the boot allocator frees core memory pages into
buddy allocator.  Because highmem pages are not freed by bootmem
allocator, all highmem zones' present_pages becomes zero.

Various options for improving the situation are being discussed but for
now, let's return to the 3.6 code.

Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/ia64/mm/init.c |    1 -
 include/linux/mm.h  |    4 ----
 mm/bootmem.c        |   10 +---------
 mm/memory_hotplug.c |    7 -------
 mm/nobootmem.c      |    3 ---
 mm/page_alloc.c     |   34 ----------------------------------
 6 files changed, 1 insertion(+), 58 deletions(-)

diff -puN arch/ia64/mm/init.c~revert-1 arch/ia64/mm/init.c
--- a/arch/ia64/mm/init.c~revert-1
+++ a/arch/ia64/mm/init.c
@@ -637,7 +637,6 @@ mem_init (void)
 
 	high_memory = __va(max_low_pfn * PAGE_SIZE);
 
-	reset_zone_present_pages();
 	for_each_online_pgdat(pgdat)
 		if (pgdat->bdata->node_bootmem_map)
 			totalram_pages += free_all_bootmem_node(pgdat);
diff -puN include/linux/mm.h~revert-1 include/linux/mm.h
--- a/include/linux/mm.h~revert-1
+++ a/include/linux/mm.h
@@ -1684,9 +1684,5 @@ static inline unsigned int debug_guardpa
 static inline bool page_is_guard(struct page *page) { return false; }
 #endif /* CONFIG_DEBUG_PAGEALLOC */
 
-extern void reset_zone_present_pages(void);
-extern void fixup_zone_present_pages(int nid, unsigned long start_pfn,
-				unsigned long end_pfn);
-
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff -puN mm/bootmem.c~revert-1 mm/bootmem.c
--- a/mm/bootmem.c~revert-1
+++ a/mm/bootmem.c
@@ -198,8 +198,6 @@ static unsigned long __init free_all_boo
 			int order = ilog2(BITS_PER_LONG);
 
 			__free_pages_bootmem(pfn_to_page(start), order);
-			fixup_zone_present_pages(page_to_nid(pfn_to_page(start)),
-					start, start + BITS_PER_LONG);
 			count += BITS_PER_LONG;
 			start += BITS_PER_LONG;
 		} else {
@@ -210,9 +208,6 @@ static unsigned long __init free_all_boo
 				if (vec & 1) {
 					page = pfn_to_page(start + off);
 					__free_pages_bootmem(page, 0);
-					fixup_zone_present_pages(
-						page_to_nid(page),
-						start + off, start + off + 1);
 					count++;
 				}
 				vec >>= 1;
@@ -226,11 +221,8 @@ static unsigned long __init free_all_boo
 	pages = bdata->node_low_pfn - bdata->node_min_pfn;
 	pages = bootmem_bootmap_pages(pages);
 	count += pages;
-	while (pages--) {
-		fixup_zone_present_pages(page_to_nid(page),
-				page_to_pfn(page), page_to_pfn(page) + 1);
+	while (pages--)
 		__free_pages_bootmem(page++, 0);
-	}
 
 	bdebug("nid=%td released=%lx\n", bdata - bootmem_node_data, count);
 
diff -puN mm/memory_hotplug.c~revert-1 mm/memory_hotplug.c
--- a/mm/memory_hotplug.c~revert-1
+++ a/mm/memory_hotplug.c
@@ -106,7 +106,6 @@ static void get_page_bootmem(unsigned lo
 void __ref put_page_bootmem(struct page *page)
 {
 	unsigned long type;
-	struct zone *zone;
 
 	type = (unsigned long) page->lru.next;
 	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
@@ -117,12 +116,6 @@ void __ref put_page_bootmem(struct page 
 		set_page_private(page, 0);
 		INIT_LIST_HEAD(&page->lru);
 		__free_pages_bootmem(page, 0);
-
-		zone = page_zone(page);
-		zone_span_writelock(zone);
-		zone->present_pages++;
-		zone_span_writeunlock(zone);
-		totalram_pages++;
 	}
 
 }
diff -puN mm/nobootmem.c~revert-1 mm/nobootmem.c
--- a/mm/nobootmem.c~revert-1
+++ a/mm/nobootmem.c
@@ -116,8 +116,6 @@ static unsigned long __init __free_memor
 		return 0;
 
 	__free_pages_memory(start_pfn, end_pfn);
-	fixup_zone_present_pages(pfn_to_nid(start >> PAGE_SHIFT),
-			start_pfn, end_pfn);
 
 	return end_pfn - start_pfn;
 }
@@ -128,7 +126,6 @@ unsigned long __init free_low_memory_cor
 	phys_addr_t start, end, size;
 	u64 i;
 
-	reset_zone_present_pages();
 	for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
 		count += __free_memory_core(start, end);
 
diff -puN mm/page_alloc.c~revert-1 mm/page_alloc.c
--- a/mm/page_alloc.c~revert-1
+++ a/mm/page_alloc.c
@@ -6098,37 +6098,3 @@ void dump_page(struct page *page)
 	dump_page_flags(page->flags);
 	mem_cgroup_print_bad_page(page);
 }
-
-/* reset zone->present_pages */
-void reset_zone_present_pages(void)
-{
-	struct zone *z;
-	int i, nid;
-
-	for_each_node_state(nid, N_HIGH_MEMORY) {
-		for (i = 0; i < MAX_NR_ZONES; i++) {
-			z = NODE_DATA(nid)->node_zones + i;
-			z->present_pages = 0;
-		}
-	}
-}
-
-/* calculate zone's present pages in buddy system */
-void fixup_zone_present_pages(int nid, unsigned long start_pfn,
-				unsigned long end_pfn)
-{
-	struct zone *z;
-	unsigned long zone_start_pfn, zone_end_pfn;
-	int i;
-
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		z = NODE_DATA(nid)->node_zones + i;
-		zone_start_pfn = z->zone_start_pfn;
-		zone_end_pfn = zone_start_pfn + z->spanned_pages;
-
-		/* if the two regions intersect */
-		if (!(zone_start_pfn >= end_pfn	|| zone_end_pfn <= start_pfn))
-			z->present_pages += min(end_pfn, zone_end_pfn) -
-					    max(start_pfn, zone_start_pfn);
-	}
-}
_


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
@ 2012-11-15 19:24       ` Andrew Morton
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2012-11-15 19:24 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Jiang Liu, Maciej Rutecki, Jianguo Wu, Chris Clayton,
	Rafael J. Wysocki, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
	Michal Hocko, linux-mm, linux-kernel, Daniel Vetter

On Wed, 14 Nov 2012 22:52:03 +0800
Jiang Liu <liuj97@gmail.com> wrote:

> 	So how about totally reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
> and I will post another version once I found a cleaner way?

We do need to get this regression fixed and I guess that a
straightforward revert is an acceptable way of doing that, for now.


I queued the below, with a plan to send it to Linus next week.


From: Andrew Morton <akpm@linux-foundation.org>
Subject: revert "mm: fix-up zone present pages"

Revert

commit 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
Author:     Jianguo Wu <wujianguo@huawei.com>
AuthorDate: Mon Oct 8 16:33:06 2012 -0700
Commit:     Linus Torvalds <torvalds@linux-foundation.org>
CommitDate: Tue Oct 9 16:22:54 2012 +0900

    mm: fix-up zone present pages


That patch tried to fix a issue when calculating zone->present_pages, but
it caused a regression on 32bit systems with HIGHMEM.  With that
changeset, reset_zone_present_pages() resets all zone->present_pages to
zero, and fixup_zone_present_pages() is called to recalculate
zone->present_pages when the boot allocator frees core memory pages into
buddy allocator.  Because highmem pages are not freed by bootmem
allocator, all highmem zones' present_pages becomes zero.

Various options for improving the situation are being discussed but for
now, let's return to the 3.6 code.

Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/ia64/mm/init.c |    1 -
 include/linux/mm.h  |    4 ----
 mm/bootmem.c        |   10 +---------
 mm/memory_hotplug.c |    7 -------
 mm/nobootmem.c      |    3 ---
 mm/page_alloc.c     |   34 ----------------------------------
 6 files changed, 1 insertion(+), 58 deletions(-)

diff -puN arch/ia64/mm/init.c~revert-1 arch/ia64/mm/init.c
--- a/arch/ia64/mm/init.c~revert-1
+++ a/arch/ia64/mm/init.c
@@ -637,7 +637,6 @@ mem_init (void)
 
 	high_memory = __va(max_low_pfn * PAGE_SIZE);
 
-	reset_zone_present_pages();
 	for_each_online_pgdat(pgdat)
 		if (pgdat->bdata->node_bootmem_map)
 			totalram_pages += free_all_bootmem_node(pgdat);
diff -puN include/linux/mm.h~revert-1 include/linux/mm.h
--- a/include/linux/mm.h~revert-1
+++ a/include/linux/mm.h
@@ -1684,9 +1684,5 @@ static inline unsigned int debug_guardpa
 static inline bool page_is_guard(struct page *page) { return false; }
 #endif /* CONFIG_DEBUG_PAGEALLOC */
 
-extern void reset_zone_present_pages(void);
-extern void fixup_zone_present_pages(int nid, unsigned long start_pfn,
-				unsigned long end_pfn);
-
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff -puN mm/bootmem.c~revert-1 mm/bootmem.c
--- a/mm/bootmem.c~revert-1
+++ a/mm/bootmem.c
@@ -198,8 +198,6 @@ static unsigned long __init free_all_boo
 			int order = ilog2(BITS_PER_LONG);
 
 			__free_pages_bootmem(pfn_to_page(start), order);
-			fixup_zone_present_pages(page_to_nid(pfn_to_page(start)),
-					start, start + BITS_PER_LONG);
 			count += BITS_PER_LONG;
 			start += BITS_PER_LONG;
 		} else {
@@ -210,9 +208,6 @@ static unsigned long __init free_all_boo
 				if (vec & 1) {
 					page = pfn_to_page(start + off);
 					__free_pages_bootmem(page, 0);
-					fixup_zone_present_pages(
-						page_to_nid(page),
-						start + off, start + off + 1);
 					count++;
 				}
 				vec >>= 1;
@@ -226,11 +221,8 @@ static unsigned long __init free_all_boo
 	pages = bdata->node_low_pfn - bdata->node_min_pfn;
 	pages = bootmem_bootmap_pages(pages);
 	count += pages;
-	while (pages--) {
-		fixup_zone_present_pages(page_to_nid(page),
-				page_to_pfn(page), page_to_pfn(page) + 1);
+	while (pages--)
 		__free_pages_bootmem(page++, 0);
-	}
 
 	bdebug("nid=%td released=%lx\n", bdata - bootmem_node_data, count);
 
diff -puN mm/memory_hotplug.c~revert-1 mm/memory_hotplug.c
--- a/mm/memory_hotplug.c~revert-1
+++ a/mm/memory_hotplug.c
@@ -106,7 +106,6 @@ static void get_page_bootmem(unsigned lo
 void __ref put_page_bootmem(struct page *page)
 {
 	unsigned long type;
-	struct zone *zone;
 
 	type = (unsigned long) page->lru.next;
 	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
@@ -117,12 +116,6 @@ void __ref put_page_bootmem(struct page 
 		set_page_private(page, 0);
 		INIT_LIST_HEAD(&page->lru);
 		__free_pages_bootmem(page, 0);
-
-		zone = page_zone(page);
-		zone_span_writelock(zone);
-		zone->present_pages++;
-		zone_span_writeunlock(zone);
-		totalram_pages++;
 	}
 
 }
diff -puN mm/nobootmem.c~revert-1 mm/nobootmem.c
--- a/mm/nobootmem.c~revert-1
+++ a/mm/nobootmem.c
@@ -116,8 +116,6 @@ static unsigned long __init __free_memor
 		return 0;
 
 	__free_pages_memory(start_pfn, end_pfn);
-	fixup_zone_present_pages(pfn_to_nid(start >> PAGE_SHIFT),
-			start_pfn, end_pfn);
 
 	return end_pfn - start_pfn;
 }
@@ -128,7 +126,6 @@ unsigned long __init free_low_memory_cor
 	phys_addr_t start, end, size;
 	u64 i;
 
-	reset_zone_present_pages();
 	for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
 		count += __free_memory_core(start, end);
 
diff -puN mm/page_alloc.c~revert-1 mm/page_alloc.c
--- a/mm/page_alloc.c~revert-1
+++ a/mm/page_alloc.c
@@ -6098,37 +6098,3 @@ void dump_page(struct page *page)
 	dump_page_flags(page->flags);
 	mem_cgroup_print_bad_page(page);
 }
-
-/* reset zone->present_pages */
-void reset_zone_present_pages(void)
-{
-	struct zone *z;
-	int i, nid;
-
-	for_each_node_state(nid, N_HIGH_MEMORY) {
-		for (i = 0; i < MAX_NR_ZONES; i++) {
-			z = NODE_DATA(nid)->node_zones + i;
-			z->present_pages = 0;
-		}
-	}
-}
-
-/* calculate zone's present pages in buddy system */
-void fixup_zone_present_pages(int nid, unsigned long start_pfn,
-				unsigned long end_pfn)
-{
-	struct zone *z;
-	unsigned long zone_start_pfn, zone_end_pfn;
-	int i;
-
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		z = NODE_DATA(nid)->node_zones + i;
-		zone_start_pfn = z->zone_start_pfn;
-		zone_end_pfn = zone_start_pfn + z->spanned_pages;
-
-		/* if the two regions intersect */
-		if (!(zone_start_pfn >= end_pfn	|| zone_end_pfn <= start_pfn))
-			z->present_pages += min(end_pfn, zone_end_pfn) -
-					    max(start_pfn, zone_start_pfn);
-	}
-}
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
  2012-11-15 19:24       ` Andrew Morton
@ 2012-11-15 21:17         ` Chris Clayton
  -1 siblings, 0 replies; 98+ messages in thread
From: Chris Clayton @ 2012-11-15 21:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiang Liu, Jiang Liu, Maciej Rutecki, Jianguo Wu,
	Rafael J. Wysocki, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
	Michal Hocko, linux-mm, linux-kernel, Daniel Vetter



On 11/15/12 19:24, Andrew Morton wrote:
> On Wed, 14 Nov 2012 22:52:03 +0800
> Jiang Liu <liuj97@gmail.com> wrote:
>
>> 	So how about totally reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
>> and I will post another version once I found a cleaner way?
>
> We do need to get this regression fixed and I guess that a
> straightforward revert is an acceptable way of doing that, for now.
>
>
> I queued the below, with a plan to send it to Linus next week.
>

I've applied this patch to v3.7-rc5-28-g79e979e and can confirm that it 
fixes the problem I had with my laptop failing to resume (by either 
freezing or rebooting) after a suspend to disk.

Tested-by: Chris Clayton <chris2553@googlemail.com>

>
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: revert "mm: fix-up zone present pages"
>
> Revert
>
> commit 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
> Author:     Jianguo Wu <wujianguo@huawei.com>
> AuthorDate: Mon Oct 8 16:33:06 2012 -0700
> Commit:     Linus Torvalds <torvalds@linux-foundation.org>
> CommitDate: Tue Oct 9 16:22:54 2012 +0900
>
>      mm: fix-up zone present pages
>
>
> That patch tried to fix a issue when calculating zone->present_pages, but
> it caused a regression on 32bit systems with HIGHMEM.  With that
> changeset, reset_zone_present_pages() resets all zone->present_pages to
> zero, and fixup_zone_present_pages() is called to recalculate
> zone->present_pages when the boot allocator frees core memory pages into
> buddy allocator.  Because highmem pages are not freed by bootmem
> allocator, all highmem zones' present_pages becomes zero.
>
> Various options for improving the situation are being discussed but for
> now, let's return to the 3.6 code.
>
> Cc: Jianguo Wu <wujianguo@huawei.com>
> Cc: Jiang Liu <jiang.liu@huawei.com>
> Cc: Petr Tesarik <ptesarik@suse.cz>
> Cc: "Luck, Tony" <tony.luck@intel.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Yinghai Lu <yinghai@kernel.org>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: David Rientjes <rientjes@google.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
>   arch/ia64/mm/init.c |    1 -
>   include/linux/mm.h  |    4 ----
>   mm/bootmem.c        |   10 +---------
>   mm/memory_hotplug.c |    7 -------
>   mm/nobootmem.c      |    3 ---
>   mm/page_alloc.c     |   34 ----------------------------------
>   6 files changed, 1 insertion(+), 58 deletions(-)
>
> diff -puN arch/ia64/mm/init.c~revert-1 arch/ia64/mm/init.c
> --- a/arch/ia64/mm/init.c~revert-1
> +++ a/arch/ia64/mm/init.c
> @@ -637,7 +637,6 @@ mem_init (void)
>
>   	high_memory = __va(max_low_pfn * PAGE_SIZE);
>
> -	reset_zone_present_pages();
>   	for_each_online_pgdat(pgdat)
>   		if (pgdat->bdata->node_bootmem_map)
>   			totalram_pages += free_all_bootmem_node(pgdat);
> diff -puN include/linux/mm.h~revert-1 include/linux/mm.h
> --- a/include/linux/mm.h~revert-1
> +++ a/include/linux/mm.h
> @@ -1684,9 +1684,5 @@ static inline unsigned int debug_guardpa
>   static inline bool page_is_guard(struct page *page) { return false; }
>   #endif /* CONFIG_DEBUG_PAGEALLOC */
>
> -extern void reset_zone_present_pages(void);
> -extern void fixup_zone_present_pages(int nid, unsigned long start_pfn,
> -				unsigned long end_pfn);
> -
>   #endif /* __KERNEL__ */
>   #endif /* _LINUX_MM_H */
> diff -puN mm/bootmem.c~revert-1 mm/bootmem.c
> --- a/mm/bootmem.c~revert-1
> +++ a/mm/bootmem.c
> @@ -198,8 +198,6 @@ static unsigned long __init free_all_boo
>   			int order = ilog2(BITS_PER_LONG);
>
>   			__free_pages_bootmem(pfn_to_page(start), order);
> -			fixup_zone_present_pages(page_to_nid(pfn_to_page(start)),
> -					start, start + BITS_PER_LONG);
>   			count += BITS_PER_LONG;
>   			start += BITS_PER_LONG;
>   		} else {
> @@ -210,9 +208,6 @@ static unsigned long __init free_all_boo
>   				if (vec & 1) {
>   					page = pfn_to_page(start + off);
>   					__free_pages_bootmem(page, 0);
> -					fixup_zone_present_pages(
> -						page_to_nid(page),
> -						start + off, start + off + 1);
>   					count++;
>   				}
>   				vec >>= 1;
> @@ -226,11 +221,8 @@ static unsigned long __init free_all_boo
>   	pages = bdata->node_low_pfn - bdata->node_min_pfn;
>   	pages = bootmem_bootmap_pages(pages);
>   	count += pages;
> -	while (pages--) {
> -		fixup_zone_present_pages(page_to_nid(page),
> -				page_to_pfn(page), page_to_pfn(page) + 1);
> +	while (pages--)
>   		__free_pages_bootmem(page++, 0);
> -	}
>
>   	bdebug("nid=%td released=%lx\n", bdata - bootmem_node_data, count);
>
> diff -puN mm/memory_hotplug.c~revert-1 mm/memory_hotplug.c
> --- a/mm/memory_hotplug.c~revert-1
> +++ a/mm/memory_hotplug.c
> @@ -106,7 +106,6 @@ static void get_page_bootmem(unsigned lo
>   void __ref put_page_bootmem(struct page *page)
>   {
>   	unsigned long type;
> -	struct zone *zone;
>
>   	type = (unsigned long) page->lru.next;
>   	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
> @@ -117,12 +116,6 @@ void __ref put_page_bootmem(struct page
>   		set_page_private(page, 0);
>   		INIT_LIST_HEAD(&page->lru);
>   		__free_pages_bootmem(page, 0);
> -
> -		zone = page_zone(page);
> -		zone_span_writelock(zone);
> -		zone->present_pages++;
> -		zone_span_writeunlock(zone);
> -		totalram_pages++;
>   	}
>
>   }
> diff -puN mm/nobootmem.c~revert-1 mm/nobootmem.c
> --- a/mm/nobootmem.c~revert-1
> +++ a/mm/nobootmem.c
> @@ -116,8 +116,6 @@ static unsigned long __init __free_memor
>   		return 0;
>
>   	__free_pages_memory(start_pfn, end_pfn);
> -	fixup_zone_present_pages(pfn_to_nid(start >> PAGE_SHIFT),
> -			start_pfn, end_pfn);
>
>   	return end_pfn - start_pfn;
>   }
> @@ -128,7 +126,6 @@ unsigned long __init free_low_memory_cor
>   	phys_addr_t start, end, size;
>   	u64 i;
>
> -	reset_zone_present_pages();
>   	for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
>   		count += __free_memory_core(start, end);
>
> diff -puN mm/page_alloc.c~revert-1 mm/page_alloc.c
> --- a/mm/page_alloc.c~revert-1
> +++ a/mm/page_alloc.c
> @@ -6098,37 +6098,3 @@ void dump_page(struct page *page)
>   	dump_page_flags(page->flags);
>   	mem_cgroup_print_bad_page(page);
>   }
> -
> -/* reset zone->present_pages */
> -void reset_zone_present_pages(void)
> -{
> -	struct zone *z;
> -	int i, nid;
> -
> -	for_each_node_state(nid, N_HIGH_MEMORY) {
> -		for (i = 0; i < MAX_NR_ZONES; i++) {
> -			z = NODE_DATA(nid)->node_zones + i;
> -			z->present_pages = 0;
> -		}
> -	}
> -}
> -
> -/* calculate zone's present pages in buddy system */
> -void fixup_zone_present_pages(int nid, unsigned long start_pfn,
> -				unsigned long end_pfn)
> -{
> -	struct zone *z;
> -	unsigned long zone_start_pfn, zone_end_pfn;
> -	int i;
> -
> -	for (i = 0; i < MAX_NR_ZONES; i++) {
> -		z = NODE_DATA(nid)->node_zones + i;
> -		zone_start_pfn = z->zone_start_pfn;
> -		zone_end_pfn = zone_start_pfn + z->spanned_pages;
> -
> -		/* if the two regions intersect */
> -		if (!(zone_start_pfn >= end_pfn	|| zone_end_pfn <= start_pfn))
> -			z->present_pages += min(end_pfn, zone_end_pfn) -
> -					    max(start_pfn, zone_start_pfn);
> -	}
> -}
> _
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
@ 2012-11-15 21:17         ` Chris Clayton
  0 siblings, 0 replies; 98+ messages in thread
From: Chris Clayton @ 2012-11-15 21:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiang Liu, Jiang Liu, Maciej Rutecki, Jianguo Wu,
	Rafael J. Wysocki, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
	Michal Hocko, linux-mm, linux-kernel, Daniel Vetter



On 11/15/12 19:24, Andrew Morton wrote:
> On Wed, 14 Nov 2012 22:52:03 +0800
> Jiang Liu <liuj97@gmail.com> wrote:
>
>> 	So how about totally reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
>> and I will post another version once I found a cleaner way?
>
> We do need to get this regression fixed and I guess that a
> straightforward revert is an acceptable way of doing that, for now.
>
>
> I queued the below, with a plan to send it to Linus next week.
>

I've applied this patch to v3.7-rc5-28-g79e979e and can confirm that it 
fixes the problem I had with my laptop failing to resume (by either 
freezing or rebooting) after a suspend to disk.

Tested-by: Chris Clayton <chris2553@googlemail.com>

>
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: revert "mm: fix-up zone present pages"
>
> Revert
>
> commit 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
> Author:     Jianguo Wu <wujianguo@huawei.com>
> AuthorDate: Mon Oct 8 16:33:06 2012 -0700
> Commit:     Linus Torvalds <torvalds@linux-foundation.org>
> CommitDate: Tue Oct 9 16:22:54 2012 +0900
>
>      mm: fix-up zone present pages
>
>
> That patch tried to fix a issue when calculating zone->present_pages, but
> it caused a regression on 32bit systems with HIGHMEM.  With that
> changeset, reset_zone_present_pages() resets all zone->present_pages to
> zero, and fixup_zone_present_pages() is called to recalculate
> zone->present_pages when the boot allocator frees core memory pages into
> buddy allocator.  Because highmem pages are not freed by bootmem
> allocator, all highmem zones' present_pages becomes zero.
>
> Various options for improving the situation are being discussed but for
> now, let's return to the 3.6 code.
>
> Cc: Jianguo Wu <wujianguo@huawei.com>
> Cc: Jiang Liu <jiang.liu@huawei.com>
> Cc: Petr Tesarik <ptesarik@suse.cz>
> Cc: "Luck, Tony" <tony.luck@intel.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Yinghai Lu <yinghai@kernel.org>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: David Rientjes <rientjes@google.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
>   arch/ia64/mm/init.c |    1 -
>   include/linux/mm.h  |    4 ----
>   mm/bootmem.c        |   10 +---------
>   mm/memory_hotplug.c |    7 -------
>   mm/nobootmem.c      |    3 ---
>   mm/page_alloc.c     |   34 ----------------------------------
>   6 files changed, 1 insertion(+), 58 deletions(-)
>
> diff -puN arch/ia64/mm/init.c~revert-1 arch/ia64/mm/init.c
> --- a/arch/ia64/mm/init.c~revert-1
> +++ a/arch/ia64/mm/init.c
> @@ -637,7 +637,6 @@ mem_init (void)
>
>   	high_memory = __va(max_low_pfn * PAGE_SIZE);
>
> -	reset_zone_present_pages();
>   	for_each_online_pgdat(pgdat)
>   		if (pgdat->bdata->node_bootmem_map)
>   			totalram_pages += free_all_bootmem_node(pgdat);
> diff -puN include/linux/mm.h~revert-1 include/linux/mm.h
> --- a/include/linux/mm.h~revert-1
> +++ a/include/linux/mm.h
> @@ -1684,9 +1684,5 @@ static inline unsigned int debug_guardpa
>   static inline bool page_is_guard(struct page *page) { return false; }
>   #endif /* CONFIG_DEBUG_PAGEALLOC */
>
> -extern void reset_zone_present_pages(void);
> -extern void fixup_zone_present_pages(int nid, unsigned long start_pfn,
> -				unsigned long end_pfn);
> -
>   #endif /* __KERNEL__ */
>   #endif /* _LINUX_MM_H */
> diff -puN mm/bootmem.c~revert-1 mm/bootmem.c
> --- a/mm/bootmem.c~revert-1
> +++ a/mm/bootmem.c
> @@ -198,8 +198,6 @@ static unsigned long __init free_all_boo
>   			int order = ilog2(BITS_PER_LONG);
>
>   			__free_pages_bootmem(pfn_to_page(start), order);
> -			fixup_zone_present_pages(page_to_nid(pfn_to_page(start)),
> -					start, start + BITS_PER_LONG);
>   			count += BITS_PER_LONG;
>   			start += BITS_PER_LONG;
>   		} else {
> @@ -210,9 +208,6 @@ static unsigned long __init free_all_boo
>   				if (vec & 1) {
>   					page = pfn_to_page(start + off);
>   					__free_pages_bootmem(page, 0);
> -					fixup_zone_present_pages(
> -						page_to_nid(page),
> -						start + off, start + off + 1);
>   					count++;
>   				}
>   				vec >>= 1;
> @@ -226,11 +221,8 @@ static unsigned long __init free_all_boo
>   	pages = bdata->node_low_pfn - bdata->node_min_pfn;
>   	pages = bootmem_bootmap_pages(pages);
>   	count += pages;
> -	while (pages--) {
> -		fixup_zone_present_pages(page_to_nid(page),
> -				page_to_pfn(page), page_to_pfn(page) + 1);
> +	while (pages--)
>   		__free_pages_bootmem(page++, 0);
> -	}
>
>   	bdebug("nid=%td released=%lx\n", bdata - bootmem_node_data, count);
>
> diff -puN mm/memory_hotplug.c~revert-1 mm/memory_hotplug.c
> --- a/mm/memory_hotplug.c~revert-1
> +++ a/mm/memory_hotplug.c
> @@ -106,7 +106,6 @@ static void get_page_bootmem(unsigned lo
>   void __ref put_page_bootmem(struct page *page)
>   {
>   	unsigned long type;
> -	struct zone *zone;
>
>   	type = (unsigned long) page->lru.next;
>   	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
> @@ -117,12 +116,6 @@ void __ref put_page_bootmem(struct page
>   		set_page_private(page, 0);
>   		INIT_LIST_HEAD(&page->lru);
>   		__free_pages_bootmem(page, 0);
> -
> -		zone = page_zone(page);
> -		zone_span_writelock(zone);
> -		zone->present_pages++;
> -		zone_span_writeunlock(zone);
> -		totalram_pages++;
>   	}
>
>   }
> diff -puN mm/nobootmem.c~revert-1 mm/nobootmem.c
> --- a/mm/nobootmem.c~revert-1
> +++ a/mm/nobootmem.c
> @@ -116,8 +116,6 @@ static unsigned long __init __free_memor
>   		return 0;
>
>   	__free_pages_memory(start_pfn, end_pfn);
> -	fixup_zone_present_pages(pfn_to_nid(start >> PAGE_SHIFT),
> -			start_pfn, end_pfn);
>
>   	return end_pfn - start_pfn;
>   }
> @@ -128,7 +126,6 @@ unsigned long __init free_low_memory_cor
>   	phys_addr_t start, end, size;
>   	u64 i;
>
> -	reset_zone_present_pages();
>   	for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
>   		count += __free_memory_core(start, end);
>
> diff -puN mm/page_alloc.c~revert-1 mm/page_alloc.c
> --- a/mm/page_alloc.c~revert-1
> +++ a/mm/page_alloc.c
> @@ -6098,37 +6098,3 @@ void dump_page(struct page *page)
>   	dump_page_flags(page->flags);
>   	mem_cgroup_print_bad_page(page);
>   }
> -
> -/* reset zone->present_pages */
> -void reset_zone_present_pages(void)
> -{
> -	struct zone *z;
> -	int i, nid;
> -
> -	for_each_node_state(nid, N_HIGH_MEMORY) {
> -		for (i = 0; i < MAX_NR_ZONES; i++) {
> -			z = NODE_DATA(nid)->node_zones + i;
> -			z->present_pages = 0;
> -		}
> -	}
> -}
> -
> -/* calculate zone's present pages in buddy system */
> -void fixup_zone_present_pages(int nid, unsigned long start_pfn,
> -				unsigned long end_pfn)
> -{
> -	struct zone *z;
> -	unsigned long zone_start_pfn, zone_end_pfn;
> -	int i;
> -
> -	for (i = 0; i < MAX_NR_ZONES; i++) {
> -		z = NODE_DATA(nid)->node_zones + i;
> -		zone_start_pfn = z->zone_start_pfn;
> -		zone_end_pfn = zone_start_pfn + z->spanned_pages;
> -
> -		/* if the two regions intersect */
> -		if (!(zone_start_pfn >= end_pfn	|| zone_end_pfn <= start_pfn))
> -			z->present_pages += min(end_pfn, zone_end_pfn) -
> -					    max(start_pfn, zone_start_pfn);
> -	}
> -}
> _
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
  2012-11-15 19:24       ` Andrew Morton
@ 2012-11-15 21:27         ` David Rientjes
  -1 siblings, 0 replies; 98+ messages in thread
From: David Rientjes @ 2012-11-15 21:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiang Liu, Jiang Liu, Maciej Rutecki, Jianguo Wu, Chris Clayton,
	Rafael J. Wysocki, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
	Michal Hocko, linux-mm, linux-kernel, Daniel Vetter

On Thu, 15 Nov 2012, Andrew Morton wrote:

> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: revert "mm: fix-up zone present pages"
> 
> Revert
> 
> commit 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
> Author:     Jianguo Wu <wujianguo@huawei.com>
> AuthorDate: Mon Oct 8 16:33:06 2012 -0700
> Commit:     Linus Torvalds <torvalds@linux-foundation.org>
> CommitDate: Tue Oct 9 16:22:54 2012 +0900
> 
>     mm: fix-up zone present pages
> 
> 
> That patch tried to fix a issue when calculating zone->present_pages, but
> it caused a regression on 32bit systems with HIGHMEM.  With that
> changeset, reset_zone_present_pages() resets all zone->present_pages to
> zero, and fixup_zone_present_pages() is called to recalculate
> zone->present_pages when the boot allocator frees core memory pages into
> buddy allocator.  Because highmem pages are not freed by bootmem
> allocator, all highmem zones' present_pages becomes zero.
> 
> Various options for improving the situation are being discussed but for
> now, let's return to the 3.6 code.
> 
> Cc: Jianguo Wu <wujianguo@huawei.com>
> Cc: Jiang Liu <jiang.liu@huawei.com>
> Cc: Petr Tesarik <ptesarik@suse.cz>
> Cc: "Luck, Tony" <tony.luck@intel.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Yinghai Lu <yinghai@kernel.org>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: David Rientjes <rientjes@google.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
@ 2012-11-15 21:27         ` David Rientjes
  0 siblings, 0 replies; 98+ messages in thread
From: David Rientjes @ 2012-11-15 21:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiang Liu, Jiang Liu, Maciej Rutecki, Jianguo Wu, Chris Clayton,
	Rafael J. Wysocki, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
	Michal Hocko, linux-mm, linux-kernel, Daniel Vetter

On Thu, 15 Nov 2012, Andrew Morton wrote:

> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: revert "mm: fix-up zone present pages"
> 
> Revert
> 
> commit 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
> Author:     Jianguo Wu <wujianguo@huawei.com>
> AuthorDate: Mon Oct 8 16:33:06 2012 -0700
> Commit:     Linus Torvalds <torvalds@linux-foundation.org>
> CommitDate: Tue Oct 9 16:22:54 2012 +0900
> 
>     mm: fix-up zone present pages
> 
> 
> That patch tried to fix a issue when calculating zone->present_pages, but
> it caused a regression on 32bit systems with HIGHMEM.  With that
> changeset, reset_zone_present_pages() resets all zone->present_pages to
> zero, and fixup_zone_present_pages() is called to recalculate
> zone->present_pages when the boot allocator frees core memory pages into
> buddy allocator.  Because highmem pages are not freed by bootmem
> allocator, all highmem zones' present_pages becomes zero.
> 
> Various options for improving the situation are being discussed but for
> now, let's return to the 3.6 code.
> 
> Cc: Jianguo Wu <wujianguo@huawei.com>
> Cc: Jiang Liu <jiang.liu@huawei.com>
> Cc: Petr Tesarik <ptesarik@suse.cz>
> Cc: "Luck, Tony" <tony.luck@intel.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Yinghai Lu <yinghai@kernel.org>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: David Rientjes <rientjes@google.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Acked-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
  2012-11-15 15:40         ` Jiang Liu
@ 2012-11-15 21:41           ` David Rientjes
  -1 siblings, 0 replies; 98+ messages in thread
From: David Rientjes @ 2012-11-15 21:41 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Wen Congyang, Andrew Morton, Jiang Liu, Maciej Rutecki,
	Jianguo Wu, Chris Clayton, Rafael J. Wysocki, Mel Gorman,
	Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko, linux-mm,
	linux-kernel, Daniel Vetter

On Thu, 15 Nov 2012, Jiang Liu wrote:

> I feel that zone->present_pages has been abused. I guess it means "physical pages 
> present in this zone" originally, but now sometimes zone->present_pages is used as
> "pages in this zone managed by the buddy system".

It's definition is all pages spanned by the zone that are not reserved and 
unavailable to the kernel to allocate from, and the implementation of 
bootmem requires that its memory be considered as "reserved" until freed.  
It's used throughout the kernel to determine the amount of memory that is 
allocatable in that zone from the page allocator since its reclaim 
heuristics and watermarks depend on this memory being allocatable.

> So I'm trying to add a new
> field "managed_pages" into zone, which accounts for pages managed by buddy system.
> That's why I thought the clean solution is a little complex:(
> 

You need to update the pgdat's node_present_pages to be consistent with 
all of its zones' present_pages.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d
@ 2012-11-15 21:41           ` David Rientjes
  0 siblings, 0 replies; 98+ messages in thread
From: David Rientjes @ 2012-11-15 21:41 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Wen Congyang, Andrew Morton, Jiang Liu, Maciej Rutecki,
	Jianguo Wu, Chris Clayton, Rafael J. Wysocki, Mel Gorman,
	Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko, linux-mm,
	linux-kernel, Daniel Vetter

On Thu, 15 Nov 2012, Jiang Liu wrote:

> I feel that zone->present_pages has been abused. I guess it means "physical pages 
> present in this zone" originally, but now sometimes zone->present_pages is used as
> "pages in this zone managed by the buddy system".

It's definition is all pages spanned by the zone that are not reserved and 
unavailable to the kernel to allocate from, and the implementation of 
bootmem requires that its memory be considered as "reserved" until freed.  
It's used throughout the kernel to determine the amount of memory that is 
allocatable in that zone from the page allocator since its reclaim 
heuristics and watermarks depend on this memory being allocatable.

> So I'm trying to add a new
> field "managed_pages" into zone, which accounts for pages managed by buddy system.
> That's why I thought the clean solution is a little complex:(
> 

You need to update the pgdat's node_present_pages to be consistent with 
all of its zones' present_pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
  2012-11-15 19:24       ` Andrew Morton
@ 2012-11-18 16:07         ` Jiang Liu
  -1 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-18 16:07 UTC (permalink / raw)
  To: Andrew Morton, Wen Congyang, David Rientjes
  Cc: Jiang Liu, Maciej Rutecki, Chris Clayton, Rafael J . Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	Jianguo Wu, linux-mm, linux-kernel

The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
resolve an issue caused by inaccurate zone->present_pages, but that
fix is incomplete and causes regresions with HIGHMEM. And it has been
reverted by commit 
5576646 revert "mm: fix-up zone present pages"

This is a following-up patchset for the issue above. It introduces a
new field named "managed_pages" to struct zone, which counts pages
managed by the buddy system from the zone. And zone->present_pages
is used to count pages existing in the zone, which is
	spanned_pages - absent_pages.

But that way, zone->present_pages will be kept in consistence with
pgdat->node_present_pages, which is sum of zone->present_pages.

This patchset has only been tested on x86_64 with nobootmem.c. So need
help to test this patchset on machines:
1) use bootmem.c
2) have highmem

This patchset applies to "f4a75d2e Linux 3.7-rc6" from
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Any comments and helps are welcomed!

Jiang Liu (5):
  mm: introduce new field "managed_pages" to struct zone
  mm: replace zone->present_pages with zone->managed_pages if
    appreciated
  mm: set zone->present_pages to number of existing pages in the zone
  mm: provide more accurate estimation of pages occupied by memmap
  mm: increase totalram_pages when free pages allocated by bootmem
    allocator

 include/linux/mmzone.h |    1 +
 mm/bootmem.c           |   14 ++++++++
 mm/memory_hotplug.c    |    6 ++++
 mm/mempolicy.c         |    2 +-
 mm/nobootmem.c         |   15 ++++++++
 mm/page_alloc.c        |   89 +++++++++++++++++++++++++++++++-----------------
 mm/vmscan.c            |   16 ++++-----
 mm/vmstat.c            |    8 +++--
 8 files changed, 108 insertions(+), 43 deletions(-)

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 98+ messages in thread

* [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
@ 2012-11-18 16:07         ` Jiang Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-18 16:07 UTC (permalink / raw)
  To: Andrew Morton, Wen Congyang, David Rientjes
  Cc: Jiang Liu, Maciej Rutecki, Chris Clayton, Rafael J . Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	Jianguo Wu, linux-mm, linux-kernel

The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
resolve an issue caused by inaccurate zone->present_pages, but that
fix is incomplete and causes regresions with HIGHMEM. And it has been
reverted by commit 
5576646 revert "mm: fix-up zone present pages"

This is a following-up patchset for the issue above. It introduces a
new field named "managed_pages" to struct zone, which counts pages
managed by the buddy system from the zone. And zone->present_pages
is used to count pages existing in the zone, which is
	spanned_pages - absent_pages.

But that way, zone->present_pages will be kept in consistence with
pgdat->node_present_pages, which is sum of zone->present_pages.

This patchset has only been tested on x86_64 with nobootmem.c. So need
help to test this patchset on machines:
1) use bootmem.c
2) have highmem

This patchset applies to "f4a75d2e Linux 3.7-rc6" from
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Any comments and helps are welcomed!

Jiang Liu (5):
  mm: introduce new field "managed_pages" to struct zone
  mm: replace zone->present_pages with zone->managed_pages if
    appreciated
  mm: set zone->present_pages to number of existing pages in the zone
  mm: provide more accurate estimation of pages occupied by memmap
  mm: increase totalram_pages when free pages allocated by bootmem
    allocator

 include/linux/mmzone.h |    1 +
 mm/bootmem.c           |   14 ++++++++
 mm/memory_hotplug.c    |    6 ++++
 mm/mempolicy.c         |    2 +-
 mm/nobootmem.c         |   15 ++++++++
 mm/page_alloc.c        |   89 +++++++++++++++++++++++++++++++-----------------
 mm/vmscan.c            |   16 ++++-----
 mm/vmstat.c            |    8 +++--
 8 files changed, 108 insertions(+), 43 deletions(-)

-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [RFT PATCH v1 1/5] mm: introduce new field "managed_pages" to struct zone
  2012-11-18 16:07         ` Jiang Liu
@ 2012-11-18 16:07           ` Jiang Liu
  -1 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-18 16:07 UTC (permalink / raw)
  To: Andrew Morton, Wen Congyang, David Rientjes
  Cc: Jiang Liu, Maciej Rutecki, Chris Clayton, Rafael J . Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	Jianguo Wu, linux-mm, linux-kernel

Currently a zone's present_pages is calcuated as below, which is
inaccurate and may cause trouble to memory hotplug.
	spanned_pages - absent_pages - memmap_pages - dma_reserve.

During fixing bugs caused by inaccurate zone->present_pages, we found
zone->present_pages has been abused. The field zone->present_pages
may have different meanings in different contexts:
1) pages existing in a zone.
2) pages managed by the buddy system.

For more discussions about the issue, please refer to:
http://lkml.org/lkml/2012/11/5/866
https://patchwork.kernel.org/patch/1346751/

This patchset tries to introduce a new field named "managed_pages" to
struct zone, which counts "pages managed by the buddy system". And
revert zone->present_pages to count "physical pages existing in a zone",
which also keep in consistence with pgdat->node_present_pages.

We will set an initial value for zone->managed_pages in function
free_area_init_core() and will be adjusted later if the initial value is
inaccurate.

For DMA/normal zones, the initial value is set to:
	(spanned_pages - absent_pages - memmap_pages - dma_reserve)
Later zone->managed_pages will be adjusted to the accurate value when
the bootmem allocator frees all free pages to the buddy system in
function free_all_bootmem_node() and free_all_bootmem().

The bootmem allocator doesn't touch highmem pages, so highmem zones'
managed_pages is set to the accurate value "spanned_pages - absent_pages"
in function free_area_init_core() and won't be updated anymore.

This patch also adds a new field "managed_pages" to /proc/zoneinfo
and sysrq showmem.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 include/linux/mmzone.h |    1 +
 mm/bootmem.c           |   14 ++++++++++++++
 mm/memory_hotplug.c    |    5 +++++
 mm/nobootmem.c         |   15 +++++++++++++++
 mm/page_alloc.c        |   37 +++++++++++++++++++++++--------------
 mm/vmstat.c            |    6 ++++--
 6 files changed, 62 insertions(+), 16 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a23923b..c61fb6d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -480,6 +480,7 @@ struct zone {
 	 */
 	unsigned long		spanned_pages;	/* total size, including holes */
 	unsigned long		present_pages;	/* amount of memory (excluding holes) */
+	unsigned long		managed_pages;	/* pages managed by the Buddy */
 
 	/*
 	 * rarely used fields:
diff --git a/mm/bootmem.c b/mm/bootmem.c
index f468185..a813e5b 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -229,6 +229,15 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
 	return count;
 }
 
+static void reset_node_lowmem_managed_pages(pg_data_t *pgdat)
+{
+	struct zone *z;
+
+	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
+		if (!is_highmem(z))
+			z->managed_pages = 0;
+}
+
 /**
  * free_all_bootmem_node - release a node's free pages to the buddy allocator
  * @pgdat: node to be released
@@ -238,6 +247,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
 unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
 {
 	register_page_bootmem_info_node(pgdat);
+	reset_node_lowmem_managed_pages(pgdat);
 	return free_all_bootmem_core(pgdat->bdata);
 }
 
@@ -250,6 +260,10 @@ unsigned long __init free_all_bootmem(void)
 {
 	unsigned long total_pages = 0;
 	bootmem_data_t *bdata;
+	struct pglist_data *pgdat;
+
+	for_each_online_pgdat(pgdat)
+		reset_node_lowmem_managed_pages(pgdat);
 
 	list_for_each_entry(bdata, &bdata_list, list)
 		total_pages += free_all_bootmem_core(bdata);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e4eeaca..acd3993 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -106,6 +106,7 @@ static void get_page_bootmem(unsigned long info,  struct page *page,
 void __ref put_page_bootmem(struct page *page)
 {
 	unsigned long type;
+	static DEFINE_MUTEX(ppb_lock);
 
 	type = (unsigned long) page->lru.next;
 	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
@@ -115,7 +116,9 @@ void __ref put_page_bootmem(struct page *page)
 		ClearPagePrivate(page);
 		set_page_private(page, 0);
 		INIT_LIST_HEAD(&page->lru);
+		mutex_lock(&ppb_lock);
 		__free_pages_bootmem(page, 0);
+		mutex_unlock(&ppb_lock);
 	}
 
 }
@@ -514,6 +517,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages)
 		return ret;
 	}
 
+	zone->managed_pages += onlined_pages;
 	zone->present_pages += onlined_pages;
 	zone->zone_pgdat->node_present_pages += onlined_pages;
 	if (onlined_pages) {
@@ -961,6 +965,7 @@ repeat:
 	/* reset pagetype flags and makes migrate type to be MOVABLE */
 	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
 	/* removal success */
+	zone->managed_pages -= offlined_pages;
 	zone->present_pages -= offlined_pages;
 	zone->zone_pgdat->node_present_pages -= offlined_pages;
 	totalram_pages -= offlined_pages;
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index bd82f6b..9c55c18 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -137,6 +137,15 @@ unsigned long __init free_low_memory_core_early(int nodeid)
 	return count;
 }
 
+static void reset_node_lowmem_managed_pages(pg_data_t *pgdat)
+{
+	struct zone *z;
+
+	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
+		if (!is_highmem(z))
+			z->managed_pages = 0;
+}
+
 /**
  * free_all_bootmem_node - release a node's free pages to the buddy allocator
  * @pgdat: node to be released
@@ -146,6 +155,7 @@ unsigned long __init free_low_memory_core_early(int nodeid)
 unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
 {
 	register_page_bootmem_info_node(pgdat);
+	reset_node_lowmem_managed_pages(pgdat);
 
 	/* free_low_memory_core_early(MAX_NUMNODES) will be called later */
 	return 0;
@@ -158,6 +168,11 @@ unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
  */
 unsigned long __init free_all_bootmem(void)
 {
+	struct pglist_data *pgdat;
+
+	for_each_online_pgdat(pgdat)
+		reset_node_lowmem_managed_pages(pgdat);
+
 	/*
 	 * We need to use MAX_NUMNODES instead of NODE_DATA(0)->node_id
 	 *  because in some case like Node0 doesn't have RAM installed
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7bb35ac..a41ee64 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -745,6 +745,7 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
 		set_page_count(p, 0);
 	}
 
+	page_zone(page)->managed_pages += 1 << order;
 	set_page_refcounted(page);
 	__free_pages(page, order);
 }
@@ -2950,6 +2951,7 @@ void show_free_areas(unsigned int filter)
 			" isolated(anon):%lukB"
 			" isolated(file):%lukB"
 			" present:%lukB"
+			" managed:%lukB"
 			" mlocked:%lukB"
 			" dirty:%lukB"
 			" writeback:%lukB"
@@ -2979,6 +2981,7 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_ISOLATED_ANON)),
 			K(zone_page_state(zone, NR_ISOLATED_FILE)),
 			K(zone->present_pages),
+			K(zone->managed_pages),
 			K(zone_page_state(zone, NR_MLOCK)),
 			K(zone_page_state(zone, NR_FILE_DIRTY)),
 			K(zone_page_state(zone, NR_WRITEBACK)),
@@ -4455,48 +4458,54 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
-		unsigned long size, realsize, memmap_pages;
+		unsigned long size, realsize, freesize, memmap_pages;
 
 		size = zone_spanned_pages_in_node(nid, j, zones_size);
-		realsize = size - zone_absent_pages_in_node(nid, j,
+		realsize = freesize = size - zone_absent_pages_in_node(nid, j,
 								zholes_size);
 
 		/*
-		 * Adjust realsize so that it accounts for how much memory
+		 * Adjust freesize so that it accounts for how much memory
 		 * is used by this zone for memmap. This affects the watermark
 		 * and per-cpu initialisations
 		 */
 		memmap_pages =
 			PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
-		if (realsize >= memmap_pages) {
-			realsize -= memmap_pages;
+		if (freesize >= memmap_pages) {
+			freesize -= memmap_pages;
 			if (memmap_pages)
 				printk(KERN_DEBUG
 				       "  %s zone: %lu pages used for memmap\n",
 				       zone_names[j], memmap_pages);
 		} else
 			printk(KERN_WARNING
-				"  %s zone: %lu pages exceeds realsize %lu\n",
-				zone_names[j], memmap_pages, realsize);
+				"  %s zone: %lu pages exceeds freesize %lu\n",
+				zone_names[j], memmap_pages, freesize);
 
 		/* Account for reserved pages */
-		if (j == 0 && realsize > dma_reserve) {
-			realsize -= dma_reserve;
+		if (j == 0 && freesize > dma_reserve) {
+			freesize -= dma_reserve;
 			printk(KERN_DEBUG "  %s zone: %lu pages reserved\n",
 					zone_names[0], dma_reserve);
 		}
 
 		if (!is_highmem_idx(j))
-			nr_kernel_pages += realsize;
-		nr_all_pages += realsize;
+			nr_kernel_pages += freesize;
+		nr_all_pages += freesize;
 
 		zone->spanned_pages = size;
-		zone->present_pages = realsize;
+		zone->present_pages = freesize;
+		/*
+		 * Set an approximate value for lowmem here, it will be adjusted
+		 * when the bootmem allocator frees pages into the buddy system.
+		 * And all highmem pages will be managed by the buddy system.
+		 */
+		zone->managed_pages = is_highmem_idx(j) ? realsize : freesize;
 #ifdef CONFIG_NUMA
 		zone->node = nid;
-		zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
+		zone->min_unmapped_pages = (freesize*sysctl_min_unmapped_ratio)
 						/ 100;
-		zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
+		zone->min_slab_pages = (freesize * sysctl_min_slab_ratio) / 100;
 #endif
 		zone->name = zone_names[j];
 		spin_lock_init(&zone->lock);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..e47d31c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -992,14 +992,16 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        high     %lu"
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
-		   "\n        present  %lu",
+		   "\n        present  %lu"
+		   "\n        managed  %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
 		   zone->pages_scanned,
 		   zone->spanned_pages,
-		   zone->present_pages);
+		   zone->present_pages,
+		   zone->managed_pages);
 
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 		seq_printf(m, "\n    %-12s %lu", vmstat_text[i],
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [RFT PATCH v1 1/5] mm: introduce new field "managed_pages" to struct zone
@ 2012-11-18 16:07           ` Jiang Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-18 16:07 UTC (permalink / raw)
  To: Andrew Morton, Wen Congyang, David Rientjes
  Cc: Jiang Liu, Maciej Rutecki, Chris Clayton, Rafael J . Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	Jianguo Wu, linux-mm, linux-kernel

Currently a zone's present_pages is calcuated as below, which is
inaccurate and may cause trouble to memory hotplug.
	spanned_pages - absent_pages - memmap_pages - dma_reserve.

During fixing bugs caused by inaccurate zone->present_pages, we found
zone->present_pages has been abused. The field zone->present_pages
may have different meanings in different contexts:
1) pages existing in a zone.
2) pages managed by the buddy system.

For more discussions about the issue, please refer to:
http://lkml.org/lkml/2012/11/5/866
https://patchwork.kernel.org/patch/1346751/

This patchset tries to introduce a new field named "managed_pages" to
struct zone, which counts "pages managed by the buddy system". And
revert zone->present_pages to count "physical pages existing in a zone",
which also keep in consistence with pgdat->node_present_pages.

We will set an initial value for zone->managed_pages in function
free_area_init_core() and will be adjusted later if the initial value is
inaccurate.

For DMA/normal zones, the initial value is set to:
	(spanned_pages - absent_pages - memmap_pages - dma_reserve)
Later zone->managed_pages will be adjusted to the accurate value when
the bootmem allocator frees all free pages to the buddy system in
function free_all_bootmem_node() and free_all_bootmem().

The bootmem allocator doesn't touch highmem pages, so highmem zones'
managed_pages is set to the accurate value "spanned_pages - absent_pages"
in function free_area_init_core() and won't be updated anymore.

This patch also adds a new field "managed_pages" to /proc/zoneinfo
and sysrq showmem.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 include/linux/mmzone.h |    1 +
 mm/bootmem.c           |   14 ++++++++++++++
 mm/memory_hotplug.c    |    5 +++++
 mm/nobootmem.c         |   15 +++++++++++++++
 mm/page_alloc.c        |   37 +++++++++++++++++++++++--------------
 mm/vmstat.c            |    6 ++++--
 6 files changed, 62 insertions(+), 16 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a23923b..c61fb6d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -480,6 +480,7 @@ struct zone {
 	 */
 	unsigned long		spanned_pages;	/* total size, including holes */
 	unsigned long		present_pages;	/* amount of memory (excluding holes) */
+	unsigned long		managed_pages;	/* pages managed by the Buddy */
 
 	/*
 	 * rarely used fields:
diff --git a/mm/bootmem.c b/mm/bootmem.c
index f468185..a813e5b 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -229,6 +229,15 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
 	return count;
 }
 
+static void reset_node_lowmem_managed_pages(pg_data_t *pgdat)
+{
+	struct zone *z;
+
+	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
+		if (!is_highmem(z))
+			z->managed_pages = 0;
+}
+
 /**
  * free_all_bootmem_node - release a node's free pages to the buddy allocator
  * @pgdat: node to be released
@@ -238,6 +247,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
 unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
 {
 	register_page_bootmem_info_node(pgdat);
+	reset_node_lowmem_managed_pages(pgdat);
 	return free_all_bootmem_core(pgdat->bdata);
 }
 
@@ -250,6 +260,10 @@ unsigned long __init free_all_bootmem(void)
 {
 	unsigned long total_pages = 0;
 	bootmem_data_t *bdata;
+	struct pglist_data *pgdat;
+
+	for_each_online_pgdat(pgdat)
+		reset_node_lowmem_managed_pages(pgdat);
 
 	list_for_each_entry(bdata, &bdata_list, list)
 		total_pages += free_all_bootmem_core(bdata);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e4eeaca..acd3993 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -106,6 +106,7 @@ static void get_page_bootmem(unsigned long info,  struct page *page,
 void __ref put_page_bootmem(struct page *page)
 {
 	unsigned long type;
+	static DEFINE_MUTEX(ppb_lock);
 
 	type = (unsigned long) page->lru.next;
 	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
@@ -115,7 +116,9 @@ void __ref put_page_bootmem(struct page *page)
 		ClearPagePrivate(page);
 		set_page_private(page, 0);
 		INIT_LIST_HEAD(&page->lru);
+		mutex_lock(&ppb_lock);
 		__free_pages_bootmem(page, 0);
+		mutex_unlock(&ppb_lock);
 	}
 
 }
@@ -514,6 +517,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages)
 		return ret;
 	}
 
+	zone->managed_pages += onlined_pages;
 	zone->present_pages += onlined_pages;
 	zone->zone_pgdat->node_present_pages += onlined_pages;
 	if (onlined_pages) {
@@ -961,6 +965,7 @@ repeat:
 	/* reset pagetype flags and makes migrate type to be MOVABLE */
 	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
 	/* removal success */
+	zone->managed_pages -= offlined_pages;
 	zone->present_pages -= offlined_pages;
 	zone->zone_pgdat->node_present_pages -= offlined_pages;
 	totalram_pages -= offlined_pages;
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index bd82f6b..9c55c18 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -137,6 +137,15 @@ unsigned long __init free_low_memory_core_early(int nodeid)
 	return count;
 }
 
+static void reset_node_lowmem_managed_pages(pg_data_t *pgdat)
+{
+	struct zone *z;
+
+	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
+		if (!is_highmem(z))
+			z->managed_pages = 0;
+}
+
 /**
  * free_all_bootmem_node - release a node's free pages to the buddy allocator
  * @pgdat: node to be released
@@ -146,6 +155,7 @@ unsigned long __init free_low_memory_core_early(int nodeid)
 unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
 {
 	register_page_bootmem_info_node(pgdat);
+	reset_node_lowmem_managed_pages(pgdat);
 
 	/* free_low_memory_core_early(MAX_NUMNODES) will be called later */
 	return 0;
@@ -158,6 +168,11 @@ unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
  */
 unsigned long __init free_all_bootmem(void)
 {
+	struct pglist_data *pgdat;
+
+	for_each_online_pgdat(pgdat)
+		reset_node_lowmem_managed_pages(pgdat);
+
 	/*
 	 * We need to use MAX_NUMNODES instead of NODE_DATA(0)->node_id
 	 *  because in some case like Node0 doesn't have RAM installed
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7bb35ac..a41ee64 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -745,6 +745,7 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
 		set_page_count(p, 0);
 	}
 
+	page_zone(page)->managed_pages += 1 << order;
 	set_page_refcounted(page);
 	__free_pages(page, order);
 }
@@ -2950,6 +2951,7 @@ void show_free_areas(unsigned int filter)
 			" isolated(anon):%lukB"
 			" isolated(file):%lukB"
 			" present:%lukB"
+			" managed:%lukB"
 			" mlocked:%lukB"
 			" dirty:%lukB"
 			" writeback:%lukB"
@@ -2979,6 +2981,7 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_ISOLATED_ANON)),
 			K(zone_page_state(zone, NR_ISOLATED_FILE)),
 			K(zone->present_pages),
+			K(zone->managed_pages),
 			K(zone_page_state(zone, NR_MLOCK)),
 			K(zone_page_state(zone, NR_FILE_DIRTY)),
 			K(zone_page_state(zone, NR_WRITEBACK)),
@@ -4455,48 +4458,54 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
-		unsigned long size, realsize, memmap_pages;
+		unsigned long size, realsize, freesize, memmap_pages;
 
 		size = zone_spanned_pages_in_node(nid, j, zones_size);
-		realsize = size - zone_absent_pages_in_node(nid, j,
+		realsize = freesize = size - zone_absent_pages_in_node(nid, j,
 								zholes_size);
 
 		/*
-		 * Adjust realsize so that it accounts for how much memory
+		 * Adjust freesize so that it accounts for how much memory
 		 * is used by this zone for memmap. This affects the watermark
 		 * and per-cpu initialisations
 		 */
 		memmap_pages =
 			PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
-		if (realsize >= memmap_pages) {
-			realsize -= memmap_pages;
+		if (freesize >= memmap_pages) {
+			freesize -= memmap_pages;
 			if (memmap_pages)
 				printk(KERN_DEBUG
 				       "  %s zone: %lu pages used for memmap\n",
 				       zone_names[j], memmap_pages);
 		} else
 			printk(KERN_WARNING
-				"  %s zone: %lu pages exceeds realsize %lu\n",
-				zone_names[j], memmap_pages, realsize);
+				"  %s zone: %lu pages exceeds freesize %lu\n",
+				zone_names[j], memmap_pages, freesize);
 
 		/* Account for reserved pages */
-		if (j == 0 && realsize > dma_reserve) {
-			realsize -= dma_reserve;
+		if (j == 0 && freesize > dma_reserve) {
+			freesize -= dma_reserve;
 			printk(KERN_DEBUG "  %s zone: %lu pages reserved\n",
 					zone_names[0], dma_reserve);
 		}
 
 		if (!is_highmem_idx(j))
-			nr_kernel_pages += realsize;
-		nr_all_pages += realsize;
+			nr_kernel_pages += freesize;
+		nr_all_pages += freesize;
 
 		zone->spanned_pages = size;
-		zone->present_pages = realsize;
+		zone->present_pages = freesize;
+		/*
+		 * Set an approximate value for lowmem here, it will be adjusted
+		 * when the bootmem allocator frees pages into the buddy system.
+		 * And all highmem pages will be managed by the buddy system.
+		 */
+		zone->managed_pages = is_highmem_idx(j) ? realsize : freesize;
 #ifdef CONFIG_NUMA
 		zone->node = nid;
-		zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
+		zone->min_unmapped_pages = (freesize*sysctl_min_unmapped_ratio)
 						/ 100;
-		zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
+		zone->min_slab_pages = (freesize * sysctl_min_slab_ratio) / 100;
 #endif
 		zone->name = zone_names[j];
 		spin_lock_init(&zone->lock);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..e47d31c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -992,14 +992,16 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        high     %lu"
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
-		   "\n        present  %lu",
+		   "\n        present  %lu"
+		   "\n        managed  %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
 		   zone->pages_scanned,
 		   zone->spanned_pages,
-		   zone->present_pages);
+		   zone->present_pages,
+		   zone->managed_pages);
 
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 		seq_printf(m, "\n    %-12s %lu", vmstat_text[i],
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [RFT PATCH v1 2/5] mm: replace zone->present_pages with zone->managed_pages if appreciated
  2012-11-18 16:07         ` Jiang Liu
@ 2012-11-18 16:07           ` Jiang Liu
  -1 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-18 16:07 UTC (permalink / raw)
  To: Andrew Morton, Wen Congyang, David Rientjes
  Cc: Jiang Liu, Maciej Rutecki, Chris Clayton, Rafael J . Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	Jianguo Wu, linux-mm, linux-kernel

Now we have zone->managed_pages for "pages managed by the buddy system
in the zone", so replace zone->present_pages with zone->managed_pages
if what the user really wants is number of pages managed by the buddy
system.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 mm/mempolicy.c  |    2 +-
 mm/page_alloc.c |   32 ++++++++++++++++----------------
 mm/vmscan.c     |   16 ++++++++--------
 mm/vmstat.c     |    2 +-
 4 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d04a8a5..8367070 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -147,7 +147,7 @@ static int is_valid_nodemask(const nodemask_t *nodemask)
 
 		for (k = 0; k <= policy_zone; k++) {
 			z = &NODE_DATA(nd)->node_zones[k];
-			if (z->present_pages > 0)
+			if (z->managed_pages > 0)
 				return 1;
 		}
 	}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a41ee64..fe1cf48 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2791,7 +2791,7 @@ static unsigned int nr_free_zone_pages(int offset)
 	struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
 
 	for_each_zone_zonelist(zone, z, zonelist, offset) {
-		unsigned long size = zone->present_pages;
+		unsigned long size = zone->managed_pages;
 		unsigned long high = high_wmark_pages(zone);
 		if (size > high)
 			sum += size - high;
@@ -2844,7 +2844,7 @@ void si_meminfo_node(struct sysinfo *val, int nid)
 	val->totalram = pgdat->node_present_pages;
 	val->freeram = node_page_state(nid, NR_FREE_PAGES);
 #ifdef CONFIG_HIGHMEM
-	val->totalhigh = pgdat->node_zones[ZONE_HIGHMEM].present_pages;
+	val->totalhigh = pgdat->node_zones[ZONE_HIGHMEM].managed_pages;
 	val->freehigh = zone_page_state(&pgdat->node_zones[ZONE_HIGHMEM],
 			NR_FREE_PAGES);
 #else
@@ -3883,7 +3883,7 @@ static int __meminit zone_batchsize(struct zone *zone)
 	 *
 	 * OK, so we don't know how big the cache is.  So guess.
 	 */
-	batch = zone->present_pages / 1024;
+	batch = zone->managed_pages / 1024;
 	if (batch * PAGE_SIZE > 512 * 1024)
 		batch = (512 * 1024) / PAGE_SIZE;
 	batch /= 4;		/* We effectively *= 4 below */
@@ -3967,7 +3967,7 @@ static void __meminit setup_zone_pageset(struct zone *zone)
 
 		if (percpu_pagelist_fraction)
 			setup_pagelist_highmark(pcp,
-				(zone->present_pages /
+				(zone->managed_pages /
 					percpu_pagelist_fraction));
 	}
 }
@@ -5077,8 +5077,8 @@ static void calculate_totalreserve_pages(void)
 			/* we treat the high watermark as reserved pages. */
 			max += high_wmark_pages(zone);
 
-			if (max > zone->present_pages)
-				max = zone->present_pages;
+			if (max > zone->managed_pages)
+				max = zone->managed_pages;
 			reserve_pages += max;
 			/*
 			 * Lowmem reserves are not available to
@@ -5110,7 +5110,7 @@ static void setup_per_zone_lowmem_reserve(void)
 	for_each_online_pgdat(pgdat) {
 		for (j = 0; j < MAX_NR_ZONES; j++) {
 			struct zone *zone = pgdat->node_zones + j;
-			unsigned long present_pages = zone->present_pages;
+			unsigned long managed_pages = zone->managed_pages;
 
 			zone->lowmem_reserve[j] = 0;
 
@@ -5124,9 +5124,9 @@ static void setup_per_zone_lowmem_reserve(void)
 					sysctl_lowmem_reserve_ratio[idx] = 1;
 
 				lower_zone = pgdat->node_zones + idx;
-				lower_zone->lowmem_reserve[j] = present_pages /
+				lower_zone->lowmem_reserve[j] = managed_pages /
 					sysctl_lowmem_reserve_ratio[idx];
-				present_pages += lower_zone->present_pages;
+				managed_pages += lower_zone->managed_pages;
 			}
 		}
 	}
@@ -5145,14 +5145,14 @@ static void __setup_per_zone_wmarks(void)
 	/* Calculate total number of !ZONE_HIGHMEM pages */
 	for_each_zone(zone) {
 		if (!is_highmem(zone))
-			lowmem_pages += zone->present_pages;
+			lowmem_pages += zone->managed_pages;
 	}
 
 	for_each_zone(zone) {
 		u64 tmp;
 
 		spin_lock_irqsave(&zone->lock, flags);
-		tmp = (u64)pages_min * zone->present_pages;
+		tmp = (u64)pages_min * zone->managed_pages;
 		do_div(tmp, lowmem_pages);
 		if (is_highmem(zone)) {
 			/*
@@ -5166,7 +5166,7 @@ static void __setup_per_zone_wmarks(void)
 			 */
 			int min_pages;
 
-			min_pages = zone->present_pages / 1024;
+			min_pages = zone->managed_pages / 1024;
 			if (min_pages < SWAP_CLUSTER_MAX)
 				min_pages = SWAP_CLUSTER_MAX;
 			if (min_pages > 128)
@@ -5235,7 +5235,7 @@ static void __meminit calculate_zone_inactive_ratio(struct zone *zone)
 	unsigned int gb, ratio;
 
 	/* Zone size in gigabytes */
-	gb = zone->present_pages >> (30 - PAGE_SHIFT);
+	gb = zone->managed_pages >> (30 - PAGE_SHIFT);
 	if (gb)
 		ratio = int_sqrt(10 * gb);
 	else
@@ -5321,7 +5321,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
 		return rc;
 
 	for_each_zone(zone)
-		zone->min_unmapped_pages = (zone->present_pages *
+		zone->min_unmapped_pages = (zone->managed_pages *
 				sysctl_min_unmapped_ratio) / 100;
 	return 0;
 }
@@ -5337,7 +5337,7 @@ int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,
 		return rc;
 
 	for_each_zone(zone)
-		zone->min_slab_pages = (zone->present_pages *
+		zone->min_slab_pages = (zone->managed_pages *
 				sysctl_min_slab_ratio) / 100;
 	return 0;
 }
@@ -5379,7 +5379,7 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
 	for_each_populated_zone(zone) {
 		for_each_possible_cpu(cpu) {
 			unsigned long  high;
-			high = zone->present_pages / percpu_pagelist_fraction;
+			high = zone->managed_pages / percpu_pagelist_fraction;
 			setup_pagelist_highmark(
 				per_cpu_ptr(zone->pageset, cpu), high);
 		}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 48550c6..7240e89 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1935,7 +1935,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
 	 * a reasonable chance of completing and allocating the page
 	 */
 	balance_gap = min(low_wmark_pages(zone),
-		(zone->present_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
+		(zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
 			KSWAPD_ZONE_BALANCE_GAP_RATIO);
 	watermark = high_wmark_pages(zone) + balance_gap + (2UL << sc->order);
 	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0, 0);
@@ -2416,14 +2416,14 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
 static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
 						int classzone_idx)
 {
-	unsigned long present_pages = 0;
+	unsigned long managed_pages = 0;
 	int i;
 
 	for (i = 0; i <= classzone_idx; i++)
-		present_pages += pgdat->node_zones[i].present_pages;
+		managed_pages += pgdat->node_zones[i].managed_pages;
 
 	/* A special case here: if zone has no page, we think it's balanced */
-	return balanced_pages >= (present_pages >> 2);
+	return balanced_pages >= (managed_pages >> 2);
 }
 
 /*
@@ -2471,7 +2471,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 		 * is to sleep
 		 */
 		if (zone->all_unreclaimable) {
-			balanced += zone->present_pages;
+			balanced += zone->managed_pages;
 			continue;
 		}
 
@@ -2479,7 +2479,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 							i, 0))
 			all_zones_ok = false;
 		else
-			balanced += zone->present_pages;
+			balanced += zone->managed_pages;
 	}
 
 	/*
@@ -2645,7 +2645,7 @@ loop_again:
 			 * of the zone, whichever is smaller.
 			 */
 			balance_gap = min(low_wmark_pages(zone),
-				(zone->present_pages +
+				(zone->managed_pages +
 					KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
 				KSWAPD_ZONE_BALANCE_GAP_RATIO);
 			/*
@@ -2712,7 +2712,7 @@ loop_again:
 				 */
 				zone_clear_flag(zone, ZONE_CONGESTED);
 				if (i <= *classzone_idx)
-					balanced += zone->present_pages;
+					balanced += zone->managed_pages;
 			}
 
 		}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index e47d31c..b2925d1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -142,7 +142,7 @@ int calculate_normal_threshold(struct zone *zone)
 	 * 125		1024		10	16-32 GB	9
 	 */
 
-	mem = zone->present_pages >> (27 - PAGE_SHIFT);
+	mem = zone->managed_pages >> (27 - PAGE_SHIFT);
 
 	threshold = 2 * fls(num_online_cpus()) * (1 + fls(mem));
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [RFT PATCH v1 2/5] mm: replace zone->present_pages with zone->managed_pages if appreciated
@ 2012-11-18 16:07           ` Jiang Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-18 16:07 UTC (permalink / raw)
  To: Andrew Morton, Wen Congyang, David Rientjes
  Cc: Jiang Liu, Maciej Rutecki, Chris Clayton, Rafael J . Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	Jianguo Wu, linux-mm, linux-kernel

Now we have zone->managed_pages for "pages managed by the buddy system
in the zone", so replace zone->present_pages with zone->managed_pages
if what the user really wants is number of pages managed by the buddy
system.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 mm/mempolicy.c  |    2 +-
 mm/page_alloc.c |   32 ++++++++++++++++----------------
 mm/vmscan.c     |   16 ++++++++--------
 mm/vmstat.c     |    2 +-
 4 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d04a8a5..8367070 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -147,7 +147,7 @@ static int is_valid_nodemask(const nodemask_t *nodemask)
 
 		for (k = 0; k <= policy_zone; k++) {
 			z = &NODE_DATA(nd)->node_zones[k];
-			if (z->present_pages > 0)
+			if (z->managed_pages > 0)
 				return 1;
 		}
 	}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a41ee64..fe1cf48 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2791,7 +2791,7 @@ static unsigned int nr_free_zone_pages(int offset)
 	struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
 
 	for_each_zone_zonelist(zone, z, zonelist, offset) {
-		unsigned long size = zone->present_pages;
+		unsigned long size = zone->managed_pages;
 		unsigned long high = high_wmark_pages(zone);
 		if (size > high)
 			sum += size - high;
@@ -2844,7 +2844,7 @@ void si_meminfo_node(struct sysinfo *val, int nid)
 	val->totalram = pgdat->node_present_pages;
 	val->freeram = node_page_state(nid, NR_FREE_PAGES);
 #ifdef CONFIG_HIGHMEM
-	val->totalhigh = pgdat->node_zones[ZONE_HIGHMEM].present_pages;
+	val->totalhigh = pgdat->node_zones[ZONE_HIGHMEM].managed_pages;
 	val->freehigh = zone_page_state(&pgdat->node_zones[ZONE_HIGHMEM],
 			NR_FREE_PAGES);
 #else
@@ -3883,7 +3883,7 @@ static int __meminit zone_batchsize(struct zone *zone)
 	 *
 	 * OK, so we don't know how big the cache is.  So guess.
 	 */
-	batch = zone->present_pages / 1024;
+	batch = zone->managed_pages / 1024;
 	if (batch * PAGE_SIZE > 512 * 1024)
 		batch = (512 * 1024) / PAGE_SIZE;
 	batch /= 4;		/* We effectively *= 4 below */
@@ -3967,7 +3967,7 @@ static void __meminit setup_zone_pageset(struct zone *zone)
 
 		if (percpu_pagelist_fraction)
 			setup_pagelist_highmark(pcp,
-				(zone->present_pages /
+				(zone->managed_pages /
 					percpu_pagelist_fraction));
 	}
 }
@@ -5077,8 +5077,8 @@ static void calculate_totalreserve_pages(void)
 			/* we treat the high watermark as reserved pages. */
 			max += high_wmark_pages(zone);
 
-			if (max > zone->present_pages)
-				max = zone->present_pages;
+			if (max > zone->managed_pages)
+				max = zone->managed_pages;
 			reserve_pages += max;
 			/*
 			 * Lowmem reserves are not available to
@@ -5110,7 +5110,7 @@ static void setup_per_zone_lowmem_reserve(void)
 	for_each_online_pgdat(pgdat) {
 		for (j = 0; j < MAX_NR_ZONES; j++) {
 			struct zone *zone = pgdat->node_zones + j;
-			unsigned long present_pages = zone->present_pages;
+			unsigned long managed_pages = zone->managed_pages;
 
 			zone->lowmem_reserve[j] = 0;
 
@@ -5124,9 +5124,9 @@ static void setup_per_zone_lowmem_reserve(void)
 					sysctl_lowmem_reserve_ratio[idx] = 1;
 
 				lower_zone = pgdat->node_zones + idx;
-				lower_zone->lowmem_reserve[j] = present_pages /
+				lower_zone->lowmem_reserve[j] = managed_pages /
 					sysctl_lowmem_reserve_ratio[idx];
-				present_pages += lower_zone->present_pages;
+				managed_pages += lower_zone->managed_pages;
 			}
 		}
 	}
@@ -5145,14 +5145,14 @@ static void __setup_per_zone_wmarks(void)
 	/* Calculate total number of !ZONE_HIGHMEM pages */
 	for_each_zone(zone) {
 		if (!is_highmem(zone))
-			lowmem_pages += zone->present_pages;
+			lowmem_pages += zone->managed_pages;
 	}
 
 	for_each_zone(zone) {
 		u64 tmp;
 
 		spin_lock_irqsave(&zone->lock, flags);
-		tmp = (u64)pages_min * zone->present_pages;
+		tmp = (u64)pages_min * zone->managed_pages;
 		do_div(tmp, lowmem_pages);
 		if (is_highmem(zone)) {
 			/*
@@ -5166,7 +5166,7 @@ static void __setup_per_zone_wmarks(void)
 			 */
 			int min_pages;
 
-			min_pages = zone->present_pages / 1024;
+			min_pages = zone->managed_pages / 1024;
 			if (min_pages < SWAP_CLUSTER_MAX)
 				min_pages = SWAP_CLUSTER_MAX;
 			if (min_pages > 128)
@@ -5235,7 +5235,7 @@ static void __meminit calculate_zone_inactive_ratio(struct zone *zone)
 	unsigned int gb, ratio;
 
 	/* Zone size in gigabytes */
-	gb = zone->present_pages >> (30 - PAGE_SHIFT);
+	gb = zone->managed_pages >> (30 - PAGE_SHIFT);
 	if (gb)
 		ratio = int_sqrt(10 * gb);
 	else
@@ -5321,7 +5321,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
 		return rc;
 
 	for_each_zone(zone)
-		zone->min_unmapped_pages = (zone->present_pages *
+		zone->min_unmapped_pages = (zone->managed_pages *
 				sysctl_min_unmapped_ratio) / 100;
 	return 0;
 }
@@ -5337,7 +5337,7 @@ int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,
 		return rc;
 
 	for_each_zone(zone)
-		zone->min_slab_pages = (zone->present_pages *
+		zone->min_slab_pages = (zone->managed_pages *
 				sysctl_min_slab_ratio) / 100;
 	return 0;
 }
@@ -5379,7 +5379,7 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
 	for_each_populated_zone(zone) {
 		for_each_possible_cpu(cpu) {
 			unsigned long  high;
-			high = zone->present_pages / percpu_pagelist_fraction;
+			high = zone->managed_pages / percpu_pagelist_fraction;
 			setup_pagelist_highmark(
 				per_cpu_ptr(zone->pageset, cpu), high);
 		}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 48550c6..7240e89 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1935,7 +1935,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
 	 * a reasonable chance of completing and allocating the page
 	 */
 	balance_gap = min(low_wmark_pages(zone),
-		(zone->present_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
+		(zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
 			KSWAPD_ZONE_BALANCE_GAP_RATIO);
 	watermark = high_wmark_pages(zone) + balance_gap + (2UL << sc->order);
 	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0, 0);
@@ -2416,14 +2416,14 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
 static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
 						int classzone_idx)
 {
-	unsigned long present_pages = 0;
+	unsigned long managed_pages = 0;
 	int i;
 
 	for (i = 0; i <= classzone_idx; i++)
-		present_pages += pgdat->node_zones[i].present_pages;
+		managed_pages += pgdat->node_zones[i].managed_pages;
 
 	/* A special case here: if zone has no page, we think it's balanced */
-	return balanced_pages >= (present_pages >> 2);
+	return balanced_pages >= (managed_pages >> 2);
 }
 
 /*
@@ -2471,7 +2471,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 		 * is to sleep
 		 */
 		if (zone->all_unreclaimable) {
-			balanced += zone->present_pages;
+			balanced += zone->managed_pages;
 			continue;
 		}
 
@@ -2479,7 +2479,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 							i, 0))
 			all_zones_ok = false;
 		else
-			balanced += zone->present_pages;
+			balanced += zone->managed_pages;
 	}
 
 	/*
@@ -2645,7 +2645,7 @@ loop_again:
 			 * of the zone, whichever is smaller.
 			 */
 			balance_gap = min(low_wmark_pages(zone),
-				(zone->present_pages +
+				(zone->managed_pages +
 					KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
 				KSWAPD_ZONE_BALANCE_GAP_RATIO);
 			/*
@@ -2712,7 +2712,7 @@ loop_again:
 				 */
 				zone_clear_flag(zone, ZONE_CONGESTED);
 				if (i <= *classzone_idx)
-					balanced += zone->present_pages;
+					balanced += zone->managed_pages;
 			}
 
 		}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index e47d31c..b2925d1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -142,7 +142,7 @@ int calculate_normal_threshold(struct zone *zone)
 	 * 125		1024		10	16-32 GB	9
 	 */
 
-	mem = zone->present_pages >> (27 - PAGE_SHIFT);
+	mem = zone->managed_pages >> (27 - PAGE_SHIFT);
 
 	threshold = 2 * fls(num_online_cpus()) * (1 + fls(mem));
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [RFT PATCH v1 3/5] mm: set zone->present_pages to number of existing pages in the zone
  2012-11-18 16:07         ` Jiang Liu
@ 2012-11-18 16:07           ` Jiang Liu
  -1 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-18 16:07 UTC (permalink / raw)
  To: Andrew Morton, Wen Congyang, David Rientjes
  Cc: Jiang Liu, Maciej Rutecki, Chris Clayton, Rafael J . Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	Jianguo Wu, linux-mm, linux-kernel

Now all users using "number of pages managed by the buddy system" have
been converted to use zone->managed_pages, so set zone->present_pages
to what it really should be:
	present_pages = spanned_pages - absent_pages;

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 mm/page_alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fe1cf48..5b327d7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4494,7 +4494,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		nr_all_pages += freesize;
 
 		zone->spanned_pages = size;
-		zone->present_pages = freesize;
+		zone->present_pages = realsize;
 		/*
 		 * Set an approximate value for lowmem here, it will be adjusted
 		 * when the bootmem allocator frees pages into the buddy system.
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [RFT PATCH v1 3/5] mm: set zone->present_pages to number of existing pages in the zone
@ 2012-11-18 16:07           ` Jiang Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-18 16:07 UTC (permalink / raw)
  To: Andrew Morton, Wen Congyang, David Rientjes
  Cc: Jiang Liu, Maciej Rutecki, Chris Clayton, Rafael J . Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	Jianguo Wu, linux-mm, linux-kernel

Now all users using "number of pages managed by the buddy system" have
been converted to use zone->managed_pages, so set zone->present_pages
to what it really should be:
	present_pages = spanned_pages - absent_pages;

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 mm/page_alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fe1cf48..5b327d7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4494,7 +4494,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		nr_all_pages += freesize;
 
 		zone->spanned_pages = size;
-		zone->present_pages = freesize;
+		zone->present_pages = realsize;
 		/*
 		 * Set an approximate value for lowmem here, it will be adjusted
 		 * when the bootmem allocator frees pages into the buddy system.
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap
  2012-11-18 16:07         ` Jiang Liu
@ 2012-11-18 16:07           ` Jiang Liu
  -1 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-18 16:07 UTC (permalink / raw)
  To: Andrew Morton, Wen Congyang, David Rientjes
  Cc: Jiang Liu, Maciej Rutecki, Chris Clayton, Rafael J . Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	Jianguo Wu, linux-mm, linux-kernel

If SPARSEMEM is enabled, it won't build page structures for
non-existing pages (holes) within a zone, so provide a more accurate
estimation of pages occupied by memmap if there are big holes within
the zone.

And pages for highmem zones' memmap will be allocated from lowmem,
so charge nr_kernel_pages for that.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 mm/page_alloc.c |   22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b327d7..eb25679 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4435,6 +4435,22 @@ void __init set_pageblock_order(void)
 
 #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
 
+static unsigned long calc_memmap_size(unsigned long spanned_pages,
+				      unsigned long present_pages)
+{
+	unsigned long pages = spanned_pages;
+
+	/*
+	 * Provide a more accurate estimation if there are big holes within
+	 * the zone and SPARSEMEM is in use.
+	 */
+	if (spanned_pages > present_pages + (present_pages >> 4) &&
+	    IS_ENABLED(CONFIG_SPARSEMEM))
+		pages = present_pages;
+
+	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
+}
+
 /*
  * Set up the zone data structures:
  *   - mark all pages reserved
@@ -4469,8 +4485,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		 * is used by this zone for memmap. This affects the watermark
 		 * and per-cpu initialisations
 		 */
-		memmap_pages =
-			PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
+		memmap_pages = calc_memmap_size(size, realsize);
 		if (freesize >= memmap_pages) {
 			freesize -= memmap_pages;
 			if (memmap_pages)
@@ -4491,6 +4506,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 		if (!is_highmem_idx(j))
 			nr_kernel_pages += freesize;
+		/* Charge for highmem memmap if there are enough kernel pages */
+		else if (nr_kernel_pages > memmap_pages * 2)
+			nr_kernel_pages -= memmap_pages;
 		nr_all_pages += freesize;
 
 		zone->spanned_pages = size;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap
@ 2012-11-18 16:07           ` Jiang Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-18 16:07 UTC (permalink / raw)
  To: Andrew Morton, Wen Congyang, David Rientjes
  Cc: Jiang Liu, Maciej Rutecki, Chris Clayton, Rafael J . Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	Jianguo Wu, linux-mm, linux-kernel

If SPARSEMEM is enabled, it won't build page structures for
non-existing pages (holes) within a zone, so provide a more accurate
estimation of pages occupied by memmap if there are big holes within
the zone.

And pages for highmem zones' memmap will be allocated from lowmem,
so charge nr_kernel_pages for that.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 mm/page_alloc.c |   22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b327d7..eb25679 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4435,6 +4435,22 @@ void __init set_pageblock_order(void)
 
 #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
 
+static unsigned long calc_memmap_size(unsigned long spanned_pages,
+				      unsigned long present_pages)
+{
+	unsigned long pages = spanned_pages;
+
+	/*
+	 * Provide a more accurate estimation if there are big holes within
+	 * the zone and SPARSEMEM is in use.
+	 */
+	if (spanned_pages > present_pages + (present_pages >> 4) &&
+	    IS_ENABLED(CONFIG_SPARSEMEM))
+		pages = present_pages;
+
+	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
+}
+
 /*
  * Set up the zone data structures:
  *   - mark all pages reserved
@@ -4469,8 +4485,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		 * is used by this zone for memmap. This affects the watermark
 		 * and per-cpu initialisations
 		 */
-		memmap_pages =
-			PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
+		memmap_pages = calc_memmap_size(size, realsize);
 		if (freesize >= memmap_pages) {
 			freesize -= memmap_pages;
 			if (memmap_pages)
@@ -4491,6 +4506,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 		if (!is_highmem_idx(j))
 			nr_kernel_pages += freesize;
+		/* Charge for highmem memmap if there are enough kernel pages */
+		else if (nr_kernel_pages > memmap_pages * 2)
+			nr_kernel_pages -= memmap_pages;
 		nr_all_pages += freesize;
 
 		zone->spanned_pages = size;
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [RFT PATCH v1 5/5] mm: increase totalram_pages when free pages allocated by bootmem allocator
  2012-11-18 16:07         ` Jiang Liu
@ 2012-11-18 16:07           ` Jiang Liu
  -1 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-18 16:07 UTC (permalink / raw)
  To: Andrew Morton, Wen Congyang, David Rientjes
  Cc: Jiang Liu, Maciej Rutecki, Chris Clayton, Rafael J . Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	Jianguo Wu, linux-mm, linux-kernel

Function put_page_bootmem() is used to free pages allocated by bootmem
allocator to the buddy system, so it should increase totalram_pages
when freeing pages.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 mm/memory_hotplug.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index acd3993..6b07e0c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -119,6 +119,7 @@ void __ref put_page_bootmem(struct page *page)
 		mutex_lock(&ppb_lock);
 		__free_pages_bootmem(page, 0);
 		mutex_unlock(&ppb_lock);
+		totalram_pages++;
 	}
 
 }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [RFT PATCH v1 5/5] mm: increase totalram_pages when free pages allocated by bootmem allocator
@ 2012-11-18 16:07           ` Jiang Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-18 16:07 UTC (permalink / raw)
  To: Andrew Morton, Wen Congyang, David Rientjes
  Cc: Jiang Liu, Maciej Rutecki, Chris Clayton, Rafael J . Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	Jianguo Wu, linux-mm, linux-kernel

Function put_page_bootmem() is used to free pages allocated by bootmem
allocator to the buddy system, so it should increase totalram_pages
when freeing pages.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 mm/memory_hotplug.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index acd3993..6b07e0c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -119,6 +119,7 @@ void __ref put_page_bootmem(struct page *page)
 		mutex_lock(&ppb_lock);
 		__free_pages_bootmem(page, 0);
 		mutex_unlock(&ppb_lock);
+		totalram_pages++;
 	}
 
 }
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
  2012-11-18 16:07         ` Jiang Liu
@ 2012-11-18 20:36           ` Chris Clayton
  -1 siblings, 0 replies; 98+ messages in thread
From: Chris Clayton @ 2012-11-18 20:36 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Wen Congyang, David Rientjes, Jiang Liu,
	Maciej Rutecki, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

Hi Gerry.

On 11/18/12 16:07, Jiang Liu wrote:
> The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
> resolve an issue caused by inaccurate zone->present_pages, but that
> fix is incomplete and causes regresions with HIGHMEM. And it has been
> reverted by commit
> 5576646 revert "mm: fix-up zone present pages"
>
> This is a following-up patchset for the issue above. It introduces a
> new field named "managed_pages" to struct zone, which counts pages
> managed by the buddy system from the zone. And zone->present_pages
> is used to count pages existing in the zone, which is
> 	spanned_pages - absent_pages.
>
> But that way, zone->present_pages will be kept in consistence with
> pgdat->node_present_pages, which is sum of zone->present_pages.
>
> This patchset has only been tested on x86_64 with nobootmem.c. So need
> help to test this patchset on machines:
> 1) use bootmem.c
> 2) have highmem
>
> This patchset applies to "f4a75d2e Linux 3.7-rc6" from
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>

I've applied the five patches to Linus' 3.7.0-rc6 and can confirm that 
the kernel allows my system to resume from a suspend to disc. Although 
my laptop is 64 bit, I run a 32 bit kernel with HIGHMEM (I have 8GB RAM):

[chris:~/kernel/tmp/linux-3.7-rc6-resume]$ grep -E HIGHMEM\|X86_32 .config
CONFIG_X86_32=y
CONFIG_X86_32_SMP=y
CONFIG_X86_32_LAZY_GS=y
# CONFIG_X86_32_IRIS is not set
# CONFIG_NOHIGHMEM is not set
# CONFIG_HIGHMEM4G is not set
CONFIG_HIGHMEM64G=y
CONFIG_HIGHMEM=y

I can also say that a quick browse of the output of dmesg, shows nothing 
out of the ordinary. I have insufficient knowledge to comment on the 
patches, but I will run the kernel over the next few days and report 
back later in the week.

Chris

> Any comments and helps are welcomed!
>
> Jiang Liu (5):
>    mm: introduce new field "managed_pages" to struct zone
>    mm: replace zone->present_pages with zone->managed_pages if
>      appreciated
>    mm: set zone->present_pages to number of existing pages in the zone
>    mm: provide more accurate estimation of pages occupied by memmap
>    mm: increase totalram_pages when free pages allocated by bootmem
>      allocator
>
>   include/linux/mmzone.h |    1 +
>   mm/bootmem.c           |   14 ++++++++
>   mm/memory_hotplug.c    |    6 ++++
>   mm/mempolicy.c         |    2 +-
>   mm/nobootmem.c         |   15 ++++++++
>   mm/page_alloc.c        |   89 +++++++++++++++++++++++++++++++-----------------
>   mm/vmscan.c            |   16 ++++-----
>   mm/vmstat.c            |    8 +++--
>   8 files changed, 108 insertions(+), 43 deletions(-)
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
@ 2012-11-18 20:36           ` Chris Clayton
  0 siblings, 0 replies; 98+ messages in thread
From: Chris Clayton @ 2012-11-18 20:36 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Wen Congyang, David Rientjes, Jiang Liu,
	Maciej Rutecki, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

Hi Gerry.

On 11/18/12 16:07, Jiang Liu wrote:
> The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
> resolve an issue caused by inaccurate zone->present_pages, but that
> fix is incomplete and causes regresions with HIGHMEM. And it has been
> reverted by commit
> 5576646 revert "mm: fix-up zone present pages"
>
> This is a following-up patchset for the issue above. It introduces a
> new field named "managed_pages" to struct zone, which counts pages
> managed by the buddy system from the zone. And zone->present_pages
> is used to count pages existing in the zone, which is
> 	spanned_pages - absent_pages.
>
> But that way, zone->present_pages will be kept in consistence with
> pgdat->node_present_pages, which is sum of zone->present_pages.
>
> This patchset has only been tested on x86_64 with nobootmem.c. So need
> help to test this patchset on machines:
> 1) use bootmem.c
> 2) have highmem
>
> This patchset applies to "f4a75d2e Linux 3.7-rc6" from
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>

I've applied the five patches to Linus' 3.7.0-rc6 and can confirm that 
the kernel allows my system to resume from a suspend to disc. Although 
my laptop is 64 bit, I run a 32 bit kernel with HIGHMEM (I have 8GB RAM):

[chris:~/kernel/tmp/linux-3.7-rc6-resume]$ grep -E HIGHMEM\|X86_32 .config
CONFIG_X86_32=y
CONFIG_X86_32_SMP=y
CONFIG_X86_32_LAZY_GS=y
# CONFIG_X86_32_IRIS is not set
# CONFIG_NOHIGHMEM is not set
# CONFIG_HIGHMEM4G is not set
CONFIG_HIGHMEM64G=y
CONFIG_HIGHMEM=y

I can also say that a quick browse of the output of dmesg, shows nothing 
out of the ordinary. I have insufficient knowledge to comment on the 
patches, but I will run the kernel over the next few days and report 
back later in the week.

Chris

> Any comments and helps are welcomed!
>
> Jiang Liu (5):
>    mm: introduce new field "managed_pages" to struct zone
>    mm: replace zone->present_pages with zone->managed_pages if
>      appreciated
>    mm: set zone->present_pages to number of existing pages in the zone
>    mm: provide more accurate estimation of pages occupied by memmap
>    mm: increase totalram_pages when free pages allocated by bootmem
>      allocator
>
>   include/linux/mmzone.h |    1 +
>   mm/bootmem.c           |   14 ++++++++
>   mm/memory_hotplug.c    |    6 ++++
>   mm/mempolicy.c         |    2 +-
>   mm/nobootmem.c         |   15 ++++++++
>   mm/page_alloc.c        |   89 +++++++++++++++++++++++++++++++-----------------
>   mm/vmscan.c            |   16 ++++-----
>   mm/vmstat.c            |    8 +++--
>   8 files changed, 108 insertions(+), 43 deletions(-)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
  2012-11-18 16:07         ` Jiang Liu
@ 2012-11-19 21:36           ` Maciej Rutecki
  -1 siblings, 0 replies; 98+ messages in thread
From: Maciej Rutecki @ 2012-11-19 21:36 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Wen Congyang, David Rientjes, Jiang Liu,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On niedziela, 18 listopada 2012 o 17:07:25 Jiang Liu wrote:
> The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
> resolve an issue caused by inaccurate zone->present_pages, but that
> fix is incomplete and causes regresions with HIGHMEM. And it has been
> reverted by commit
> 5576646 revert "mm: fix-up zone present pages"
> 
> This is a following-up patchset for the issue above. It introduces a
> new field named "managed_pages" to struct zone, which counts pages
> managed by the buddy system from the zone. And zone->present_pages
> is used to count pages existing in the zone, which is
> 	spanned_pages - absent_pages.
> 
> But that way, zone->present_pages will be kept in consistence with
> pgdat->node_present_pages, which is sum of zone->present_pages.
> 
> This patchset has only been tested on x86_64 with nobootmem.c. So need
> help to test this patchset on machines:
> 1) use bootmem.c
> 2) have highmem
> 
> This patchset applies to "f4a75d2e Linux 3.7-rc6" from
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> 
> Any comments and helps are welcomed!
> 
> Jiang Liu (5):
>   mm: introduce new field "managed_pages" to struct zone
>   mm: replace zone->present_pages with zone->managed_pages if
>     appreciated
>   mm: set zone->present_pages to number of existing pages in the zone
>   mm: provide more accurate estimation of pages occupied by memmap
>   mm: increase totalram_pages when free pages allocated by bootmem
>     allocator
> 
>  include/linux/mmzone.h |    1 +
>  mm/bootmem.c           |   14 ++++++++
>  mm/memory_hotplug.c    |    6 ++++
>  mm/mempolicy.c         |    2 +-
>  mm/nobootmem.c         |   15 ++++++++
>  mm/page_alloc.c        |   89
> +++++++++++++++++++++++++++++++----------------- mm/vmscan.c            | 
>  16 ++++-----
>  mm/vmstat.c            |    8 +++--
>  8 files changed, 108 insertions(+), 43 deletions(-)
Tested in 32 bit linux with HIGHMEM. Seems be OK.

Regards
-- 
Maciej Rutecki
http://www.mrutecki.pl

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
@ 2012-11-19 21:36           ` Maciej Rutecki
  0 siblings, 0 replies; 98+ messages in thread
From: Maciej Rutecki @ 2012-11-19 21:36 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Wen Congyang, David Rientjes, Jiang Liu,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On niedziela, 18 listopada 2012 o 17:07:25 Jiang Liu wrote:
> The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
> resolve an issue caused by inaccurate zone->present_pages, but that
> fix is incomplete and causes regresions with HIGHMEM. And it has been
> reverted by commit
> 5576646 revert "mm: fix-up zone present pages"
> 
> This is a following-up patchset for the issue above. It introduces a
> new field named "managed_pages" to struct zone, which counts pages
> managed by the buddy system from the zone. And zone->present_pages
> is used to count pages existing in the zone, which is
> 	spanned_pages - absent_pages.
> 
> But that way, zone->present_pages will be kept in consistence with
> pgdat->node_present_pages, which is sum of zone->present_pages.
> 
> This patchset has only been tested on x86_64 with nobootmem.c. So need
> help to test this patchset on machines:
> 1) use bootmem.c
> 2) have highmem
> 
> This patchset applies to "f4a75d2e Linux 3.7-rc6" from
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> 
> Any comments and helps are welcomed!
> 
> Jiang Liu (5):
>   mm: introduce new field "managed_pages" to struct zone
>   mm: replace zone->present_pages with zone->managed_pages if
>     appreciated
>   mm: set zone->present_pages to number of existing pages in the zone
>   mm: provide more accurate estimation of pages occupied by memmap
>   mm: increase totalram_pages when free pages allocated by bootmem
>     allocator
> 
>  include/linux/mmzone.h |    1 +
>  mm/bootmem.c           |   14 ++++++++
>  mm/memory_hotplug.c    |    6 ++++
>  mm/mempolicy.c         |    2 +-
>  mm/nobootmem.c         |   15 ++++++++
>  mm/page_alloc.c        |   89
> +++++++++++++++++++++++++++++++----------------- mm/vmscan.c            | 
>  16 ++++-----
>  mm/vmstat.c            |    8 +++--
>  8 files changed, 108 insertions(+), 43 deletions(-)
Tested in 32 bit linux with HIGHMEM. Seems be OK.

Regards
-- 
Maciej Rutecki
http://www.mrutecki.pl

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 1/5] mm: introduce new field "managed_pages" to struct zone
  2012-11-18 16:07           ` Jiang Liu
@ 2012-11-19 23:38             ` Andrew Morton
  -1 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2012-11-19 23:38 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On Mon, 19 Nov 2012 00:07:26 +0800
Jiang Liu <liuj97@gmail.com> wrote:

> Currently a zone's present_pages is calcuated as below, which is
> inaccurate and may cause trouble to memory hotplug.
> 	spanned_pages - absent_pages - memmap_pages - dma_reserve.
> 
> During fixing bugs caused by inaccurate zone->present_pages, we found
> zone->present_pages has been abused. The field zone->present_pages
> may have different meanings in different contexts:
> 1) pages existing in a zone.
> 2) pages managed by the buddy system.
> 
> For more discussions about the issue, please refer to:
> http://lkml.org/lkml/2012/11/5/866
> https://patchwork.kernel.org/patch/1346751/
> 
> This patchset tries to introduce a new field named "managed_pages" to
> struct zone, which counts "pages managed by the buddy system". And
> revert zone->present_pages to count "physical pages existing in a zone",
> which also keep in consistence with pgdat->node_present_pages.
> 
> We will set an initial value for zone->managed_pages in function
> free_area_init_core() and will be adjusted later if the initial value is
> inaccurate.
> 
> For DMA/normal zones, the initial value is set to:
> 	(spanned_pages - absent_pages - memmap_pages - dma_reserve)
> Later zone->managed_pages will be adjusted to the accurate value when
> the bootmem allocator frees all free pages to the buddy system in
> function free_all_bootmem_node() and free_all_bootmem().
> 
> The bootmem allocator doesn't touch highmem pages, so highmem zones'
> managed_pages is set to the accurate value "spanned_pages - absent_pages"
> in function free_area_init_core() and won't be updated anymore.
> 
> This patch also adds a new field "managed_pages" to /proc/zoneinfo
> and sysrq showmem.

hoo boy, what a mess we made.  I'd like to merge these patches and get
them into -next for some testing, but -next has stopped for a couple of
weeks.  Oh well, let's see what can be done.

> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -480,6 +480,7 @@ struct zone {
>  	 */
>  	unsigned long		spanned_pages;	/* total size, including holes */
>  	unsigned long		present_pages;	/* amount of memory (excluding holes) */
> +	unsigned long		managed_pages;	/* pages managed by the Buddy */

Can you please add a nice big comment over these three fields which
fully describes what they do and the relationship between them? 
Basically that stuff that's in the changelog.

Also, the existing comment tells us that spanned_pages and
present_pages are protected by span_seqlock but has not been updated to
describe the locking (if any) for managed_pages.

>  	/*
>  	 * rarely used fields:
> diff --git a/mm/bootmem.c b/mm/bootmem.c
> index f468185..a813e5b 100644
> --- a/mm/bootmem.c
> +++ b/mm/bootmem.c
> @@ -229,6 +229,15 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
>  	return count;
>  }
>  
> +static void reset_node_lowmem_managed_pages(pg_data_t *pgdat)
> +{
> +	struct zone *z;
> +
> +	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
> +		if (!is_highmem(z))

Needs a comment explaining why we skip the highmem zone, please.

> +			z->managed_pages = 0;
> +}
> +
>
> ...
>
> @@ -106,6 +106,7 @@ static void get_page_bootmem(unsigned long info,  struct page *page,
>  void __ref put_page_bootmem(struct page *page)
>  {
>  	unsigned long type;
> +	static DEFINE_MUTEX(ppb_lock);
>  
>  	type = (unsigned long) page->lru.next;
>  	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
> @@ -115,7 +116,9 @@ void __ref put_page_bootmem(struct page *page)
>  		ClearPagePrivate(page);
>  		set_page_private(page, 0);
>  		INIT_LIST_HEAD(&page->lru);
> +		mutex_lock(&ppb_lock);
>  		__free_pages_bootmem(page, 0);
> +		mutex_unlock(&ppb_lock);

The mutex is odd.  Nothing in the changelog, no code comment. 
__free_pages_bootmem() is called from a lot of places but only this one
has locking.  I'm madly guessing that the lock is here to handle two or
more concurrent memory hotpluggings, but I shouldn't need to guess!!

>  	}
>  
>  }
>
> ...
>


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 1/5] mm: introduce new field "managed_pages" to struct zone
@ 2012-11-19 23:38             ` Andrew Morton
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2012-11-19 23:38 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On Mon, 19 Nov 2012 00:07:26 +0800
Jiang Liu <liuj97@gmail.com> wrote:

> Currently a zone's present_pages is calcuated as below, which is
> inaccurate and may cause trouble to memory hotplug.
> 	spanned_pages - absent_pages - memmap_pages - dma_reserve.
> 
> During fixing bugs caused by inaccurate zone->present_pages, we found
> zone->present_pages has been abused. The field zone->present_pages
> may have different meanings in different contexts:
> 1) pages existing in a zone.
> 2) pages managed by the buddy system.
> 
> For more discussions about the issue, please refer to:
> http://lkml.org/lkml/2012/11/5/866
> https://patchwork.kernel.org/patch/1346751/
> 
> This patchset tries to introduce a new field named "managed_pages" to
> struct zone, which counts "pages managed by the buddy system". And
> revert zone->present_pages to count "physical pages existing in a zone",
> which also keep in consistence with pgdat->node_present_pages.
> 
> We will set an initial value for zone->managed_pages in function
> free_area_init_core() and will be adjusted later if the initial value is
> inaccurate.
> 
> For DMA/normal zones, the initial value is set to:
> 	(spanned_pages - absent_pages - memmap_pages - dma_reserve)
> Later zone->managed_pages will be adjusted to the accurate value when
> the bootmem allocator frees all free pages to the buddy system in
> function free_all_bootmem_node() and free_all_bootmem().
> 
> The bootmem allocator doesn't touch highmem pages, so highmem zones'
> managed_pages is set to the accurate value "spanned_pages - absent_pages"
> in function free_area_init_core() and won't be updated anymore.
> 
> This patch also adds a new field "managed_pages" to /proc/zoneinfo
> and sysrq showmem.

hoo boy, what a mess we made.  I'd like to merge these patches and get
them into -next for some testing, but -next has stopped for a couple of
weeks.  Oh well, let's see what can be done.

> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -480,6 +480,7 @@ struct zone {
>  	 */
>  	unsigned long		spanned_pages;	/* total size, including holes */
>  	unsigned long		present_pages;	/* amount of memory (excluding holes) */
> +	unsigned long		managed_pages;	/* pages managed by the Buddy */

Can you please add a nice big comment over these three fields which
fully describes what they do and the relationship between them? 
Basically that stuff that's in the changelog.

Also, the existing comment tells us that spanned_pages and
present_pages are protected by span_seqlock but has not been updated to
describe the locking (if any) for managed_pages.

>  	/*
>  	 * rarely used fields:
> diff --git a/mm/bootmem.c b/mm/bootmem.c
> index f468185..a813e5b 100644
> --- a/mm/bootmem.c
> +++ b/mm/bootmem.c
> @@ -229,6 +229,15 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
>  	return count;
>  }
>  
> +static void reset_node_lowmem_managed_pages(pg_data_t *pgdat)
> +{
> +	struct zone *z;
> +
> +	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
> +		if (!is_highmem(z))

Needs a comment explaining why we skip the highmem zone, please.

> +			z->managed_pages = 0;
> +}
> +
>
> ...
>
> @@ -106,6 +106,7 @@ static void get_page_bootmem(unsigned long info,  struct page *page,
>  void __ref put_page_bootmem(struct page *page)
>  {
>  	unsigned long type;
> +	static DEFINE_MUTEX(ppb_lock);
>  
>  	type = (unsigned long) page->lru.next;
>  	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
> @@ -115,7 +116,9 @@ void __ref put_page_bootmem(struct page *page)
>  		ClearPagePrivate(page);
>  		set_page_private(page, 0);
>  		INIT_LIST_HEAD(&page->lru);
> +		mutex_lock(&ppb_lock);
>  		__free_pages_bootmem(page, 0);
> +		mutex_unlock(&ppb_lock);

The mutex is odd.  Nothing in the changelog, no code comment. 
__free_pages_bootmem() is called from a lot of places but only this one
has locking.  I'm madly guessing that the lock is here to handle two or
more concurrent memory hotpluggings, but I shouldn't need to guess!!

>  	}
>  
>  }
>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap
  2012-11-18 16:07           ` Jiang Liu
@ 2012-11-19 23:42             ` Andrew Morton
  -1 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2012-11-19 23:42 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On Mon, 19 Nov 2012 00:07:29 +0800
Jiang Liu <liuj97@gmail.com> wrote:

> If SPARSEMEM is enabled, it won't build page structures for
> non-existing pages (holes) within a zone, so provide a more accurate
> estimation of pages occupied by memmap if there are big holes within
> the zone.
> 
> And pages for highmem zones' memmap will be allocated from lowmem,
> so charge nr_kernel_pages for that.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4435,6 +4435,22 @@ void __init set_pageblock_order(void)
>  
>  #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>  
> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
> +				      unsigned long present_pages)
> +{
> +	unsigned long pages = spanned_pages;
> +
> +	/*
> +	 * Provide a more accurate estimation if there are big holes within
> +	 * the zone and SPARSEMEM is in use.
> +	 */
> +	if (spanned_pages > present_pages + (present_pages >> 4) &&
> +	    IS_ENABLED(CONFIG_SPARSEMEM))
> +		pages = present_pages;
> +
> +	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
> +}

Please explain the ">> 4" heuristc more completely - preferably in both
the changelog and code comments.  Why can't we calculate this
requirement exactly?  That might require a second pass, but that's OK for
code like this?


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap
@ 2012-11-19 23:42             ` Andrew Morton
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2012-11-19 23:42 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On Mon, 19 Nov 2012 00:07:29 +0800
Jiang Liu <liuj97@gmail.com> wrote:

> If SPARSEMEM is enabled, it won't build page structures for
> non-existing pages (holes) within a zone, so provide a more accurate
> estimation of pages occupied by memmap if there are big holes within
> the zone.
> 
> And pages for highmem zones' memmap will be allocated from lowmem,
> so charge nr_kernel_pages for that.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4435,6 +4435,22 @@ void __init set_pageblock_order(void)
>  
>  #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>  
> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
> +				      unsigned long present_pages)
> +{
> +	unsigned long pages = spanned_pages;
> +
> +	/*
> +	 * Provide a more accurate estimation if there are big holes within
> +	 * the zone and SPARSEMEM is in use.
> +	 */
> +	if (spanned_pages > present_pages + (present_pages >> 4) &&
> +	    IS_ENABLED(CONFIG_SPARSEMEM))
> +		pages = present_pages;
> +
> +	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
> +}

Please explain the ">> 4" heuristc more completely - preferably in both
the changelog and code comments.  Why can't we calculate this
requirement exactly?  That might require a second pass, but that's OK for
code like this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
  2012-11-18 16:07         ` Jiang Liu
@ 2012-11-20  2:13           ` Jaegeuk Hanse
  -1 siblings, 0 replies; 98+ messages in thread
From: Jaegeuk Hanse @ 2012-11-20  2:13 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Wen Congyang, David Rientjes, Jiang Liu,
	Maciej Rutecki, Chris Clayton, Rafael J . Wysocki, Mel Gorman,
	Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu,
	linux-mm, linux-kernel

On 11/19/2012 12:07 AM, Jiang Liu wrote:
> The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
> resolve an issue caused by inaccurate zone->present_pages, but that
> fix is incomplete and causes regresions with HIGHMEM. And it has been
> reverted by commit
> 5576646 revert "mm: fix-up zone present pages"
>
> This is a following-up patchset for the issue above. It introduces a
> new field named "managed_pages" to struct zone, which counts pages
> managed by the buddy system from the zone. And zone->present_pages
> is used to count pages existing in the zone, which is
> 	spanned_pages - absent_pages.
>
> But that way, zone->present_pages will be kept in consistence with
> pgdat->node_present_pages, which is sum of zone->present_pages.
>
> This patchset has only been tested on x86_64 with nobootmem.c. So need
> help to test this patchset on machines:
> 1) use bootmem.c

If only x86_32 use bootmem.c instead of nobootmem.c? How could I confirm it?
> 2) have highmem
>
> This patchset applies to "f4a75d2e Linux 3.7-rc6" from
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>
> Any comments and helps are welcomed!
>
> Jiang Liu (5):
>    mm: introduce new field "managed_pages" to struct zone
>    mm: replace zone->present_pages with zone->managed_pages if
>      appreciated
>    mm: set zone->present_pages to number of existing pages in the zone
>    mm: provide more accurate estimation of pages occupied by memmap
>    mm: increase totalram_pages when free pages allocated by bootmem
>      allocator
>
>   include/linux/mmzone.h |    1 +
>   mm/bootmem.c           |   14 ++++++++
>   mm/memory_hotplug.c    |    6 ++++
>   mm/mempolicy.c         |    2 +-
>   mm/nobootmem.c         |   15 ++++++++
>   mm/page_alloc.c        |   89 +++++++++++++++++++++++++++++++-----------------
>   mm/vmscan.c            |   16 ++++-----
>   mm/vmstat.c            |    8 +++--
>   8 files changed, 108 insertions(+), 43 deletions(-)
>


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
@ 2012-11-20  2:13           ` Jaegeuk Hanse
  0 siblings, 0 replies; 98+ messages in thread
From: Jaegeuk Hanse @ 2012-11-20  2:13 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Wen Congyang, David Rientjes, Jiang Liu,
	Maciej Rutecki, Chris Clayton, Rafael J . Wysocki, Mel Gorman,
	Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu,
	linux-mm, linux-kernel

On 11/19/2012 12:07 AM, Jiang Liu wrote:
> The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
> resolve an issue caused by inaccurate zone->present_pages, but that
> fix is incomplete and causes regresions with HIGHMEM. And it has been
> reverted by commit
> 5576646 revert "mm: fix-up zone present pages"
>
> This is a following-up patchset for the issue above. It introduces a
> new field named "managed_pages" to struct zone, which counts pages
> managed by the buddy system from the zone. And zone->present_pages
> is used to count pages existing in the zone, which is
> 	spanned_pages - absent_pages.
>
> But that way, zone->present_pages will be kept in consistence with
> pgdat->node_present_pages, which is sum of zone->present_pages.
>
> This patchset has only been tested on x86_64 with nobootmem.c. So need
> help to test this patchset on machines:
> 1) use bootmem.c

If only x86_32 use bootmem.c instead of nobootmem.c? How could I confirm it?
> 2) have highmem
>
> This patchset applies to "f4a75d2e Linux 3.7-rc6" from
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>
> Any comments and helps are welcomed!
>
> Jiang Liu (5):
>    mm: introduce new field "managed_pages" to struct zone
>    mm: replace zone->present_pages with zone->managed_pages if
>      appreciated
>    mm: set zone->present_pages to number of existing pages in the zone
>    mm: provide more accurate estimation of pages occupied by memmap
>    mm: increase totalram_pages when free pages allocated by bootmem
>      allocator
>
>   include/linux/mmzone.h |    1 +
>   mm/bootmem.c           |   14 ++++++++
>   mm/memory_hotplug.c    |    6 ++++
>   mm/mempolicy.c         |    2 +-
>   mm/nobootmem.c         |   15 ++++++++
>   mm/page_alloc.c        |   89 +++++++++++++++++++++++++++++++-----------------
>   mm/vmscan.c            |   16 ++++-----
>   mm/vmstat.c            |    8 +++--
>   8 files changed, 108 insertions(+), 43 deletions(-)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap
  2012-11-18 16:07           ` Jiang Liu
@ 2012-11-20  2:15             ` Jaegeuk Hanse
  -1 siblings, 0 replies; 98+ messages in thread
From: Jaegeuk Hanse @ 2012-11-20  2:15 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Wen Congyang, David Rientjes, Jiang Liu,
	Maciej Rutecki, Chris Clayton, Rafael J . Wysocki, Mel Gorman,
	Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu,
	linux-mm, linux-kernel

On 11/19/2012 12:07 AM, Jiang Liu wrote:
> If SPARSEMEM is enabled, it won't build page structures for
> non-existing pages (holes) within a zone, so provide a more accurate
> estimation of pages occupied by memmap if there are big holes within
> the zone.
>
> And pages for highmem zones' memmap will be allocated from lowmem,
> so charge nr_kernel_pages for that.
>
> Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
> ---
>   mm/page_alloc.c |   22 ++++++++++++++++++++--
>   1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5b327d7..eb25679 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4435,6 +4435,22 @@ void __init set_pageblock_order(void)
>   
>   #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>   
> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
> +				      unsigned long present_pages)
> +{
> +	unsigned long pages = spanned_pages;
> +
> +	/*
> +	 * Provide a more accurate estimation if there are big holes within
> +	 * the zone and SPARSEMEM is in use.
> +	 */
> +	if (spanned_pages > present_pages + (present_pages >> 4) &&
> +	    IS_ENABLED(CONFIG_SPARSEMEM))
> +		pages = present_pages;
> +
> +	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
> +}
> +
>   /*
>    * Set up the zone data structures:
>    *   - mark all pages reserved
> @@ -4469,8 +4485,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>   		 * is used by this zone for memmap. This affects the watermark
>   		 * and per-cpu initialisations
>   		 */
> -		memmap_pages =
> -			PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
> +		memmap_pages = calc_memmap_size(size, realsize);
>   		if (freesize >= memmap_pages) {
>   			freesize -= memmap_pages;
>   			if (memmap_pages)
> @@ -4491,6 +4506,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>   
>   		if (!is_highmem_idx(j))
>   			nr_kernel_pages += freesize;
> +		/* Charge for highmem memmap if there are enough kernel pages */
> +		else if (nr_kernel_pages > memmap_pages * 2)
> +			nr_kernel_pages -= memmap_pages;

Since this is in else branch, if nr_kernel_pages is equal to 0 at 
initially time?

>   		nr_all_pages += freesize;
>   
>   		zone->spanned_pages = size;


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap
@ 2012-11-20  2:15             ` Jaegeuk Hanse
  0 siblings, 0 replies; 98+ messages in thread
From: Jaegeuk Hanse @ 2012-11-20  2:15 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Wen Congyang, David Rientjes, Jiang Liu,
	Maciej Rutecki, Chris Clayton, Rafael J . Wysocki, Mel Gorman,
	Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu,
	linux-mm, linux-kernel

On 11/19/2012 12:07 AM, Jiang Liu wrote:
> If SPARSEMEM is enabled, it won't build page structures for
> non-existing pages (holes) within a zone, so provide a more accurate
> estimation of pages occupied by memmap if there are big holes within
> the zone.
>
> And pages for highmem zones' memmap will be allocated from lowmem,
> so charge nr_kernel_pages for that.
>
> Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
> ---
>   mm/page_alloc.c |   22 ++++++++++++++++++++--
>   1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5b327d7..eb25679 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4435,6 +4435,22 @@ void __init set_pageblock_order(void)
>   
>   #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>   
> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
> +				      unsigned long present_pages)
> +{
> +	unsigned long pages = spanned_pages;
> +
> +	/*
> +	 * Provide a more accurate estimation if there are big holes within
> +	 * the zone and SPARSEMEM is in use.
> +	 */
> +	if (spanned_pages > present_pages + (present_pages >> 4) &&
> +	    IS_ENABLED(CONFIG_SPARSEMEM))
> +		pages = present_pages;
> +
> +	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
> +}
> +
>   /*
>    * Set up the zone data structures:
>    *   - mark all pages reserved
> @@ -4469,8 +4485,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>   		 * is used by this zone for memmap. This affects the watermark
>   		 * and per-cpu initialisations
>   		 */
> -		memmap_pages =
> -			PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
> +		memmap_pages = calc_memmap_size(size, realsize);
>   		if (freesize >= memmap_pages) {
>   			freesize -= memmap_pages;
>   			if (memmap_pages)
> @@ -4491,6 +4506,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>   
>   		if (!is_highmem_idx(j))
>   			nr_kernel_pages += freesize;
> +		/* Charge for highmem memmap if there are enough kernel pages */
> +		else if (nr_kernel_pages > memmap_pages * 2)
> +			nr_kernel_pages -= memmap_pages;

Since this is in else branch, if nr_kernel_pages is equal to 0 at 
initially time?

>   		nr_all_pages += freesize;
>   
>   		zone->spanned_pages = size;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
  2012-11-20  2:13           ` Jaegeuk Hanse
@ 2012-11-20  2:43             ` Jiang Liu
  -1 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-20  2:43 UTC (permalink / raw)
  To: Jaegeuk Hanse
  Cc: Jiang Liu, Andrew Morton, Wen Congyang, David Rientjes,
	Maciej Rutecki, Chris Clayton, Rafael J . Wysocki, Mel Gorman,
	Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu,
	linux-mm, linux-kernel

On 2012-11-20 10:13, Jaegeuk Hanse wrote:
> On 11/19/2012 12:07 AM, Jiang Liu wrote:
>> The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
>> resolve an issue caused by inaccurate zone->present_pages, but that
>> fix is incomplete and causes regresions with HIGHMEM. And it has been
>> reverted by commit
>> 5576646 revert "mm: fix-up zone present pages"
>>
>> This is a following-up patchset for the issue above. It introduces a
>> new field named "managed_pages" to struct zone, which counts pages
>> managed by the buddy system from the zone. And zone->present_pages
>> is used to count pages existing in the zone, which is
>>     spanned_pages - absent_pages.
>>
>> But that way, zone->present_pages will be kept in consistence with
>> pgdat->node_present_pages, which is sum of zone->present_pages.
>>
>> This patchset has only been tested on x86_64 with nobootmem.c. So need
>> help to test this patchset on machines:
>> 1) use bootmem.c
> 
> If only x86_32 use bootmem.c instead of nobootmem.c? How could I confirm it?
Hi Jaegeuk,
	Thanks for review this patch set.
	Currently x86/x86_64/Sparc have been converted to use nobootmem.c,
and other Arches still use bootmem.c. So need to test it on other Arches,
such as ARM etc. Yesterday we have tested it patchset on an Itanium platform,
so bootmem.c should work as expected too.
	Thanks!
	Gerry



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
@ 2012-11-20  2:43             ` Jiang Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-20  2:43 UTC (permalink / raw)
  To: Jaegeuk Hanse
  Cc: Jiang Liu, Andrew Morton, Wen Congyang, David Rientjes,
	Maciej Rutecki, Chris Clayton, Rafael J . Wysocki, Mel Gorman,
	Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu,
	linux-mm, linux-kernel

On 2012-11-20 10:13, Jaegeuk Hanse wrote:
> On 11/19/2012 12:07 AM, Jiang Liu wrote:
>> The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
>> resolve an issue caused by inaccurate zone->present_pages, but that
>> fix is incomplete and causes regresions with HIGHMEM. And it has been
>> reverted by commit
>> 5576646 revert "mm: fix-up zone present pages"
>>
>> This is a following-up patchset for the issue above. It introduces a
>> new field named "managed_pages" to struct zone, which counts pages
>> managed by the buddy system from the zone. And zone->present_pages
>> is used to count pages existing in the zone, which is
>>     spanned_pages - absent_pages.
>>
>> But that way, zone->present_pages will be kept in consistence with
>> pgdat->node_present_pages, which is sum of zone->present_pages.
>>
>> This patchset has only been tested on x86_64 with nobootmem.c. So need
>> help to test this patchset on machines:
>> 1) use bootmem.c
> 
> If only x86_32 use bootmem.c instead of nobootmem.c? How could I confirm it?
Hi Jaegeuk,
	Thanks for review this patch set.
	Currently x86/x86_64/Sparc have been converted to use nobootmem.c,
and other Arches still use bootmem.c. So need to test it on other Arches,
such as ARM etc. Yesterday we have tested it patchset on an Itanium platform,
so bootmem.c should work as expected too.
	Thanks!
	Gerry


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
  2012-11-20  2:43             ` Jiang Liu
@ 2012-11-20  3:20               ` Jaegeuk Hanse
  -1 siblings, 0 replies; 98+ messages in thread
From: Jaegeuk Hanse @ 2012-11-20  3:20 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Jiang Liu, Andrew Morton, Wen Congyang, David Rientjes,
	Maciej Rutecki, Chris Clayton, Rafael J . Wysocki, Mel Gorman,
	Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu,
	linux-mm, linux-kernel

On 11/20/2012 10:43 AM, Jiang Liu wrote:
> On 2012-11-20 10:13, Jaegeuk Hanse wrote:
>> On 11/19/2012 12:07 AM, Jiang Liu wrote:
>>> The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
>>> resolve an issue caused by inaccurate zone->present_pages, but that
>>> fix is incomplete and causes regresions with HIGHMEM. And it has been
>>> reverted by commit
>>> 5576646 revert "mm: fix-up zone present pages"
>>>
>>> This is a following-up patchset for the issue above. It introduces a
>>> new field named "managed_pages" to struct zone, which counts pages
>>> managed by the buddy system from the zone. And zone->present_pages
>>> is used to count pages existing in the zone, which is
>>>      spanned_pages - absent_pages.
>>>
>>> But that way, zone->present_pages will be kept in consistence with
>>> pgdat->node_present_pages, which is sum of zone->present_pages.
>>>
>>> This patchset has only been tested on x86_64 with nobootmem.c. So need
>>> help to test this patchset on machines:
>>> 1) use bootmem.c
>> If only x86_32 use bootmem.c instead of nobootmem.c? How could I confirm it?
> Hi Jaegeuk,
> 	Thanks for review this patch set.
> 	Currently x86/x86_64/Sparc have been converted to use nobootmem.c,
> and other Arches still use bootmem.c. So need to test it on other Arches,
> such as ARM etc. Yesterday we have tested it patchset on an Itanium platform,
> so bootmem.c should work as expected too.

Hi Jiang,

If there are any codes changed in x86/x86_64  to meet nobootmem.c logic? 
I mean if remove
config NO_BOOTMEM
        def_bool y
in arch/x86/Kconfig, whether x86/x86_64 can take advantage of bootmem.c 
or not.

Regards,
Jaegeuk

> 	Thanks!
> 	Gerry
>
>


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
@ 2012-11-20  3:20               ` Jaegeuk Hanse
  0 siblings, 0 replies; 98+ messages in thread
From: Jaegeuk Hanse @ 2012-11-20  3:20 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Jiang Liu, Andrew Morton, Wen Congyang, David Rientjes,
	Maciej Rutecki, Chris Clayton, Rafael J . Wysocki, Mel Gorman,
	Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu,
	linux-mm, linux-kernel

On 11/20/2012 10:43 AM, Jiang Liu wrote:
> On 2012-11-20 10:13, Jaegeuk Hanse wrote:
>> On 11/19/2012 12:07 AM, Jiang Liu wrote:
>>> The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
>>> resolve an issue caused by inaccurate zone->present_pages, but that
>>> fix is incomplete and causes regresions with HIGHMEM. And it has been
>>> reverted by commit
>>> 5576646 revert "mm: fix-up zone present pages"
>>>
>>> This is a following-up patchset for the issue above. It introduces a
>>> new field named "managed_pages" to struct zone, which counts pages
>>> managed by the buddy system from the zone. And zone->present_pages
>>> is used to count pages existing in the zone, which is
>>>      spanned_pages - absent_pages.
>>>
>>> But that way, zone->present_pages will be kept in consistence with
>>> pgdat->node_present_pages, which is sum of zone->present_pages.
>>>
>>> This patchset has only been tested on x86_64 with nobootmem.c. So need
>>> help to test this patchset on machines:
>>> 1) use bootmem.c
>> If only x86_32 use bootmem.c instead of nobootmem.c? How could I confirm it?
> Hi Jaegeuk,
> 	Thanks for review this patch set.
> 	Currently x86/x86_64/Sparc have been converted to use nobootmem.c,
> and other Arches still use bootmem.c. So need to test it on other Arches,
> such as ARM etc. Yesterday we have tested it patchset on an Itanium platform,
> so bootmem.c should work as expected too.

Hi Jiang,

If there are any codes changed in x86/x86_64  to meet nobootmem.c logic? 
I mean if remove
config NO_BOOTMEM
        def_bool y
in arch/x86/Kconfig, whether x86/x86_64 can take advantage of bootmem.c 
or not.

Regards,
Jaegeuk

> 	Thanks!
> 	Gerry
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
  2012-11-20  3:20               ` Jaegeuk Hanse
@ 2012-11-20  3:46                 ` Jiang Liu
  -1 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-20  3:46 UTC (permalink / raw)
  To: Jaegeuk Hanse
  Cc: Jiang Liu, Andrew Morton, Wen Congyang, David Rientjes,
	Maciej Rutecki, Chris Clayton, Rafael J . Wysocki, Mel Gorman,
	Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu,
	linux-mm, linux-kernel

On 2012-11-20 11:20, Jaegeuk Hanse wrote:
> On 11/20/2012 10:43 AM, Jiang Liu wrote:
>> On 2012-11-20 10:13, Jaegeuk Hanse wrote:
>>> On 11/19/2012 12:07 AM, Jiang Liu wrote:
>> Hi Jaegeuk,
>>     Thanks for review this patch set.
>>     Currently x86/x86_64/Sparc have been converted to use nobootmem.c,
>> and other Arches still use bootmem.c. So need to test it on other Arches,
>> such as ARM etc. Yesterday we have tested it patchset on an Itanium platform,
>> so bootmem.c should work as expected too.
> 
> Hi Jiang,
> 
> If there are any codes changed in x86/x86_64  to meet nobootmem.c logic? I mean if remove
> config NO_BOOTMEM
>        def_bool y
> in arch/x86/Kconfig, whether x86/x86_64 can take advantage of bootmem.c or not.
There are code change in x86/x86_64 arch directory to convert from bootmem.c
to nobootmem.c, so you can't simply comment out NO_BOOTMEM Kconfig item.
There are differences in APIs for bootmem.c and nobootmem.c.
For example, free_low_memory_core_early() is only provided by nobootmem.c
and init_bootmem_node() is only provided by bootmem.c.
Thanks!


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
@ 2012-11-20  3:46                 ` Jiang Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-20  3:46 UTC (permalink / raw)
  To: Jaegeuk Hanse
  Cc: Jiang Liu, Andrew Morton, Wen Congyang, David Rientjes,
	Maciej Rutecki, Chris Clayton, Rafael J . Wysocki, Mel Gorman,
	Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu,
	linux-mm, linux-kernel

On 2012-11-20 11:20, Jaegeuk Hanse wrote:
> On 11/20/2012 10:43 AM, Jiang Liu wrote:
>> On 2012-11-20 10:13, Jaegeuk Hanse wrote:
>>> On 11/19/2012 12:07 AM, Jiang Liu wrote:
>> Hi Jaegeuk,
>>     Thanks for review this patch set.
>>     Currently x86/x86_64/Sparc have been converted to use nobootmem.c,
>> and other Arches still use bootmem.c. So need to test it on other Arches,
>> such as ARM etc. Yesterday we have tested it patchset on an Itanium platform,
>> so bootmem.c should work as expected too.
> 
> Hi Jiang,
> 
> If there are any codes changed in x86/x86_64  to meet nobootmem.c logic? I mean if remove
> config NO_BOOTMEM
>        def_bool y
> in arch/x86/Kconfig, whether x86/x86_64 can take advantage of bootmem.c or not.
There are code change in x86/x86_64 arch directory to convert from bootmem.c
to nobootmem.c, so you can't simply comment out NO_BOOTMEM Kconfig item.
There are differences in APIs for bootmem.c and nobootmem.c.
For example, free_low_memory_core_early() is only provided by nobootmem.c
and init_bootmem_node() is only provided by bootmem.c.
Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 1/5] mm: introduce new field "managed_pages" to struct zone
  2012-11-19 23:38             ` Andrew Morton
@ 2012-11-20 14:56               ` Jiang Liu
  -1 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-20 14:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On 11/20/2012 07:38 AM, Andrew Morton wrote:
> On Mon, 19 Nov 2012 00:07:26 +0800
> Jiang Liu <liuj97@gmail.com> wrote:
> 
>> Currently a zone's present_pages is calcuated as below, which is
>> inaccurate and may cause trouble to memory hotplug.
>> 	spanned_pages - absent_pages - memmap_pages - dma_reserve.
>>
>> During fixing bugs caused by inaccurate zone->present_pages, we found
>> zone->present_pages has been abused. The field zone->present_pages
>> may have different meanings in different contexts:
>> 1) pages existing in a zone.
>> 2) pages managed by the buddy system.
>>
>> For more discussions about the issue, please refer to:
>> http://lkml.org/lkml/2012/11/5/866
>> https://patchwork.kernel.org/patch/1346751/
>>
>> This patchset tries to introduce a new field named "managed_pages" to
>> struct zone, which counts "pages managed by the buddy system". And
>> revert zone->present_pages to count "physical pages existing in a zone",
>> which also keep in consistence with pgdat->node_present_pages.
>>
>> We will set an initial value for zone->managed_pages in function
>> free_area_init_core() and will be adjusted later if the initial value is
>> inaccurate.
>>
>> For DMA/normal zones, the initial value is set to:
>> 	(spanned_pages - absent_pages - memmap_pages - dma_reserve)
>> Later zone->managed_pages will be adjusted to the accurate value when
>> the bootmem allocator frees all free pages to the buddy system in
>> function free_all_bootmem_node() and free_all_bootmem().
>>
>> The bootmem allocator doesn't touch highmem pages, so highmem zones'
>> managed_pages is set to the accurate value "spanned_pages - absent_pages"
>> in function free_area_init_core() and won't be updated anymore.
>>
>> This patch also adds a new field "managed_pages" to /proc/zoneinfo
>> and sysrq showmem.
> 
> hoo boy, what a mess we made.  I'd like to merge these patches and get
> them into -next for some testing, but -next has stopped for a couple of
> weeks.  Oh well, let's see what can be done.
Hi Andrew,
	Really sorry for the delay. Within last a few weeks, I could only
find after work hours or weekends for programming:(

>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -480,6 +480,7 @@ struct zone {
>>  	 */
>>  	unsigned long		spanned_pages;	/* total size, including holes */
>>  	unsigned long		present_pages;	/* amount of memory (excluding holes) */
>> +	unsigned long		managed_pages;	/* pages managed by the Buddy */
> 
> Can you please add a nice big comment over these three fields which
> fully describes what they do and the relationship between them? 
> Basically that stuff that's in the changelog.
> 
> Also, the existing comment tells us that spanned_pages and
> present_pages are protected by span_seqlock but has not been updated to
> describe the locking (if any) for managed_pages.
How about this?

        /*
         * spanned_pages is the total pages spanned by the zone, including
         * holes, which is calcualted as:
         *      spanned_pages = zone_end_pfn - zone_start_pfn;
         *
         * present_pages is physical pages existing within the zone, which
         * is calculated as:
         *      present_pages = spanned_pages - absent_pages(pags in holes);
         *
         * managed_pages is present pages managed by the buddy system, which
         * is calculated as (reserved_pages includes pages allocated by the
         * bootmem allocator):
         *      managed_pages = present_pages - reserved_pages;
         *
         * So present_pages may be used by memory hotplug or memory power
         * management logic to figure out unmanaged pages by checking
         * (present_pages - managed_pages). And managed_pages should be used
         * by page allocator and vm scanner to calculate all kinds of watermarks
         * and thresholds.
         *
         * Lock Rules:
         *
         * zone_start_pfn, spanned_pages are protected by span_seqlock.
         * It is a seqlock because it has to be read outside of zone->lock,
         * and it is done in the main allocator path.  But, it is written
         * quite infrequently.
         *
         * The span_seq lock is declared along with zone->lock because it is
         * frequently read in proximity to zone->lock.  It's good to
         * give them a chance of being in the same cacheline.
         *
         * Writing access to present_pages and managed_pages at runtime should
         * be protected by lock_memory_hotplug()/unlock_memory_hotplug().
         * Any reader who can't tolerant drift of present_pages and
         * managed_pages should hold memory hotplug lock to get a stable value.
         */
        unsigned long           spanned_pages;  
        unsigned long           present_pages;  
        unsigned long           managed_pages;  


> 
>>  	/*
>>  	 * rarely used fields:
>> diff --git a/mm/bootmem.c b/mm/bootmem.c
>> index f468185..a813e5b 100644
>> --- a/mm/bootmem.c
>> +++ b/mm/bootmem.c
>> @@ -229,6 +229,15 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
>>  	return count;
>>  }
>>  
>> +static void reset_node_lowmem_managed_pages(pg_data_t *pgdat)
>> +{
>> +	struct zone *z;
>> +
>> +	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
>> +		if (!is_highmem(z))
> 
> Needs a comment explaining why we skip the highmem zone, please.
How about this?
        /*
         * In free_area_init_core(), highmem zone's managed_pages is set to
         * present_pages, and bootmem allocator doesn't allocate from highmem
         * zones. So there's no need to recalculate managed_pages because all
         * highmem pages will be managed by the buddy system. Here highmem
         * zone also includes highmem movable zone.
         */


>> +			z->managed_pages = 0;
>> +}
>> +
>>
>> ...
>>
>> @@ -106,6 +106,7 @@ static void get_page_bootmem(unsigned long info,  struct page *page,
>>  void __ref put_page_bootmem(struct page *page)
>>  {
>>  	unsigned long type;
>> +	static DEFINE_MUTEX(ppb_lock);
>>  
>>  	type = (unsigned long) page->lru.next;
>>  	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
>> @@ -115,7 +116,9 @@ void __ref put_page_bootmem(struct page *page)
>>  		ClearPagePrivate(page);
>>  		set_page_private(page, 0);
>>  		INIT_LIST_HEAD(&page->lru);
>> +		mutex_lock(&ppb_lock);
>>  		__free_pages_bootmem(page, 0);
>> +		mutex_unlock(&ppb_lock);
> 
> The mutex is odd.  Nothing in the changelog, no code comment. 
> __free_pages_bootmem() is called from a lot of places but only this one
> has locking.  I'm madly guessing that the lock is here to handle two or
> more concurrent memory hotpluggings, but I shouldn't need to guess!!
Actually I'm a little hesitate whether we should add a lock here.

All callers of __free_pages_bootmem() other than put_page_bootmem() should
only be used at startup time. And currently the only caller of put_page_bootmem()
has already been protected by pgdat_resize_lock(pgdat, &flags). So there's
no real need for lock, just defensive.

I'm not sure which is the best solution here.
1) add a comments into __free_pages_bootmem() to state that the caller should
   serialize themselves.
2) Use a dedicated lock to serialize updates to zone->managed_pages, this need
   modifications to page_alloc.c and memory_hotplug.c.
3) The above solution to serialize in put_page_bootmem().
What's your suggestions here?

Thanks
Gerry


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 1/5] mm: introduce new field "managed_pages" to struct zone
@ 2012-11-20 14:56               ` Jiang Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-20 14:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On 11/20/2012 07:38 AM, Andrew Morton wrote:
> On Mon, 19 Nov 2012 00:07:26 +0800
> Jiang Liu <liuj97@gmail.com> wrote:
> 
>> Currently a zone's present_pages is calcuated as below, which is
>> inaccurate and may cause trouble to memory hotplug.
>> 	spanned_pages - absent_pages - memmap_pages - dma_reserve.
>>
>> During fixing bugs caused by inaccurate zone->present_pages, we found
>> zone->present_pages has been abused. The field zone->present_pages
>> may have different meanings in different contexts:
>> 1) pages existing in a zone.
>> 2) pages managed by the buddy system.
>>
>> For more discussions about the issue, please refer to:
>> http://lkml.org/lkml/2012/11/5/866
>> https://patchwork.kernel.org/patch/1346751/
>>
>> This patchset tries to introduce a new field named "managed_pages" to
>> struct zone, which counts "pages managed by the buddy system". And
>> revert zone->present_pages to count "physical pages existing in a zone",
>> which also keep in consistence with pgdat->node_present_pages.
>>
>> We will set an initial value for zone->managed_pages in function
>> free_area_init_core() and will be adjusted later if the initial value is
>> inaccurate.
>>
>> For DMA/normal zones, the initial value is set to:
>> 	(spanned_pages - absent_pages - memmap_pages - dma_reserve)
>> Later zone->managed_pages will be adjusted to the accurate value when
>> the bootmem allocator frees all free pages to the buddy system in
>> function free_all_bootmem_node() and free_all_bootmem().
>>
>> The bootmem allocator doesn't touch highmem pages, so highmem zones'
>> managed_pages is set to the accurate value "spanned_pages - absent_pages"
>> in function free_area_init_core() and won't be updated anymore.
>>
>> This patch also adds a new field "managed_pages" to /proc/zoneinfo
>> and sysrq showmem.
> 
> hoo boy, what a mess we made.  I'd like to merge these patches and get
> them into -next for some testing, but -next has stopped for a couple of
> weeks.  Oh well, let's see what can be done.
Hi Andrew,
	Really sorry for the delay. Within last a few weeks, I could only
find after work hours or weekends for programming:(

>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -480,6 +480,7 @@ struct zone {
>>  	 */
>>  	unsigned long		spanned_pages;	/* total size, including holes */
>>  	unsigned long		present_pages;	/* amount of memory (excluding holes) */
>> +	unsigned long		managed_pages;	/* pages managed by the Buddy */
> 
> Can you please add a nice big comment over these three fields which
> fully describes what they do and the relationship between them? 
> Basically that stuff that's in the changelog.
> 
> Also, the existing comment tells us that spanned_pages and
> present_pages are protected by span_seqlock but has not been updated to
> describe the locking (if any) for managed_pages.
How about this?

        /*
         * spanned_pages is the total pages spanned by the zone, including
         * holes, which is calcualted as:
         *      spanned_pages = zone_end_pfn - zone_start_pfn;
         *
         * present_pages is physical pages existing within the zone, which
         * is calculated as:
         *      present_pages = spanned_pages - absent_pages(pags in holes);
         *
         * managed_pages is present pages managed by the buddy system, which
         * is calculated as (reserved_pages includes pages allocated by the
         * bootmem allocator):
         *      managed_pages = present_pages - reserved_pages;
         *
         * So present_pages may be used by memory hotplug or memory power
         * management logic to figure out unmanaged pages by checking
         * (present_pages - managed_pages). And managed_pages should be used
         * by page allocator and vm scanner to calculate all kinds of watermarks
         * and thresholds.
         *
         * Lock Rules:
         *
         * zone_start_pfn, spanned_pages are protected by span_seqlock.
         * It is a seqlock because it has to be read outside of zone->lock,
         * and it is done in the main allocator path.  But, it is written
         * quite infrequently.
         *
         * The span_seq lock is declared along with zone->lock because it is
         * frequently read in proximity to zone->lock.  It's good to
         * give them a chance of being in the same cacheline.
         *
         * Writing access to present_pages and managed_pages at runtime should
         * be protected by lock_memory_hotplug()/unlock_memory_hotplug().
         * Any reader who can't tolerant drift of present_pages and
         * managed_pages should hold memory hotplug lock to get a stable value.
         */
        unsigned long           spanned_pages;  
        unsigned long           present_pages;  
        unsigned long           managed_pages;  


> 
>>  	/*
>>  	 * rarely used fields:
>> diff --git a/mm/bootmem.c b/mm/bootmem.c
>> index f468185..a813e5b 100644
>> --- a/mm/bootmem.c
>> +++ b/mm/bootmem.c
>> @@ -229,6 +229,15 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
>>  	return count;
>>  }
>>  
>> +static void reset_node_lowmem_managed_pages(pg_data_t *pgdat)
>> +{
>> +	struct zone *z;
>> +
>> +	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
>> +		if (!is_highmem(z))
> 
> Needs a comment explaining why we skip the highmem zone, please.
How about this?
        /*
         * In free_area_init_core(), highmem zone's managed_pages is set to
         * present_pages, and bootmem allocator doesn't allocate from highmem
         * zones. So there's no need to recalculate managed_pages because all
         * highmem pages will be managed by the buddy system. Here highmem
         * zone also includes highmem movable zone.
         */


>> +			z->managed_pages = 0;
>> +}
>> +
>>
>> ...
>>
>> @@ -106,6 +106,7 @@ static void get_page_bootmem(unsigned long info,  struct page *page,
>>  void __ref put_page_bootmem(struct page *page)
>>  {
>>  	unsigned long type;
>> +	static DEFINE_MUTEX(ppb_lock);
>>  
>>  	type = (unsigned long) page->lru.next;
>>  	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
>> @@ -115,7 +116,9 @@ void __ref put_page_bootmem(struct page *page)
>>  		ClearPagePrivate(page);
>>  		set_page_private(page, 0);
>>  		INIT_LIST_HEAD(&page->lru);
>> +		mutex_lock(&ppb_lock);
>>  		__free_pages_bootmem(page, 0);
>> +		mutex_unlock(&ppb_lock);
> 
> The mutex is odd.  Nothing in the changelog, no code comment. 
> __free_pages_bootmem() is called from a lot of places but only this one
> has locking.  I'm madly guessing that the lock is here to handle two or
> more concurrent memory hotpluggings, but I shouldn't need to guess!!
Actually I'm a little hesitate whether we should add a lock here.

All callers of __free_pages_bootmem() other than put_page_bootmem() should
only be used at startup time. And currently the only caller of put_page_bootmem()
has already been protected by pgdat_resize_lock(pgdat, &flags). So there's
no real need for lock, just defensive.

I'm not sure which is the best solution here.
1) add a comments into __free_pages_bootmem() to state that the caller should
   serialize themselves.
2) Use a dedicated lock to serialize updates to zone->managed_pages, this need
   modifications to page_alloc.c and memory_hotplug.c.
3) The above solution to serialize in put_page_bootmem().
What's your suggestions here?

Thanks
Gerry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap
  2012-11-19 23:42             ` Andrew Morton
@ 2012-11-20 15:18               ` Jiang Liu
  -1 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-20 15:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On 11/20/2012 07:42 AM, Andrew Morton wrote:
> On Mon, 19 Nov 2012 00:07:29 +0800
> Jiang Liu <liuj97@gmail.com> wrote:
> 
>> If SPARSEMEM is enabled, it won't build page structures for
>> non-existing pages (holes) within a zone, so provide a more accurate
>> estimation of pages occupied by memmap if there are big holes within
>> the zone.
>>
>> And pages for highmem zones' memmap will be allocated from lowmem,
>> so charge nr_kernel_pages for that.
>>
>> ...
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4435,6 +4435,22 @@ void __init set_pageblock_order(void)
>>  
>>  #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>>  
>> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
>> +				      unsigned long present_pages)
>> +{
>> +	unsigned long pages = spanned_pages;
>> +
>> +	/*
>> +	 * Provide a more accurate estimation if there are big holes within
>> +	 * the zone and SPARSEMEM is in use.
>> +	 */
>> +	if (spanned_pages > present_pages + (present_pages >> 4) &&
>> +	    IS_ENABLED(CONFIG_SPARSEMEM))
>> +		pages = present_pages;
>> +
>> +	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
>> +}
> 
> Please explain the ">> 4" heuristc more completely - preferably in both
> the changelog and code comments.  Why can't we calculate this
> requirement exactly?  That might require a second pass, but that's OK for
> code like this?
Hi Andrew,
	A normal x86 platform always have some holes within the DMA ZONE,
so the ">> 4" heuristic is to avoid applying this adjustment to the DMA
ZONE on x86 platforms. 
	Because the memmap_size is just an estimation, I feel it's OK to
remove the ">> 4" heuristic, that shouldn't affect much.

Thanks
Gerry

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap
@ 2012-11-20 15:18               ` Jiang Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-20 15:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On 11/20/2012 07:42 AM, Andrew Morton wrote:
> On Mon, 19 Nov 2012 00:07:29 +0800
> Jiang Liu <liuj97@gmail.com> wrote:
> 
>> If SPARSEMEM is enabled, it won't build page structures for
>> non-existing pages (holes) within a zone, so provide a more accurate
>> estimation of pages occupied by memmap if there are big holes within
>> the zone.
>>
>> And pages for highmem zones' memmap will be allocated from lowmem,
>> so charge nr_kernel_pages for that.
>>
>> ...
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4435,6 +4435,22 @@ void __init set_pageblock_order(void)
>>  
>>  #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>>  
>> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
>> +				      unsigned long present_pages)
>> +{
>> +	unsigned long pages = spanned_pages;
>> +
>> +	/*
>> +	 * Provide a more accurate estimation if there are big holes within
>> +	 * the zone and SPARSEMEM is in use.
>> +	 */
>> +	if (spanned_pages > present_pages + (present_pages >> 4) &&
>> +	    IS_ENABLED(CONFIG_SPARSEMEM))
>> +		pages = present_pages;
>> +
>> +	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
>> +}
> 
> Please explain the ">> 4" heuristc more completely - preferably in both
> the changelog and code comments.  Why can't we calculate this
> requirement exactly?  That might require a second pass, but that's OK for
> code like this?
Hi Andrew,
	A normal x86 platform always have some holes within the DMA ZONE,
so the ">> 4" heuristic is to avoid applying this adjustment to the DMA
ZONE on x86 platforms. 
	Because the memmap_size is just an estimation, I feel it's OK to
remove the ">> 4" heuristic, that shouldn't affect much.

Thanks
Gerry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
  2012-11-19 21:36           ` Maciej Rutecki
@ 2012-11-20 16:03             ` Jiang Liu
  -1 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-20 16:03 UTC (permalink / raw)
  To: maciej.rutecki
  Cc: Andrew Morton, Wen Congyang, David Rientjes, Jiang Liu,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

Thanks, Maciej!

On 11/20/2012 05:36 AM, Maciej Rutecki wrote:
> On niedziela, 18 listopada 2012 o 17:07:25 Jiang Liu wrote:
>> The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
>> resolve an issue caused by inaccurate zone->present_pages, but that
>> fix is incomplete and causes regresions with HIGHMEM. And it has been
>> reverted by commit
>> 5576646 revert "mm: fix-up zone present pages"
>>
>> This is a following-up patchset for the issue above. It introduces a
>> new field named "managed_pages" to struct zone, which counts pages
>> managed by the buddy system from the zone. And zone->present_pages
>> is used to count pages existing in the zone, which is
>> 	spanned_pages - absent_pages.
>>
>> But that way, zone->present_pages will be kept in consistence with
>> pgdat->node_present_pages, which is sum of zone->present_pages.
>>
>> This patchset has only been tested on x86_64 with nobootmem.c. So need
>> help to test this patchset on machines:
>> 1) use bootmem.c
>> 2) have highmem
>>
>> This patchset applies to "f4a75d2e Linux 3.7-rc6" from
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>>
>> Any comments and helps are welcomed!
>>
>> Jiang Liu (5):
>>   mm: introduce new field "managed_pages" to struct zone
>>   mm: replace zone->present_pages with zone->managed_pages if
>>     appreciated
>>   mm: set zone->present_pages to number of existing pages in the zone
>>   mm: provide more accurate estimation of pages occupied by memmap
>>   mm: increase totalram_pages when free pages allocated by bootmem
>>     allocator
>>
>>  include/linux/mmzone.h |    1 +
>>  mm/bootmem.c           |   14 ++++++++
>>  mm/memory_hotplug.c    |    6 ++++
>>  mm/mempolicy.c         |    2 +-
>>  mm/nobootmem.c         |   15 ++++++++
>>  mm/page_alloc.c        |   89
>> +++++++++++++++++++++++++++++++----------------- mm/vmscan.c            | 
>>  16 ++++-----
>>  mm/vmstat.c            |    8 +++--
>>  8 files changed, 108 insertions(+), 43 deletions(-)
> Tested in 32 bit linux with HIGHMEM. Seems be OK.
> 
> Regards
> 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
@ 2012-11-20 16:03             ` Jiang Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-20 16:03 UTC (permalink / raw)
  To: maciej.rutecki
  Cc: Andrew Morton, Wen Congyang, David Rientjes, Jiang Liu,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

Thanks, Maciej!

On 11/20/2012 05:36 AM, Maciej Rutecki wrote:
> On niedziela, 18 listopada 2012 o 17:07:25 Jiang Liu wrote:
>> The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
>> resolve an issue caused by inaccurate zone->present_pages, but that
>> fix is incomplete and causes regresions with HIGHMEM. And it has been
>> reverted by commit
>> 5576646 revert "mm: fix-up zone present pages"
>>
>> This is a following-up patchset for the issue above. It introduces a
>> new field named "managed_pages" to struct zone, which counts pages
>> managed by the buddy system from the zone. And zone->present_pages
>> is used to count pages existing in the zone, which is
>> 	spanned_pages - absent_pages.
>>
>> But that way, zone->present_pages will be kept in consistence with
>> pgdat->node_present_pages, which is sum of zone->present_pages.
>>
>> This patchset has only been tested on x86_64 with nobootmem.c. So need
>> help to test this patchset on machines:
>> 1) use bootmem.c
>> 2) have highmem
>>
>> This patchset applies to "f4a75d2e Linux 3.7-rc6" from
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>>
>> Any comments and helps are welcomed!
>>
>> Jiang Liu (5):
>>   mm: introduce new field "managed_pages" to struct zone
>>   mm: replace zone->present_pages with zone->managed_pages if
>>     appreciated
>>   mm: set zone->present_pages to number of existing pages in the zone
>>   mm: provide more accurate estimation of pages occupied by memmap
>>   mm: increase totalram_pages when free pages allocated by bootmem
>>     allocator
>>
>>  include/linux/mmzone.h |    1 +
>>  mm/bootmem.c           |   14 ++++++++
>>  mm/memory_hotplug.c    |    6 ++++
>>  mm/mempolicy.c         |    2 +-
>>  mm/nobootmem.c         |   15 ++++++++
>>  mm/page_alloc.c        |   89
>> +++++++++++++++++++++++++++++++----------------- mm/vmscan.c            | 
>>  16 ++++-----
>>  mm/vmstat.c            |    8 +++--
>>  8 files changed, 108 insertions(+), 43 deletions(-)
> Tested in 32 bit linux with HIGHMEM. Seems be OK.
> 
> Regards
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap
  2012-11-20 15:18               ` Jiang Liu
@ 2012-11-20 19:19                 ` Andrew Morton
  -1 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2012-11-20 19:19 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On Tue, 20 Nov 2012 23:18:34 +0800
Jiang Liu <liuj97@gmail.com> wrote:

> >> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
> >> +				      unsigned long present_pages)
> >> +{
> >> +	unsigned long pages = spanned_pages;
> >> +
> >> +	/*
> >> +	 * Provide a more accurate estimation if there are big holes within
> >> +	 * the zone and SPARSEMEM is in use.
> >> +	 */
> >> +	if (spanned_pages > present_pages + (present_pages >> 4) &&
> >> +	    IS_ENABLED(CONFIG_SPARSEMEM))
> >> +		pages = present_pages;
> >> +
> >> +	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
> >> +}
> > 
> > Please explain the ">> 4" heuristc more completely - preferably in both
> > the changelog and code comments.  Why can't we calculate this
> > requirement exactly?  That might require a second pass, but that's OK for
> > code like this?
> Hi Andrew,
> 	A normal x86 platform always have some holes within the DMA ZONE,
> so the ">> 4" heuristic is to avoid applying this adjustment to the DMA
> ZONE on x86 platforms. 
> 	Because the memmap_size is just an estimation, I feel it's OK to
> remove the ">> 4" heuristic, that shouldn't affect much.

Again: why can't we calculate this requirement exactly?  That might
require a second pass, but that's OK for code like this?


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap
@ 2012-11-20 19:19                 ` Andrew Morton
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2012-11-20 19:19 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On Tue, 20 Nov 2012 23:18:34 +0800
Jiang Liu <liuj97@gmail.com> wrote:

> >> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
> >> +				      unsigned long present_pages)
> >> +{
> >> +	unsigned long pages = spanned_pages;
> >> +
> >> +	/*
> >> +	 * Provide a more accurate estimation if there are big holes within
> >> +	 * the zone and SPARSEMEM is in use.
> >> +	 */
> >> +	if (spanned_pages > present_pages + (present_pages >> 4) &&
> >> +	    IS_ENABLED(CONFIG_SPARSEMEM))
> >> +		pages = present_pages;
> >> +
> >> +	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
> >> +}
> > 
> > Please explain the ">> 4" heuristc more completely - preferably in both
> > the changelog and code comments.  Why can't we calculate this
> > requirement exactly?  That might require a second pass, but that's OK for
> > code like this?
> Hi Andrew,
> 	A normal x86 platform always have some holes within the DMA ZONE,
> so the ">> 4" heuristic is to avoid applying this adjustment to the DMA
> ZONE on x86 platforms. 
> 	Because the memmap_size is just an estimation, I feel it's OK to
> remove the ">> 4" heuristic, that shouldn't affect much.

Again: why can't we calculate this requirement exactly?  That might
require a second pass, but that's OK for code like this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 1/5] mm: introduce new field "managed_pages" to struct zone
  2012-11-20 14:56               ` Jiang Liu
@ 2012-11-20 19:31                 ` Andrew Morton
  -1 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2012-11-20 19:31 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On Tue, 20 Nov 2012 22:56:11 +0800
Jiang Liu <liuj97@gmail.com> wrote:

> On 11/20/2012 07:38 AM, Andrew Morton wrote:
> > On Mon, 19 Nov 2012 00:07:26 +0800
> > Jiang Liu <liuj97@gmail.com> wrote:
>
> ...
>
> > Also, the existing comment tells us that spanned_pages and
> > present_pages are protected by span_seqlock but has not been updated to
> > describe the locking (if any) for managed_pages.
> How about this?
>
> ...
>

Looks nice.
 
> >> +	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
> >> +		if (!is_highmem(z))
> > 
> > Needs a comment explaining why we skip the highmem zone, please.
> How about this?
>
> ...
>

Ditto.

> >> @@ -106,6 +106,7 @@ static void get_page_bootmem(unsigned long info,  struct page *page,
> >>  void __ref put_page_bootmem(struct page *page)
> >>  {
> >>  	unsigned long type;
> >> +	static DEFINE_MUTEX(ppb_lock);
> >>  
> >>  	type = (unsigned long) page->lru.next;
> >>  	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
> >> @@ -115,7 +116,9 @@ void __ref put_page_bootmem(struct page *page)
> >>  		ClearPagePrivate(page);
> >>  		set_page_private(page, 0);
> >>  		INIT_LIST_HEAD(&page->lru);
> >> +		mutex_lock(&ppb_lock);
> >>  		__free_pages_bootmem(page, 0);
> >> +		mutex_unlock(&ppb_lock);
> > 
> > The mutex is odd.  Nothing in the changelog, no code comment. 
> > __free_pages_bootmem() is called from a lot of places but only this one
> > has locking.  I'm madly guessing that the lock is here to handle two or
> > more concurrent memory hotpluggings, but I shouldn't need to guess!!
> Actually I'm a little hesitate whether we should add a lock here.
> 
> All callers of __free_pages_bootmem() other than put_page_bootmem() should
> only be used at startup time. And currently the only caller of put_page_bootmem()
> has already been protected by pgdat_resize_lock(pgdat, &flags). So there's
> no real need for lock, just defensive.
> 
> I'm not sure which is the best solution here.
> 1) add a comments into __free_pages_bootmem() to state that the caller should
>    serialize themselves.
> 2) Use a dedicated lock to serialize updates to zone->managed_pages, this need
>    modifications to page_alloc.c and memory_hotplug.c.
> 3) The above solution to serialize in put_page_bootmem().
> What's your suggestions here?

Firstly, let's be clear about what *data* we're protecting here.  I
think it's only ->managed_pages?

I agree that no locking is needed during the init-time code.

So afaict we only need be concerned about concurrent updates to
->managed_pages via memory hotplug, and lock_memory_hotplug() is
sufficient there.  We don't need to be concerned about readers of
managed_pages because it is an unsigned long (a u64 on 32-bit machines
would be a problem).

All correct?  If so, the code is OK as-is and this can all be
described/formalised in code comments.  If one wants to be really
confident, we could do something along the lines of

void mod_zone_managed_pages(struct zone *zone, signed long delta)
{
	WARN_ON(system_state != SYSTEM_BOOTING &&
		!is_locked_memory_hotplug());
	zone->managed_pages += delta;
}

And yes, is_locked_memory_hotplug() is a dopey name. 
[un]lock_memory_hotplug() should have been called
memory_hotplug_[un]lock()!



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 1/5] mm: introduce new field "managed_pages" to struct zone
@ 2012-11-20 19:31                 ` Andrew Morton
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2012-11-20 19:31 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On Tue, 20 Nov 2012 22:56:11 +0800
Jiang Liu <liuj97@gmail.com> wrote:

> On 11/20/2012 07:38 AM, Andrew Morton wrote:
> > On Mon, 19 Nov 2012 00:07:26 +0800
> > Jiang Liu <liuj97@gmail.com> wrote:
>
> ...
>
> > Also, the existing comment tells us that spanned_pages and
> > present_pages are protected by span_seqlock but has not been updated to
> > describe the locking (if any) for managed_pages.
> How about this?
>
> ...
>

Looks nice.
 
> >> +	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
> >> +		if (!is_highmem(z))
> > 
> > Needs a comment explaining why we skip the highmem zone, please.
> How about this?
>
> ...
>

Ditto.

> >> @@ -106,6 +106,7 @@ static void get_page_bootmem(unsigned long info,  struct page *page,
> >>  void __ref put_page_bootmem(struct page *page)
> >>  {
> >>  	unsigned long type;
> >> +	static DEFINE_MUTEX(ppb_lock);
> >>  
> >>  	type = (unsigned long) page->lru.next;
> >>  	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
> >> @@ -115,7 +116,9 @@ void __ref put_page_bootmem(struct page *page)
> >>  		ClearPagePrivate(page);
> >>  		set_page_private(page, 0);
> >>  		INIT_LIST_HEAD(&page->lru);
> >> +		mutex_lock(&ppb_lock);
> >>  		__free_pages_bootmem(page, 0);
> >> +		mutex_unlock(&ppb_lock);
> > 
> > The mutex is odd.  Nothing in the changelog, no code comment. 
> > __free_pages_bootmem() is called from a lot of places but only this one
> > has locking.  I'm madly guessing that the lock is here to handle two or
> > more concurrent memory hotpluggings, but I shouldn't need to guess!!
> Actually I'm a little hesitate whether we should add a lock here.
> 
> All callers of __free_pages_bootmem() other than put_page_bootmem() should
> only be used at startup time. And currently the only caller of put_page_bootmem()
> has already been protected by pgdat_resize_lock(pgdat, &flags). So there's
> no real need for lock, just defensive.
> 
> I'm not sure which is the best solution here.
> 1) add a comments into __free_pages_bootmem() to state that the caller should
>    serialize themselves.
> 2) Use a dedicated lock to serialize updates to zone->managed_pages, this need
>    modifications to page_alloc.c and memory_hotplug.c.
> 3) The above solution to serialize in put_page_bootmem().
> What's your suggestions here?

Firstly, let's be clear about what *data* we're protecting here.  I
think it's only ->managed_pages?

I agree that no locking is needed during the init-time code.

So afaict we only need be concerned about concurrent updates to
->managed_pages via memory hotplug, and lock_memory_hotplug() is
sufficient there.  We don't need to be concerned about readers of
managed_pages because it is an unsigned long (a u64 on 32-bit machines
would be a problem).

All correct?  If so, the code is OK as-is and this can all be
described/formalised in code comments.  If one wants to be really
confident, we could do something along the lines of

void mod_zone_managed_pages(struct zone *zone, signed long delta)
{
	WARN_ON(system_state != SYSTEM_BOOTING &&
		!is_locked_memory_hotplug());
	zone->managed_pages += delta;
}

And yes, is_locked_memory_hotplug() is a dopey name. 
[un]lock_memory_hotplug() should have been called
memory_hotplug_[un]lock()!


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 1/5] mm: introduce new field "managed_pages" to struct zone
  2012-11-20 19:31                 ` Andrew Morton
@ 2012-11-21 14:36                   ` Jiang Liu
  -1 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-21 14:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On 11/21/2012 03:31 AM, Andrew Morton wrote:

>>>> @@ -106,6 +106,7 @@ static void get_page_bootmem(unsigned long info,  struct page *page,
>>>>  void __ref put_page_bootmem(struct page *page)
>>>>  {
>>>>  	unsigned long type;
>>>> +	static DEFINE_MUTEX(ppb_lock);
>>>>  
>>>>  	type = (unsigned long) page->lru.next;
>>>>  	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
>>>> @@ -115,7 +116,9 @@ void __ref put_page_bootmem(struct page *page)
>>>>  		ClearPagePrivate(page);
>>>>  		set_page_private(page, 0);
>>>>  		INIT_LIST_HEAD(&page->lru);
>>>> +		mutex_lock(&ppb_lock);
>>>>  		__free_pages_bootmem(page, 0);
>>>> +		mutex_unlock(&ppb_lock);
>>>
>>> The mutex is odd.  Nothing in the changelog, no code comment. 
>>> __free_pages_bootmem() is called from a lot of places but only this one
>>> has locking.  I'm madly guessing that the lock is here to handle two or
>>> more concurrent memory hotpluggings, but I shouldn't need to guess!!
>> Actually I'm a little hesitate whether we should add a lock here.
>>
>> All callers of __free_pages_bootmem() other than put_page_bootmem() should
>> only be used at startup time. And currently the only caller of put_page_bootmem()
>> has already been protected by pgdat_resize_lock(pgdat, &flags). So there's
>> no real need for lock, just defensive.
>>
>> I'm not sure which is the best solution here.
>> 1) add a comments into __free_pages_bootmem() to state that the caller should
>>    serialize themselves.
>> 2) Use a dedicated lock to serialize updates to zone->managed_pages, this need
>>    modifications to page_alloc.c and memory_hotplug.c.
>> 3) The above solution to serialize in put_page_bootmem().
>> What's your suggestions here?
> 
> Firstly, let's be clear about what *data* we're protecting here.  I
> think it's only ->managed_pages?
Yes, we are just trying to protect ->managed_pages.

> I agree that no locking is needed during the init-time code.
> 
> So afaict we only need be concerned about concurrent updates to
> ->managed_pages via memory hotplug, and lock_memory_hotplug() is
> sufficient there.  We don't need to be concerned about readers of
> managed_pages because it is an unsigned long (a u64 on 32-bit machines
> would be a problem).
> 
> All correct?  If so, the code is OK as-is and this can all be
> described/formalised in code comments.  If one wants to be really
> confident, we could do something along the lines of
OK, will add some comments to describe.

> void mod_zone_managed_pages(struct zone *zone, signed long delta)
> {
> 	WARN_ON(system_state != SYSTEM_BOOTING &&
> 		!is_locked_memory_hotplug());
> 	zone->managed_pages += delta;
> }
This seems a little overhead because __free_pages_bootmem() is on the hot path
and will be called many times at boot time.

Regards!
Gerry

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 1/5] mm: introduce new field "managed_pages" to struct zone
@ 2012-11-21 14:36                   ` Jiang Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-21 14:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On 11/21/2012 03:31 AM, Andrew Morton wrote:

>>>> @@ -106,6 +106,7 @@ static void get_page_bootmem(unsigned long info,  struct page *page,
>>>>  void __ref put_page_bootmem(struct page *page)
>>>>  {
>>>>  	unsigned long type;
>>>> +	static DEFINE_MUTEX(ppb_lock);
>>>>  
>>>>  	type = (unsigned long) page->lru.next;
>>>>  	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
>>>> @@ -115,7 +116,9 @@ void __ref put_page_bootmem(struct page *page)
>>>>  		ClearPagePrivate(page);
>>>>  		set_page_private(page, 0);
>>>>  		INIT_LIST_HEAD(&page->lru);
>>>> +		mutex_lock(&ppb_lock);
>>>>  		__free_pages_bootmem(page, 0);
>>>> +		mutex_unlock(&ppb_lock);
>>>
>>> The mutex is odd.  Nothing in the changelog, no code comment. 
>>> __free_pages_bootmem() is called from a lot of places but only this one
>>> has locking.  I'm madly guessing that the lock is here to handle two or
>>> more concurrent memory hotpluggings, but I shouldn't need to guess!!
>> Actually I'm a little hesitate whether we should add a lock here.
>>
>> All callers of __free_pages_bootmem() other than put_page_bootmem() should
>> only be used at startup time. And currently the only caller of put_page_bootmem()
>> has already been protected by pgdat_resize_lock(pgdat, &flags). So there's
>> no real need for lock, just defensive.
>>
>> I'm not sure which is the best solution here.
>> 1) add a comments into __free_pages_bootmem() to state that the caller should
>>    serialize themselves.
>> 2) Use a dedicated lock to serialize updates to zone->managed_pages, this need
>>    modifications to page_alloc.c and memory_hotplug.c.
>> 3) The above solution to serialize in put_page_bootmem().
>> What's your suggestions here?
> 
> Firstly, let's be clear about what *data* we're protecting here.  I
> think it's only ->managed_pages?
Yes, we are just trying to protect ->managed_pages.

> I agree that no locking is needed during the init-time code.
> 
> So afaict we only need be concerned about concurrent updates to
> ->managed_pages via memory hotplug, and lock_memory_hotplug() is
> sufficient there.  We don't need to be concerned about readers of
> managed_pages because it is an unsigned long (a u64 on 32-bit machines
> would be a problem).
> 
> All correct?  If so, the code is OK as-is and this can all be
> described/formalised in code comments.  If one wants to be really
> confident, we could do something along the lines of
OK, will add some comments to describe.

> void mod_zone_managed_pages(struct zone *zone, signed long delta)
> {
> 	WARN_ON(system_state != SYSTEM_BOOTING &&
> 		!is_locked_memory_hotplug());
> 	zone->managed_pages += delta;
> }
This seems a little overhead because __free_pages_bootmem() is on the hot path
and will be called many times at boot time.

Regards!
Gerry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap
  2012-11-20 19:19                 ` Andrew Morton
@ 2012-11-21 14:52                   ` Jiang Liu
  -1 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-21 14:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On 11/21/2012 03:19 AM, Andrew Morton wrote:
> On Tue, 20 Nov 2012 23:18:34 +0800
> Jiang Liu <liuj97@gmail.com> wrote:
> 
>>>> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
>>>> +				      unsigned long present_pages)
>>>> +{
>>>> +	unsigned long pages = spanned_pages;
>>>> +
>>>> +	/*
>>>> +	 * Provide a more accurate estimation if there are big holes within
>>>> +	 * the zone and SPARSEMEM is in use.
>>>> +	 */
>>>> +	if (spanned_pages > present_pages + (present_pages >> 4) &&
>>>> +	    IS_ENABLED(CONFIG_SPARSEMEM))
>>>> +		pages = present_pages;
>>>> +
>>>> +	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
>>>> +}
>>>
>>> Please explain the ">> 4" heuristc more completely - preferably in both
>>> the changelog and code comments.  Why can't we calculate this
>>> requirement exactly?  That might require a second pass, but that's OK for
>>> code like this?
>> Hi Andrew,
>> 	A normal x86 platform always have some holes within the DMA ZONE,
>> so the ">> 4" heuristic is to avoid applying this adjustment to the DMA
>> ZONE on x86 platforms. 
>> 	Because the memmap_size is just an estimation, I feel it's OK to
>> remove the ">> 4" heuristic, that shouldn't affect much.
> 
> Again: why can't we calculate this requirement exactly?  That might
> require a second pass, but that's OK for code like this?

Hi Andrew,
	If there are holes within a zone, it may cost us one or two extra pages
for each populated region within the zone due to alignment because memmap for 
each populated regions may not naturally aligned on page boundary.
	Originally the ">> 4" heuristic is to trade off these extra memmap pages,
especially for small zones linke DMA zone.
	Thanks!



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap
@ 2012-11-21 14:52                   ` Jiang Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-21 14:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On 11/21/2012 03:19 AM, Andrew Morton wrote:
> On Tue, 20 Nov 2012 23:18:34 +0800
> Jiang Liu <liuj97@gmail.com> wrote:
> 
>>>> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
>>>> +				      unsigned long present_pages)
>>>> +{
>>>> +	unsigned long pages = spanned_pages;
>>>> +
>>>> +	/*
>>>> +	 * Provide a more accurate estimation if there are big holes within
>>>> +	 * the zone and SPARSEMEM is in use.
>>>> +	 */
>>>> +	if (spanned_pages > present_pages + (present_pages >> 4) &&
>>>> +	    IS_ENABLED(CONFIG_SPARSEMEM))
>>>> +		pages = present_pages;
>>>> +
>>>> +	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
>>>> +}
>>>
>>> Please explain the ">> 4" heuristc more completely - preferably in both
>>> the changelog and code comments.  Why can't we calculate this
>>> requirement exactly?  That might require a second pass, but that's OK for
>>> code like this?
>> Hi Andrew,
>> 	A normal x86 platform always have some holes within the DMA ZONE,
>> so the ">> 4" heuristic is to avoid applying this adjustment to the DMA
>> ZONE on x86 platforms. 
>> 	Because the memmap_size is just an estimation, I feel it's OK to
>> remove the ">> 4" heuristic, that shouldn't affect much.
> 
> Again: why can't we calculate this requirement exactly?  That might
> require a second pass, but that's OK for code like this?

Hi Andrew,
	If there are holes within a zone, it may cost us one or two extra pages
for each populated region within the zone due to alignment because memmap for 
each populated regions may not naturally aligned on page boundary.
	Originally the ">> 4" heuristic is to trade off these extra memmap pages,
especially for small zones linke DMA zone.
	Thanks!


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [RFT PATCH v2 1/5] mm: introduce new field "managed_pages" to struct zone
  2012-11-20 19:31                 ` Andrew Morton
@ 2012-11-21 15:06                   ` Jiang Liu
  -1 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-21 15:06 UTC (permalink / raw)
  To: Andrew Morton, Wen Congyang, David Rientjes
  Cc: Jiang Liu, Maciej Rutecki, Chris Clayton, Rafael J . Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	Jianguo Wu, linux-mm, linux-kernel

Currently a zone's present_pages is calcuated as below, which is
inaccurate and may cause trouble to memory hotplug.
	spanned_pages - absent_pages - memmap_pages - dma_reserve.

During fixing bugs caused by inaccurate zone->present_pages, we found
zone->present_pages has been abused. The field zone->present_pages
may have different meanings in different contexts:
1) pages existing in a zone.
2) pages managed by the buddy system.

For more discussions about the issue, please refer to:
http://lkml.org/lkml/2012/11/5/866
https://patchwork.kernel.org/patch/1346751/

This patchset tries to introduce a new field named "managed_pages" to
struct zone, which counts "pages managed by the buddy system". And
revert zone->present_pages to count "physical pages existing in a zone",
which also keep in consistence with pgdat->node_present_pages.

We will set an initial value for zone->managed_pages in function
free_area_init_core() and will adjust it later if the initial value is
inaccurate.

For DMA/normal zones, the initial value is set to:
	(spanned_pages - absent_pages - memmap_pages - dma_reserve)
Later zone->managed_pages will be adjusted to the accurate value when
the bootmem allocator frees all free pages to the buddy system in
function free_all_bootmem_node() and free_all_bootmem().

The bootmem allocator doesn't touch highmem pages, so highmem zones'
managed_pages is set to the accurate value "spanned_pages - absent_pages"
in function free_area_init_core() and won't be updated anymore.

This patch also adds a new field "managed_pages" to /proc/zoneinfo
and sysrq showmem.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 include/linux/mmzone.h |   41 ++++++++++++++++++++++++++++++++++-------
 mm/bootmem.c           |   21 +++++++++++++++++++++
 mm/memory_hotplug.c    |   10 ++++++++++
 mm/nobootmem.c         |   22 ++++++++++++++++++++++
 mm/page_alloc.c        |   44 ++++++++++++++++++++++++++++++--------------
 mm/vmstat.c            |    6 ++++--
 6 files changed, 121 insertions(+), 23 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a23923b..47f5672 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -469,17 +469,44 @@ struct zone {
 	unsigned long		zone_start_pfn;
 
 	/*
-	 * zone_start_pfn, spanned_pages and present_pages are all
-	 * protected by span_seqlock.  It is a seqlock because it has
-	 * to be read outside of zone->lock, and it is done in the main
-	 * allocator path.  But, it is written quite infrequently.
+	 * spanned_pages is the total pages spanned by the zone, including
+	 * holes, which is calcualted as:
+	 * 	spanned_pages = zone_end_pfn - zone_start_pfn;
 	 *
-	 * The lock is declared along with zone->lock because it is
+	 * present_pages is physical pages existing within the zone, which
+	 * is calculated as:
+	 *	present_pages = spanned_pages - absent_pages(pags in holes);
+	 *
+	 * managed_pages is present pages managed by the buddy system, which
+	 * is calculated as (reserved_pages includes pages allocated by the
+	 * bootmem allocator):
+	 *	managed_pages = present_pages - reserved_pages;
+	 *
+	 * So present_pages may be used by memory hotplug or memory power
+	 * management logic to figure out unmanaged pages by checking
+	 * (present_pages - managed_pages). And managed_pages should be used
+	 * by page allocator and vm scanner to calculate all kinds of watermarks
+	 * and thresholds.
+	 *
+	 * Lock Rules:
+	 *
+	 * zone_start_pfn, spanned_pages are protected by span_seqlock.
+	 * It is a seqlock because it has to be read outside of zone->lock,
+	 * and it is done in the main allocator path.  But, it is written
+	 * quite infrequently.
+	 *
+	 * The span_seq lock is declared along with zone->lock because it is
 	 * frequently read in proximity to zone->lock.  It's good to
 	 * give them a chance of being in the same cacheline.
+	 *
+	 * Writing access to present_pages and managed_pages at runtime should
+	 * be protected by lock_memory_hotplug()/unlock_memory_hotplug().
+	 * Any reader who can't tolerant drift of present_pages and
+	 * managed_pages should hold memory hotplug lock to get a stable value.
 	 */
-	unsigned long		spanned_pages;	/* total size, including holes */
-	unsigned long		present_pages;	/* amount of memory (excluding holes) */
+	unsigned long		spanned_pages;
+	unsigned long		present_pages;
+	unsigned long		managed_pages;
 
 	/*
 	 * rarely used fields:
diff --git a/mm/bootmem.c b/mm/bootmem.c
index f468185..4117294 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -229,6 +229,22 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
 	return count;
 }
 
+static void reset_node_lowmem_managed_pages(pg_data_t *pgdat)
+{
+	struct zone *z;
+
+	/*
+	 * In free_area_init_core(), highmem zone's managed_pages is set to
+	 * present_pages, and bootmem allocator doesn't allocate from highmem
+	 * zones. So there's no need to recalculate managed_pages because all
+	 * highmem pages will be managed by the buddy system. Here highmem
+	 * zone also includes highmem movable zone.
+	 */
+	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
+		if (!is_highmem(z))
+			z->managed_pages = 0;
+}
+
 /**
  * free_all_bootmem_node - release a node's free pages to the buddy allocator
  * @pgdat: node to be released
@@ -238,6 +254,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
 unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
 {
 	register_page_bootmem_info_node(pgdat);
+	reset_node_lowmem_managed_pages(pgdat);
 	return free_all_bootmem_core(pgdat->bdata);
 }
 
@@ -250,6 +267,10 @@ unsigned long __init free_all_bootmem(void)
 {
 	unsigned long total_pages = 0;
 	bootmem_data_t *bdata;
+	struct pglist_data *pgdat;
+
+	for_each_online_pgdat(pgdat)
+		reset_node_lowmem_managed_pages(pgdat);
 
 	list_for_each_entry(bdata, &bdata_list, list)
 		total_pages += free_all_bootmem_core(bdata);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e4eeaca..1a9e675 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -106,6 +106,7 @@ static void get_page_bootmem(unsigned long info,  struct page *page,
 void __ref put_page_bootmem(struct page *page)
 {
 	unsigned long type;
+	static DEFINE_MUTEX(ppb_lock);
 
 	type = (unsigned long) page->lru.next;
 	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
@@ -115,7 +116,14 @@ void __ref put_page_bootmem(struct page *page)
 		ClearPagePrivate(page);
 		set_page_private(page, 0);
 		INIT_LIST_HEAD(&page->lru);
+
+		/*
+		 * Please refer to comment for __free_pages_bootmem()
+		 * for why we serialize here.
+		 */
+		mutex_lock(&ppb_lock);
 		__free_pages_bootmem(page, 0);
+		mutex_unlock(&ppb_lock);
 	}
 
 }
@@ -514,6 +522,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages)
 		return ret;
 	}
 
+	zone->managed_pages += onlined_pages;
 	zone->present_pages += onlined_pages;
 	zone->zone_pgdat->node_present_pages += onlined_pages;
 	if (onlined_pages) {
@@ -961,6 +970,7 @@ repeat:
 	/* reset pagetype flags and makes migrate type to be MOVABLE */
 	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
 	/* removal success */
+	zone->managed_pages -= offlined_pages;
 	zone->present_pages -= offlined_pages;
 	zone->zone_pgdat->node_present_pages -= offlined_pages;
 	totalram_pages -= offlined_pages;
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index bd82f6b..b8294fc 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -137,6 +137,22 @@ unsigned long __init free_low_memory_core_early(int nodeid)
 	return count;
 }
 
+static void reset_node_lowmem_managed_pages(pg_data_t *pgdat)
+{
+	struct zone *z;
+
+	/*
+	 * In free_area_init_core(), highmem zone's managed_pages is set to
+	 * present_pages, and bootmem allocator doesn't allocate from highmem
+	 * zones. So there's no need to recalculate managed_pages because all
+	 * highmem pages will be managed by the buddy system. Here highmem
+	 * zone also includes highmem movable zone.
+	 */
+	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
+		if (!is_highmem(z))
+			z->managed_pages = 0;
+}
+
 /**
  * free_all_bootmem_node - release a node's free pages to the buddy allocator
  * @pgdat: node to be released
@@ -146,6 +162,7 @@ unsigned long __init free_low_memory_core_early(int nodeid)
 unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
 {
 	register_page_bootmem_info_node(pgdat);
+	reset_node_lowmem_managed_pages(pgdat);
 
 	/* free_low_memory_core_early(MAX_NUMNODES) will be called later */
 	return 0;
@@ -158,6 +175,11 @@ unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
  */
 unsigned long __init free_all_bootmem(void)
 {
+	struct pglist_data *pgdat;
+
+	for_each_online_pgdat(pgdat)
+		reset_node_lowmem_managed_pages(pgdat);
+
 	/*
 	 * We need to use MAX_NUMNODES instead of NODE_DATA(0)->node_id
 	 *  because in some case like Node0 doesn't have RAM installed
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7bb35ac..4418a04 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -730,6 +730,13 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	local_irq_restore(flags);
 }
 
+/*
+ * Read access to zone->managed_pages is safe because it's unsigned long,
+ * but we still need to serialize writers. Currently all callers of
+ * __free_pages_bootmem() except put_page_bootmem() should only be used
+ * at boot time. So for shorter boot time, we have shift the burden to
+ * put_page_bootmem() to serialize writers.
+ */
 void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
 {
 	unsigned int nr_pages = 1 << order;
@@ -745,6 +752,7 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
 		set_page_count(p, 0);
 	}
 
+	page_zone(page)->managed_pages += 1 << order;
 	set_page_refcounted(page);
 	__free_pages(page, order);
 }
@@ -2950,6 +2958,7 @@ void show_free_areas(unsigned int filter)
 			" isolated(anon):%lukB"
 			" isolated(file):%lukB"
 			" present:%lukB"
+			" managed:%lukB"
 			" mlocked:%lukB"
 			" dirty:%lukB"
 			" writeback:%lukB"
@@ -2979,6 +2988,7 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_ISOLATED_ANON)),
 			K(zone_page_state(zone, NR_ISOLATED_FILE)),
 			K(zone->present_pages),
+			K(zone->managed_pages),
 			K(zone_page_state(zone, NR_MLOCK)),
 			K(zone_page_state(zone, NR_FILE_DIRTY)),
 			K(zone_page_state(zone, NR_WRITEBACK)),
@@ -4455,48 +4465,54 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
-		unsigned long size, realsize, memmap_pages;
+		unsigned long size, realsize, freesize, memmap_pages;
 
 		size = zone_spanned_pages_in_node(nid, j, zones_size);
-		realsize = size - zone_absent_pages_in_node(nid, j,
+		realsize = freesize = size - zone_absent_pages_in_node(nid, j,
 								zholes_size);
 
 		/*
-		 * Adjust realsize so that it accounts for how much memory
+		 * Adjust freesize so that it accounts for how much memory
 		 * is used by this zone for memmap. This affects the watermark
 		 * and per-cpu initialisations
 		 */
 		memmap_pages =
 			PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
-		if (realsize >= memmap_pages) {
-			realsize -= memmap_pages;
+		if (freesize >= memmap_pages) {
+			freesize -= memmap_pages;
 			if (memmap_pages)
 				printk(KERN_DEBUG
 				       "  %s zone: %lu pages used for memmap\n",
 				       zone_names[j], memmap_pages);
 		} else
 			printk(KERN_WARNING
-				"  %s zone: %lu pages exceeds realsize %lu\n",
-				zone_names[j], memmap_pages, realsize);
+				"  %s zone: %lu pages exceeds freesize %lu\n",
+				zone_names[j], memmap_pages, freesize);
 
 		/* Account for reserved pages */
-		if (j == 0 && realsize > dma_reserve) {
-			realsize -= dma_reserve;
+		if (j == 0 && freesize > dma_reserve) {
+			freesize -= dma_reserve;
 			printk(KERN_DEBUG "  %s zone: %lu pages reserved\n",
 					zone_names[0], dma_reserve);
 		}
 
 		if (!is_highmem_idx(j))
-			nr_kernel_pages += realsize;
-		nr_all_pages += realsize;
+			nr_kernel_pages += freesize;
+		nr_all_pages += freesize;
 
 		zone->spanned_pages = size;
-		zone->present_pages = realsize;
+		zone->present_pages = freesize;
+		/*
+		 * Set an approximate value for lowmem here, it will be adjusted
+		 * when the bootmem allocator frees pages into the buddy system.
+		 * And all highmem pages will be managed by the buddy system.
+		 */
+		zone->managed_pages = is_highmem_idx(j) ? realsize : freesize;
 #ifdef CONFIG_NUMA
 		zone->node = nid;
-		zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
+		zone->min_unmapped_pages = (freesize*sysctl_min_unmapped_ratio)
 						/ 100;
-		zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
+		zone->min_slab_pages = (freesize * sysctl_min_slab_ratio) / 100;
 #endif
 		zone->name = zone_names[j];
 		spin_lock_init(&zone->lock);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..e47d31c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -992,14 +992,16 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        high     %lu"
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
-		   "\n        present  %lu",
+		   "\n        present  %lu"
+		   "\n        managed  %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
 		   zone->pages_scanned,
 		   zone->spanned_pages,
-		   zone->present_pages);
+		   zone->present_pages,
+		   zone->managed_pages);
 
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 		seq_printf(m, "\n    %-12s %lu", vmstat_text[i],
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [RFT PATCH v2 1/5] mm: introduce new field "managed_pages" to struct zone
@ 2012-11-21 15:06                   ` Jiang Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-21 15:06 UTC (permalink / raw)
  To: Andrew Morton, Wen Congyang, David Rientjes
  Cc: Jiang Liu, Maciej Rutecki, Chris Clayton, Rafael J . Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	Jianguo Wu, linux-mm, linux-kernel

Currently a zone's present_pages is calcuated as below, which is
inaccurate and may cause trouble to memory hotplug.
	spanned_pages - absent_pages - memmap_pages - dma_reserve.

During fixing bugs caused by inaccurate zone->present_pages, we found
zone->present_pages has been abused. The field zone->present_pages
may have different meanings in different contexts:
1) pages existing in a zone.
2) pages managed by the buddy system.

For more discussions about the issue, please refer to:
http://lkml.org/lkml/2012/11/5/866
https://patchwork.kernel.org/patch/1346751/

This patchset tries to introduce a new field named "managed_pages" to
struct zone, which counts "pages managed by the buddy system". And
revert zone->present_pages to count "physical pages existing in a zone",
which also keep in consistence with pgdat->node_present_pages.

We will set an initial value for zone->managed_pages in function
free_area_init_core() and will adjust it later if the initial value is
inaccurate.

For DMA/normal zones, the initial value is set to:
	(spanned_pages - absent_pages - memmap_pages - dma_reserve)
Later zone->managed_pages will be adjusted to the accurate value when
the bootmem allocator frees all free pages to the buddy system in
function free_all_bootmem_node() and free_all_bootmem().

The bootmem allocator doesn't touch highmem pages, so highmem zones'
managed_pages is set to the accurate value "spanned_pages - absent_pages"
in function free_area_init_core() and won't be updated anymore.

This patch also adds a new field "managed_pages" to /proc/zoneinfo
and sysrq showmem.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 include/linux/mmzone.h |   41 ++++++++++++++++++++++++++++++++++-------
 mm/bootmem.c           |   21 +++++++++++++++++++++
 mm/memory_hotplug.c    |   10 ++++++++++
 mm/nobootmem.c         |   22 ++++++++++++++++++++++
 mm/page_alloc.c        |   44 ++++++++++++++++++++++++++++++--------------
 mm/vmstat.c            |    6 ++++--
 6 files changed, 121 insertions(+), 23 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a23923b..47f5672 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -469,17 +469,44 @@ struct zone {
 	unsigned long		zone_start_pfn;
 
 	/*
-	 * zone_start_pfn, spanned_pages and present_pages are all
-	 * protected by span_seqlock.  It is a seqlock because it has
-	 * to be read outside of zone->lock, and it is done in the main
-	 * allocator path.  But, it is written quite infrequently.
+	 * spanned_pages is the total pages spanned by the zone, including
+	 * holes, which is calcualted as:
+	 * 	spanned_pages = zone_end_pfn - zone_start_pfn;
 	 *
-	 * The lock is declared along with zone->lock because it is
+	 * present_pages is physical pages existing within the zone, which
+	 * is calculated as:
+	 *	present_pages = spanned_pages - absent_pages(pags in holes);
+	 *
+	 * managed_pages is present pages managed by the buddy system, which
+	 * is calculated as (reserved_pages includes pages allocated by the
+	 * bootmem allocator):
+	 *	managed_pages = present_pages - reserved_pages;
+	 *
+	 * So present_pages may be used by memory hotplug or memory power
+	 * management logic to figure out unmanaged pages by checking
+	 * (present_pages - managed_pages). And managed_pages should be used
+	 * by page allocator and vm scanner to calculate all kinds of watermarks
+	 * and thresholds.
+	 *
+	 * Lock Rules:
+	 *
+	 * zone_start_pfn, spanned_pages are protected by span_seqlock.
+	 * It is a seqlock because it has to be read outside of zone->lock,
+	 * and it is done in the main allocator path.  But, it is written
+	 * quite infrequently.
+	 *
+	 * The span_seq lock is declared along with zone->lock because it is
 	 * frequently read in proximity to zone->lock.  It's good to
 	 * give them a chance of being in the same cacheline.
+	 *
+	 * Writing access to present_pages and managed_pages at runtime should
+	 * be protected by lock_memory_hotplug()/unlock_memory_hotplug().
+	 * Any reader who can't tolerant drift of present_pages and
+	 * managed_pages should hold memory hotplug lock to get a stable value.
 	 */
-	unsigned long		spanned_pages;	/* total size, including holes */
-	unsigned long		present_pages;	/* amount of memory (excluding holes) */
+	unsigned long		spanned_pages;
+	unsigned long		present_pages;
+	unsigned long		managed_pages;
 
 	/*
 	 * rarely used fields:
diff --git a/mm/bootmem.c b/mm/bootmem.c
index f468185..4117294 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -229,6 +229,22 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
 	return count;
 }
 
+static void reset_node_lowmem_managed_pages(pg_data_t *pgdat)
+{
+	struct zone *z;
+
+	/*
+	 * In free_area_init_core(), highmem zone's managed_pages is set to
+	 * present_pages, and bootmem allocator doesn't allocate from highmem
+	 * zones. So there's no need to recalculate managed_pages because all
+	 * highmem pages will be managed by the buddy system. Here highmem
+	 * zone also includes highmem movable zone.
+	 */
+	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
+		if (!is_highmem(z))
+			z->managed_pages = 0;
+}
+
 /**
  * free_all_bootmem_node - release a node's free pages to the buddy allocator
  * @pgdat: node to be released
@@ -238,6 +254,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
 unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
 {
 	register_page_bootmem_info_node(pgdat);
+	reset_node_lowmem_managed_pages(pgdat);
 	return free_all_bootmem_core(pgdat->bdata);
 }
 
@@ -250,6 +267,10 @@ unsigned long __init free_all_bootmem(void)
 {
 	unsigned long total_pages = 0;
 	bootmem_data_t *bdata;
+	struct pglist_data *pgdat;
+
+	for_each_online_pgdat(pgdat)
+		reset_node_lowmem_managed_pages(pgdat);
 
 	list_for_each_entry(bdata, &bdata_list, list)
 		total_pages += free_all_bootmem_core(bdata);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e4eeaca..1a9e675 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -106,6 +106,7 @@ static void get_page_bootmem(unsigned long info,  struct page *page,
 void __ref put_page_bootmem(struct page *page)
 {
 	unsigned long type;
+	static DEFINE_MUTEX(ppb_lock);
 
 	type = (unsigned long) page->lru.next;
 	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
@@ -115,7 +116,14 @@ void __ref put_page_bootmem(struct page *page)
 		ClearPagePrivate(page);
 		set_page_private(page, 0);
 		INIT_LIST_HEAD(&page->lru);
+
+		/*
+		 * Please refer to comment for __free_pages_bootmem()
+		 * for why we serialize here.
+		 */
+		mutex_lock(&ppb_lock);
 		__free_pages_bootmem(page, 0);
+		mutex_unlock(&ppb_lock);
 	}
 
 }
@@ -514,6 +522,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages)
 		return ret;
 	}
 
+	zone->managed_pages += onlined_pages;
 	zone->present_pages += onlined_pages;
 	zone->zone_pgdat->node_present_pages += onlined_pages;
 	if (onlined_pages) {
@@ -961,6 +970,7 @@ repeat:
 	/* reset pagetype flags and makes migrate type to be MOVABLE */
 	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
 	/* removal success */
+	zone->managed_pages -= offlined_pages;
 	zone->present_pages -= offlined_pages;
 	zone->zone_pgdat->node_present_pages -= offlined_pages;
 	totalram_pages -= offlined_pages;
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index bd82f6b..b8294fc 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -137,6 +137,22 @@ unsigned long __init free_low_memory_core_early(int nodeid)
 	return count;
 }
 
+static void reset_node_lowmem_managed_pages(pg_data_t *pgdat)
+{
+	struct zone *z;
+
+	/*
+	 * In free_area_init_core(), highmem zone's managed_pages is set to
+	 * present_pages, and bootmem allocator doesn't allocate from highmem
+	 * zones. So there's no need to recalculate managed_pages because all
+	 * highmem pages will be managed by the buddy system. Here highmem
+	 * zone also includes highmem movable zone.
+	 */
+	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
+		if (!is_highmem(z))
+			z->managed_pages = 0;
+}
+
 /**
  * free_all_bootmem_node - release a node's free pages to the buddy allocator
  * @pgdat: node to be released
@@ -146,6 +162,7 @@ unsigned long __init free_low_memory_core_early(int nodeid)
 unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
 {
 	register_page_bootmem_info_node(pgdat);
+	reset_node_lowmem_managed_pages(pgdat);
 
 	/* free_low_memory_core_early(MAX_NUMNODES) will be called later */
 	return 0;
@@ -158,6 +175,11 @@ unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
  */
 unsigned long __init free_all_bootmem(void)
 {
+	struct pglist_data *pgdat;
+
+	for_each_online_pgdat(pgdat)
+		reset_node_lowmem_managed_pages(pgdat);
+
 	/*
 	 * We need to use MAX_NUMNODES instead of NODE_DATA(0)->node_id
 	 *  because in some case like Node0 doesn't have RAM installed
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7bb35ac..4418a04 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -730,6 +730,13 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	local_irq_restore(flags);
 }
 
+/*
+ * Read access to zone->managed_pages is safe because it's unsigned long,
+ * but we still need to serialize writers. Currently all callers of
+ * __free_pages_bootmem() except put_page_bootmem() should only be used
+ * at boot time. So for shorter boot time, we have shift the burden to
+ * put_page_bootmem() to serialize writers.
+ */
 void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
 {
 	unsigned int nr_pages = 1 << order;
@@ -745,6 +752,7 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
 		set_page_count(p, 0);
 	}
 
+	page_zone(page)->managed_pages += 1 << order;
 	set_page_refcounted(page);
 	__free_pages(page, order);
 }
@@ -2950,6 +2958,7 @@ void show_free_areas(unsigned int filter)
 			" isolated(anon):%lukB"
 			" isolated(file):%lukB"
 			" present:%lukB"
+			" managed:%lukB"
 			" mlocked:%lukB"
 			" dirty:%lukB"
 			" writeback:%lukB"
@@ -2979,6 +2988,7 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_ISOLATED_ANON)),
 			K(zone_page_state(zone, NR_ISOLATED_FILE)),
 			K(zone->present_pages),
+			K(zone->managed_pages),
 			K(zone_page_state(zone, NR_MLOCK)),
 			K(zone_page_state(zone, NR_FILE_DIRTY)),
 			K(zone_page_state(zone, NR_WRITEBACK)),
@@ -4455,48 +4465,54 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
-		unsigned long size, realsize, memmap_pages;
+		unsigned long size, realsize, freesize, memmap_pages;
 
 		size = zone_spanned_pages_in_node(nid, j, zones_size);
-		realsize = size - zone_absent_pages_in_node(nid, j,
+		realsize = freesize = size - zone_absent_pages_in_node(nid, j,
 								zholes_size);
 
 		/*
-		 * Adjust realsize so that it accounts for how much memory
+		 * Adjust freesize so that it accounts for how much memory
 		 * is used by this zone for memmap. This affects the watermark
 		 * and per-cpu initialisations
 		 */
 		memmap_pages =
 			PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
-		if (realsize >= memmap_pages) {
-			realsize -= memmap_pages;
+		if (freesize >= memmap_pages) {
+			freesize -= memmap_pages;
 			if (memmap_pages)
 				printk(KERN_DEBUG
 				       "  %s zone: %lu pages used for memmap\n",
 				       zone_names[j], memmap_pages);
 		} else
 			printk(KERN_WARNING
-				"  %s zone: %lu pages exceeds realsize %lu\n",
-				zone_names[j], memmap_pages, realsize);
+				"  %s zone: %lu pages exceeds freesize %lu\n",
+				zone_names[j], memmap_pages, freesize);
 
 		/* Account for reserved pages */
-		if (j == 0 && realsize > dma_reserve) {
-			realsize -= dma_reserve;
+		if (j == 0 && freesize > dma_reserve) {
+			freesize -= dma_reserve;
 			printk(KERN_DEBUG "  %s zone: %lu pages reserved\n",
 					zone_names[0], dma_reserve);
 		}
 
 		if (!is_highmem_idx(j))
-			nr_kernel_pages += realsize;
-		nr_all_pages += realsize;
+			nr_kernel_pages += freesize;
+		nr_all_pages += freesize;
 
 		zone->spanned_pages = size;
-		zone->present_pages = realsize;
+		zone->present_pages = freesize;
+		/*
+		 * Set an approximate value for lowmem here, it will be adjusted
+		 * when the bootmem allocator frees pages into the buddy system.
+		 * And all highmem pages will be managed by the buddy system.
+		 */
+		zone->managed_pages = is_highmem_idx(j) ? realsize : freesize;
 #ifdef CONFIG_NUMA
 		zone->node = nid;
-		zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
+		zone->min_unmapped_pages = (freesize*sysctl_min_unmapped_ratio)
 						/ 100;
-		zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
+		zone->min_slab_pages = (freesize * sysctl_min_slab_ratio) / 100;
 #endif
 		zone->name = zone_names[j];
 		spin_lock_init(&zone->lock);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..e47d31c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -992,14 +992,16 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        high     %lu"
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
-		   "\n        present  %lu",
+		   "\n        present  %lu"
+		   "\n        managed  %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
 		   zone->pages_scanned,
 		   zone->spanned_pages,
-		   zone->present_pages);
+		   zone->present_pages,
+		   zone->managed_pages);
 
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 		seq_printf(m, "\n    %-12s %lu", vmstat_text[i],
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
  2012-11-20 19:19                 ` Andrew Morton
@ 2012-11-21 15:09                   ` Jiang Liu
  -1 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-21 15:09 UTC (permalink / raw)
  To: Andrew Morton, Wen Congyang, David Rientjes
  Cc: Jiang Liu, Maciej Rutecki, Chris Clayton, Rafael J . Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	Jianguo Wu, linux-mm, linux-kernel

If SPARSEMEM is enabled, it won't build page structures for
non-existing pages (holes) within a zone, so provide a more accurate
estimation of pages occupied by memmap if there are bigger holes within
the zone.

And pages for highmem zones' memmap will be allocated from lowmem, so
charge nr_kernel_pages for that.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 mm/page_alloc.c |   26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fc10071..9bbac97 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4442,6 +4442,26 @@ void __init set_pageblock_order(void)
 
 #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
 
+static unsigned long calc_memmap_size(unsigned long spanned_pages,
+				      unsigned long present_pages)
+{
+	unsigned long pages = spanned_pages;
+
+	/*
+	 * Provide a more accurate estimation if there are holes within
+	 * the zone and SPARSEMEM is in use. If there are holes within the
+	 * zone, each populated memory region may cost us one or two extra
+	 * memmap pages due to alignment because memmap pages for each
+	 * populated regions may not naturally algined on page boundary.
+	 * So the (present_pages >> 4) heuristic is a tradeoff for that.
+	 */
+	if (spanned_pages > present_pages + (present_pages >> 4) &&
+	    IS_ENABLED(CONFIG_SPARSEMEM))
+		pages = present_pages;
+
+	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
+}
+
 /*
  * Set up the zone data structures:
  *   - mark all pages reserved
@@ -4476,8 +4496,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		 * is used by this zone for memmap. This affects the watermark
 		 * and per-cpu initialisations
 		 */
-		memmap_pages =
-			PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
+		memmap_pages = calc_memmap_size(size, realsize);
 		if (freesize >= memmap_pages) {
 			freesize -= memmap_pages;
 			if (memmap_pages)
@@ -4498,6 +4517,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 		if (!is_highmem_idx(j))
 			nr_kernel_pages += freesize;
+		/* Charge for highmem memmap if there are enough kernel pages */
+		else if (nr_kernel_pages > memmap_pages * 2)
+			nr_kernel_pages -= memmap_pages;
 		nr_all_pages += freesize;
 
 		zone->spanned_pages = size;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
@ 2012-11-21 15:09                   ` Jiang Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-21 15:09 UTC (permalink / raw)
  To: Andrew Morton, Wen Congyang, David Rientjes
  Cc: Jiang Liu, Maciej Rutecki, Chris Clayton, Rafael J . Wysocki,
	Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko,
	Jianguo Wu, linux-mm, linux-kernel

If SPARSEMEM is enabled, it won't build page structures for
non-existing pages (holes) within a zone, so provide a more accurate
estimation of pages occupied by memmap if there are bigger holes within
the zone.

And pages for highmem zones' memmap will be allocated from lowmem, so
charge nr_kernel_pages for that.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 mm/page_alloc.c |   26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fc10071..9bbac97 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4442,6 +4442,26 @@ void __init set_pageblock_order(void)
 
 #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
 
+static unsigned long calc_memmap_size(unsigned long spanned_pages,
+				      unsigned long present_pages)
+{
+	unsigned long pages = spanned_pages;
+
+	/*
+	 * Provide a more accurate estimation if there are holes within
+	 * the zone and SPARSEMEM is in use. If there are holes within the
+	 * zone, each populated memory region may cost us one or two extra
+	 * memmap pages due to alignment because memmap pages for each
+	 * populated regions may not naturally algined on page boundary.
+	 * So the (present_pages >> 4) heuristic is a tradeoff for that.
+	 */
+	if (spanned_pages > present_pages + (present_pages >> 4) &&
+	    IS_ENABLED(CONFIG_SPARSEMEM))
+		pages = present_pages;
+
+	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
+}
+
 /*
  * Set up the zone data structures:
  *   - mark all pages reserved
@@ -4476,8 +4496,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		 * is used by this zone for memmap. This affects the watermark
 		 * and per-cpu initialisations
 		 */
-		memmap_pages =
-			PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
+		memmap_pages = calc_memmap_size(size, realsize);
 		if (freesize >= memmap_pages) {
 			freesize -= memmap_pages;
 			if (memmap_pages)
@@ -4498,6 +4517,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 		if (!is_highmem_idx(j))
 			nr_kernel_pages += freesize;
+		/* Charge for highmem memmap if there are enough kernel pages */
+		else if (nr_kernel_pages > memmap_pages * 2)
+			nr_kernel_pages -= memmap_pages;
 		nr_all_pages += freesize;
 
 		zone->spanned_pages = size;
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 1/5] mm: introduce new field "managed_pages" to struct zone
  2012-11-21 14:36                   ` Jiang Liu
@ 2012-11-21 19:31                     ` Andrew Morton
  -1 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2012-11-21 19:31 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On Wed, 21 Nov 2012 22:36:56 +0800
Jiang Liu <liuj97@gmail.com> wrote:

> > void mod_zone_managed_pages(struct zone *zone, signed long delta)
> > {
> > 	WARN_ON(system_state != SYSTEM_BOOTING &&
> > 		!is_locked_memory_hotplug());
> > 	zone->managed_pages += delta;
> > }
> This seems a little overhead because __free_pages_bootmem() is on the hot path
> and will be called many times at boot time.

Maybe, maybe not.  These things are measurable so let's not just guess.

But I'm not really recommending that we do this - there are all sorts
of things we *could* check and warn about, but we don't.  Potential
errors in this area don't seem terribly important.



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 1/5] mm: introduce new field "managed_pages" to struct zone
@ 2012-11-21 19:31                     ` Andrew Morton
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2012-11-21 19:31 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On Wed, 21 Nov 2012 22:36:56 +0800
Jiang Liu <liuj97@gmail.com> wrote:

> > void mod_zone_managed_pages(struct zone *zone, signed long delta)
> > {
> > 	WARN_ON(system_state != SYSTEM_BOOTING &&
> > 		!is_locked_memory_hotplug());
> > 	zone->managed_pages += delta;
> > }
> This seems a little overhead because __free_pages_bootmem() is on the hot path
> and will be called many times at boot time.

Maybe, maybe not.  These things are measurable so let's not just guess.

But I'm not really recommending that we do this - there are all sorts
of things we *could* check and warn about, but we don't.  Potential
errors in this area don't seem terribly important.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap
  2012-11-21 14:52                   ` Jiang Liu
@ 2012-11-21 19:35                     ` Andrew Morton
  -1 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2012-11-21 19:35 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On Wed, 21 Nov 2012 22:52:29 +0800
Jiang Liu <liuj97@gmail.com> wrote:

> On 11/21/2012 03:19 AM, Andrew Morton wrote:
> > On Tue, 20 Nov 2012 23:18:34 +0800
> > Jiang Liu <liuj97@gmail.com> wrote:
> > 
> >>>> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
> >>>> +				      unsigned long present_pages)
> >>>> +{
> >>>> +	unsigned long pages = spanned_pages;
> >>>> +
> >>>> +	/*
> >>>> +	 * Provide a more accurate estimation if there are big holes within
> >>>> +	 * the zone and SPARSEMEM is in use.
> >>>> +	 */
> >>>> +	if (spanned_pages > present_pages + (present_pages >> 4) &&
> >>>> +	    IS_ENABLED(CONFIG_SPARSEMEM))
> >>>> +		pages = present_pages;
> >>>> +
> >>>> +	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
> >>>> +}
> >>>
> >>> Please explain the ">> 4" heuristc more completely - preferably in both
> >>> the changelog and code comments.  Why can't we calculate this
> >>> requirement exactly?  That might require a second pass, but that's OK for
> >>> code like this?
> >> Hi Andrew,
> >> 	A normal x86 platform always have some holes within the DMA ZONE,
> >> so the ">> 4" heuristic is to avoid applying this adjustment to the DMA
> >> ZONE on x86 platforms. 
> >> 	Because the memmap_size is just an estimation, I feel it's OK to
> >> remove the ">> 4" heuristic, that shouldn't affect much.
> > 
> > Again: why can't we calculate this requirement exactly?  That might
> > require a second pass, but that's OK for code like this?
> 
> Hi Andrew,
> 	If there are holes within a zone, it may cost us one or two extra pages
> for each populated region within the zone due to alignment because memmap for 
> each populated regions may not naturally aligned on page boundary.

Right.  So with an additional pass across the zone and a bit of
arithmetic, we can calculate the exact space requirement for memmap?
No need for kludgy heuristics?

> Originally the ">> 4" heuristic is to trade off these extra memmap pages,
> especially for small zones linke DMA zone.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap
@ 2012-11-21 19:35                     ` Andrew Morton
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2012-11-21 19:35 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On Wed, 21 Nov 2012 22:52:29 +0800
Jiang Liu <liuj97@gmail.com> wrote:

> On 11/21/2012 03:19 AM, Andrew Morton wrote:
> > On Tue, 20 Nov 2012 23:18:34 +0800
> > Jiang Liu <liuj97@gmail.com> wrote:
> > 
> >>>> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
> >>>> +				      unsigned long present_pages)
> >>>> +{
> >>>> +	unsigned long pages = spanned_pages;
> >>>> +
> >>>> +	/*
> >>>> +	 * Provide a more accurate estimation if there are big holes within
> >>>> +	 * the zone and SPARSEMEM is in use.
> >>>> +	 */
> >>>> +	if (spanned_pages > present_pages + (present_pages >> 4) &&
> >>>> +	    IS_ENABLED(CONFIG_SPARSEMEM))
> >>>> +		pages = present_pages;
> >>>> +
> >>>> +	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
> >>>> +}
> >>>
> >>> Please explain the ">> 4" heuristc more completely - preferably in both
> >>> the changelog and code comments.  Why can't we calculate this
> >>> requirement exactly?  That might require a second pass, but that's OK for
> >>> code like this?
> >> Hi Andrew,
> >> 	A normal x86 platform always have some holes within the DMA ZONE,
> >> so the ">> 4" heuristic is to avoid applying this adjustment to the DMA
> >> ZONE on x86 platforms. 
> >> 	Because the memmap_size is just an estimation, I feel it's OK to
> >> remove the ">> 4" heuristic, that shouldn't affect much.
> > 
> > Again: why can't we calculate this requirement exactly?  That might
> > require a second pass, but that's OK for code like this?
> 
> Hi Andrew,
> 	If there are holes within a zone, it may cost us one or two extra pages
> for each populated region within the zone due to alignment because memmap for 
> each populated regions may not naturally aligned on page boundary.

Right.  So with an additional pass across the zone and a bit of
arithmetic, we can calculate the exact space requirement for memmap?
No need for kludgy heuristics?

> Originally the ">> 4" heuristic is to trade off these extra memmap pages,
> especially for small zones linke DMA zone.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
  2012-11-18 20:36           ` Chris Clayton
@ 2012-11-22  9:23             ` Chris Clayton
  -1 siblings, 0 replies; 98+ messages in thread
From: Chris Clayton @ 2012-11-22  9:23 UTC (permalink / raw)
  To: Chris Clayton
  Cc: Jiang Liu, Andrew Morton, Wen Congyang, David Rientjes,
	Jiang Liu, Maciej Rutecki, Rafael J . Wysocki, Mel Gorman,
	Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu,
	linux-mm, linux-kernel

>> This patchset has only been tested on x86_64 with nobootmem.c. So need
>> help to test this patchset on machines:
>> 1) use bootmem.c
>> 2) have highmem
>>
>> This patchset applies to "f4a75d2e Linux 3.7-rc6" from
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>>
>
> I've applied the five patches to Linus' 3.7.0-rc6 and can confirm that
> the kernel allows my system to resume from a suspend to disc. Although
> my laptop is 64 bit, I run a 32 bit kernel with HIGHMEM (I have 8GB RAM):
>
> [chris:~/kernel/tmp/linux-3.7-rc6-resume]$ grep -E HIGHMEM\|X86_32 .config
> CONFIG_X86_32=y
> CONFIG_X86_32_SMP=y
> CONFIG_X86_32_LAZY_GS=y
> # CONFIG_X86_32_IRIS is not set
> # CONFIG_NOHIGHMEM is not set
> # CONFIG_HIGHMEM4G is not set
> CONFIG_HIGHMEM64G=y
> CONFIG_HIGHMEM=y
>
> I can also say that a quick browse of the output of dmesg, shows nothing
> out of the ordinary. I have insufficient knowledge to comment on the
> patches, but I will run the kernel over the next few days and report
> back later in the week.
>

Well, I've been running the kernel since Sunday and have had no problems 
with my normal work mix of browsing, browsing the internet, video 
editing, listening to music and building software. I'm now running a 
kernel that build with the new patches 1 and 4 from yesterday (plus the 
original 1, 2 and 5). All seems OK so far, including a couple of resumes 
from suspend to disk.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
@ 2012-11-22  9:23             ` Chris Clayton
  0 siblings, 0 replies; 98+ messages in thread
From: Chris Clayton @ 2012-11-22  9:23 UTC (permalink / raw)
  To: Chris Clayton
  Cc: Jiang Liu, Andrew Morton, Wen Congyang, David Rientjes,
	Jiang Liu, Maciej Rutecki, Rafael J . Wysocki, Mel Gorman,
	Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu,
	linux-mm, linux-kernel

>> This patchset has only been tested on x86_64 with nobootmem.c. So need
>> help to test this patchset on machines:
>> 1) use bootmem.c
>> 2) have highmem
>>
>> This patchset applies to "f4a75d2e Linux 3.7-rc6" from
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>>
>
> I've applied the five patches to Linus' 3.7.0-rc6 and can confirm that
> the kernel allows my system to resume from a suspend to disc. Although
> my laptop is 64 bit, I run a 32 bit kernel with HIGHMEM (I have 8GB RAM):
>
> [chris:~/kernel/tmp/linux-3.7-rc6-resume]$ grep -E HIGHMEM\|X86_32 .config
> CONFIG_X86_32=y
> CONFIG_X86_32_SMP=y
> CONFIG_X86_32_LAZY_GS=y
> # CONFIG_X86_32_IRIS is not set
> # CONFIG_NOHIGHMEM is not set
> # CONFIG_HIGHMEM4G is not set
> CONFIG_HIGHMEM64G=y
> CONFIG_HIGHMEM=y
>
> I can also say that a quick browse of the output of dmesg, shows nothing
> out of the ordinary. I have insufficient knowledge to comment on the
> patches, but I will run the kernel over the next few days and report
> back later in the week.
>

Well, I've been running the kernel since Sunday and have had no problems 
with my normal work mix of browsing, browsing the internet, video 
editing, listening to music and building software. I'm now running a 
kernel that build with the new patches 1 and 4 from yesterday (plus the 
original 1, 2 and 5). All seems OK so far, including a couple of resumes 
from suspend to disk.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap
  2012-11-21 19:35                     ` Andrew Morton
@ 2012-11-22 16:17                       ` Jiang Liu
  -1 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-22 16:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On 11/22/2012 03:35 AM, Andrew Morton wrote:
> On Wed, 21 Nov 2012 22:52:29 +0800
> Jiang Liu <liuj97@gmail.com> wrote:
> 
>> On 11/21/2012 03:19 AM, Andrew Morton wrote:
>>> On Tue, 20 Nov 2012 23:18:34 +0800
>>> Jiang Liu <liuj97@gmail.com> wrote:
>>>
>>>>>> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
>>>>>> +				      unsigned long present_pages)
>>>>>> +{
>>>>>> +	unsigned long pages = spanned_pages;
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * Provide a more accurate estimation if there are big holes within
>>>>>> +	 * the zone and SPARSEMEM is in use.
>>>>>> +	 */
>>>>>> +	if (spanned_pages > present_pages + (present_pages >> 4) &&
>>>>>> +	    IS_ENABLED(CONFIG_SPARSEMEM))
>>>>>> +		pages = present_pages;
>>>>>> +
>>>>>> +	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
>>>>>> +}
>>>>>
>>>>> Please explain the ">> 4" heuristc more completely - preferably in both
>>>>> the changelog and code comments.  Why can't we calculate this
>>>>> requirement exactly?  That might require a second pass, but that's OK for
>>>>> code like this?
>>>> Hi Andrew,
>>>> 	A normal x86 platform always have some holes within the DMA ZONE,
>>>> so the ">> 4" heuristic is to avoid applying this adjustment to the DMA
>>>> ZONE on x86 platforms. 
>>>> 	Because the memmap_size is just an estimation, I feel it's OK to
>>>> remove the ">> 4" heuristic, that shouldn't affect much.
>>>
>>> Again: why can't we calculate this requirement exactly?  That might
>>> require a second pass, but that's OK for code like this?
>>
>> Hi Andrew,
>> 	If there are holes within a zone, it may cost us one or two extra pages
>> for each populated region within the zone due to alignment because memmap for 
>> each populated regions may not naturally aligned on page boundary.
> 
> Right.  So with an additional pass across the zone and a bit of
> arithmetic, we can calculate the exact space requirement for memmap?
> No need for kludgy heuristics?
Hi Andrew:
	Happy Thanksgiving!

The way to calculate the exact space requirement for memmap seems a little
complex, it depends on:
CONFIG_SPARSEMEM_VMEMMAP
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
arch implemenation of alloc_remap()

Actually the original motivation is to reduce the deviation on a platform
such as:
node 0: 0-2G,4G-255G (a 2G hole between 2-4G)
node 1: 256G - 511G
node 2: 512G - 767G
node 3: 768G - 1023G
node 0: 1024G - 1026G (memory recovered from the hole)

So I just tried to reduce the deviation instead of accurate calculation of memmap.

Regards!
Gerry

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap
@ 2012-11-22 16:17                       ` Jiang Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-11-22 16:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On 11/22/2012 03:35 AM, Andrew Morton wrote:
> On Wed, 21 Nov 2012 22:52:29 +0800
> Jiang Liu <liuj97@gmail.com> wrote:
> 
>> On 11/21/2012 03:19 AM, Andrew Morton wrote:
>>> On Tue, 20 Nov 2012 23:18:34 +0800
>>> Jiang Liu <liuj97@gmail.com> wrote:
>>>
>>>>>> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
>>>>>> +				      unsigned long present_pages)
>>>>>> +{
>>>>>> +	unsigned long pages = spanned_pages;
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * Provide a more accurate estimation if there are big holes within
>>>>>> +	 * the zone and SPARSEMEM is in use.
>>>>>> +	 */
>>>>>> +	if (spanned_pages > present_pages + (present_pages >> 4) &&
>>>>>> +	    IS_ENABLED(CONFIG_SPARSEMEM))
>>>>>> +		pages = present_pages;
>>>>>> +
>>>>>> +	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
>>>>>> +}
>>>>>
>>>>> Please explain the ">> 4" heuristc more completely - preferably in both
>>>>> the changelog and code comments.  Why can't we calculate this
>>>>> requirement exactly?  That might require a second pass, but that's OK for
>>>>> code like this?
>>>> Hi Andrew,
>>>> 	A normal x86 platform always have some holes within the DMA ZONE,
>>>> so the ">> 4" heuristic is to avoid applying this adjustment to the DMA
>>>> ZONE on x86 platforms. 
>>>> 	Because the memmap_size is just an estimation, I feel it's OK to
>>>> remove the ">> 4" heuristic, that shouldn't affect much.
>>>
>>> Again: why can't we calculate this requirement exactly?  That might
>>> require a second pass, but that's OK for code like this?
>>
>> Hi Andrew,
>> 	If there are holes within a zone, it may cost us one or two extra pages
>> for each populated region within the zone due to alignment because memmap for 
>> each populated regions may not naturally aligned on page boundary.
> 
> Right.  So with an additional pass across the zone and a bit of
> arithmetic, we can calculate the exact space requirement for memmap?
> No need for kludgy heuristics?
Hi Andrew:
	Happy Thanksgiving!

The way to calculate the exact space requirement for memmap seems a little
complex, it depends on:
CONFIG_SPARSEMEM_VMEMMAP
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
arch implemenation of alloc_remap()

Actually the original motivation is to reduce the deviation on a platform
such as:
node 0: 0-2G,4G-255G (a 2G hole between 2-4G)
node 1: 256G - 511G
node 2: 512G - 767G
node 3: 768G - 1023G
node 0: 1024G - 1026G (memory recovered from the hole)

So I just tried to reduce the deviation instead of accurate calculation of memmap.

Regards!
Gerry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
  2012-11-22  9:23             ` Chris Clayton
@ 2012-11-26  9:46               ` Chris Clayton
  -1 siblings, 0 replies; 98+ messages in thread
From: Chris Clayton @ 2012-11-26  9:46 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Wen Congyang, David Rientjes, Jiang Liu,
	Maciej Rutecki, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel



On 11/22/12 09:23, Chris Clayton wrote:
>>> This patchset has only been tested on x86_64 with nobootmem.c. So need
>>> help to test this patchset on machines:
>>> 1) use bootmem.c
>>> 2) have highmem
>>>
>>> This patchset applies to "f4a75d2e Linux 3.7-rc6" from
>>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>>>
>>
>> I've applied the five patches to Linus' 3.7.0-rc6 and can confirm that
>> the kernel allows my system to resume from a suspend to disc. Although
>> my laptop is 64 bit, I run a 32 bit kernel with HIGHMEM (I have 8GB RAM):
>>
>> [chris:~/kernel/tmp/linux-3.7-rc6-resume]$ grep -E HIGHMEM\|X86_32
>> .config
>> CONFIG_X86_32=y
>> CONFIG_X86_32_SMP=y
>> CONFIG_X86_32_LAZY_GS=y
>> # CONFIG_X86_32_IRIS is not set
>> # CONFIG_NOHIGHMEM is not set
>> # CONFIG_HIGHMEM4G is not set
>> CONFIG_HIGHMEM64G=y
>> CONFIG_HIGHMEM=y
>>
>> I can also say that a quick browse of the output of dmesg, shows nothing
>> out of the ordinary. I have insufficient knowledge to comment on the
>> patches, but I will run the kernel over the next few days and report
>> back later in the week.
>>
>
> Well, I've been running the kernel since Sunday and have had no problems
> with my normal work mix of browsing, browsing the internet, video
> editing, listening to music and building software. I'm now running a
> kernel that build with the new patches 1 and 4 from yesterday (plus the
> original 1, 2 and 5). All seems OK so far, including a couple of resumes
> from suspend to disk.
>
>

-rc6 with Gerry's patches has run fine, including numerous resumes from 
suspend to disk, which fails (freezing or rebooting) without the 
patches. I've now applied the patches to -rc7 (they apply with a few 
offsets, but look OK) and will run that for a day or two.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
@ 2012-11-26  9:46               ` Chris Clayton
  0 siblings, 0 replies; 98+ messages in thread
From: Chris Clayton @ 2012-11-26  9:46 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Wen Congyang, David Rientjes, Jiang Liu,
	Maciej Rutecki, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel



On 11/22/12 09:23, Chris Clayton wrote:
>>> This patchset has only been tested on x86_64 with nobootmem.c. So need
>>> help to test this patchset on machines:
>>> 1) use bootmem.c
>>> 2) have highmem
>>>
>>> This patchset applies to "f4a75d2e Linux 3.7-rc6" from
>>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>>>
>>
>> I've applied the five patches to Linus' 3.7.0-rc6 and can confirm that
>> the kernel allows my system to resume from a suspend to disc. Although
>> my laptop is 64 bit, I run a 32 bit kernel with HIGHMEM (I have 8GB RAM):
>>
>> [chris:~/kernel/tmp/linux-3.7-rc6-resume]$ grep -E HIGHMEM\|X86_32
>> .config
>> CONFIG_X86_32=y
>> CONFIG_X86_32_SMP=y
>> CONFIG_X86_32_LAZY_GS=y
>> # CONFIG_X86_32_IRIS is not set
>> # CONFIG_NOHIGHMEM is not set
>> # CONFIG_HIGHMEM4G is not set
>> CONFIG_HIGHMEM64G=y
>> CONFIG_HIGHMEM=y
>>
>> I can also say that a quick browse of the output of dmesg, shows nothing
>> out of the ordinary. I have insufficient knowledge to comment on the
>> patches, but I will run the kernel over the next few days and report
>> back later in the week.
>>
>
> Well, I've been running the kernel since Sunday and have had no problems
> with my normal work mix of browsing, browsing the internet, video
> editing, listening to music and building software. I'm now running a
> kernel that build with the new patches 1 and 4 from yesterday (plus the
> original 1, 2 and 5). All seems OK so far, including a couple of resumes
> from suspend to disk.
>
>

-rc6 with Gerry's patches has run fine, including numerous resumes from 
suspend to disk, which fails (freezing or rebooting) without the 
patches. I've now applied the patches to -rc7 (they apply with a few 
offsets, but look OK) and will run that for a day or two.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
  2012-11-21 15:09                   ` Jiang Liu
@ 2012-11-28 23:52                     ` Andrew Morton
  -1 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2012-11-28 23:52 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On Wed, 21 Nov 2012 23:09:46 +0800
Jiang Liu <liuj97@gmail.com> wrote:

> Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap

How are people to test this?  "does it boot"?

> If SPARSEMEM is enabled, it won't build page structures for
> non-existing pages (holes) within a zone, so provide a more accurate
> estimation of pages occupied by memmap if there are bigger holes within
> the zone.
> 
> And pages for highmem zones' memmap will be allocated from lowmem, so
> charge nr_kernel_pages for that.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4442,6 +4442,26 @@ void __init set_pageblock_order(void)
>  
>  #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>  
> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
> +				      unsigned long present_pages)
> +{
> +	unsigned long pages = spanned_pages;
> +
> +	/*
> +	 * Provide a more accurate estimation if there are holes within
> +	 * the zone and SPARSEMEM is in use. If there are holes within the
> +	 * zone, each populated memory region may cost us one or two extra
> +	 * memmap pages due to alignment because memmap pages for each
> +	 * populated regions may not naturally algined on page boundary.
> +	 * So the (present_pages >> 4) heuristic is a tradeoff for that.
> +	 */
> +	if (spanned_pages > present_pages + (present_pages >> 4) &&
> +	    IS_ENABLED(CONFIG_SPARSEMEM))
> +		pages = present_pages;
> +
> +	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
> +}
> +

I spose we should do this, although it makes no difference as the
compiler will inline calc_memmap_size() into its caller:

--- a/mm/page_alloc.c~mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap-fix
+++ a/mm/page_alloc.c
@@ -4526,8 +4526,8 @@ void __init set_pageblock_order(void)
 
 #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
 
-static unsigned long calc_memmap_size(unsigned long spanned_pages,
-				      unsigned long present_pages)
+static unsigned long __paginginit calc_memmap_size(unsigned long spanned_pages,
+						   unsigned long present_pages)
 {
 	unsigned long pages = spanned_pages;
 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
@ 2012-11-28 23:52                     ` Andrew Morton
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2012-11-28 23:52 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Wen Congyang, David Rientjes, Jiang Liu, Maciej Rutecki,
	Chris Clayton, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On Wed, 21 Nov 2012 23:09:46 +0800
Jiang Liu <liuj97@gmail.com> wrote:

> Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap

How are people to test this?  "does it boot"?

> If SPARSEMEM is enabled, it won't build page structures for
> non-existing pages (holes) within a zone, so provide a more accurate
> estimation of pages occupied by memmap if there are bigger holes within
> the zone.
> 
> And pages for highmem zones' memmap will be allocated from lowmem, so
> charge nr_kernel_pages for that.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4442,6 +4442,26 @@ void __init set_pageblock_order(void)
>  
>  #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>  
> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
> +				      unsigned long present_pages)
> +{
> +	unsigned long pages = spanned_pages;
> +
> +	/*
> +	 * Provide a more accurate estimation if there are holes within
> +	 * the zone and SPARSEMEM is in use. If there are holes within the
> +	 * zone, each populated memory region may cost us one or two extra
> +	 * memmap pages due to alignment because memmap pages for each
> +	 * populated regions may not naturally algined on page boundary.
> +	 * So the (present_pages >> 4) heuristic is a tradeoff for that.
> +	 */
> +	if (spanned_pages > present_pages + (present_pages >> 4) &&
> +	    IS_ENABLED(CONFIG_SPARSEMEM))
> +		pages = present_pages;
> +
> +	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
> +}
> +

I spose we should do this, although it makes no difference as the
compiler will inline calc_memmap_size() into its caller:

--- a/mm/page_alloc.c~mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap-fix
+++ a/mm/page_alloc.c
@@ -4526,8 +4526,8 @@ void __init set_pageblock_order(void)
 
 #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
 
-static unsigned long calc_memmap_size(unsigned long spanned_pages,
-				      unsigned long present_pages)
+static unsigned long __paginginit calc_memmap_size(unsigned long spanned_pages,
+						   unsigned long present_pages)
 {
 	unsigned long pages = spanned_pages;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
  2012-11-28 23:52                     ` Andrew Morton
@ 2012-11-29  2:25                       ` Jianguo Wu
  -1 siblings, 0 replies; 98+ messages in thread
From: Jianguo Wu @ 2012-11-29  2:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiang Liu, Wen Congyang, David Rientjes, Jiang Liu,
	Maciej Rutecki, Chris Clayton, Rafael J . Wysocki, Mel Gorman,
	Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko, linux-mm,
	linux-kernel

On 2012/11/29 7:52, Andrew Morton wrote:

> On Wed, 21 Nov 2012 23:09:46 +0800
> Jiang Liu <liuj97@gmail.com> wrote:
> 
>> Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
> 
> How are people to test this?  "does it boot"?
> 

I have tested this in x86_64, it does boot.

Node 0, zone      DMA
  pages free     3972
        min      1
        low      1
        high     1
        scanned  0
        spanned  4080
        present  3979
        managed  3972

Node 0, zone    DMA32
  pages free     448783
        min      172
        low      215
        high     258
        scanned  0
        spanned  1044480
        present  500799
        managed  444545

Node 0, zone   Normal
  pages free     2375547
        min      1394
        low      1742
        high     2091
        scanned  0
        spanned  3670016
        present  3670016
        managed  3585105

Thanks,
Jianguo Wu

>> If SPARSEMEM is enabled, it won't build page structures for
>> non-existing pages (holes) within a zone, so provide a more accurate
>> estimation of pages occupied by memmap if there are bigger holes within
>> the zone.
>>
>> And pages for highmem zones' memmap will be allocated from lowmem, so
>> charge nr_kernel_pages for that.
>>
>> ...
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4442,6 +4442,26 @@ void __init set_pageblock_order(void)
>>  
>>  #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>>  
>> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
>> +				      unsigned long present_pages)
>> +{
>> +	unsigned long pages = spanned_pages;
>> +
>> +	/*
>> +	 * Provide a more accurate estimation if there are holes within
>> +	 * the zone and SPARSEMEM is in use. If there are holes within the
>> +	 * zone, each populated memory region may cost us one or two extra
>> +	 * memmap pages due to alignment because memmap pages for each
>> +	 * populated regions may not naturally algined on page boundary.
>> +	 * So the (present_pages >> 4) heuristic is a tradeoff for that.
>> +	 */
>> +	if (spanned_pages > present_pages + (present_pages >> 4) &&
>> +	    IS_ENABLED(CONFIG_SPARSEMEM))
>> +		pages = present_pages;
>> +
>> +	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
>> +}
>> +
> 
> I spose we should do this, although it makes no difference as the
> compiler will inline calc_memmap_size() into its caller:
> 
> --- a/mm/page_alloc.c~mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap-fix
> +++ a/mm/page_alloc.c
> @@ -4526,8 +4526,8 @@ void __init set_pageblock_order(void)
>  
>  #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>  
> -static unsigned long calc_memmap_size(unsigned long spanned_pages,
> -				      unsigned long present_pages)
> +static unsigned long __paginginit calc_memmap_size(unsigned long spanned_pages,
> +						   unsigned long present_pages)
>  {
>  	unsigned long pages = spanned_pages;
>  
> 
> 
> .
> 




^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
@ 2012-11-29  2:25                       ` Jianguo Wu
  0 siblings, 0 replies; 98+ messages in thread
From: Jianguo Wu @ 2012-11-29  2:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiang Liu, Wen Congyang, David Rientjes, Jiang Liu,
	Maciej Rutecki, Chris Clayton, Rafael J . Wysocki, Mel Gorman,
	Minchan Kim, KAMEZAWA Hiroyuki, Michal Hocko, linux-mm,
	linux-kernel

On 2012/11/29 7:52, Andrew Morton wrote:

> On Wed, 21 Nov 2012 23:09:46 +0800
> Jiang Liu <liuj97@gmail.com> wrote:
> 
>> Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
> 
> How are people to test this?  "does it boot"?
> 

I have tested this in x86_64, it does boot.

Node 0, zone      DMA
  pages free     3972
        min      1
        low      1
        high     1
        scanned  0
        spanned  4080
        present  3979
        managed  3972

Node 0, zone    DMA32
  pages free     448783
        min      172
        low      215
        high     258
        scanned  0
        spanned  1044480
        present  500799
        managed  444545

Node 0, zone   Normal
  pages free     2375547
        min      1394
        low      1742
        high     2091
        scanned  0
        spanned  3670016
        present  3670016
        managed  3585105

Thanks,
Jianguo Wu

>> If SPARSEMEM is enabled, it won't build page structures for
>> non-existing pages (holes) within a zone, so provide a more accurate
>> estimation of pages occupied by memmap if there are bigger holes within
>> the zone.
>>
>> And pages for highmem zones' memmap will be allocated from lowmem, so
>> charge nr_kernel_pages for that.
>>
>> ...
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4442,6 +4442,26 @@ void __init set_pageblock_order(void)
>>  
>>  #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>>  
>> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
>> +				      unsigned long present_pages)
>> +{
>> +	unsigned long pages = spanned_pages;
>> +
>> +	/*
>> +	 * Provide a more accurate estimation if there are holes within
>> +	 * the zone and SPARSEMEM is in use. If there are holes within the
>> +	 * zone, each populated memory region may cost us one or two extra
>> +	 * memmap pages due to alignment because memmap pages for each
>> +	 * populated regions may not naturally algined on page boundary.
>> +	 * So the (present_pages >> 4) heuristic is a tradeoff for that.
>> +	 */
>> +	if (spanned_pages > present_pages + (present_pages >> 4) &&
>> +	    IS_ENABLED(CONFIG_SPARSEMEM))
>> +		pages = present_pages;
>> +
>> +	return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
>> +}
>> +
> 
> I spose we should do this, although it makes no difference as the
> compiler will inline calc_memmap_size() into its caller:
> 
> --- a/mm/page_alloc.c~mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap-fix
> +++ a/mm/page_alloc.c
> @@ -4526,8 +4526,8 @@ void __init set_pageblock_order(void)
>  
>  #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>  
> -static unsigned long calc_memmap_size(unsigned long spanned_pages,
> -				      unsigned long present_pages)
> +static unsigned long __paginginit calc_memmap_size(unsigned long spanned_pages,
> +						   unsigned long present_pages)
>  {
>  	unsigned long pages = spanned_pages;
>  
> 
> 
> .
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
  2012-11-28 23:52                     ` Andrew Morton
@ 2012-11-29 10:52                       ` Chris Clayton
  -1 siblings, 0 replies; 98+ messages in thread
From: Chris Clayton @ 2012-11-29 10:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiang Liu, Wen Congyang, David Rientjes, Jiang Liu,
	Maciej Rutecki, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On 11/28/12 23:52, Andrew Morton wrote:
> On Wed, 21 Nov 2012 23:09:46 +0800
> Jiang Liu <liuj97@gmail.com> wrote:
>
>> Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
>
> How are people to test this?  "does it boot"?
>

I've been running kernels with Gerry's 5 patches applied for 11 days 
now. This is on a 64bit laptop but with a 32bit kernel + HIGHMEM. I 
joined the conversation because my laptop would not resume from suspend 
to disk - it either froze or rebooted. With the patches applied the 
laptop does successfully resume and has been stable.

Since Monday, I have have been running a kernel with the patches (plus, 
from today, the patch you mailed yesterday) applied to 3.7rc7, without 
problems.

Thanks,
Chris

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
@ 2012-11-29 10:52                       ` Chris Clayton
  0 siblings, 0 replies; 98+ messages in thread
From: Chris Clayton @ 2012-11-29 10:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiang Liu, Wen Congyang, David Rientjes, Jiang Liu,
	Maciej Rutecki, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On 11/28/12 23:52, Andrew Morton wrote:
> On Wed, 21 Nov 2012 23:09:46 +0800
> Jiang Liu <liuj97@gmail.com> wrote:
>
>> Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
>
> How are people to test this?  "does it boot"?
>

I've been running kernels with Gerry's 5 patches applied for 11 days 
now. This is on a 64bit laptop but with a 32bit kernel + HIGHMEM. I 
joined the conversation because my laptop would not resume from suspend 
to disk - it either froze or rebooted. With the patches applied the 
laptop does successfully resume and has been stable.

Since Monday, I have have been running a kernel with the patches (plus, 
from today, the patch you mailed yesterday) applied to 3.7rc7, without 
problems.

Thanks,
Chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
  2012-11-29 10:52                       ` Chris Clayton
@ 2012-12-02 19:55                         ` Chris Clayton
  -1 siblings, 0 replies; 98+ messages in thread
From: Chris Clayton @ 2012-12-02 19:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiang Liu, Wen Congyang, David Rientjes, Jiang Liu,
	Maciej Rutecki, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel



On 11/29/12 10:52, Chris Clayton wrote:
> On 11/28/12 23:52, Andrew Morton wrote:
>> On Wed, 21 Nov 2012 23:09:46 +0800
>> Jiang Liu <liuj97@gmail.com> wrote:
>>
>>> Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation
>>> of pages occupied by memmap
>>
>> How are people to test this?  "does it boot"?
>>
>
> I've been running kernels with Gerry's 5 patches applied for 11 days
> now. This is on a 64bit laptop but with a 32bit kernel + HIGHMEM. I
> joined the conversation because my laptop would not resume from suspend
> to disk - it either froze or rebooted. With the patches applied the
> laptop does successfully resume and has been stable.
>
> Since Monday, I have have been running a kernel with the patches (plus,
> from today, the patch you mailed yesterday) applied to 3.7rc7, without
> problems.
>

I've been running 3.7-rc7 with the patches listed below for a week now 
and it has been perfectly stable. In particular, my laptop will now 
successfully resume from suspend to disk, which always failed without 
the patches.

 From Jiang Liu:
1. [RFT PATCH v2 1/5] mm: introduce new field "managed_pages" to struct zone
2. [RFT PATCH v1 2/5] mm: replace zone->present_pages with 
zone->managed_pages if appreciated
3. [RFT PATCH v1 3/5] mm: set zone->present_pages to number of existing 
pages in the zone
4. [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages 
occupied by memmap
5. [RFT PATCH v1 5/5] mm: increase totalram_pages when free pages 
allocated by bootmem allocator

 From Andrew Morton:
6. mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap.patch

Tested-by: Chris Clayton <chris2553@googlemail.com>

> Thanks,
> Chris
>
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
@ 2012-12-02 19:55                         ` Chris Clayton
  0 siblings, 0 replies; 98+ messages in thread
From: Chris Clayton @ 2012-12-02 19:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiang Liu, Wen Congyang, David Rientjes, Jiang Liu,
	Maciej Rutecki, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel



On 11/29/12 10:52, Chris Clayton wrote:
> On 11/28/12 23:52, Andrew Morton wrote:
>> On Wed, 21 Nov 2012 23:09:46 +0800
>> Jiang Liu <liuj97@gmail.com> wrote:
>>
>>> Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation
>>> of pages occupied by memmap
>>
>> How are people to test this?  "does it boot"?
>>
>
> I've been running kernels with Gerry's 5 patches applied for 11 days
> now. This is on a 64bit laptop but with a 32bit kernel + HIGHMEM. I
> joined the conversation because my laptop would not resume from suspend
> to disk - it either froze or rebooted. With the patches applied the
> laptop does successfully resume and has been stable.
>
> Since Monday, I have have been running a kernel with the patches (plus,
> from today, the patch you mailed yesterday) applied to 3.7rc7, without
> problems.
>

I've been running 3.7-rc7 with the patches listed below for a week now 
and it has been perfectly stable. In particular, my laptop will now 
successfully resume from suspend to disk, which always failed without 
the patches.

 From Jiang Liu:
1. [RFT PATCH v2 1/5] mm: introduce new field "managed_pages" to struct zone
2. [RFT PATCH v1 2/5] mm: replace zone->present_pages with 
zone->managed_pages if appreciated
3. [RFT PATCH v1 3/5] mm: set zone->present_pages to number of existing 
pages in the zone
4. [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages 
occupied by memmap
5. [RFT PATCH v1 5/5] mm: increase totalram_pages when free pages 
allocated by bootmem allocator

 From Andrew Morton:
6. mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap.patch

Tested-by: Chris Clayton <chris2553@googlemail.com>

> Thanks,
> Chris
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
  2012-12-02 19:55                         ` Chris Clayton
@ 2012-12-03  7:26                           ` Chris Clayton
  -1 siblings, 0 replies; 98+ messages in thread
From: Chris Clayton @ 2012-12-03  7:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiang Liu, Wen Congyang, David Rientjes, Jiang Liu,
	Maciej Rutecki, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel



On 12/02/12 19:55, Chris Clayton wrote:
>
>
> On 11/29/12 10:52, Chris Clayton wrote:
>> On 11/28/12 23:52, Andrew Morton wrote:
>>> On Wed, 21 Nov 2012 23:09:46 +0800
>>> Jiang Liu <liuj97@gmail.com> wrote:
>>>
>>>> Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation
>>>> of pages occupied by memmap
>>>
>>> How are people to test this?  "does it boot"?
>>>
>>
>> I've been running kernels with Gerry's 5 patches applied for 11 days
>> now. This is on a 64bit laptop but with a 32bit kernel + HIGHMEM. I
>> joined the conversation because my laptop would not resume from suspend
>> to disk - it either froze or rebooted. With the patches applied the
>> laptop does successfully resume and has been stable.
>>
>> Since Monday, I have have been running a kernel with the patches (plus,
>> from today, the patch you mailed yesterday) applied to 3.7rc7, without
>> problems.
>>
>
> I've been running 3.7-rc7 with the patches listed below for a week now
> and it has been perfectly stable. In particular, my laptop will now
> successfully resume from suspend to disk, which always failed without
> the patches.
>

I should have said, of course, that it was -rc6 and earlier that would 
not boot without Jiang Liu's patches. I applied those patches to rc-6 
and my resume after suspend to disk problem was fixed. For a subsequent 
week I have been running with the patches applied to -rc7, with Andrew's 
patch also applied for the last 3 days. -rc7 was not subject to the 
resume problem because the patch which broke it had been reverted.
All this has been on a 64bit laptop, but running a 32bit kernel with 
HIGHMEM.

Apologies for yesterday's inaccuracy. I shouldn't send testing reports 
when I'm in a hurry.

>  From Jiang Liu:
> 1. [RFT PATCH v2 1/5] mm: introduce new field "managed_pages" to struct
> zone
> 2. [RFT PATCH v1 2/5] mm: replace zone->present_pages with
> zone->managed_pages if appreciated
> 3. [RFT PATCH v1 3/5] mm: set zone->present_pages to number of existing
> pages in the zone
> 4. [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages
> occupied by memmap
> 5. [RFT PATCH v1 5/5] mm: increase totalram_pages when free pages
> allocated by bootmem allocator
>
>  From Andrew Morton:
> 6. mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap.patch
>
> Tested-by: Chris Clayton <chris2553@googlemail.com>
>
>> Thanks,
>> Chris
>>
>>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
@ 2012-12-03  7:26                           ` Chris Clayton
  0 siblings, 0 replies; 98+ messages in thread
From: Chris Clayton @ 2012-12-03  7:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiang Liu, Wen Congyang, David Rientjes, Jiang Liu,
	Maciej Rutecki, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel



On 12/02/12 19:55, Chris Clayton wrote:
>
>
> On 11/29/12 10:52, Chris Clayton wrote:
>> On 11/28/12 23:52, Andrew Morton wrote:
>>> On Wed, 21 Nov 2012 23:09:46 +0800
>>> Jiang Liu <liuj97@gmail.com> wrote:
>>>
>>>> Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation
>>>> of pages occupied by memmap
>>>
>>> How are people to test this?  "does it boot"?
>>>
>>
>> I've been running kernels with Gerry's 5 patches applied for 11 days
>> now. This is on a 64bit laptop but with a 32bit kernel + HIGHMEM. I
>> joined the conversation because my laptop would not resume from suspend
>> to disk - it either froze or rebooted. With the patches applied the
>> laptop does successfully resume and has been stable.
>>
>> Since Monday, I have have been running a kernel with the patches (plus,
>> from today, the patch you mailed yesterday) applied to 3.7rc7, without
>> problems.
>>
>
> I've been running 3.7-rc7 with the patches listed below for a week now
> and it has been perfectly stable. In particular, my laptop will now
> successfully resume from suspend to disk, which always failed without
> the patches.
>

I should have said, of course, that it was -rc6 and earlier that would 
not boot without Jiang Liu's patches. I applied those patches to rc-6 
and my resume after suspend to disk problem was fixed. For a subsequent 
week I have been running with the patches applied to -rc7, with Andrew's 
patch also applied for the last 3 days. -rc7 was not subject to the 
resume problem because the patch which broke it had been reverted.
All this has been on a 64bit laptop, but running a 32bit kernel with 
HIGHMEM.

Apologies for yesterday's inaccuracy. I shouldn't send testing reports 
when I'm in a hurry.

>  From Jiang Liu:
> 1. [RFT PATCH v2 1/5] mm: introduce new field "managed_pages" to struct
> zone
> 2. [RFT PATCH v1 2/5] mm: replace zone->present_pages with
> zone->managed_pages if appreciated
> 3. [RFT PATCH v1 3/5] mm: set zone->present_pages to number of existing
> pages in the zone
> 4. [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages
> occupied by memmap
> 5. [RFT PATCH v1 5/5] mm: increase totalram_pages when free pages
> allocated by bootmem allocator
>
>  From Andrew Morton:
> 6. mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap.patch
>
> Tested-by: Chris Clayton <chris2553@googlemail.com>
>
>> Thanks,
>> Chris
>>
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
  2012-12-02 19:55                         ` Chris Clayton
@ 2012-12-03 23:17                           ` Andrew Morton
  -1 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2012-12-03 23:17 UTC (permalink / raw)
  To: Chris Clayton
  Cc: Jiang Liu, Wen Congyang, David Rientjes, Jiang Liu,
	Maciej Rutecki, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On Sun, 02 Dec 2012 19:55:09 +0000
Chris Clayton <chris2553@googlemail.com> wrote:

> 
> 
> On 11/29/12 10:52, Chris Clayton wrote:
> > On 11/28/12 23:52, Andrew Morton wrote:
> >> On Wed, 21 Nov 2012 23:09:46 +0800
> >> Jiang Liu <liuj97@gmail.com> wrote:
> >>
> >>> Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation
> >>> of pages occupied by memmap
> >>
> >> How are people to test this?  "does it boot"?
> >>
> >
> > I've been running kernels with Gerry's 5 patches applied for 11 days
> > now. This is on a 64bit laptop but with a 32bit kernel + HIGHMEM. I
> > joined the conversation because my laptop would not resume from suspend
> > to disk - it either froze or rebooted. With the patches applied the
> > laptop does successfully resume and has been stable.
> >
> > Since Monday, I have have been running a kernel with the patches (plus,
> > from today, the patch you mailed yesterday) applied to 3.7rc7, without
> > problems.
> >
> 
> I've been running 3.7-rc7 with the patches listed below for a week now 
> and it has been perfectly stable. In particular, my laptop will now 
> successfully resume from suspend to disk, which always failed without 
> the patches.
> 
>  From Jiang Liu:
> 1. [RFT PATCH v2 1/5] mm: introduce new field "managed_pages" to struct zone
> 2. [RFT PATCH v1 2/5] mm: replace zone->present_pages with 
> zone->managed_pages if appreciated
> 3. [RFT PATCH v1 3/5] mm: set zone->present_pages to number of existing 
> pages in the zone
> 4. [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages 
> occupied by memmap
> 5. [RFT PATCH v1 5/5] mm: increase totalram_pages when free pages 
> allocated by bootmem allocator
> 
>  From Andrew Morton:
> 6. mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap.patch
> 
> Tested-by: Chris Clayton <chris2553@googlemail.com>

Thanks.

I have only two of these five patches queued for 3.8:
mm-introduce-new-field-managed_pages-to-struct-zone.patch and
mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap.patch. 
I don't recall what happened with the other three.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
@ 2012-12-03 23:17                           ` Andrew Morton
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2012-12-03 23:17 UTC (permalink / raw)
  To: Chris Clayton
  Cc: Jiang Liu, Wen Congyang, David Rientjes, Jiang Liu,
	Maciej Rutecki, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On Sun, 02 Dec 2012 19:55:09 +0000
Chris Clayton <chris2553@googlemail.com> wrote:

> 
> 
> On 11/29/12 10:52, Chris Clayton wrote:
> > On 11/28/12 23:52, Andrew Morton wrote:
> >> On Wed, 21 Nov 2012 23:09:46 +0800
> >> Jiang Liu <liuj97@gmail.com> wrote:
> >>
> >>> Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation
> >>> of pages occupied by memmap
> >>
> >> How are people to test this?  "does it boot"?
> >>
> >
> > I've been running kernels with Gerry's 5 patches applied for 11 days
> > now. This is on a 64bit laptop but with a 32bit kernel + HIGHMEM. I
> > joined the conversation because my laptop would not resume from suspend
> > to disk - it either froze or rebooted. With the patches applied the
> > laptop does successfully resume and has been stable.
> >
> > Since Monday, I have have been running a kernel with the patches (plus,
> > from today, the patch you mailed yesterday) applied to 3.7rc7, without
> > problems.
> >
> 
> I've been running 3.7-rc7 with the patches listed below for a week now 
> and it has been perfectly stable. In particular, my laptop will now 
> successfully resume from suspend to disk, which always failed without 
> the patches.
> 
>  From Jiang Liu:
> 1. [RFT PATCH v2 1/5] mm: introduce new field "managed_pages" to struct zone
> 2. [RFT PATCH v1 2/5] mm: replace zone->present_pages with 
> zone->managed_pages if appreciated
> 3. [RFT PATCH v1 3/5] mm: set zone->present_pages to number of existing 
> pages in the zone
> 4. [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages 
> occupied by memmap
> 5. [RFT PATCH v1 5/5] mm: increase totalram_pages when free pages 
> allocated by bootmem allocator
> 
>  From Andrew Morton:
> 6. mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap.patch
> 
> Tested-by: Chris Clayton <chris2553@googlemail.com>

Thanks.

I have only two of these five patches queued for 3.8:
mm-introduce-new-field-managed_pages-to-struct-zone.patch and
mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap.patch. 
I don't recall what happened with the other three.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
  2012-12-03 23:17                           ` Andrew Morton
@ 2012-12-04  1:21                             ` Jiang Liu
  -1 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-12-04  1:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chris Clayton, Jiang Liu, Wen Congyang, David Rientjes,
	Maciej Rutecki, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On 2012-12-4 7:17, Andrew Morton wrote:
> On Sun, 02 Dec 2012 19:55:09 +0000
> Chris Clayton <chris2553@googlemail.com> wrote:
> 
>>
>>
>> On 11/29/12 10:52, Chris Clayton wrote:
>>> On 11/28/12 23:52, Andrew Morton wrote:
>>>> On Wed, 21 Nov 2012 23:09:46 +0800
>>>> Jiang Liu <liuj97@gmail.com> wrote:
>>>>
>>>>> Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation
>>>>> of pages occupied by memmap
>>>>
>>>> How are people to test this?  "does it boot"?
>>>>
>>>
>>> I've been running kernels with Gerry's 5 patches applied for 11 days
>>> now. This is on a 64bit laptop but with a 32bit kernel + HIGHMEM. I
>>> joined the conversation because my laptop would not resume from suspend
>>> to disk - it either froze or rebooted. With the patches applied the
>>> laptop does successfully resume and has been stable.
>>>
>>> Since Monday, I have have been running a kernel with the patches (plus,
>>> from today, the patch you mailed yesterday) applied to 3.7rc7, without
>>> problems.
>>>
>>
>> I've been running 3.7-rc7 with the patches listed below for a week now 
>> and it has been perfectly stable. In particular, my laptop will now 
>> successfully resume from suspend to disk, which always failed without 
>> the patches.
>>
>>  From Jiang Liu:
>> 1. [RFT PATCH v2 1/5] mm: introduce new field "managed_pages" to struct zone
>> 2. [RFT PATCH v1 2/5] mm: replace zone->present_pages with 
>> zone->managed_pages if appreciated
>> 3. [RFT PATCH v1 3/5] mm: set zone->present_pages to number of existing 
>> pages in the zone
>> 4. [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages 
>> occupied by memmap
>> 5. [RFT PATCH v1 5/5] mm: increase totalram_pages when free pages 
>> allocated by bootmem allocator
>>
>>  From Andrew Morton:
>> 6. mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap.patch
>>
>> Tested-by: Chris Clayton <chris2553@googlemail.com>
> 
> Thanks.
> 
> I have only two of these five patches queued for 3.8:
> mm-introduce-new-field-managed_pages-to-struct-zone.patch and
> mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap.patch. 
> I don't recall what happened with the other three.

Hi Andrew,
	I think the story that,
1) I sent out all five patches for the first version.
2) Later updated and resend the first and forth patches according to 
   review comments without touching the other three patches.
	So you may just get the updated two without noticing the other
three from the initial patchset.
	Sorry for the trouble.
Regards!
Gerry


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
@ 2012-12-04  1:21                             ` Jiang Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Jiang Liu @ 2012-12-04  1:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chris Clayton, Jiang Liu, Wen Congyang, David Rientjes,
	Maciej Rutecki, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel

On 2012-12-4 7:17, Andrew Morton wrote:
> On Sun, 02 Dec 2012 19:55:09 +0000
> Chris Clayton <chris2553@googlemail.com> wrote:
> 
>>
>>
>> On 11/29/12 10:52, Chris Clayton wrote:
>>> On 11/28/12 23:52, Andrew Morton wrote:
>>>> On Wed, 21 Nov 2012 23:09:46 +0800
>>>> Jiang Liu <liuj97@gmail.com> wrote:
>>>>
>>>>> Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation
>>>>> of pages occupied by memmap
>>>>
>>>> How are people to test this?  "does it boot"?
>>>>
>>>
>>> I've been running kernels with Gerry's 5 patches applied for 11 days
>>> now. This is on a 64bit laptop but with a 32bit kernel + HIGHMEM. I
>>> joined the conversation because my laptop would not resume from suspend
>>> to disk - it either froze or rebooted. With the patches applied the
>>> laptop does successfully resume and has been stable.
>>>
>>> Since Monday, I have have been running a kernel with the patches (plus,
>>> from today, the patch you mailed yesterday) applied to 3.7rc7, without
>>> problems.
>>>
>>
>> I've been running 3.7-rc7 with the patches listed below for a week now 
>> and it has been perfectly stable. In particular, my laptop will now 
>> successfully resume from suspend to disk, which always failed without 
>> the patches.
>>
>>  From Jiang Liu:
>> 1. [RFT PATCH v2 1/5] mm: introduce new field "managed_pages" to struct zone
>> 2. [RFT PATCH v1 2/5] mm: replace zone->present_pages with 
>> zone->managed_pages if appreciated
>> 3. [RFT PATCH v1 3/5] mm: set zone->present_pages to number of existing 
>> pages in the zone
>> 4. [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages 
>> occupied by memmap
>> 5. [RFT PATCH v1 5/5] mm: increase totalram_pages when free pages 
>> allocated by bootmem allocator
>>
>>  From Andrew Morton:
>> 6. mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap.patch
>>
>> Tested-by: Chris Clayton <chris2553@googlemail.com>
> 
> Thanks.
> 
> I have only two of these five patches queued for 3.8:
> mm-introduce-new-field-managed_pages-to-struct-zone.patch and
> mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap.patch. 
> I don't recall what happened with the other three.

Hi Andrew,
	I think the story that,
1) I sent out all five patches for the first version.
2) Later updated and resend the first and forth patches according to 
   review comments without touching the other three patches.
	So you may just get the updated two without noticing the other
three from the initial patchset.
	Sorry for the trouble.
Regards!
Gerry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
  2012-12-03 23:17                           ` Andrew Morton
@ 2012-12-04 10:05                             ` Chris Clayton
  -1 siblings, 0 replies; 98+ messages in thread
From: Chris Clayton @ 2012-12-04 10:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiang Liu, Wen Congyang, David Rientjes, Jiang Liu,
	Maciej Rutecki, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel



On 12/03/12 23:17, Andrew Morton wrote:
> On Sun, 02 Dec 2012 19:55:09 +0000
> Chris Clayton <chris2553@googlemail.com> wrote:
>
>>
>>
>> On 11/29/12 10:52, Chris Clayton wrote:
>>> On 11/28/12 23:52, Andrew Morton wrote:
>>>> On Wed, 21 Nov 2012 23:09:46 +0800
>>>> Jiang Liu <liuj97@gmail.com> wrote:
>>>>
>>>>> Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation
>>>>> of pages occupied by memmap
>>>>
>>>> How are people to test this?  "does it boot"?
>>>>
>>>
>>> I've been running kernels with Gerry's 5 patches applied for 11 days
>>> now. This is on a 64bit laptop but with a 32bit kernel + HIGHMEM. I
>>> joined the conversation because my laptop would not resume from suspend
>>> to disk - it either froze or rebooted. With the patches applied the
>>> laptop does successfully resume and has been stable.
>>>
>>> Since Monday, I have have been running a kernel with the patches (plus,
>>> from today, the patch you mailed yesterday) applied to 3.7rc7, without
>>> problems.
>>>
>>
>> I've been running 3.7-rc7 with the patches listed below for a week now
>> and it has been perfectly stable. In particular, my laptop will now
>> successfully resume from suspend to disk, which always failed without
>> the patches.
>>
>>   From Jiang Liu:
>> 1. [RFT PATCH v2 1/5] mm: introduce new field "managed_pages" to struct zone
>> 2. [RFT PATCH v1 2/5] mm: replace zone->present_pages with
>> zone->managed_pages if appreciated
>> 3. [RFT PATCH v1 3/5] mm: set zone->present_pages to number of existing
>> pages in the zone
>> 4. [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages
>> occupied by memmap
>> 5. [RFT PATCH v1 5/5] mm: increase totalram_pages when free pages
>> allocated by bootmem allocator
>>
>>   From Andrew Morton:
>> 6. mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap.patch
>>
>> Tested-by: Chris Clayton <chris2553@googlemail.com>
>
> Thanks.
>
> I have only two of these five patches queued for 3.8:
> mm-introduce-new-field-managed_pages-to-struct-zone.patch and
> mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap.patch.
> I don't recall what happened with the other three.
>

Gerry posted version 1 of the five patches on 18 November. Version 2 of 
the first and fourth patches were posted on 21 November.

BTW, the title of Andrew's patch that I applied is wrong above. It 
should be 
mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap-fix.

Chris



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
@ 2012-12-04 10:05                             ` Chris Clayton
  0 siblings, 0 replies; 98+ messages in thread
From: Chris Clayton @ 2012-12-04 10:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiang Liu, Wen Congyang, David Rientjes, Jiang Liu,
	Maciej Rutecki, Rafael J . Wysocki, Mel Gorman, Minchan Kim,
	KAMEZAWA Hiroyuki, Michal Hocko, Jianguo Wu, linux-mm,
	linux-kernel



On 12/03/12 23:17, Andrew Morton wrote:
> On Sun, 02 Dec 2012 19:55:09 +0000
> Chris Clayton <chris2553@googlemail.com> wrote:
>
>>
>>
>> On 11/29/12 10:52, Chris Clayton wrote:
>>> On 11/28/12 23:52, Andrew Morton wrote:
>>>> On Wed, 21 Nov 2012 23:09:46 +0800
>>>> Jiang Liu <liuj97@gmail.com> wrote:
>>>>
>>>>> Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation
>>>>> of pages occupied by memmap
>>>>
>>>> How are people to test this?  "does it boot"?
>>>>
>>>
>>> I've been running kernels with Gerry's 5 patches applied for 11 days
>>> now. This is on a 64bit laptop but with a 32bit kernel + HIGHMEM. I
>>> joined the conversation because my laptop would not resume from suspend
>>> to disk - it either froze or rebooted. With the patches applied the
>>> laptop does successfully resume and has been stable.
>>>
>>> Since Monday, I have have been running a kernel with the patches (plus,
>>> from today, the patch you mailed yesterday) applied to 3.7rc7, without
>>> problems.
>>>
>>
>> I've been running 3.7-rc7 with the patches listed below for a week now
>> and it has been perfectly stable. In particular, my laptop will now
>> successfully resume from suspend to disk, which always failed without
>> the patches.
>>
>>   From Jiang Liu:
>> 1. [RFT PATCH v2 1/5] mm: introduce new field "managed_pages" to struct zone
>> 2. [RFT PATCH v1 2/5] mm: replace zone->present_pages with
>> zone->managed_pages if appreciated
>> 3. [RFT PATCH v1 3/5] mm: set zone->present_pages to number of existing
>> pages in the zone
>> 4. [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages
>> occupied by memmap
>> 5. [RFT PATCH v1 5/5] mm: increase totalram_pages when free pages
>> allocated by bootmem allocator
>>
>>   From Andrew Morton:
>> 6. mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap.patch
>>
>> Tested-by: Chris Clayton <chris2553@googlemail.com>
>
> Thanks.
>
> I have only two of these five patches queued for 3.8:
> mm-introduce-new-field-managed_pages-to-struct-zone.patch and
> mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap.patch.
> I don't recall what happened with the other three.
>

Gerry posted version 1 of the five patches on 18 November. Version 2 of 
the first and fourth patches were posted on 21 November.

BTW, the title of Andrew's patch that I applied is wrong above. It 
should be 
mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap-fix.

Chris


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

end of thread, other threads:[~2012-12-04 10:05 UTC | newest]

Thread overview: 98+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-06  1:31 [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d Jiang Liu
2012-11-06  1:31 ` Jiang Liu
2012-11-06 10:23 ` Chris Clayton
2012-11-06 10:23   ` Chris Clayton
2012-11-06 20:43 ` Andrew Morton
2012-11-06 20:43   ` Andrew Morton
2012-11-14 14:52   ` Jiang Liu
2012-11-14 14:52     ` Jiang Liu
2012-11-15  9:22     ` Wen Congyang
2012-11-15  9:22       ` Wen Congyang
2012-11-15 11:28       ` Bob Liu
2012-11-15 11:28         ` Bob Liu
2012-11-15 14:23         ` Wen Congyang
2012-11-15 14:23           ` Wen Congyang
2012-11-15 15:40       ` Jiang Liu
2012-11-15 15:40         ` Jiang Liu
2012-11-15 21:41         ` David Rientjes
2012-11-15 21:41           ` David Rientjes
2012-11-15 19:24     ` Andrew Morton
2012-11-15 19:24       ` Andrew Morton
2012-11-15 21:17       ` Chris Clayton
2012-11-15 21:17         ` Chris Clayton
2012-11-15 21:27       ` David Rientjes
2012-11-15 21:27         ` David Rientjes
2012-11-18 16:07       ` [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages Jiang Liu
2012-11-18 16:07         ` Jiang Liu
2012-11-18 16:07         ` [RFT PATCH v1 1/5] mm: introduce new field "managed_pages" to struct zone Jiang Liu
2012-11-18 16:07           ` Jiang Liu
2012-11-19 23:38           ` Andrew Morton
2012-11-19 23:38             ` Andrew Morton
2012-11-20 14:56             ` Jiang Liu
2012-11-20 14:56               ` Jiang Liu
2012-11-20 19:31               ` Andrew Morton
2012-11-20 19:31                 ` Andrew Morton
2012-11-21 14:36                 ` Jiang Liu
2012-11-21 14:36                   ` Jiang Liu
2012-11-21 19:31                   ` Andrew Morton
2012-11-21 19:31                     ` Andrew Morton
2012-11-21 15:06                 ` [RFT PATCH v2 " Jiang Liu
2012-11-21 15:06                   ` Jiang Liu
2012-11-18 16:07         ` [RFT PATCH v1 2/5] mm: replace zone->present_pages with zone->managed_pages if appreciated Jiang Liu
2012-11-18 16:07           ` Jiang Liu
2012-11-18 16:07         ` [RFT PATCH v1 3/5] mm: set zone->present_pages to number of existing pages in the zone Jiang Liu
2012-11-18 16:07           ` Jiang Liu
2012-11-18 16:07         ` [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap Jiang Liu
2012-11-18 16:07           ` Jiang Liu
2012-11-19 23:42           ` Andrew Morton
2012-11-19 23:42             ` Andrew Morton
2012-11-20 15:18             ` Jiang Liu
2012-11-20 15:18               ` Jiang Liu
2012-11-20 19:19               ` Andrew Morton
2012-11-20 19:19                 ` Andrew Morton
2012-11-21 14:52                 ` Jiang Liu
2012-11-21 14:52                   ` Jiang Liu
2012-11-21 19:35                   ` Andrew Morton
2012-11-21 19:35                     ` Andrew Morton
2012-11-22 16:17                     ` Jiang Liu
2012-11-22 16:17                       ` Jiang Liu
2012-11-21 15:09                 ` [RFT PATCH v2 " Jiang Liu
2012-11-21 15:09                   ` Jiang Liu
2012-11-28 23:52                   ` Andrew Morton
2012-11-28 23:52                     ` Andrew Morton
2012-11-29  2:25                     ` Jianguo Wu
2012-11-29  2:25                       ` Jianguo Wu
2012-11-29 10:52                     ` Chris Clayton
2012-11-29 10:52                       ` Chris Clayton
2012-12-02 19:55                       ` Chris Clayton
2012-12-02 19:55                         ` Chris Clayton
2012-12-03  7:26                         ` Chris Clayton
2012-12-03  7:26                           ` Chris Clayton
2012-12-03 23:17                         ` Andrew Morton
2012-12-03 23:17                           ` Andrew Morton
2012-12-04  1:21                           ` Jiang Liu
2012-12-04  1:21                             ` Jiang Liu
2012-12-04 10:05                           ` Chris Clayton
2012-12-04 10:05                             ` Chris Clayton
2012-11-20  2:15           ` [RFT PATCH v1 " Jaegeuk Hanse
2012-11-20  2:15             ` Jaegeuk Hanse
2012-11-18 16:07         ` [RFT PATCH v1 5/5] mm: increase totalram_pages when free pages allocated by bootmem allocator Jiang Liu
2012-11-18 16:07           ` Jiang Liu
2012-11-18 20:36         ` [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages Chris Clayton
2012-11-18 20:36           ` Chris Clayton
2012-11-22  9:23           ` Chris Clayton
2012-11-22  9:23             ` Chris Clayton
2012-11-26  9:46             ` Chris Clayton
2012-11-26  9:46               ` Chris Clayton
2012-11-19 21:36         ` Maciej Rutecki
2012-11-19 21:36           ` Maciej Rutecki
2012-11-20 16:03           ` Jiang Liu
2012-11-20 16:03             ` Jiang Liu
2012-11-20  2:13         ` Jaegeuk Hanse
2012-11-20  2:13           ` Jaegeuk Hanse
2012-11-20  2:43           ` Jiang Liu
2012-11-20  2:43             ` Jiang Liu
2012-11-20  3:20             ` Jaegeuk Hanse
2012-11-20  3:20               ` Jaegeuk Hanse
2012-11-20  3:46               ` Jiang Liu
2012-11-20  3:46                 ` Jiang Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.