All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] mm/page_alloc: fix freeing of MIGRATE_RESERVE migratetype pages
@ 2014-03-06 17:35 ` Bartlomiej Zolnierkiewicz
  0 siblings, 0 replies; 52+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2014-03-06 17:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Hugh Dickins, Marek Szyprowski, Yong-Taek Lee, linux-mm, linux-kernel

Pages allocated from MIGRATE_RESERVE migratetype pageblocks
are not freed back to MIGRATE_RESERVE migratetype free
lists in free_pcppages_bulk()->__free_one_page() if we got
to free_pcppages_bulk() through drain_[zone_]pages().
The freeing through free_hot_cold_page() is okay because
freepage migratetype is set to pageblock migratetype before
calling free_pcppages_bulk().  If pages of MIGRATE_RESERVE
migratetype end up on the free lists of other migratetype
whole Reserved pageblock may be later changed to the other
migratetype in __rmqueue_fallback() and it will be never
changed back to be a Reserved pageblock.  Fix the issue by
moving freepage migratetype setting from rmqueue_bulk() to
__rmqueue[_fallback]() and preserving freepage migratetype
as an original pageblock migratetype for MIGRATE_RESERVE
migratetype pages.

The problem was introduced in v2.6.31 by commit ed0ae21
("page allocator: do not call get_pageblock_migratetype()
more than necessary").

Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
---
v2:
- updated patch description, there is no __zone_pcp_update()
  in newer kernels
v3:
- set freepage migratetype in __rmqueue[_fallback]()
  instead of rmqueue_bulk() (per Mel's request)

 mm/page_alloc.c |   27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c	2014-03-06 18:10:21.884422983 +0100
+++ b/mm/page_alloc.c	2014-03-06 18:10:27.016422895 +0100
@@ -1094,7 +1094,7 @@ __rmqueue_fallback(struct zone *zone, in
 	struct free_area *area;
 	int current_order;
 	struct page *page;
-	int migratetype, new_type, i;
+	int migratetype, new_type, mt = start_migratetype, i;
 
 	/* Find the largest possible block of pages in the other list */
 	for (current_order = MAX_ORDER-1; current_order >= order;
@@ -1125,6 +1125,14 @@ __rmqueue_fallback(struct zone *zone, in
 			expand(zone, page, order, current_order, area,
 			       new_type);
 
+			if (IS_ENABLED(CONFIG_CMA)) {
+				mt = get_pageblock_migratetype(page);
+				if (!is_migrate_cma(mt) &&
+				    !is_migrate_isolate(mt))
+					mt = start_migratetype;
+			}
+			set_freepage_migratetype(page, mt);
+
 			trace_mm_page_alloc_extfrag(page, order, current_order,
 				start_migratetype, migratetype, new_type);
 
@@ -1147,7 +1155,9 @@ static struct page *__rmqueue(struct zon
 retry_reserve:
 	page = __rmqueue_smallest(zone, order, migratetype);
 
-	if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
+	if (likely(page)) {
+		set_freepage_migratetype(page, migratetype);
+	} else if (migratetype != MIGRATE_RESERVE) {
 		page = __rmqueue_fallback(zone, order, migratetype);
 
 		/*
@@ -1174,7 +1184,7 @@ static int rmqueue_bulk(struct zone *zon
 			unsigned long count, struct list_head *list,
 			int migratetype, int cold)
 {
-	int mt = migratetype, i;
+	int i;
 
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
@@ -1195,16 +1205,15 @@ static int rmqueue_bulk(struct zone *zon
 			list_add(&page->lru, list);
 		else
 			list_add_tail(&page->lru, list);
+		list = &page->lru;
 		if (IS_ENABLED(CONFIG_CMA)) {
-			mt = get_pageblock_migratetype(page);
+			int mt = get_pageblock_migratetype(page);
 			if (!is_migrate_cma(mt) && !is_migrate_isolate(mt))
 				mt = migratetype;
+			if (is_migrate_cma(mt))
+				__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
+						      -(1 << order));
 		}
-		set_freepage_migratetype(page, mt);
-		list = &page->lru;
-		if (is_migrate_cma(mt))
-			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
-					      -(1 << order));
 	}
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
 	spin_unlock(&zone->lock);


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3] mm/page_alloc: fix freeing of MIGRATE_RESERVE migratetype pages
@ 2014-03-06 17:35 ` Bartlomiej Zolnierkiewicz
  0 siblings, 0 replies; 52+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2014-03-06 17:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Hugh Dickins, Marek Szyprowski, Yong-Taek Lee, linux-mm, linux-kernel

Pages allocated from MIGRATE_RESERVE migratetype pageblocks
are not freed back to MIGRATE_RESERVE migratetype free
lists in free_pcppages_bulk()->__free_one_page() if we got
to free_pcppages_bulk() through drain_[zone_]pages().
The freeing through free_hot_cold_page() is okay because
freepage migratetype is set to pageblock migratetype before
calling free_pcppages_bulk().  If pages of MIGRATE_RESERVE
migratetype end up on the free lists of other migratetype
whole Reserved pageblock may be later changed to the other
migratetype in __rmqueue_fallback() and it will be never
changed back to be a Reserved pageblock.  Fix the issue by
moving freepage migratetype setting from rmqueue_bulk() to
__rmqueue[_fallback]() and preserving freepage migratetype
as an original pageblock migratetype for MIGRATE_RESERVE
migratetype pages.

The problem was introduced in v2.6.31 by commit ed0ae21
("page allocator: do not call get_pageblock_migratetype()
more than necessary").

Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
---
v2:
- updated patch description, there is no __zone_pcp_update()
  in newer kernels
v3:
- set freepage migratetype in __rmqueue[_fallback]()
  instead of rmqueue_bulk() (per Mel's request)

 mm/page_alloc.c |   27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c	2014-03-06 18:10:21.884422983 +0100
+++ b/mm/page_alloc.c	2014-03-06 18:10:27.016422895 +0100
@@ -1094,7 +1094,7 @@ __rmqueue_fallback(struct zone *zone, in
 	struct free_area *area;
 	int current_order;
 	struct page *page;
-	int migratetype, new_type, i;
+	int migratetype, new_type, mt = start_migratetype, i;
 
 	/* Find the largest possible block of pages in the other list */
 	for (current_order = MAX_ORDER-1; current_order >= order;
@@ -1125,6 +1125,14 @@ __rmqueue_fallback(struct zone *zone, in
 			expand(zone, page, order, current_order, area,
 			       new_type);
 
+			if (IS_ENABLED(CONFIG_CMA)) {
+				mt = get_pageblock_migratetype(page);
+				if (!is_migrate_cma(mt) &&
+				    !is_migrate_isolate(mt))
+					mt = start_migratetype;
+			}
+			set_freepage_migratetype(page, mt);
+
 			trace_mm_page_alloc_extfrag(page, order, current_order,
 				start_migratetype, migratetype, new_type);
 
@@ -1147,7 +1155,9 @@ static struct page *__rmqueue(struct zon
 retry_reserve:
 	page = __rmqueue_smallest(zone, order, migratetype);
 
-	if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
+	if (likely(page)) {
+		set_freepage_migratetype(page, migratetype);
+	} else if (migratetype != MIGRATE_RESERVE) {
 		page = __rmqueue_fallback(zone, order, migratetype);
 
 		/*
@@ -1174,7 +1184,7 @@ static int rmqueue_bulk(struct zone *zon
 			unsigned long count, struct list_head *list,
 			int migratetype, int cold)
 {
-	int mt = migratetype, i;
+	int i;
 
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
@@ -1195,16 +1205,15 @@ static int rmqueue_bulk(struct zone *zon
 			list_add(&page->lru, list);
 		else
 			list_add_tail(&page->lru, list);
+		list = &page->lru;
 		if (IS_ENABLED(CONFIG_CMA)) {
-			mt = get_pageblock_migratetype(page);
+			int mt = get_pageblock_migratetype(page);
 			if (!is_migrate_cma(mt) && !is_migrate_isolate(mt))
 				mt = migratetype;
+			if (is_migrate_cma(mt))
+				__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
+						      -(1 << order));
 		}
-		set_freepage_migratetype(page, mt);
-		list = &page->lru;
-		if (is_migrate_cma(mt))
-			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
-					      -(1 << order));
 	}
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
 	spin_unlock(&zone->lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3] mm/page_alloc: fix freeing of MIGRATE_RESERVE migratetype pages
  2014-03-06 17:35 ` Bartlomiej Zolnierkiewicz
@ 2014-03-21 14:16   ` Vlastimil Babka
  -1 siblings, 0 replies; 52+ messages in thread
From: Vlastimil Babka @ 2014-03-21 14:16 UTC (permalink / raw)
  To: Bartlomiej Zolnierkiewicz, Mel Gorman
  Cc: Hugh Dickins, Marek Szyprowski, Yong-Taek Lee, linux-mm, linux-kernel

On 03/06/2014 06:35 PM, Bartlomiej Zolnierkiewicz wrote:
> Pages allocated from MIGRATE_RESERVE migratetype pageblocks
> are not freed back to MIGRATE_RESERVE migratetype free
> lists in free_pcppages_bulk()->__free_one_page() if we got
> to free_pcppages_bulk() through drain_[zone_]pages().
> The freeing through free_hot_cold_page() is okay because
> freepage migratetype is set to pageblock migratetype before
> calling free_pcppages_bulk().

I think this is somewhat misleading and got me confused for a while. 
It's not about the call path of free_pcppages_bulk(), but about the
fact that rmqueue_bulk() has been called at some point to fill up the 
pcp lists, and had to resort to __rmqueue_fallback(). So, going through 
free_hot_cold_page() might give you correct migratetype for the last 
page freed, but the pcp lists may still contain misplaced pages from 
earlier rmqueue_bulk().

> If pages of MIGRATE_RESERVE
> migratetype end up on the free lists of other migratetype
> whole Reserved pageblock may be later changed to the other
> migratetype in __rmqueue_fallback() and it will be never
> changed back to be a Reserved pageblock.  Fix the issue by
> moving freepage migratetype setting from rmqueue_bulk() to
> __rmqueue[_fallback]() and preserving freepage migratetype
> as an original pageblock migratetype for MIGRATE_RESERVE
> migratetype pages.

Actually wouldn't the easiest solution to this particular problem to 
check current pageblock migratetype in try_to_steal_freepages() and 
disallow changing it. However I agree that preventing the misplaced page 
in the first place would be even better.

> The problem was introduced in v2.6.31 by commit ed0ae21
> ("page allocator: do not call get_pageblock_migratetype()
> more than necessary").
>
> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
> Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> ---
> v2:
> - updated patch description, there is no __zone_pcp_update()
>    in newer kernels
> v3:
> - set freepage migratetype in __rmqueue[_fallback]()
>    instead of rmqueue_bulk() (per Mel's request)
>
>   mm/page_alloc.c |   27 ++++++++++++++++++---------
>   1 file changed, 18 insertions(+), 9 deletions(-)
>
> Index: b/mm/page_alloc.c
> ===================================================================
> --- a/mm/page_alloc.c	2014-03-06 18:10:21.884422983 +0100
> +++ b/mm/page_alloc.c	2014-03-06 18:10:27.016422895 +0100
> @@ -1094,7 +1094,7 @@ __rmqueue_fallback(struct zone *zone, in
>   	struct free_area *area;
>   	int current_order;
>   	struct page *page;
> -	int migratetype, new_type, i;
> +	int migratetype, new_type, mt = start_migratetype, i;

A better naming would help, "mt" and "migratetype" are the same thing 
and it gets too confusing.

>
>   	/* Find the largest possible block of pages in the other list */
>   	for (current_order = MAX_ORDER-1; current_order >= order;
> @@ -1125,6 +1125,14 @@ __rmqueue_fallback(struct zone *zone, in
>   			expand(zone, page, order, current_order, area,
>   			       new_type);
>
> +			if (IS_ENABLED(CONFIG_CMA)) {
> +				mt = get_pageblock_migratetype(page);
> +				if (!is_migrate_cma(mt) &&
> +				    !is_migrate_isolate(mt))
> +					mt = start_migratetype;
> +			}
> +			set_freepage_migratetype(page, mt);
> +
>   			trace_mm_page_alloc_extfrag(page, order, current_order,
>   				start_migratetype, migratetype, new_type);
>
> @@ -1147,7 +1155,9 @@ static struct page *__rmqueue(struct zon
>   retry_reserve:
>   	page = __rmqueue_smallest(zone, order, migratetype);
>
> -	if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
> +	if (likely(page)) {
> +		set_freepage_migratetype(page, migratetype);

Are you sure that here the checking of of CMA and ISOLATE is not needed? 
Did the original rmqueue_bulk() have this checking only for the 
__rmqueue_fallback() case? Why wouldn't the check already be only in 
__rmqueue_fallback() then?

> +	} else if (migratetype != MIGRATE_RESERVE) {
>   		page = __rmqueue_fallback(zone, order, migratetype);
>
>   		/*
> @@ -1174,7 +1184,7 @@ static int rmqueue_bulk(struct zone *zon
>   			unsigned long count, struct list_head *list,
>   			int migratetype, int cold)
>   {
> -	int mt = migratetype, i;
> +	int i;
>
>   	spin_lock(&zone->lock);
>   	for (i = 0; i < count; ++i) {
> @@ -1195,16 +1205,15 @@ static int rmqueue_bulk(struct zone *zon
>   			list_add(&page->lru, list);
>   		else
>   			list_add_tail(&page->lru, list);
> +		list = &page->lru;
>   		if (IS_ENABLED(CONFIG_CMA)) {
> -			mt = get_pageblock_migratetype(page);
> +			int mt = get_pageblock_migratetype(page);
>   			if (!is_migrate_cma(mt) && !is_migrate_isolate(mt))
>   				mt = migratetype;
> +			if (is_migrate_cma(mt))
> +				__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> +						      -(1 << order));
>   		}
> -		set_freepage_migratetype(page, mt);
> -		list = &page->lru;
> -		if (is_migrate_cma(mt))
> -			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> -					      -(1 << order));
>   	}
>   	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>   	spin_unlock(&zone->lock);
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3] mm/page_alloc: fix freeing of MIGRATE_RESERVE migratetype pages
@ 2014-03-21 14:16   ` Vlastimil Babka
  0 siblings, 0 replies; 52+ messages in thread
From: Vlastimil Babka @ 2014-03-21 14:16 UTC (permalink / raw)
  To: Bartlomiej Zolnierkiewicz, Mel Gorman
  Cc: Hugh Dickins, Marek Szyprowski, Yong-Taek Lee, linux-mm, linux-kernel

On 03/06/2014 06:35 PM, Bartlomiej Zolnierkiewicz wrote:
> Pages allocated from MIGRATE_RESERVE migratetype pageblocks
> are not freed back to MIGRATE_RESERVE migratetype free
> lists in free_pcppages_bulk()->__free_one_page() if we got
> to free_pcppages_bulk() through drain_[zone_]pages().
> The freeing through free_hot_cold_page() is okay because
> freepage migratetype is set to pageblock migratetype before
> calling free_pcppages_bulk().

I think this is somewhat misleading and got me confused for a while. 
It's not about the call path of free_pcppages_bulk(), but about the
fact that rmqueue_bulk() has been called at some point to fill up the 
pcp lists, and had to resort to __rmqueue_fallback(). So, going through 
free_hot_cold_page() might give you correct migratetype for the last 
page freed, but the pcp lists may still contain misplaced pages from 
earlier rmqueue_bulk().

> If pages of MIGRATE_RESERVE
> migratetype end up on the free lists of other migratetype
> whole Reserved pageblock may be later changed to the other
> migratetype in __rmqueue_fallback() and it will be never
> changed back to be a Reserved pageblock.  Fix the issue by
> moving freepage migratetype setting from rmqueue_bulk() to
> __rmqueue[_fallback]() and preserving freepage migratetype
> as an original pageblock migratetype for MIGRATE_RESERVE
> migratetype pages.

Actually wouldn't the easiest solution to this particular problem to 
check current pageblock migratetype in try_to_steal_freepages() and 
disallow changing it. However I agree that preventing the misplaced page 
in the first place would be even better.

> The problem was introduced in v2.6.31 by commit ed0ae21
> ("page allocator: do not call get_pageblock_migratetype()
> more than necessary").
>
> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
> Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> ---
> v2:
> - updated patch description, there is no __zone_pcp_update()
>    in newer kernels
> v3:
> - set freepage migratetype in __rmqueue[_fallback]()
>    instead of rmqueue_bulk() (per Mel's request)
>
>   mm/page_alloc.c |   27 ++++++++++++++++++---------
>   1 file changed, 18 insertions(+), 9 deletions(-)
>
> Index: b/mm/page_alloc.c
> ===================================================================
> --- a/mm/page_alloc.c	2014-03-06 18:10:21.884422983 +0100
> +++ b/mm/page_alloc.c	2014-03-06 18:10:27.016422895 +0100
> @@ -1094,7 +1094,7 @@ __rmqueue_fallback(struct zone *zone, in
>   	struct free_area *area;
>   	int current_order;
>   	struct page *page;
> -	int migratetype, new_type, i;
> +	int migratetype, new_type, mt = start_migratetype, i;

A better naming would help, "mt" and "migratetype" are the same thing 
and it gets too confusing.

>
>   	/* Find the largest possible block of pages in the other list */
>   	for (current_order = MAX_ORDER-1; current_order >= order;
> @@ -1125,6 +1125,14 @@ __rmqueue_fallback(struct zone *zone, in
>   			expand(zone, page, order, current_order, area,
>   			       new_type);
>
> +			if (IS_ENABLED(CONFIG_CMA)) {
> +				mt = get_pageblock_migratetype(page);
> +				if (!is_migrate_cma(mt) &&
> +				    !is_migrate_isolate(mt))
> +					mt = start_migratetype;
> +			}
> +			set_freepage_migratetype(page, mt);
> +
>   			trace_mm_page_alloc_extfrag(page, order, current_order,
>   				start_migratetype, migratetype, new_type);
>
> @@ -1147,7 +1155,9 @@ static struct page *__rmqueue(struct zon
>   retry_reserve:
>   	page = __rmqueue_smallest(zone, order, migratetype);
>
> -	if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
> +	if (likely(page)) {
> +		set_freepage_migratetype(page, migratetype);

Are you sure that here the checking of of CMA and ISOLATE is not needed? 
Did the original rmqueue_bulk() have this checking only for the 
__rmqueue_fallback() case? Why wouldn't the check already be only in 
__rmqueue_fallback() then?

> +	} else if (migratetype != MIGRATE_RESERVE) {
>   		page = __rmqueue_fallback(zone, order, migratetype);
>
>   		/*
> @@ -1174,7 +1184,7 @@ static int rmqueue_bulk(struct zone *zon
>   			unsigned long count, struct list_head *list,
>   			int migratetype, int cold)
>   {
> -	int mt = migratetype, i;
> +	int i;
>
>   	spin_lock(&zone->lock);
>   	for (i = 0; i < count; ++i) {
> @@ -1195,16 +1205,15 @@ static int rmqueue_bulk(struct zone *zon
>   			list_add(&page->lru, list);
>   		else
>   			list_add_tail(&page->lru, list);
> +		list = &page->lru;
>   		if (IS_ENABLED(CONFIG_CMA)) {
> -			mt = get_pageblock_migratetype(page);
> +			int mt = get_pageblock_migratetype(page);
>   			if (!is_migrate_cma(mt) && !is_migrate_isolate(mt))
>   				mt = migratetype;
> +			if (is_migrate_cma(mt))
> +				__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> +						      -(1 << order));
>   		}
> -		set_freepage_migratetype(page, mt);
> -		list = &page->lru;
> -		if (is_migrate_cma(mt))
> -			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> -					      -(1 << order));
>   	}
>   	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>   	spin_unlock(&zone->lock);
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3] mm/page_alloc: fix freeing of MIGRATE_RESERVE migratetype pages
  2014-03-21 14:16   ` Vlastimil Babka
@ 2014-03-25 13:47     ` Bartlomiej Zolnierkiewicz
  -1 siblings, 0 replies; 52+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2014-03-25 13:47 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mel Gorman, Hugh Dickins, Marek Szyprowski, Yong-Taek Lee,
	linux-mm, linux-kernel


Hi,

On Friday, March 21, 2014 03:16:31 PM Vlastimil Babka wrote:
> On 03/06/2014 06:35 PM, Bartlomiej Zolnierkiewicz wrote:
> > Pages allocated from MIGRATE_RESERVE migratetype pageblocks
> > are not freed back to MIGRATE_RESERVE migratetype free
> > lists in free_pcppages_bulk()->__free_one_page() if we got
> > to free_pcppages_bulk() through drain_[zone_]pages().
> > The freeing through free_hot_cold_page() is okay because
> > freepage migratetype is set to pageblock migratetype before
> > calling free_pcppages_bulk().
> 
> I think this is somewhat misleading and got me confused for a while. 
> It's not about the call path of free_pcppages_bulk(), but about the
> fact that rmqueue_bulk() has been called at some point to fill up the 
> pcp lists, and had to resort to __rmqueue_fallback(). So, going through 
> free_hot_cold_page() might give you correct migratetype for the last 
> page freed, but the pcp lists may still contain misplaced pages from 
> earlier rmqueue_bulk().

Ok, you're right.  I'll fix this.

> > If pages of MIGRATE_RESERVE
> > migratetype end up on the free lists of other migratetype
> > whole Reserved pageblock may be later changed to the other
> > migratetype in __rmqueue_fallback() and it will be never
> > changed back to be a Reserved pageblock.  Fix the issue by
> > moving freepage migratetype setting from rmqueue_bulk() to
> > __rmqueue[_fallback]() and preserving freepage migratetype
> > as an original pageblock migratetype for MIGRATE_RESERVE
> > migratetype pages.
> 
> Actually wouldn't the easiest solution to this particular problem to 
> check current pageblock migratetype in try_to_steal_freepages() and 
> disallow changing it. However I agree that preventing the misplaced page 
> in the first place would be even better.
> 
> > The problem was introduced in v2.6.31 by commit ed0ae21
> > ("page allocator: do not call get_pageblock_migratetype()
> > more than necessary").
> >
> > Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
> > Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
> > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Hugh Dickins <hughd@google.com>
> > ---
> > v2:
> > - updated patch description, there is no __zone_pcp_update()
> >    in newer kernels
> > v3:
> > - set freepage migratetype in __rmqueue[_fallback]()
> >    instead of rmqueue_bulk() (per Mel's request)
> >
> >   mm/page_alloc.c |   27 ++++++++++++++++++---------
> >   1 file changed, 18 insertions(+), 9 deletions(-)
> >
> > Index: b/mm/page_alloc.c
> > ===================================================================
> > --- a/mm/page_alloc.c	2014-03-06 18:10:21.884422983 +0100
> > +++ b/mm/page_alloc.c	2014-03-06 18:10:27.016422895 +0100
> > @@ -1094,7 +1094,7 @@ __rmqueue_fallback(struct zone *zone, in
> >   	struct free_area *area;
> >   	int current_order;
> >   	struct page *page;
> > -	int migratetype, new_type, i;
> > +	int migratetype, new_type, mt = start_migratetype, i;
> 
> A better naming would help, "mt" and "migratetype" are the same thing 
> and it gets too confusing.

Well, yes, though 'mt' is short and the check code is consistent with
the corresponding code in rmqueue_bulk().

Do you have a proposal for a better name for this variable?

> >
> >   	/* Find the largest possible block of pages in the other list */
> >   	for (current_order = MAX_ORDER-1; current_order >= order;
> > @@ -1125,6 +1125,14 @@ __rmqueue_fallback(struct zone *zone, in
> >   			expand(zone, page, order, current_order, area,
> >   			       new_type);
> >
> > +			if (IS_ENABLED(CONFIG_CMA)) {
> > +				mt = get_pageblock_migratetype(page);
> > +				if (!is_migrate_cma(mt) &&
> > +				    !is_migrate_isolate(mt))
> > +					mt = start_migratetype;
> > +			}
> > +			set_freepage_migratetype(page, mt);
> > +
> >   			trace_mm_page_alloc_extfrag(page, order, current_order,
> >   				start_migratetype, migratetype, new_type);
> >
> > @@ -1147,7 +1155,9 @@ static struct page *__rmqueue(struct zon
> >   retry_reserve:
> >   	page = __rmqueue_smallest(zone, order, migratetype);
> >
> > -	if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
> > +	if (likely(page)) {
> > +		set_freepage_migratetype(page, migratetype);
> 
> Are you sure that here the checking of of CMA and ISOLATE is not needed? 

CMA and ISOLATE migratetype pages are always put back on the correct
free lists (since set_freepage_migratetype() sets freepage migratetype
to the original one for CMA and ISOLATE migratetype pages) and
__rmqueue_smallest() can take page only from the 'migratetype' free
list.

+ It was suggested to do it this way by Mel.

> Did the original rmqueue_bulk() have this checking only for the 
> __rmqueue_fallback() case? Why wouldn't the check already be only in 
> __rmqueue_fallback() then?

Probably because of historical reasons.  The rmqueue_bulk() contained
set_page_private() call when CMA was introduced and added the special
handling for CMA and ISOLATE migratetype pages, please see commit
47118af ("mm: mmzone: MIGRATE_CMA migration type added").

> > +	} else if (migratetype != MIGRATE_RESERVE) {
> >   		page = __rmqueue_fallback(zone, order, migratetype);
> >
> >   		/*
> > @@ -1174,7 +1184,7 @@ static int rmqueue_bulk(struct zone *zon
> >   			unsigned long count, struct list_head *list,
> >   			int migratetype, int cold)
> >   {
> > -	int mt = migratetype, i;
> > +	int i;
> >
> >   	spin_lock(&zone->lock);
> >   	for (i = 0; i < count; ++i) {
> > @@ -1195,16 +1205,15 @@ static int rmqueue_bulk(struct zone *zon
> >   			list_add(&page->lru, list);
> >   		else
> >   			list_add_tail(&page->lru, list);
> > +		list = &page->lru;
> >   		if (IS_ENABLED(CONFIG_CMA)) {
> > -			mt = get_pageblock_migratetype(page);
> > +			int mt = get_pageblock_migratetype(page);
> >   			if (!is_migrate_cma(mt) && !is_migrate_isolate(mt))
> >   				mt = migratetype;
> > +			if (is_migrate_cma(mt))
> > +				__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> > +						      -(1 << order));
> >   		}
> > -		set_freepage_migratetype(page, mt);
> > -		list = &page->lru;
> > -		if (is_migrate_cma(mt))
> > -			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> > -					      -(1 << order));
> >   	}
> >   	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> >   	spin_unlock(&zone->lock);

Best regards,
--
Bartlomiej Zolnierkiewicz
Samsung R&D Institute Poland
Samsung Electronics


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3] mm/page_alloc: fix freeing of MIGRATE_RESERVE migratetype pages
@ 2014-03-25 13:47     ` Bartlomiej Zolnierkiewicz
  0 siblings, 0 replies; 52+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2014-03-25 13:47 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mel Gorman, Hugh Dickins, Marek Szyprowski, Yong-Taek Lee,
	linux-mm, linux-kernel


Hi,

On Friday, March 21, 2014 03:16:31 PM Vlastimil Babka wrote:
> On 03/06/2014 06:35 PM, Bartlomiej Zolnierkiewicz wrote:
> > Pages allocated from MIGRATE_RESERVE migratetype pageblocks
> > are not freed back to MIGRATE_RESERVE migratetype free
> > lists in free_pcppages_bulk()->__free_one_page() if we got
> > to free_pcppages_bulk() through drain_[zone_]pages().
> > The freeing through free_hot_cold_page() is okay because
> > freepage migratetype is set to pageblock migratetype before
> > calling free_pcppages_bulk().
> 
> I think this is somewhat misleading and got me confused for a while. 
> It's not about the call path of free_pcppages_bulk(), but about the
> fact that rmqueue_bulk() has been called at some point to fill up the 
> pcp lists, and had to resort to __rmqueue_fallback(). So, going through 
> free_hot_cold_page() might give you correct migratetype for the last 
> page freed, but the pcp lists may still contain misplaced pages from 
> earlier rmqueue_bulk().

Ok, you're right.  I'll fix this.

> > If pages of MIGRATE_RESERVE
> > migratetype end up on the free lists of other migratetype
> > whole Reserved pageblock may be later changed to the other
> > migratetype in __rmqueue_fallback() and it will be never
> > changed back to be a Reserved pageblock.  Fix the issue by
> > moving freepage migratetype setting from rmqueue_bulk() to
> > __rmqueue[_fallback]() and preserving freepage migratetype
> > as an original pageblock migratetype for MIGRATE_RESERVE
> > migratetype pages.
> 
> Actually wouldn't the easiest solution to this particular problem to 
> check current pageblock migratetype in try_to_steal_freepages() and 
> disallow changing it. However I agree that preventing the misplaced page 
> in the first place would be even better.
> 
> > The problem was introduced in v2.6.31 by commit ed0ae21
> > ("page allocator: do not call get_pageblock_migratetype()
> > more than necessary").
> >
> > Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
> > Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
> > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Hugh Dickins <hughd@google.com>
> > ---
> > v2:
> > - updated patch description, there is no __zone_pcp_update()
> >    in newer kernels
> > v3:
> > - set freepage migratetype in __rmqueue[_fallback]()
> >    instead of rmqueue_bulk() (per Mel's request)
> >
> >   mm/page_alloc.c |   27 ++++++++++++++++++---------
> >   1 file changed, 18 insertions(+), 9 deletions(-)
> >
> > Index: b/mm/page_alloc.c
> > ===================================================================
> > --- a/mm/page_alloc.c	2014-03-06 18:10:21.884422983 +0100
> > +++ b/mm/page_alloc.c	2014-03-06 18:10:27.016422895 +0100
> > @@ -1094,7 +1094,7 @@ __rmqueue_fallback(struct zone *zone, in
> >   	struct free_area *area;
> >   	int current_order;
> >   	struct page *page;
> > -	int migratetype, new_type, i;
> > +	int migratetype, new_type, mt = start_migratetype, i;
> 
> A better naming would help, "mt" and "migratetype" are the same thing 
> and it gets too confusing.

Well, yes, though 'mt' is short and the check code is consistent with
the corresponding code in rmqueue_bulk().

Do you have a proposal for a better name for this variable?

> >
> >   	/* Find the largest possible block of pages in the other list */
> >   	for (current_order = MAX_ORDER-1; current_order >= order;
> > @@ -1125,6 +1125,14 @@ __rmqueue_fallback(struct zone *zone, in
> >   			expand(zone, page, order, current_order, area,
> >   			       new_type);
> >
> > +			if (IS_ENABLED(CONFIG_CMA)) {
> > +				mt = get_pageblock_migratetype(page);
> > +				if (!is_migrate_cma(mt) &&
> > +				    !is_migrate_isolate(mt))
> > +					mt = start_migratetype;
> > +			}
> > +			set_freepage_migratetype(page, mt);
> > +
> >   			trace_mm_page_alloc_extfrag(page, order, current_order,
> >   				start_migratetype, migratetype, new_type);
> >
> > @@ -1147,7 +1155,9 @@ static struct page *__rmqueue(struct zon
> >   retry_reserve:
> >   	page = __rmqueue_smallest(zone, order, migratetype);
> >
> > -	if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
> > +	if (likely(page)) {
> > +		set_freepage_migratetype(page, migratetype);
> 
> Are you sure that here the checking of of CMA and ISOLATE is not needed? 

CMA and ISOLATE migratetype pages are always put back on the correct
free lists (since set_freepage_migratetype() sets freepage migratetype
to the original one for CMA and ISOLATE migratetype pages) and
__rmqueue_smallest() can take page only from the 'migratetype' free
list.

+ It was suggested to do it this way by Mel.

> Did the original rmqueue_bulk() have this checking only for the 
> __rmqueue_fallback() case? Why wouldn't the check already be only in 
> __rmqueue_fallback() then?

Probably because of historical reasons.  The rmqueue_bulk() contained
set_page_private() call when CMA was introduced and added the special
handling for CMA and ISOLATE migratetype pages, please see commit
47118af ("mm: mmzone: MIGRATE_CMA migration type added").

> > +	} else if (migratetype != MIGRATE_RESERVE) {
> >   		page = __rmqueue_fallback(zone, order, migratetype);
> >
> >   		/*
> > @@ -1174,7 +1184,7 @@ static int rmqueue_bulk(struct zone *zon
> >   			unsigned long count, struct list_head *list,
> >   			int migratetype, int cold)
> >   {
> > -	int mt = migratetype, i;
> > +	int i;
> >
> >   	spin_lock(&zone->lock);
> >   	for (i = 0; i < count; ++i) {
> > @@ -1195,16 +1205,15 @@ static int rmqueue_bulk(struct zone *zon
> >   			list_add(&page->lru, list);
> >   		else
> >   			list_add_tail(&page->lru, list);
> > +		list = &page->lru;
> >   		if (IS_ENABLED(CONFIG_CMA)) {
> > -			mt = get_pageblock_migratetype(page);
> > +			int mt = get_pageblock_migratetype(page);
> >   			if (!is_migrate_cma(mt) && !is_migrate_isolate(mt))
> >   				mt = migratetype;
> > +			if (is_migrate_cma(mt))
> > +				__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> > +						      -(1 << order));
> >   		}
> > -		set_freepage_migratetype(page, mt);
> > -		list = &page->lru;
> > -		if (is_migrate_cma(mt))
> > -			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> > -					      -(1 << order));
> >   	}
> >   	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> >   	spin_unlock(&zone->lock);

Best regards,
--
Bartlomiej Zolnierkiewicz
Samsung R&D Institute Poland
Samsung Electronics

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3] mm/page_alloc: fix freeing of MIGRATE_RESERVE migratetype pages
  2014-03-25 13:47     ` Bartlomiej Zolnierkiewicz
@ 2014-04-03 15:36       ` Vlastimil Babka
  -1 siblings, 0 replies; 52+ messages in thread
From: Vlastimil Babka @ 2014-04-03 15:36 UTC (permalink / raw)
  To: Bartlomiej Zolnierkiewicz
  Cc: Mel Gorman, Hugh Dickins, Marek Szyprowski, Yong-Taek Lee,
	linux-mm, linux-kernel, Joonsoo Kim

On 03/25/2014 02:47 PM, Bartlomiej Zolnierkiewicz wrote:
>
> Hi,
>
> On Friday, March 21, 2014 03:16:31 PM Vlastimil Babka wrote:
>> On 03/06/2014 06:35 PM, Bartlomiej Zolnierkiewicz wrote:
>>> Pages allocated from MIGRATE_RESERVE migratetype pageblocks
>>> are not freed back to MIGRATE_RESERVE migratetype free
>>> lists in free_pcppages_bulk()->__free_one_page() if we got
>>> to free_pcppages_bulk() through drain_[zone_]pages().
>>> The freeing through free_hot_cold_page() is okay because
>>> freepage migratetype is set to pageblock migratetype before
>>> calling free_pcppages_bulk().
>>
>> I think this is somewhat misleading and got me confused for a while.
>> It's not about the call path of free_pcppages_bulk(), but about the
>> fact that rmqueue_bulk() has been called at some point to fill up the
>> pcp lists, and had to resort to __rmqueue_fallback(). So, going through
>> free_hot_cold_page() might give you correct migratetype for the last
>> page freed, but the pcp lists may still contain misplaced pages from
>> earlier rmqueue_bulk().
>
> Ok, you're right.  I'll fix this.
>
>>> If pages of MIGRATE_RESERVE
>>> migratetype end up on the free lists of other migratetype
>>> whole Reserved pageblock may be later changed to the other
>>> migratetype in __rmqueue_fallback() and it will be never
>>> changed back to be a Reserved pageblock.  Fix the issue by
>>> moving freepage migratetype setting from rmqueue_bulk() to
>>> __rmqueue[_fallback]() and preserving freepage migratetype
>>> as an original pageblock migratetype for MIGRATE_RESERVE
>>> migratetype pages.
>>
>> Actually wouldn't the easiest solution to this particular problem to
>> check current pageblock migratetype in try_to_steal_freepages() and
>> disallow changing it. However I agree that preventing the misplaced page
>> in the first place would be even better.
>>
>>> The problem was introduced in v2.6.31 by commit ed0ae21
>>> ("page allocator: do not call get_pageblock_migratetype()
>>> more than necessary").
>>>
>>> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
>>> Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>>> Cc: Mel Gorman <mgorman@suse.de>
>>> Cc: Hugh Dickins <hughd@google.com>
>>> ---
>>> v2:
>>> - updated patch description, there is no __zone_pcp_update()
>>>     in newer kernels
>>> v3:
>>> - set freepage migratetype in __rmqueue[_fallback]()
>>>     instead of rmqueue_bulk() (per Mel's request)
>>>
>>>    mm/page_alloc.c |   27 ++++++++++++++++++---------
>>>    1 file changed, 18 insertions(+), 9 deletions(-)
>>>
>>> Index: b/mm/page_alloc.c
>>> ===================================================================
>>> --- a/mm/page_alloc.c	2014-03-06 18:10:21.884422983 +0100
>>> +++ b/mm/page_alloc.c	2014-03-06 18:10:27.016422895 +0100
>>> @@ -1094,7 +1094,7 @@ __rmqueue_fallback(struct zone *zone, in
>>>    	struct free_area *area;
>>>    	int current_order;
>>>    	struct page *page;
>>> -	int migratetype, new_type, i;
>>> +	int migratetype, new_type, mt = start_migratetype, i;
>>
>> A better naming would help, "mt" and "migratetype" are the same thing
>> and it gets too confusing.
>
> Well, yes, though 'mt' is short and the check code is consistent with
> the corresponding code in rmqueue_bulk().
>
> Do you have a proposal for a better name for this variable?
>
>>>
>>>    	/* Find the largest possible block of pages in the other list */
>>>    	for (current_order = MAX_ORDER-1; current_order >= order;
>>> @@ -1125,6 +1125,14 @@ __rmqueue_fallback(struct zone *zone, in
>>>    			expand(zone, page, order, current_order, area,
>>>    			       new_type);
>>>
>>> +			if (IS_ENABLED(CONFIG_CMA)) {
>>> +				mt = get_pageblock_migratetype(page);
>>> +				if (!is_migrate_cma(mt) &&
>>> +				    !is_migrate_isolate(mt))
>>> +					mt = start_migratetype;
>>> +			}
>>> +			set_freepage_migratetype(page, mt);
>>> +
>>>    			trace_mm_page_alloc_extfrag(page, order, current_order,
>>>    				start_migratetype, migratetype, new_type);
>>>
>>> @@ -1147,7 +1155,9 @@ static struct page *__rmqueue(struct zon
>>>    retry_reserve:
>>>    	page = __rmqueue_smallest(zone, order, migratetype);
>>>
>>> -	if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
>>> +	if (likely(page)) {
>>> +		set_freepage_migratetype(page, migratetype);
>>
>> Are you sure that here the checking of of CMA and ISOLATE is not needed?
>
> CMA and ISOLATE migratetype pages are always put back on the correct
> free lists (since set_freepage_migratetype() sets freepage migratetype
> to the original one for CMA and ISOLATE migratetype pages) and
> __rmqueue_smallest() can take page only from the 'migratetype' free
> list.

Actually, this is true also for the __rmqueue_fallback() case. So we can 
do without get_pageblock_migratetype() completely. In fact, Joonsoo 
already posted such patch in "[PATCH 3/7] mm/page_alloc: move 
set_freepage_migratetype() to better place", see:
http://lkml.org/lkml/2014/1/9/33

I've updated and improved this and will send shortly along with some 
DEBUG_VM checks to test easier that this is indeed the case. Testing 
from the CMA people is welcome.

Vlastimil

> + It was suggested to do it this way by Mel.
>
>> Did the original rmqueue_bulk() have this checking only for the
>> __rmqueue_fallback() case? Why wouldn't the check already be only in
>> __rmqueue_fallback() then?
>
> Probably because of historical reasons.  The rmqueue_bulk() contained
> set_page_private() call when CMA was introduced and added the special
> handling for CMA and ISOLATE migratetype pages, please see commit
> 47118af ("mm: mmzone: MIGRATE_CMA migration type added").
>
>>> +	} else if (migratetype != MIGRATE_RESERVE) {
>>>    		page = __rmqueue_fallback(zone, order, migratetype);
>>>
>>>    		/*
>>> @@ -1174,7 +1184,7 @@ static int rmqueue_bulk(struct zone *zon
>>>    			unsigned long count, struct list_head *list,
>>>    			int migratetype, int cold)
>>>    {
>>> -	int mt = migratetype, i;
>>> +	int i;
>>>
>>>    	spin_lock(&zone->lock);
>>>    	for (i = 0; i < count; ++i) {
>>> @@ -1195,16 +1205,15 @@ static int rmqueue_bulk(struct zone *zon
>>>    			list_add(&page->lru, list);
>>>    		else
>>>    			list_add_tail(&page->lru, list);
>>> +		list = &page->lru;
>>>    		if (IS_ENABLED(CONFIG_CMA)) {
>>> -			mt = get_pageblock_migratetype(page);
>>> +			int mt = get_pageblock_migratetype(page);
>>>    			if (!is_migrate_cma(mt) && !is_migrate_isolate(mt))
>>>    				mt = migratetype;
>>> +			if (is_migrate_cma(mt))
>>> +				__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
>>> +						      -(1 << order));
>>>    		}
>>> -		set_freepage_migratetype(page, mt);
>>> -		list = &page->lru;
>>> -		if (is_migrate_cma(mt))
>>> -			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
>>> -					      -(1 << order));
>>>    	}
>>>    	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>>>    	spin_unlock(&zone->lock);
>
> Best regards,
> --
> Bartlomiej Zolnierkiewicz
> Samsung R&D Institute Poland
> Samsung Electronics
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3] mm/page_alloc: fix freeing of MIGRATE_RESERVE migratetype pages
@ 2014-04-03 15:36       ` Vlastimil Babka
  0 siblings, 0 replies; 52+ messages in thread
From: Vlastimil Babka @ 2014-04-03 15:36 UTC (permalink / raw)
  To: Bartlomiej Zolnierkiewicz
  Cc: Mel Gorman, Hugh Dickins, Marek Szyprowski, Yong-Taek Lee,
	linux-mm, linux-kernel, Joonsoo Kim

On 03/25/2014 02:47 PM, Bartlomiej Zolnierkiewicz wrote:
>
> Hi,
>
> On Friday, March 21, 2014 03:16:31 PM Vlastimil Babka wrote:
>> On 03/06/2014 06:35 PM, Bartlomiej Zolnierkiewicz wrote:
>>> Pages allocated from MIGRATE_RESERVE migratetype pageblocks
>>> are not freed back to MIGRATE_RESERVE migratetype free
>>> lists in free_pcppages_bulk()->__free_one_page() if we got
>>> to free_pcppages_bulk() through drain_[zone_]pages().
>>> The freeing through free_hot_cold_page() is okay because
>>> freepage migratetype is set to pageblock migratetype before
>>> calling free_pcppages_bulk().
>>
>> I think this is somewhat misleading and got me confused for a while.
>> It's not about the call path of free_pcppages_bulk(), but about the
>> fact that rmqueue_bulk() has been called at some point to fill up the
>> pcp lists, and had to resort to __rmqueue_fallback(). So, going through
>> free_hot_cold_page() might give you correct migratetype for the last
>> page freed, but the pcp lists may still contain misplaced pages from
>> earlier rmqueue_bulk().
>
> Ok, you're right.  I'll fix this.
>
>>> If pages of MIGRATE_RESERVE
>>> migratetype end up on the free lists of other migratetype
>>> whole Reserved pageblock may be later changed to the other
>>> migratetype in __rmqueue_fallback() and it will be never
>>> changed back to be a Reserved pageblock.  Fix the issue by
>>> moving freepage migratetype setting from rmqueue_bulk() to
>>> __rmqueue[_fallback]() and preserving freepage migratetype
>>> as an original pageblock migratetype for MIGRATE_RESERVE
>>> migratetype pages.
>>
>> Actually wouldn't the easiest solution to this particular problem to
>> check current pageblock migratetype in try_to_steal_freepages() and
>> disallow changing it. However I agree that preventing the misplaced page
>> in the first place would be even better.
>>
>>> The problem was introduced in v2.6.31 by commit ed0ae21
>>> ("page allocator: do not call get_pageblock_migratetype()
>>> more than necessary").
>>>
>>> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
>>> Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>>> Cc: Mel Gorman <mgorman@suse.de>
>>> Cc: Hugh Dickins <hughd@google.com>
>>> ---
>>> v2:
>>> - updated patch description, there is no __zone_pcp_update()
>>>     in newer kernels
>>> v3:
>>> - set freepage migratetype in __rmqueue[_fallback]()
>>>     instead of rmqueue_bulk() (per Mel's request)
>>>
>>>    mm/page_alloc.c |   27 ++++++++++++++++++---------
>>>    1 file changed, 18 insertions(+), 9 deletions(-)
>>>
>>> Index: b/mm/page_alloc.c
>>> ===================================================================
>>> --- a/mm/page_alloc.c	2014-03-06 18:10:21.884422983 +0100
>>> +++ b/mm/page_alloc.c	2014-03-06 18:10:27.016422895 +0100
>>> @@ -1094,7 +1094,7 @@ __rmqueue_fallback(struct zone *zone, in
>>>    	struct free_area *area;
>>>    	int current_order;
>>>    	struct page *page;
>>> -	int migratetype, new_type, i;
>>> +	int migratetype, new_type, mt = start_migratetype, i;
>>
>> A better naming would help, "mt" and "migratetype" are the same thing
>> and it gets too confusing.
>
> Well, yes, though 'mt' is short and the check code is consistent with
> the corresponding code in rmqueue_bulk().
>
> Do you have a proposal for a better name for this variable?
>
>>>
>>>    	/* Find the largest possible block of pages in the other list */
>>>    	for (current_order = MAX_ORDER-1; current_order >= order;
>>> @@ -1125,6 +1125,14 @@ __rmqueue_fallback(struct zone *zone, in
>>>    			expand(zone, page, order, current_order, area,
>>>    			       new_type);
>>>
>>> +			if (IS_ENABLED(CONFIG_CMA)) {
>>> +				mt = get_pageblock_migratetype(page);
>>> +				if (!is_migrate_cma(mt) &&
>>> +				    !is_migrate_isolate(mt))
>>> +					mt = start_migratetype;
>>> +			}
>>> +			set_freepage_migratetype(page, mt);
>>> +
>>>    			trace_mm_page_alloc_extfrag(page, order, current_order,
>>>    				start_migratetype, migratetype, new_type);
>>>
>>> @@ -1147,7 +1155,9 @@ static struct page *__rmqueue(struct zon
>>>    retry_reserve:
>>>    	page = __rmqueue_smallest(zone, order, migratetype);
>>>
>>> -	if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
>>> +	if (likely(page)) {
>>> +		set_freepage_migratetype(page, migratetype);
>>
>> Are you sure that here the checking of of CMA and ISOLATE is not needed?
>
> CMA and ISOLATE migratetype pages are always put back on the correct
> free lists (since set_freepage_migratetype() sets freepage migratetype
> to the original one for CMA and ISOLATE migratetype pages) and
> __rmqueue_smallest() can take page only from the 'migratetype' free
> list.

Actually, this is true also for the __rmqueue_fallback() case. So we can 
do without get_pageblock_migratetype() completely. In fact, Joonsoo 
already posted such patch in "[PATCH 3/7] mm/page_alloc: move 
set_freepage_migratetype() to better place", see:
http://lkml.org/lkml/2014/1/9/33

I've updated and improved this and will send shortly along with some 
DEBUG_VM checks to test easier that this is indeed the case. Testing 
from the CMA people is welcome.

Vlastimil

> + It was suggested to do it this way by Mel.
>
>> Did the original rmqueue_bulk() have this checking only for the
>> __rmqueue_fallback() case? Why wouldn't the check already be only in
>> __rmqueue_fallback() then?
>
> Probably because of historical reasons.  The rmqueue_bulk() contained
> set_page_private() call when CMA was introduced and added the special
> handling for CMA and ISOLATE migratetype pages, please see commit
> 47118af ("mm: mmzone: MIGRATE_CMA migration type added").
>
>>> +	} else if (migratetype != MIGRATE_RESERVE) {
>>>    		page = __rmqueue_fallback(zone, order, migratetype);
>>>
>>>    		/*
>>> @@ -1174,7 +1184,7 @@ static int rmqueue_bulk(struct zone *zon
>>>    			unsigned long count, struct list_head *list,
>>>    			int migratetype, int cold)
>>>    {
>>> -	int mt = migratetype, i;
>>> +	int i;
>>>
>>>    	spin_lock(&zone->lock);
>>>    	for (i = 0; i < count; ++i) {
>>> @@ -1195,16 +1205,15 @@ static int rmqueue_bulk(struct zone *zon
>>>    			list_add(&page->lru, list);
>>>    		else
>>>    			list_add_tail(&page->lru, list);
>>> +		list = &page->lru;
>>>    		if (IS_ENABLED(CONFIG_CMA)) {
>>> -			mt = get_pageblock_migratetype(page);
>>> +			int mt = get_pageblock_migratetype(page);
>>>    			if (!is_migrate_cma(mt) && !is_migrate_isolate(mt))
>>>    				mt = migratetype;
>>> +			if (is_migrate_cma(mt))
>>> +				__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
>>> +						      -(1 << order));
>>>    		}
>>> -		set_freepage_migratetype(page, mt);
>>> -		list = &page->lru;
>>> -		if (is_migrate_cma(mt))
>>> -			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
>>> -					      -(1 << order));
>>>    	}
>>>    	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>>>    	spin_unlock(&zone->lock);
>
> Best regards,
> --
> Bartlomiej Zolnierkiewicz
> Samsung R&D Institute Poland
> Samsung Electronics
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 1/2] mm/page_alloc: prevent MIGRATE_RESERVE pages from being misplaced
  2014-04-03 15:36       ` Vlastimil Babka
@ 2014-04-03 15:40         ` Vlastimil Babka
  -1 siblings, 0 replies; 52+ messages in thread
From: Vlastimil Babka @ 2014-04-03 15:40 UTC (permalink / raw)
  To: Andrew Morton, Joonsoo Kim, Bartlomiej Zolnierkiewicz
  Cc: linux-kernel, linux-mm, Mel Gorman, Yong-Taek Lee,
	Vlastimil Babka, Minchan Kim, KOSAKI Motohiro, Marek Szyprowski,
	Hugh Dickins, Rik van Riel, Michal Nazarewicz

For the MIGRATE_RESERVE pages, it is important they do not get misplaced
on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
pageblock might be changed to other migratetype in try_to_steal_freepages().

Currently, it is however possible for this to happen when MIGRATE_RESERVE
page is allocated on pcplist through rmqueue_bulk() as a fallback for other
desired migratetype, and then later freed back through free_pcppages_bulk()
without being actually used. This happens because free_pcppages_bulk() uses
get_freepage_migratetype() to choose the free_list, and rmqueue_bulk() calls
set_freepage_migratetype() with the *desired* migratetype and not the page's
original MIGRATE_RESERVE migratetype.

This patch fixes the problem by moving the call to set_freepage_migratetype()
from rmqueue_bulk() down to __rmqueue_smallest() and __rmqueue_fallback() where
the actual page's migratetype (e.g. from which free_list the page is taken
from) is used. Note that this migratetype might be different from the
pageblock's migratetype due to freepage stealing decisions. This is OK, as page
stealing never uses MIGRATE_RESERVE as a fallback, and also takes care to leave
all MIGRATE_CMA pages on the correct freelist.

Therefore, as an additional benefit, the call to get_pageblock_migratetype()
from rmqueue_bulk() when CMA is enabled, can be removed completely. This relies
on the fact that MIGRATE_CMA pageblocks are created only during system init,
and the above. The related is_migrate_isolate() check is also unnecessary, as
memory isolation has other ways to move pages between freelists, and drain
pcp lists containing pages that should be isolated.
The buffered_rmqueue() can also benefit from calling get_freepage_migratetype()
instead of get_pageblock_migratetype().

A separate patch will add VM_BUG_ON checks for the invariant that for
MIGRATE_RESERVE and MIGRATE_CMA pageblocks, freepage_migratetype must equal to
pageblock_migratetype so that these pages always go to the correct free_list.

Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3bac76a..2dbaba1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -930,6 +930,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 		rmv_page_order(page);
 		area->nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);
+		set_freepage_migratetype(page, migratetype);
 		return page;
 	}
 
@@ -1056,7 +1057,9 @@ static int try_to_steal_freepages(struct zone *zone, struct page *page,
 
 	/*
 	 * When borrowing from MIGRATE_CMA, we need to release the excess
-	 * buddy pages to CMA itself.
+	 * buddy pages to CMA itself. We also ensure the freepage_migratetype
+	 * is set to CMA so it is returned to the correct freelist in case
+	 * the page ends up being not actually allocated from the pcp lists.
 	 */
 	if (is_migrate_cma(fallback_type))
 		return fallback_type;
@@ -1124,6 +1127,12 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 
 			expand(zone, page, order, current_order, area,
 			       new_type);
+			/* The freepage_migratetype may differ from pageblock's
+			 * migratetype depending on the decisions in
+			 * try_to_steal_freepages. This is OK as long as it does
+			 * not differ for MIGRATE_CMA type.
+			 */
+			set_freepage_migratetype(page, new_type);
 
 			trace_mm_page_alloc_extfrag(page, order, current_order,
 				start_migratetype, migratetype, new_type);
@@ -1174,7 +1183,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			unsigned long count, struct list_head *list,
 			int migratetype, int cold)
 {
-	int mt = migratetype, i;
+	int i;
 
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
@@ -1195,14 +1204,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			list_add(&page->lru, list);
 		else
 			list_add_tail(&page->lru, list);
-		if (IS_ENABLED(CONFIG_CMA)) {
-			mt = get_pageblock_migratetype(page);
-			if (!is_migrate_cma(mt) && !is_migrate_isolate(mt))
-				mt = migratetype;
-		}
-		set_freepage_migratetype(page, mt);
 		list = &page->lru;
-		if (is_migrate_cma(mt))
+		if (is_migrate_cma(get_freepage_migratetype(page)))
 			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
 					      -(1 << order));
 	}
@@ -1580,7 +1583,7 @@ again:
 		if (!page)
 			goto failed;
 		__mod_zone_freepage_state(zone, -(1 << order),
-					  get_pageblock_migratetype(page));
+					  get_freepage_migratetype(page));
 	}
 
 	/*
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 1/2] mm/page_alloc: prevent MIGRATE_RESERVE pages from being misplaced
@ 2014-04-03 15:40         ` Vlastimil Babka
  0 siblings, 0 replies; 52+ messages in thread
From: Vlastimil Babka @ 2014-04-03 15:40 UTC (permalink / raw)
  To: Andrew Morton, Joonsoo Kim, Bartlomiej Zolnierkiewicz
  Cc: linux-kernel, linux-mm, Mel Gorman, Yong-Taek Lee,
	Vlastimil Babka, Minchan Kim, KOSAKI Motohiro, Marek Szyprowski,
	Hugh Dickins, Rik van Riel, Michal Nazarewicz

For the MIGRATE_RESERVE pages, it is important they do not get misplaced
on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
pageblock might be changed to other migratetype in try_to_steal_freepages().

Currently, it is however possible for this to happen when MIGRATE_RESERVE
page is allocated on pcplist through rmqueue_bulk() as a fallback for other
desired migratetype, and then later freed back through free_pcppages_bulk()
without being actually used. This happens because free_pcppages_bulk() uses
get_freepage_migratetype() to choose the free_list, and rmqueue_bulk() calls
set_freepage_migratetype() with the *desired* migratetype and not the page's
original MIGRATE_RESERVE migratetype.

This patch fixes the problem by moving the call to set_freepage_migratetype()
from rmqueue_bulk() down to __rmqueue_smallest() and __rmqueue_fallback() where
the actual page's migratetype (e.g. from which free_list the page is taken
from) is used. Note that this migratetype might be different from the
pageblock's migratetype due to freepage stealing decisions. This is OK, as page
stealing never uses MIGRATE_RESERVE as a fallback, and also takes care to leave
all MIGRATE_CMA pages on the correct freelist.

Therefore, as an additional benefit, the call to get_pageblock_migratetype()
from rmqueue_bulk() when CMA is enabled, can be removed completely. This relies
on the fact that MIGRATE_CMA pageblocks are created only during system init,
and the above. The related is_migrate_isolate() check is also unnecessary, as
memory isolation has other ways to move pages between freelists, and drain
pcp lists containing pages that should be isolated.
The buffered_rmqueue() can also benefit from calling get_freepage_migratetype()
instead of get_pageblock_migratetype().

A separate patch will add VM_BUG_ON checks for the invariant that for
MIGRATE_RESERVE and MIGRATE_CMA pageblocks, freepage_migratetype must equal to
pageblock_migratetype so that these pages always go to the correct free_list.

Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3bac76a..2dbaba1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -930,6 +930,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 		rmv_page_order(page);
 		area->nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);
+		set_freepage_migratetype(page, migratetype);
 		return page;
 	}
 
@@ -1056,7 +1057,9 @@ static int try_to_steal_freepages(struct zone *zone, struct page *page,
 
 	/*
 	 * When borrowing from MIGRATE_CMA, we need to release the excess
-	 * buddy pages to CMA itself.
+	 * buddy pages to CMA itself. We also ensure the freepage_migratetype
+	 * is set to CMA so it is returned to the correct freelist in case
+	 * the page ends up being not actually allocated from the pcp lists.
 	 */
 	if (is_migrate_cma(fallback_type))
 		return fallback_type;
@@ -1124,6 +1127,12 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 
 			expand(zone, page, order, current_order, area,
 			       new_type);
+			/* The freepage_migratetype may differ from pageblock's
+			 * migratetype depending on the decisions in
+			 * try_to_steal_freepages. This is OK as long as it does
+			 * not differ for MIGRATE_CMA type.
+			 */
+			set_freepage_migratetype(page, new_type);
 
 			trace_mm_page_alloc_extfrag(page, order, current_order,
 				start_migratetype, migratetype, new_type);
@@ -1174,7 +1183,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			unsigned long count, struct list_head *list,
 			int migratetype, int cold)
 {
-	int mt = migratetype, i;
+	int i;
 
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
@@ -1195,14 +1204,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			list_add(&page->lru, list);
 		else
 			list_add_tail(&page->lru, list);
-		if (IS_ENABLED(CONFIG_CMA)) {
-			mt = get_pageblock_migratetype(page);
-			if (!is_migrate_cma(mt) && !is_migrate_isolate(mt))
-				mt = migratetype;
-		}
-		set_freepage_migratetype(page, mt);
 		list = &page->lru;
-		if (is_migrate_cma(mt))
+		if (is_migrate_cma(get_freepage_migratetype(page)))
 			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
 					      -(1 << order));
 	}
@@ -1580,7 +1583,7 @@ again:
 		if (!page)
 			goto failed;
 		__mod_zone_freepage_state(zone, -(1 << order),
-					  get_pageblock_migratetype(page));
+					  get_freepage_migratetype(page));
 	}
 
 	/*
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
  2014-04-03 15:40         ` Vlastimil Babka
@ 2014-04-03 15:40           ` Vlastimil Babka
  -1 siblings, 0 replies; 52+ messages in thread
From: Vlastimil Babka @ 2014-04-03 15:40 UTC (permalink / raw)
  To: Andrew Morton, Joonsoo Kim, Bartlomiej Zolnierkiewicz
  Cc: linux-kernel, linux-mm, Mel Gorman, Yong-Taek Lee,
	Vlastimil Babka, Minchan Kim, KOSAKI Motohiro, Marek Szyprowski,
	Hugh Dickins, Rik van Riel, Michal Nazarewicz

For the MIGRATE_RESERVE pages, it is important they do not get misplaced
on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
pageblock might be changed to other migratetype in try_to_steal_freepages().
For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
they could get allocated as unmovable and result in CMA failure.

This is ensured by setting the freepage_migratetype appropriately when placing
pages on pcp lists, and using the information when releasing them back to
free_list. It is also assumed that CMA and RESERVE pageblocks are created only
in the init phase. This patch adds DEBUG_VM checks to catch any regressions
introduced for this invariant.

Cc: Yong-Taek Lee <ytk.lee@samsung.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h | 19 +++++++++++++++++++
 mm/page_alloc.c    |  3 +++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c1b7414..27a74ba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -280,6 +280,25 @@ static inline int get_freepage_migratetype(struct page *page)
 }
 
 /*
+ * Check that a freepage cannot end up on a wrong free_list for "sensitive"
+ * migratetypes. Return false if it could. Useful for VM_BUG_ON checks.
+ */
+static inline bool check_freepage_migratetype(struct page *page)
+{
+	int pageblock_mt = get_pageblock_migratetype(page);
+	int freepage_mt = get_freepage_migratetype(page);
+
+	/*
+	 * For RESERVE and CMA pageblocks, the freepage_migratetype must
+	 * match their migratetype. For other pageblocks, we don't care.
+	 */
+	if (pageblock_mt != MIGRATE_RESERVE && !is_migrate_cma(pageblock_mt))
+		return true;
+
+	return (freepage_mt == pageblock_mt);
+}
+
+/*
  * FIXME: take this include out, include page-flags.h in
  * files which need it (119 of them)
  */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2dbaba1..0ee9f8c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -697,6 +697,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			page = list_entry(list->prev, struct page, lru);
 			/* must delete as __free_one_page list manipulates */
 			list_del(&page->lru);
+
+			VM_BUG_ON(!check_freepage_migratetype(page));
 			mt = get_freepage_migratetype(page);
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
 			__free_one_page(page, zone, 0, mt);
@@ -1190,6 +1192,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		struct page *page = __rmqueue(zone, order, migratetype);
 		if (unlikely(page == NULL))
 			break;
+		VM_BUG_ON(!check_freepage_migratetype(page));
 
 		/*
 		 * Split buddy pages returned by expand() are received here
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
@ 2014-04-03 15:40           ` Vlastimil Babka
  0 siblings, 0 replies; 52+ messages in thread
From: Vlastimil Babka @ 2014-04-03 15:40 UTC (permalink / raw)
  To: Andrew Morton, Joonsoo Kim, Bartlomiej Zolnierkiewicz
  Cc: linux-kernel, linux-mm, Mel Gorman, Yong-Taek Lee,
	Vlastimil Babka, Minchan Kim, KOSAKI Motohiro, Marek Szyprowski,
	Hugh Dickins, Rik van Riel, Michal Nazarewicz

For the MIGRATE_RESERVE pages, it is important they do not get misplaced
on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
pageblock might be changed to other migratetype in try_to_steal_freepages().
For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
they could get allocated as unmovable and result in CMA failure.

This is ensured by setting the freepage_migratetype appropriately when placing
pages on pcp lists, and using the information when releasing them back to
free_list. It is also assumed that CMA and RESERVE pageblocks are created only
in the init phase. This patch adds DEBUG_VM checks to catch any regressions
introduced for this invariant.

Cc: Yong-Taek Lee <ytk.lee@samsung.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h | 19 +++++++++++++++++++
 mm/page_alloc.c    |  3 +++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c1b7414..27a74ba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -280,6 +280,25 @@ static inline int get_freepage_migratetype(struct page *page)
 }
 
 /*
+ * Check that a freepage cannot end up on a wrong free_list for "sensitive"
+ * migratetypes. Return false if it could. Useful for VM_BUG_ON checks.
+ */
+static inline bool check_freepage_migratetype(struct page *page)
+{
+	int pageblock_mt = get_pageblock_migratetype(page);
+	int freepage_mt = get_freepage_migratetype(page);
+
+	/*
+	 * For RESERVE and CMA pageblocks, the freepage_migratetype must
+	 * match their migratetype. For other pageblocks, we don't care.
+	 */
+	if (pageblock_mt != MIGRATE_RESERVE && !is_migrate_cma(pageblock_mt))
+		return true;
+
+	return (freepage_mt == pageblock_mt);
+}
+
+/*
  * FIXME: take this include out, include page-flags.h in
  * files which need it (119 of them)
  */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2dbaba1..0ee9f8c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -697,6 +697,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			page = list_entry(list->prev, struct page, lru);
 			/* must delete as __free_one_page list manipulates */
 			list_del(&page->lru);
+
+			VM_BUG_ON(!check_freepage_migratetype(page));
 			mt = get_freepage_migratetype(page);
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
 			__free_one_page(page, zone, 0, mt);
@@ -1190,6 +1192,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		struct page *page = __rmqueue(zone, order, migratetype);
 		if (unlikely(page == NULL))
 			break;
+		VM_BUG_ON(!check_freepage_migratetype(page));
 
 		/*
 		 * Split buddy pages returned by expand() are received here
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: prevent MIGRATE_RESERVE pages from being misplaced
  2014-04-03 15:40         ` Vlastimil Babka
@ 2014-04-16  0:56           ` Joonsoo Kim
  -1 siblings, 0 replies; 52+ messages in thread
From: Joonsoo Kim @ 2014-04-16  0:56 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm,
	Mel Gorman, Yong-Taek Lee, Minchan Kim, KOSAKI Motohiro,
	Marek Szyprowski, Hugh Dickins, Rik van Riel, Michal Nazarewicz

On Thu, Apr 03, 2014 at 05:40:17PM +0200, Vlastimil Babka wrote:
> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
> pageblock might be changed to other migratetype in try_to_steal_freepages().
> 
> Currently, it is however possible for this to happen when MIGRATE_RESERVE
> page is allocated on pcplist through rmqueue_bulk() as a fallback for other
> desired migratetype, and then later freed back through free_pcppages_bulk()
> without being actually used. This happens because free_pcppages_bulk() uses
> get_freepage_migratetype() to choose the free_list, and rmqueue_bulk() calls
> set_freepage_migratetype() with the *desired* migratetype and not the page's
> original MIGRATE_RESERVE migratetype.
> 
> This patch fixes the problem by moving the call to set_freepage_migratetype()
> from rmqueue_bulk() down to __rmqueue_smallest() and __rmqueue_fallback() where
> the actual page's migratetype (e.g. from which free_list the page is taken
> from) is used. Note that this migratetype might be different from the
> pageblock's migratetype due to freepage stealing decisions. This is OK, as page
> stealing never uses MIGRATE_RESERVE as a fallback, and also takes care to leave
> all MIGRATE_CMA pages on the correct freelist.
> 
> Therefore, as an additional benefit, the call to get_pageblock_migratetype()
> from rmqueue_bulk() when CMA is enabled, can be removed completely. This relies
> on the fact that MIGRATE_CMA pageblocks are created only during system init,
> and the above. The related is_migrate_isolate() check is also unnecessary, as
> memory isolation has other ways to move pages between freelists, and drain
> pcp lists containing pages that should be isolated.
> The buffered_rmqueue() can also benefit from calling get_freepage_migratetype()
> instead of get_pageblock_migratetype().
> 
> A separate patch will add VM_BUG_ON checks for the invariant that for
> MIGRATE_RESERVE and MIGRATE_CMA pageblocks, freepage_migratetype must equal to
> pageblock_migratetype so that these pages always go to the correct free_list.
> 
> Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
> Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
> Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Suggested-by: Mel Gorman <mgorman@suse.de>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Looks good to me.

Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: prevent MIGRATE_RESERVE pages from being misplaced
@ 2014-04-16  0:56           ` Joonsoo Kim
  0 siblings, 0 replies; 52+ messages in thread
From: Joonsoo Kim @ 2014-04-16  0:56 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm,
	Mel Gorman, Yong-Taek Lee, Minchan Kim, KOSAKI Motohiro,
	Marek Szyprowski, Hugh Dickins, Rik van Riel, Michal Nazarewicz

On Thu, Apr 03, 2014 at 05:40:17PM +0200, Vlastimil Babka wrote:
> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
> pageblock might be changed to other migratetype in try_to_steal_freepages().
> 
> Currently, it is however possible for this to happen when MIGRATE_RESERVE
> page is allocated on pcplist through rmqueue_bulk() as a fallback for other
> desired migratetype, and then later freed back through free_pcppages_bulk()
> without being actually used. This happens because free_pcppages_bulk() uses
> get_freepage_migratetype() to choose the free_list, and rmqueue_bulk() calls
> set_freepage_migratetype() with the *desired* migratetype and not the page's
> original MIGRATE_RESERVE migratetype.
> 
> This patch fixes the problem by moving the call to set_freepage_migratetype()
> from rmqueue_bulk() down to __rmqueue_smallest() and __rmqueue_fallback() where
> the actual page's migratetype (e.g. from which free_list the page is taken
> from) is used. Note that this migratetype might be different from the
> pageblock's migratetype due to freepage stealing decisions. This is OK, as page
> stealing never uses MIGRATE_RESERVE as a fallback, and also takes care to leave
> all MIGRATE_CMA pages on the correct freelist.
> 
> Therefore, as an additional benefit, the call to get_pageblock_migratetype()
> from rmqueue_bulk() when CMA is enabled, can be removed completely. This relies
> on the fact that MIGRATE_CMA pageblocks are created only during system init,
> and the above. The related is_migrate_isolate() check is also unnecessary, as
> memory isolation has other ways to move pages between freelists, and drain
> pcp lists containing pages that should be isolated.
> The buffered_rmqueue() can also benefit from calling get_freepage_migratetype()
> instead of get_pageblock_migratetype().
> 
> A separate patch will add VM_BUG_ON checks for the invariant that for
> MIGRATE_RESERVE and MIGRATE_CMA pageblocks, freepage_migratetype must equal to
> pageblock_migratetype so that these pages always go to the correct free_list.
> 
> Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
> Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
> Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Suggested-by: Mel Gorman <mgorman@suse.de>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Looks good to me.

Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
  2014-04-03 15:40           ` Vlastimil Babka
@ 2014-04-16  1:09             ` Joonsoo Kim
  -1 siblings, 0 replies; 52+ messages in thread
From: Joonsoo Kim @ 2014-04-16  1:09 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm,
	Mel Gorman, Yong-Taek Lee, Minchan Kim, KOSAKI Motohiro,
	Marek Szyprowski, Hugh Dickins, Rik van Riel, Michal Nazarewicz

On Thu, Apr 03, 2014 at 05:40:18PM +0200, Vlastimil Babka wrote:
> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
> pageblock might be changed to other migratetype in try_to_steal_freepages().
> For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
> they could get allocated as unmovable and result in CMA failure.
> 
> This is ensured by setting the freepage_migratetype appropriately when placing
> pages on pcp lists, and using the information when releasing them back to
> free_list. It is also assumed that CMA and RESERVE pageblocks are created only
> in the init phase. This patch adds DEBUG_VM checks to catch any regressions
> introduced for this invariant.

Hello, Vlastimil.

Idea looks good to me.

> 
> Cc: Yong-Taek Lee <ytk.lee@samsung.com>
> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
@ 2014-04-16  1:09             ` Joonsoo Kim
  0 siblings, 0 replies; 52+ messages in thread
From: Joonsoo Kim @ 2014-04-16  1:09 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm,
	Mel Gorman, Yong-Taek Lee, Minchan Kim, KOSAKI Motohiro,
	Marek Szyprowski, Hugh Dickins, Rik van Riel, Michal Nazarewicz

On Thu, Apr 03, 2014 at 05:40:18PM +0200, Vlastimil Babka wrote:
> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
> pageblock might be changed to other migratetype in try_to_steal_freepages().
> For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
> they could get allocated as unmovable and result in CMA failure.
> 
> This is ensured by setting the freepage_migratetype appropriately when placing
> pages on pcp lists, and using the information when releasing them back to
> free_list. It is also assumed that CMA and RESERVE pageblocks are created only
> in the init phase. This patch adds DEBUG_VM checks to catch any regressions
> introduced for this invariant.

Hello, Vlastimil.

Idea looks good to me.

> 
> Cc: Yong-Taek Lee <ytk.lee@samsung.com>
> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: prevent MIGRATE_RESERVE pages from being misplaced
  2014-04-03 15:40         ` Vlastimil Babka
@ 2014-04-17 23:29           ` Minchan Kim
  -1 siblings, 0 replies; 52+ messages in thread
From: Minchan Kim @ 2014-04-17 23:29 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Joonsoo Kim, Bartlomiej Zolnierkiewicz,
	linux-kernel, linux-mm, Mel Gorman, Yong-Taek Lee,
	KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins, Rik van Riel,
	Michal Nazarewicz

On Thu, Apr 03, 2014 at 05:40:17PM +0200, Vlastimil Babka wrote:
> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
> pageblock might be changed to other migratetype in try_to_steal_freepages().
> 
> Currently, it is however possible for this to happen when MIGRATE_RESERVE
> page is allocated on pcplist through rmqueue_bulk() as a fallback for other
> desired migratetype, and then later freed back through free_pcppages_bulk()
> without being actually used. This happens because free_pcppages_bulk() uses
> get_freepage_migratetype() to choose the free_list, and rmqueue_bulk() calls
> set_freepage_migratetype() with the *desired* migratetype and not the page's
> original MIGRATE_RESERVE migratetype.
> 
> This patch fixes the problem by moving the call to set_freepage_migratetype()
> from rmqueue_bulk() down to __rmqueue_smallest() and __rmqueue_fallback() where
> the actual page's migratetype (e.g. from which free_list the page is taken
> from) is used. Note that this migratetype might be different from the
> pageblock's migratetype due to freepage stealing decisions. This is OK, as page
> stealing never uses MIGRATE_RESERVE as a fallback, and also takes care to leave
> all MIGRATE_CMA pages on the correct freelist.
> 
> Therefore, as an additional benefit, the call to get_pageblock_migratetype()
> from rmqueue_bulk() when CMA is enabled, can be removed completely. This relies
> on the fact that MIGRATE_CMA pageblocks are created only during system init,
> and the above. The related is_migrate_isolate() check is also unnecessary, as
> memory isolation has other ways to move pages between freelists, and drain
> pcp lists containing pages that should be isolated.
> The buffered_rmqueue() can also benefit from calling get_freepage_migratetype()
> instead of get_pageblock_migratetype().

Nice description.

> 
> A separate patch will add VM_BUG_ON checks for the invariant that for
> MIGRATE_RESERVE and MIGRATE_CMA pageblocks, freepage_migratetype must equal to
> pageblock_migratetype so that these pages always go to the correct free_list.
> 
> Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
> Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
> Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Suggested-by: Mel Gorman <mgorman@suse.de>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Minchan Kim <minchan@kernel.org>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: prevent MIGRATE_RESERVE pages from being misplaced
@ 2014-04-17 23:29           ` Minchan Kim
  0 siblings, 0 replies; 52+ messages in thread
From: Minchan Kim @ 2014-04-17 23:29 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Joonsoo Kim, Bartlomiej Zolnierkiewicz,
	linux-kernel, linux-mm, Mel Gorman, Yong-Taek Lee,
	KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins, Rik van Riel,
	Michal Nazarewicz

On Thu, Apr 03, 2014 at 05:40:17PM +0200, Vlastimil Babka wrote:
> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
> pageblock might be changed to other migratetype in try_to_steal_freepages().
> 
> Currently, it is however possible for this to happen when MIGRATE_RESERVE
> page is allocated on pcplist through rmqueue_bulk() as a fallback for other
> desired migratetype, and then later freed back through free_pcppages_bulk()
> without being actually used. This happens because free_pcppages_bulk() uses
> get_freepage_migratetype() to choose the free_list, and rmqueue_bulk() calls
> set_freepage_migratetype() with the *desired* migratetype and not the page's
> original MIGRATE_RESERVE migratetype.
> 
> This patch fixes the problem by moving the call to set_freepage_migratetype()
> from rmqueue_bulk() down to __rmqueue_smallest() and __rmqueue_fallback() where
> the actual page's migratetype (e.g. from which free_list the page is taken
> from) is used. Note that this migratetype might be different from the
> pageblock's migratetype due to freepage stealing decisions. This is OK, as page
> stealing never uses MIGRATE_RESERVE as a fallback, and also takes care to leave
> all MIGRATE_CMA pages on the correct freelist.
> 
> Therefore, as an additional benefit, the call to get_pageblock_migratetype()
> from rmqueue_bulk() when CMA is enabled, can be removed completely. This relies
> on the fact that MIGRATE_CMA pageblocks are created only during system init,
> and the above. The related is_migrate_isolate() check is also unnecessary, as
> memory isolation has other ways to move pages between freelists, and drain
> pcp lists containing pages that should be isolated.
> The buffered_rmqueue() can also benefit from calling get_freepage_migratetype()
> instead of get_pageblock_migratetype().

Nice description.

> 
> A separate patch will add VM_BUG_ON checks for the invariant that for
> MIGRATE_RESERVE and MIGRATE_CMA pageblocks, freepage_migratetype must equal to
> pageblock_migratetype so that these pages always go to the correct free_list.
> 
> Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
> Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
> Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Suggested-by: Mel Gorman <mgorman@suse.de>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Minchan Kim <minchan@kernel.org>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
  2014-04-03 15:40           ` Vlastimil Babka
@ 2014-04-30 21:46             ` Sasha Levin
  -1 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-04-30 21:46 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Joonsoo Kim, Bartlomiej Zolnierkiewicz
  Cc: linux-kernel, linux-mm, Mel Gorman, Yong-Taek Lee, Minchan Kim,
	KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins, Rik van Riel,
	Michal Nazarewicz, Dave Jones

On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
> pageblock might be changed to other migratetype in try_to_steal_freepages().
> For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
> they could get allocated as unmovable and result in CMA failure.
> 
> This is ensured by setting the freepage_migratetype appropriately when placing
> pages on pcp lists, and using the information when releasing them back to
> free_list. It is also assumed that CMA and RESERVE pageblocks are created only
> in the init phase. This patch adds DEBUG_VM checks to catch any regressions
> introduced for this invariant.
> 
> Cc: Yong-Taek Lee <ytk.lee@samsung.com>
> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Two issues with this patch.

First:

[ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
[ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 3446.320082] Dumping ftrace buffer:
[ 3446.320082]    (ftrace buffer empty)
[ 3446.320082] Modules linked in:
[ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
[ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
[ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
[ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
[ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
[ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
[ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
[ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
[ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
[ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
[ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
[ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
[ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
[ 3446.335888] Stack:
[ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
[ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
[ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
[ 3446.335888] Call Trace:
[ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
[ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
[ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
[ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
[ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
[ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
[ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
[ 3446.335888] ? find_get_entry (mm/filemap.c:979)
[ 3446.335888] ? find_get_entry (mm/filemap.c:940)
[ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
[ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
[ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
[ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
[ 3446.335888] shmem_fault (mm/shmem.c:1237)
[ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
[ 3446.335888] __do_fault (mm/memory.c:3344)
[ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
[ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
[ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
[ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
[ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
[ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
[ 3446.335888] handle_mm_fault (mm/memory.c:3973)
[ 3446.335888] __get_user_pages (mm/memory.c:1863)
[ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
[ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
[ 3446.335888] __mm_populate (mm/mlock.c:711)
[ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
[ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
[ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
[ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
[ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
[ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
[ 3446.335888]  RSP <ffff88053e247778>

And second:

[snip]

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2dbaba1..0ee9f8c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -697,6 +697,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  			page = list_entry(list->prev, struct page, lru);
>  			/* must delete as __free_one_page list manipulates */
>  			list_del(&page->lru);
> +
> +			VM_BUG_ON(!check_freepage_migratetype(page));
>  			mt = get_freepage_migratetype(page);
>  			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
>  			__free_one_page(page, zone, 0, mt);
> @@ -1190,6 +1192,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  		struct page *page = __rmqueue(zone, order, migratetype);
>  		if (unlikely(page == NULL))
>  			break;
> +		VM_BUG_ON(!check_freepage_migratetype(page));
>  
>  		/*
>  		 * Split buddy pages returned by expand() are received here
> 

Could the VM_BUG_ON()s be VM_BUG_ON_PAGE() instead?


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
@ 2014-04-30 21:46             ` Sasha Levin
  0 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-04-30 21:46 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Joonsoo Kim, Bartlomiej Zolnierkiewicz
  Cc: linux-kernel, linux-mm, Mel Gorman, Yong-Taek Lee, Minchan Kim,
	KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins, Rik van Riel,
	Michal Nazarewicz, Dave Jones

On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
> pageblock might be changed to other migratetype in try_to_steal_freepages().
> For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
> they could get allocated as unmovable and result in CMA failure.
> 
> This is ensured by setting the freepage_migratetype appropriately when placing
> pages on pcp lists, and using the information when releasing them back to
> free_list. It is also assumed that CMA and RESERVE pageblocks are created only
> in the init phase. This patch adds DEBUG_VM checks to catch any regressions
> introduced for this invariant.
> 
> Cc: Yong-Taek Lee <ytk.lee@samsung.com>
> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Two issues with this patch.

First:

[ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
[ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 3446.320082] Dumping ftrace buffer:
[ 3446.320082]    (ftrace buffer empty)
[ 3446.320082] Modules linked in:
[ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
[ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
[ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
[ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
[ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
[ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
[ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
[ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
[ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
[ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
[ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
[ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
[ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
[ 3446.335888] Stack:
[ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
[ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
[ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
[ 3446.335888] Call Trace:
[ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
[ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
[ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
[ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
[ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
[ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
[ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
[ 3446.335888] ? find_get_entry (mm/filemap.c:979)
[ 3446.335888] ? find_get_entry (mm/filemap.c:940)
[ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
[ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
[ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
[ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
[ 3446.335888] shmem_fault (mm/shmem.c:1237)
[ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
[ 3446.335888] __do_fault (mm/memory.c:3344)
[ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
[ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
[ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
[ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
[ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
[ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
[ 3446.335888] handle_mm_fault (mm/memory.c:3973)
[ 3446.335888] __get_user_pages (mm/memory.c:1863)
[ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
[ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
[ 3446.335888] __mm_populate (mm/mlock.c:711)
[ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
[ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
[ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
[ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
[ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
[ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
[ 3446.335888]  RSP <ffff88053e247778>

And second:

[snip]

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2dbaba1..0ee9f8c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -697,6 +697,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  			page = list_entry(list->prev, struct page, lru);
>  			/* must delete as __free_one_page list manipulates */
>  			list_del(&page->lru);
> +
> +			VM_BUG_ON(!check_freepage_migratetype(page));
>  			mt = get_freepage_migratetype(page);
>  			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
>  			__free_one_page(page, zone, 0, mt);
> @@ -1190,6 +1192,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  		struct page *page = __rmqueue(zone, order, migratetype);
>  		if (unlikely(page == NULL))
>  			break;
> +		VM_BUG_ON(!check_freepage_migratetype(page));
>  
>  		/*
>  		 * Split buddy pages returned by expand() are received here
> 

Could the VM_BUG_ON()s be VM_BUG_ON_PAGE() instead?


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
  2014-04-30 21:46             ` Sasha Levin
@ 2014-05-02 12:08               ` Vlastimil Babka
  -1 siblings, 0 replies; 52+ messages in thread
From: Vlastimil Babka @ 2014-05-02 12:08 UTC (permalink / raw)
  To: Sasha Levin, Andrew Morton, Joonsoo Kim, Bartlomiej Zolnierkiewicz
  Cc: linux-kernel, linux-mm, Mel Gorman, Yong-Taek Lee, Minchan Kim,
	KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins, Rik van Riel,
	Michal Nazarewicz, Dave Jones

On 04/30/2014 11:46 PM, Sasha Levin wrote:
> On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
>> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
>> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
>> pageblock might be changed to other migratetype in try_to_steal_freepages().
>> For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
>> they could get allocated as unmovable and result in CMA failure.
>>
>> This is ensured by setting the freepage_migratetype appropriately when placing
>> pages on pcp lists, and using the information when releasing them back to
>> free_list. It is also assumed that CMA and RESERVE pageblocks are created only
>> in the init phase. This patch adds DEBUG_VM checks to catch any regressions
>> introduced for this invariant.
>>
>> Cc: Yong-Taek Lee <ytk.lee@samsung.com>
>> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Michal Nazarewicz <mina86@mina86.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Two issues with this patch.
> 
> First:
> 
> [ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
> [ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 3446.320082] Dumping ftrace buffer:
> [ 3446.320082]    (ftrace buffer empty)
> [ 3446.320082] Modules linked in:
> [ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
> [ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
> [ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
> [ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
> [ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
> [ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
> [ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
> [ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
> [ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
> [ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
> [ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
> [ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
> [ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
> [ 3446.335888] Stack:
> [ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
> [ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
> [ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
> [ 3446.335888] Call Trace:
> [ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
> [ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
> [ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
> [ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
> [ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> [ 3446.335888] ? find_get_entry (mm/filemap.c:979)
> [ 3446.335888] ? find_get_entry (mm/filemap.c:940)
> [ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
> [ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
> [ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
> [ 3446.335888] shmem_fault (mm/shmem.c:1237)
> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
> [ 3446.335888] __do_fault (mm/memory.c:3344)
> [ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> [ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
> [ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> [ 3446.335888] handle_mm_fault (mm/memory.c:3973)
> [ 3446.335888] __get_user_pages (mm/memory.c:1863)
> [ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
> [ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
> [ 3446.335888] __mm_populate (mm/mlock.c:711)
> [ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
> [ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
> [ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
> [ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
> [ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
> [ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
> [ 3446.335888]  RSP <ffff88053e247778>

Hey, that's not an issue, that means the patch works as intended :) And
I believe it's not a bug introduced by PATCH 1/2.

So, according to my decodecode reading, RAX is the results of
get_pageblock_migratetype() and it's MIGRATE_RESERVE. RDX is the result
of get_freepage_migratetype() and it's MIGRATE_UNMOVABLE. The
freepage_migratetype has just been set either by __rmqueue_smallest() or
__rmqueue_fallback(), according to the free_list the page has been taken
from. So this looks like a page from MIGRATE_RESERVE pageblock found on
the !MIGRATE_RESERVE free_list, which is exactly what the patch intends
to catch.

I think there are two possible explanations.

1) the pageblock is genuinely MIGRATE_RESERVE and it was misplaced by
mistake. I think it wasn't in free_pcppages_bulk() as there's the same
VM_BUG_ON which would supposedly trigger at the moment of displacing. In
theory it's possible that there's a race through __free_pages_ok() ->
free_one_page() where the get_pageblock_migratetype() in
__free_pages_ok() would race with set_pageblock_migratetype() and result
in bogus value. But nobody should be calling set_pageblock_migratetype()
on a MIGRATE_RESERVE pageblock.

2) the pageblock was marked as MIGRATE_RESERVE due to a race between
set_pageblock_migratetype() and set_pageblock_skip(). The latter is
currently not serialized by zone->lock, nor it uses atomic bit set. So
it may result in lost updates in a racing set_pageblock_migratetype(). I
think a well-placed race when changing pageblock from MIGRATE_MOVABLE to
MIGRATE_RECLAIMABLE could result in MIGRATE_RESERVE value. Similar races
have been already observed to be a problem where frequent changing
to/from MIGRATE_ISOLATE is involved, and I did a patch series to address
this, but it was not complete and I postponed it after Mel's changes
that remove the racy for-cycles completely. So it might be that his
"[PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set
pageblock bitmaps" already solves this bug (but maybe only on certain
architectures where you don't need atomic operations). You might try
that patch if you can reproduce this bug frequently enough?


> And second:
> 
> [snip]
> 
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 2dbaba1..0ee9f8c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -697,6 +697,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>   			page = list_entry(list->prev, struct page, lru);
>>   			/* must delete as __free_one_page list manipulates */
>>   			list_del(&page->lru);
>> +
>> +			VM_BUG_ON(!check_freepage_migratetype(page));
>>   			mt = get_freepage_migratetype(page);
>>   			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
>>   			__free_one_page(page, zone, 0, mt);
>> @@ -1190,6 +1192,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>>   		struct page *page = __rmqueue(zone, order, migratetype);
>>   		if (unlikely(page == NULL))
>>   			break;
>> +		VM_BUG_ON(!check_freepage_migratetype(page));
>>   
>>   		/*
>>   		 * Split buddy pages returned by expand() are received here
>>
> 
> Could the VM_BUG_ON()s be VM_BUG_ON_PAGE() instead?

Right, Andrew can you please add and fold this:

-----8<-----
From: Vlastimil Babka <vbabka@suse.cz>
Date: Fri, 2 May 2014 13:20:48 +0200
Subject: [PATCH] 
 mm-page_alloc-debug_vm-checks-for-free_list-placement-of-cma-and-reserve-pages-fix

Use VM_BUG_ON_PAGE instead of VM_BUG_ON as suggested by Sasha Levin.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 343c684..a64d672 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -700,7 +700,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			/* must delete as __free_one_page list manipulates */
 			list_del(&page->lru);
 
-			VM_BUG_ON(!check_freepage_migratetype(page));
+			VM_BUG_ON_PAGE(!check_freepage_migratetype(page), page);
 			mt = get_freepage_migratetype(page);
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
 			__free_one_page(page, zone, 0, mt);
@@ -1194,7 +1194,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		struct page *page = __rmqueue(zone, order, migratetype);
 		if (unlikely(page == NULL))
 			break;
-		VM_BUG_ON(!check_freepage_migratetype(page));
+		VM_BUG_ON_PAGE(!check_freepage_migratetype(page), page);
 
 		/*
 		 * Split buddy pages returned by expand() are received here
-- 
1.8.4.5




^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
@ 2014-05-02 12:08               ` Vlastimil Babka
  0 siblings, 0 replies; 52+ messages in thread
From: Vlastimil Babka @ 2014-05-02 12:08 UTC (permalink / raw)
  To: Sasha Levin, Andrew Morton, Joonsoo Kim, Bartlomiej Zolnierkiewicz
  Cc: linux-kernel, linux-mm, Mel Gorman, Yong-Taek Lee, Minchan Kim,
	KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins, Rik van Riel,
	Michal Nazarewicz, Dave Jones

On 04/30/2014 11:46 PM, Sasha Levin wrote:
> On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
>> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
>> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
>> pageblock might be changed to other migratetype in try_to_steal_freepages().
>> For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
>> they could get allocated as unmovable and result in CMA failure.
>>
>> This is ensured by setting the freepage_migratetype appropriately when placing
>> pages on pcp lists, and using the information when releasing them back to
>> free_list. It is also assumed that CMA and RESERVE pageblocks are created only
>> in the init phase. This patch adds DEBUG_VM checks to catch any regressions
>> introduced for this invariant.
>>
>> Cc: Yong-Taek Lee <ytk.lee@samsung.com>
>> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Michal Nazarewicz <mina86@mina86.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Two issues with this patch.
> 
> First:
> 
> [ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
> [ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 3446.320082] Dumping ftrace buffer:
> [ 3446.320082]    (ftrace buffer empty)
> [ 3446.320082] Modules linked in:
> [ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
> [ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
> [ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
> [ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
> [ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
> [ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
> [ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
> [ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
> [ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
> [ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
> [ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
> [ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
> [ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
> [ 3446.335888] Stack:
> [ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
> [ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
> [ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
> [ 3446.335888] Call Trace:
> [ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
> [ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
> [ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
> [ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
> [ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> [ 3446.335888] ? find_get_entry (mm/filemap.c:979)
> [ 3446.335888] ? find_get_entry (mm/filemap.c:940)
> [ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
> [ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
> [ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
> [ 3446.335888] shmem_fault (mm/shmem.c:1237)
> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
> [ 3446.335888] __do_fault (mm/memory.c:3344)
> [ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> [ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
> [ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> [ 3446.335888] handle_mm_fault (mm/memory.c:3973)
> [ 3446.335888] __get_user_pages (mm/memory.c:1863)
> [ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
> [ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
> [ 3446.335888] __mm_populate (mm/mlock.c:711)
> [ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
> [ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
> [ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
> [ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
> [ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
> [ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
> [ 3446.335888]  RSP <ffff88053e247778>

Hey, that's not an issue, that means the patch works as intended :) And
I believe it's not a bug introduced by PATCH 1/2.

So, according to my decodecode reading, RAX is the results of
get_pageblock_migratetype() and it's MIGRATE_RESERVE. RDX is the result
of get_freepage_migratetype() and it's MIGRATE_UNMOVABLE. The
freepage_migratetype has just been set either by __rmqueue_smallest() or
__rmqueue_fallback(), according to the free_list the page has been taken
from. So this looks like a page from MIGRATE_RESERVE pageblock found on
the !MIGRATE_RESERVE free_list, which is exactly what the patch intends
to catch.

I think there are two possible explanations.

1) the pageblock is genuinely MIGRATE_RESERVE and it was misplaced by
mistake. I think it wasn't in free_pcppages_bulk() as there's the same
VM_BUG_ON which would supposedly trigger at the moment of displacing. In
theory it's possible that there's a race through __free_pages_ok() ->
free_one_page() where the get_pageblock_migratetype() in
__free_pages_ok() would race with set_pageblock_migratetype() and result
in bogus value. But nobody should be calling set_pageblock_migratetype()
on a MIGRATE_RESERVE pageblock.

2) the pageblock was marked as MIGRATE_RESERVE due to a race between
set_pageblock_migratetype() and set_pageblock_skip(). The latter is
currently not serialized by zone->lock, nor it uses atomic bit set. So
it may result in lost updates in a racing set_pageblock_migratetype(). I
think a well-placed race when changing pageblock from MIGRATE_MOVABLE to
MIGRATE_RECLAIMABLE could result in MIGRATE_RESERVE value. Similar races
have been already observed to be a problem where frequent changing
to/from MIGRATE_ISOLATE is involved, and I did a patch series to address
this, but it was not complete and I postponed it after Mel's changes
that remove the racy for-cycles completely. So it might be that his
"[PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set
pageblock bitmaps" already solves this bug (but maybe only on certain
architectures where you don't need atomic operations). You might try
that patch if you can reproduce this bug frequently enough?


> And second:
> 
> [snip]
> 
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 2dbaba1..0ee9f8c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -697,6 +697,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>   			page = list_entry(list->prev, struct page, lru);
>>   			/* must delete as __free_one_page list manipulates */
>>   			list_del(&page->lru);
>> +
>> +			VM_BUG_ON(!check_freepage_migratetype(page));
>>   			mt = get_freepage_migratetype(page);
>>   			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
>>   			__free_one_page(page, zone, 0, mt);
>> @@ -1190,6 +1192,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>>   		struct page *page = __rmqueue(zone, order, migratetype);
>>   		if (unlikely(page == NULL))
>>   			break;
>> +		VM_BUG_ON(!check_freepage_migratetype(page));
>>   
>>   		/*
>>   		 * Split buddy pages returned by expand() are received here
>>
> 
> Could the VM_BUG_ON()s be VM_BUG_ON_PAGE() instead?

Right, Andrew can you please add and fold this:

-----8<-----
From: Vlastimil Babka <vbabka@suse.cz>
Date: Fri, 2 May 2014 13:20:48 +0200
Subject: [PATCH] 
 mm-page_alloc-debug_vm-checks-for-free_list-placement-of-cma-and-reserve-pages-fix

Use VM_BUG_ON_PAGE instead of VM_BUG_ON as suggested by Sasha Levin.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 343c684..a64d672 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -700,7 +700,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			/* must delete as __free_one_page list manipulates */
 			list_del(&page->lru);
 
-			VM_BUG_ON(!check_freepage_migratetype(page));
+			VM_BUG_ON_PAGE(!check_freepage_migratetype(page), page);
 			mt = get_freepage_migratetype(page);
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
 			__free_one_page(page, zone, 0, mt);
@@ -1194,7 +1194,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		struct page *page = __rmqueue(zone, order, migratetype);
 		if (unlikely(page == NULL))
 			break;
-		VM_BUG_ON(!check_freepage_migratetype(page));
+		VM_BUG_ON_PAGE(!check_freepage_migratetype(page), page);
 
 		/*
 		 * Split buddy pages returned by expand() are received here
-- 
1.8.4.5



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
  2014-05-02 12:08               ` Vlastimil Babka
@ 2014-05-05 14:36                 ` Sasha Levin
  -1 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-05-05 14:36 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Joonsoo Kim, Bartlomiej Zolnierkiewicz
  Cc: linux-kernel, linux-mm, Mel Gorman, Yong-Taek Lee, Minchan Kim,
	KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins, Rik van Riel,
	Michal Nazarewicz, Dave Jones

On 05/02/2014 08:08 AM, Vlastimil Babka wrote:
> On 04/30/2014 11:46 PM, Sasha Levin wrote:
>> > On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
>>> >> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
>>> >> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
>>> >> pageblock might be changed to other migratetype in try_to_steal_freepages().
>>> >> For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
>>> >> they could get allocated as unmovable and result in CMA failure.
>>> >>
>>> >> This is ensured by setting the freepage_migratetype appropriately when placing
>>> >> pages on pcp lists, and using the information when releasing them back to
>>> >> free_list. It is also assumed that CMA and RESERVE pageblocks are created only
>>> >> in the init phase. This patch adds DEBUG_VM checks to catch any regressions
>>> >> introduced for this invariant.
>>> >>
>>> >> Cc: Yong-Taek Lee <ytk.lee@samsung.com>
>>> >> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
>>> >> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>> >> Cc: Mel Gorman <mgorman@suse.de>
>>> >> Cc: Minchan Kim <minchan@kernel.org>
>>> >> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>> >> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>>> >> Cc: Hugh Dickins <hughd@google.com>
>>> >> Cc: Rik van Riel <riel@redhat.com>
>>> >> Cc: Michal Nazarewicz <mina86@mina86.com>
>>> >> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> > 
>> > Two issues with this patch.
>> > 
>> > First:
>> > 
>> > [ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
>> > [ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>> > [ 3446.320082] Dumping ftrace buffer:
>> > [ 3446.320082]    (ftrace buffer empty)
>> > [ 3446.320082] Modules linked in:
>> > [ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
>> > [ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
>> > [ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
>> > [ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
>> > [ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
>> > [ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
>> > [ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
>> > [ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
>> > [ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
>> > [ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
>> > [ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> > [ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
>> > [ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
>> > [ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
>> > [ 3446.335888] Stack:
>> > [ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
>> > [ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
>> > [ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
>> > [ 3446.335888] Call Trace:
>> > [ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
>> > [ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
>> > [ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
>> > [ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
>> > [ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
>> > [ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
>> > [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
>> > [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>> > [ 3446.335888] ? find_get_entry (mm/filemap.c:979)
>> > [ 3446.335888] ? find_get_entry (mm/filemap.c:940)
>> > [ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
>> > [ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
>> > [ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
>> > [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
>> > [ 3446.335888] shmem_fault (mm/shmem.c:1237)
>> > [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
>> > [ 3446.335888] __do_fault (mm/memory.c:3344)
>> > [ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
>> > [ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
>> > [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
>> > [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
>> > [ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
>> > [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
>> > [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>> > [ 3446.335888] handle_mm_fault (mm/memory.c:3973)
>> > [ 3446.335888] __get_user_pages (mm/memory.c:1863)
>> > [ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
>> > [ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
>> > [ 3446.335888] __mm_populate (mm/mlock.c:711)
>> > [ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
>> > [ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
>> > [ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
>> > [ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
>> > [ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
>> > [ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
>> > [ 3446.335888]  RSP <ffff88053e247778>
> Hey, that's not an issue, that means the patch works as intended :) And
> I believe it's not a bug introduced by PATCH 1/2.
> 
> So, according to my decodecode reading, RAX is the results of
> get_pageblock_migratetype() and it's MIGRATE_RESERVE. RDX is the result
> of get_freepage_migratetype() and it's MIGRATE_UNMOVABLE. The
> freepage_migratetype has just been set either by __rmqueue_smallest() or
> __rmqueue_fallback(), according to the free_list the page has been taken
> from. So this looks like a page from MIGRATE_RESERVE pageblock found on
> the !MIGRATE_RESERVE free_list, which is exactly what the patch intends
> to catch.
> 
> I think there are two possible explanations.
> 
> 1) the pageblock is genuinely MIGRATE_RESERVE and it was misplaced by
> mistake. I think it wasn't in free_pcppages_bulk() as there's the same
> VM_BUG_ON which would supposedly trigger at the moment of displacing. In
> theory it's possible that there's a race through __free_pages_ok() ->
> free_one_page() where the get_pageblock_migratetype() in
> __free_pages_ok() would race with set_pageblock_migratetype() and result
> in bogus value. But nobody should be calling set_pageblock_migratetype()
> on a MIGRATE_RESERVE pageblock.
> 
> 2) the pageblock was marked as MIGRATE_RESERVE due to a race between
> set_pageblock_migratetype() and set_pageblock_skip(). The latter is
> currently not serialized by zone->lock, nor it uses atomic bit set. So
> it may result in lost updates in a racing set_pageblock_migratetype(). I
> think a well-placed race when changing pageblock from MIGRATE_MOVABLE to
> MIGRATE_RECLAIMABLE could result in MIGRATE_RESERVE value. Similar races
> have been already observed to be a problem where frequent changing
> to/from MIGRATE_ISOLATE is involved, and I did a patch series to address
> this, but it was not complete and I postponed it after Mel's changes
> that remove the racy for-cycles completely. So it might be that his
> "[PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set
> pageblock bitmaps" already solves this bug (but maybe only on certain
> architectures where you don't need atomic operations). You might try
> that patch if you can reproduce this bug frequently enough?

I've tried that patch, but still see the same BUG_ON.


Thanks,
Sasha


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
@ 2014-05-05 14:36                 ` Sasha Levin
  0 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-05-05 14:36 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Joonsoo Kim, Bartlomiej Zolnierkiewicz
  Cc: linux-kernel, linux-mm, Mel Gorman, Yong-Taek Lee, Minchan Kim,
	KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins, Rik van Riel,
	Michal Nazarewicz, Dave Jones

On 05/02/2014 08:08 AM, Vlastimil Babka wrote:
> On 04/30/2014 11:46 PM, Sasha Levin wrote:
>> > On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
>>> >> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
>>> >> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
>>> >> pageblock might be changed to other migratetype in try_to_steal_freepages().
>>> >> For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
>>> >> they could get allocated as unmovable and result in CMA failure.
>>> >>
>>> >> This is ensured by setting the freepage_migratetype appropriately when placing
>>> >> pages on pcp lists, and using the information when releasing them back to
>>> >> free_list. It is also assumed that CMA and RESERVE pageblocks are created only
>>> >> in the init phase. This patch adds DEBUG_VM checks to catch any regressions
>>> >> introduced for this invariant.
>>> >>
>>> >> Cc: Yong-Taek Lee <ytk.lee@samsung.com>
>>> >> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
>>> >> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>> >> Cc: Mel Gorman <mgorman@suse.de>
>>> >> Cc: Minchan Kim <minchan@kernel.org>
>>> >> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>> >> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>>> >> Cc: Hugh Dickins <hughd@google.com>
>>> >> Cc: Rik van Riel <riel@redhat.com>
>>> >> Cc: Michal Nazarewicz <mina86@mina86.com>
>>> >> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> > 
>> > Two issues with this patch.
>> > 
>> > First:
>> > 
>> > [ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
>> > [ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>> > [ 3446.320082] Dumping ftrace buffer:
>> > [ 3446.320082]    (ftrace buffer empty)
>> > [ 3446.320082] Modules linked in:
>> > [ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
>> > [ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
>> > [ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
>> > [ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
>> > [ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
>> > [ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
>> > [ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
>> > [ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
>> > [ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
>> > [ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
>> > [ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> > [ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
>> > [ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
>> > [ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
>> > [ 3446.335888] Stack:
>> > [ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
>> > [ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
>> > [ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
>> > [ 3446.335888] Call Trace:
>> > [ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
>> > [ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
>> > [ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
>> > [ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
>> > [ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
>> > [ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
>> > [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
>> > [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>> > [ 3446.335888] ? find_get_entry (mm/filemap.c:979)
>> > [ 3446.335888] ? find_get_entry (mm/filemap.c:940)
>> > [ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
>> > [ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
>> > [ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
>> > [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
>> > [ 3446.335888] shmem_fault (mm/shmem.c:1237)
>> > [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
>> > [ 3446.335888] __do_fault (mm/memory.c:3344)
>> > [ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
>> > [ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
>> > [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
>> > [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
>> > [ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
>> > [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
>> > [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>> > [ 3446.335888] handle_mm_fault (mm/memory.c:3973)
>> > [ 3446.335888] __get_user_pages (mm/memory.c:1863)
>> > [ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
>> > [ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
>> > [ 3446.335888] __mm_populate (mm/mlock.c:711)
>> > [ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
>> > [ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
>> > [ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
>> > [ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
>> > [ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
>> > [ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
>> > [ 3446.335888]  RSP <ffff88053e247778>
> Hey, that's not an issue, that means the patch works as intended :) And
> I believe it's not a bug introduced by PATCH 1/2.
> 
> So, according to my decodecode reading, RAX is the results of
> get_pageblock_migratetype() and it's MIGRATE_RESERVE. RDX is the result
> of get_freepage_migratetype() and it's MIGRATE_UNMOVABLE. The
> freepage_migratetype has just been set either by __rmqueue_smallest() or
> __rmqueue_fallback(), according to the free_list the page has been taken
> from. So this looks like a page from MIGRATE_RESERVE pageblock found on
> the !MIGRATE_RESERVE free_list, which is exactly what the patch intends
> to catch.
> 
> I think there are two possible explanations.
> 
> 1) the pageblock is genuinely MIGRATE_RESERVE and it was misplaced by
> mistake. I think it wasn't in free_pcppages_bulk() as there's the same
> VM_BUG_ON which would supposedly trigger at the moment of displacing. In
> theory it's possible that there's a race through __free_pages_ok() ->
> free_one_page() where the get_pageblock_migratetype() in
> __free_pages_ok() would race with set_pageblock_migratetype() and result
> in bogus value. But nobody should be calling set_pageblock_migratetype()
> on a MIGRATE_RESERVE pageblock.
> 
> 2) the pageblock was marked as MIGRATE_RESERVE due to a race between
> set_pageblock_migratetype() and set_pageblock_skip(). The latter is
> currently not serialized by zone->lock, nor it uses atomic bit set. So
> it may result in lost updates in a racing set_pageblock_migratetype(). I
> think a well-placed race when changing pageblock from MIGRATE_MOVABLE to
> MIGRATE_RECLAIMABLE could result in MIGRATE_RESERVE value. Similar races
> have been already observed to be a problem where frequent changing
> to/from MIGRATE_ISOLATE is involved, and I did a patch series to address
> this, but it was not complete and I postponed it after Mel's changes
> that remove the racy for-cycles completely. So it might be that his
> "[PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set
> pageblock bitmaps" already solves this bug (but maybe only on certain
> architectures where you don't need atomic operations). You might try
> that patch if you can reproduce this bug frequently enough?

I've tried that patch, but still see the same BUG_ON.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
  2014-05-05 14:36                 ` Sasha Levin
@ 2014-05-05 15:50                   ` Vlastimil Babka
  -1 siblings, 0 replies; 52+ messages in thread
From: Vlastimil Babka @ 2014-05-05 15:50 UTC (permalink / raw)
  To: Sasha Levin, Andrew Morton, Joonsoo Kim, Bartlomiej Zolnierkiewicz
  Cc: linux-kernel, linux-mm, Mel Gorman, Yong-Taek Lee, Minchan Kim,
	KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins, Rik van Riel,
	Michal Nazarewicz, Dave Jones

On 05/05/2014 04:36 PM, Sasha Levin wrote:
> On 05/02/2014 08:08 AM, Vlastimil Babka wrote:
>> On 04/30/2014 11:46 PM, Sasha Levin wrote:
>>>> On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
>>>>>> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
>>>>>> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
>>>>>> pageblock might be changed to other migratetype in try_to_steal_freepages().
>>>>>> For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
>>>>>> they could get allocated as unmovable and result in CMA failure.
>>>>>>
>>>>>> This is ensured by setting the freepage_migratetype appropriately when placing
>>>>>> pages on pcp lists, and using the information when releasing them back to
>>>>>> free_list. It is also assumed that CMA and RESERVE pageblocks are created only
>>>>>> in the init phase. This patch adds DEBUG_VM checks to catch any regressions
>>>>>> introduced for this invariant.
>>>>>>
>>>>>> Cc: Yong-Taek Lee <ytk.lee@samsung.com>
>>>>>> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
>>>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>>>>> Cc: Mel Gorman <mgorman@suse.de>
>>>>>> Cc: Minchan Kim <minchan@kernel.org>
>>>>>> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>>>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>>>>>> Cc: Hugh Dickins <hughd@google.com>
>>>>>> Cc: Rik van Riel <riel@redhat.com>
>>>>>> Cc: Michal Nazarewicz <mina86@mina86.com>
>>>>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>>>>
>>>> Two issues with this patch.
>>>>
>>>> First:
>>>>
>>>> [ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
>>>> [ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>>> [ 3446.320082] Dumping ftrace buffer:
>>>> [ 3446.320082]    (ftrace buffer empty)
>>>> [ 3446.320082] Modules linked in:
>>>> [ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
>>>> [ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
>>>> [ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
>>>> [ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
>>>> [ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
>>>> [ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
>>>> [ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
>>>> [ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
>>>> [ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
>>>> [ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
>>>> [ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>> [ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
>>>> [ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
>>>> [ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
>>>> [ 3446.335888] Stack:
>>>> [ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
>>>> [ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
>>>> [ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
>>>> [ 3446.335888] Call Trace:
>>>> [ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
>>>> [ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
>>>> [ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
>>>> [ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
>>>> [ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
>>>> [ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:979)
>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:940)
>>>> [ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
>>>> [ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
>>>> [ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
>>>> [ 3446.335888] shmem_fault (mm/shmem.c:1237)
>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
>>>> [ 3446.335888] __do_fault (mm/memory.c:3344)
>>>> [ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
>>>> [ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
>>>> [ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>>>> [ 3446.335888] handle_mm_fault (mm/memory.c:3973)
>>>> [ 3446.335888] __get_user_pages (mm/memory.c:1863)
>>>> [ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
>>>> [ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
>>>> [ 3446.335888] __mm_populate (mm/mlock.c:711)
>>>> [ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
>>>> [ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
>>>> [ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
>>>> [ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
>>>> [ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
>>>> [ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
>>>> [ 3446.335888]  RSP <ffff88053e247778>
>> Hey, that's not an issue, that means the patch works as intended :) And
>> I believe it's not a bug introduced by PATCH 1/2.
>>
>> So, according to my decodecode reading, RAX is the results of
>> get_pageblock_migratetype() and it's MIGRATE_RESERVE. RDX is the result
>> of get_freepage_migratetype() and it's MIGRATE_UNMOVABLE. The
>> freepage_migratetype has just been set either by __rmqueue_smallest() or
>> __rmqueue_fallback(), according to the free_list the page has been taken
>> from. So this looks like a page from MIGRATE_RESERVE pageblock found on
>> the !MIGRATE_RESERVE free_list, which is exactly what the patch intends
>> to catch.
>>
>> I think there are two possible explanations.
>>
>> 1) the pageblock is genuinely MIGRATE_RESERVE and it was misplaced by
>> mistake. I think it wasn't in free_pcppages_bulk() as there's the same
>> VM_BUG_ON which would supposedly trigger at the moment of displacing. In
>> theory it's possible that there's a race through __free_pages_ok() ->
>> free_one_page() where the get_pageblock_migratetype() in
>> __free_pages_ok() would race with set_pageblock_migratetype() and result
>> in bogus value. But nobody should be calling set_pageblock_migratetype()
>> on a MIGRATE_RESERVE pageblock.
>>
>> 2) the pageblock was marked as MIGRATE_RESERVE due to a race between
>> set_pageblock_migratetype() and set_pageblock_skip(). The latter is
>> currently not serialized by zone->lock, nor it uses atomic bit set. So
>> it may result in lost updates in a racing set_pageblock_migratetype(). I
>> think a well-placed race when changing pageblock from MIGRATE_MOVABLE to
>> MIGRATE_RECLAIMABLE could result in MIGRATE_RESERVE value. Similar races
>> have been already observed to be a problem where frequent changing
>> to/from MIGRATE_ISOLATE is involved, and I did a patch series to address
>> this, but it was not complete and I postponed it after Mel's changes
>> that remove the racy for-cycles completely. So it might be that his
>> "[PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set
>> pageblock bitmaps" already solves this bug (but maybe only on certain
>> architectures where you don't need atomic operations). You might try
>> that patch if you can reproduce this bug frequently enough?
>
> I've tried that patch, but still see the same BUG_ON.

Oh damn, I've realized that my assumptions about MIGRATE_RESERVE 
pageblocks being created only on zone init time were wrong. 
setup_zone_migrate_reserve() is called also from the handler of 
min_free_kbytes sysctl... does trinity try to change that while running?
The function will change MOVABLE pageblocks to RESERVE and try to move 
all free pages to the RESERVE free_list, but of course pages on pcplists 
will remain MOVABLE and may trigger the VM_BUG_ON. You triggered the bug 
with page on MOVABLE free_list (in the first reply I said its UNMOVABLE 
by mistake) so this might be good explanation if trinity changes 
min_free_kbytes.

Furthermore, I think there's a problem that setup_zone_migrate_reserve() 
operates on pageblocks, but as MAX_ODER is higher than pageblock_order, 
RESERVE pages might be merged with buddies of different migratetype and 
end up on their free_list. That seems to me like a flaw in the design of 
reserves, but perhaps others won't think it's serious enough to fix?

So in the end this VM_DEBUG check probably cannot work anymore for 
MIGRATE_RESERVE, only for CMA. I'm not sure if it's worth keeping it 
only for CMA, what are the CMA guys' opinions on that?

Also this means that the 1/2 patch "prevent MIGRATE_RESERVE pages from 
being misplaced" still won't prevent stealing a MIGRATE_RESERVE 
pageblock when __rmqueue_fallback() encounters a strayed MIGRATE_RESERVE 
page on e.g. a MOVABLE freelist. This is fixable by having 
__rmqueue_fallback() not trusting the migratetype of free_list and 
checking for pageblock_migratetype. I hate that, but at least it's not 
on the fast path...

> Thanks,
> Sasha
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
@ 2014-05-05 15:50                   ` Vlastimil Babka
  0 siblings, 0 replies; 52+ messages in thread
From: Vlastimil Babka @ 2014-05-05 15:50 UTC (permalink / raw)
  To: Sasha Levin, Andrew Morton, Joonsoo Kim, Bartlomiej Zolnierkiewicz
  Cc: linux-kernel, linux-mm, Mel Gorman, Yong-Taek Lee, Minchan Kim,
	KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins, Rik van Riel,
	Michal Nazarewicz, Dave Jones

On 05/05/2014 04:36 PM, Sasha Levin wrote:
> On 05/02/2014 08:08 AM, Vlastimil Babka wrote:
>> On 04/30/2014 11:46 PM, Sasha Levin wrote:
>>>> On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
>>>>>> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
>>>>>> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
>>>>>> pageblock might be changed to other migratetype in try_to_steal_freepages().
>>>>>> For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
>>>>>> they could get allocated as unmovable and result in CMA failure.
>>>>>>
>>>>>> This is ensured by setting the freepage_migratetype appropriately when placing
>>>>>> pages on pcp lists, and using the information when releasing them back to
>>>>>> free_list. It is also assumed that CMA and RESERVE pageblocks are created only
>>>>>> in the init phase. This patch adds DEBUG_VM checks to catch any regressions
>>>>>> introduced for this invariant.
>>>>>>
>>>>>> Cc: Yong-Taek Lee <ytk.lee@samsung.com>
>>>>>> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
>>>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>>>>> Cc: Mel Gorman <mgorman@suse.de>
>>>>>> Cc: Minchan Kim <minchan@kernel.org>
>>>>>> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>>>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>>>>>> Cc: Hugh Dickins <hughd@google.com>
>>>>>> Cc: Rik van Riel <riel@redhat.com>
>>>>>> Cc: Michal Nazarewicz <mina86@mina86.com>
>>>>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>>>>
>>>> Two issues with this patch.
>>>>
>>>> First:
>>>>
>>>> [ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
>>>> [ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>>> [ 3446.320082] Dumping ftrace buffer:
>>>> [ 3446.320082]    (ftrace buffer empty)
>>>> [ 3446.320082] Modules linked in:
>>>> [ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
>>>> [ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
>>>> [ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
>>>> [ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
>>>> [ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
>>>> [ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
>>>> [ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
>>>> [ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
>>>> [ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
>>>> [ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
>>>> [ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>> [ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
>>>> [ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
>>>> [ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
>>>> [ 3446.335888] Stack:
>>>> [ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
>>>> [ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
>>>> [ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
>>>> [ 3446.335888] Call Trace:
>>>> [ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
>>>> [ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
>>>> [ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
>>>> [ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
>>>> [ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
>>>> [ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:979)
>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:940)
>>>> [ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
>>>> [ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
>>>> [ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
>>>> [ 3446.335888] shmem_fault (mm/shmem.c:1237)
>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
>>>> [ 3446.335888] __do_fault (mm/memory.c:3344)
>>>> [ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
>>>> [ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
>>>> [ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>>>> [ 3446.335888] handle_mm_fault (mm/memory.c:3973)
>>>> [ 3446.335888] __get_user_pages (mm/memory.c:1863)
>>>> [ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
>>>> [ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
>>>> [ 3446.335888] __mm_populate (mm/mlock.c:711)
>>>> [ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
>>>> [ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
>>>> [ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
>>>> [ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
>>>> [ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
>>>> [ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
>>>> [ 3446.335888]  RSP <ffff88053e247778>
>> Hey, that's not an issue, that means the patch works as intended :) And
>> I believe it's not a bug introduced by PATCH 1/2.
>>
>> So, according to my decodecode reading, RAX is the results of
>> get_pageblock_migratetype() and it's MIGRATE_RESERVE. RDX is the result
>> of get_freepage_migratetype() and it's MIGRATE_UNMOVABLE. The
>> freepage_migratetype has just been set either by __rmqueue_smallest() or
>> __rmqueue_fallback(), according to the free_list the page has been taken
>> from. So this looks like a page from MIGRATE_RESERVE pageblock found on
>> the !MIGRATE_RESERVE free_list, which is exactly what the patch intends
>> to catch.
>>
>> I think there are two possible explanations.
>>
>> 1) the pageblock is genuinely MIGRATE_RESERVE and it was misplaced by
>> mistake. I think it wasn't in free_pcppages_bulk() as there's the same
>> VM_BUG_ON which would supposedly trigger at the moment of displacing. In
>> theory it's possible that there's a race through __free_pages_ok() ->
>> free_one_page() where the get_pageblock_migratetype() in
>> __free_pages_ok() would race with set_pageblock_migratetype() and result
>> in bogus value. But nobody should be calling set_pageblock_migratetype()
>> on a MIGRATE_RESERVE pageblock.
>>
>> 2) the pageblock was marked as MIGRATE_RESERVE due to a race between
>> set_pageblock_migratetype() and set_pageblock_skip(). The latter is
>> currently not serialized by zone->lock, nor it uses atomic bit set. So
>> it may result in lost updates in a racing set_pageblock_migratetype(). I
>> think a well-placed race when changing pageblock from MIGRATE_MOVABLE to
>> MIGRATE_RECLAIMABLE could result in MIGRATE_RESERVE value. Similar races
>> have been already observed to be a problem where frequent changing
>> to/from MIGRATE_ISOLATE is involved, and I did a patch series to address
>> this, but it was not complete and I postponed it after Mel's changes
>> that remove the racy for-cycles completely. So it might be that his
>> "[PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set
>> pageblock bitmaps" already solves this bug (but maybe only on certain
>> architectures where you don't need atomic operations). You might try
>> that patch if you can reproduce this bug frequently enough?
>
> I've tried that patch, but still see the same BUG_ON.

Oh damn, I've realized that my assumptions about MIGRATE_RESERVE 
pageblocks being created only on zone init time were wrong. 
setup_zone_migrate_reserve() is called also from the handler of 
min_free_kbytes sysctl... does trinity try to change that while running?
The function will change MOVABLE pageblocks to RESERVE and try to move 
all free pages to the RESERVE free_list, but of course pages on pcplists 
will remain MOVABLE and may trigger the VM_BUG_ON. You triggered the bug 
with page on MOVABLE free_list (in the first reply I said its UNMOVABLE 
by mistake) so this might be good explanation if trinity changes 
min_free_kbytes.

Furthermore, I think there's a problem that setup_zone_migrate_reserve() 
operates on pageblocks, but as MAX_ODER is higher than pageblock_order, 
RESERVE pages might be merged with buddies of different migratetype and 
end up on their free_list. That seems to me like a flaw in the design of 
reserves, but perhaps others won't think it's serious enough to fix?

So in the end this VM_DEBUG check probably cannot work anymore for 
MIGRATE_RESERVE, only for CMA. I'm not sure if it's worth keeping it 
only for CMA, what are the CMA guys' opinions on that?

Also this means that the 1/2 patch "prevent MIGRATE_RESERVE pages from 
being misplaced" still won't prevent stealing a MIGRATE_RESERVE 
pageblock when __rmqueue_fallback() encounters a strayed MIGRATE_RESERVE 
page on e.g. a MOVABLE freelist. This is fixable by having 
__rmqueue_fallback() not trusting the migratetype of free_list and 
checking for pageblock_migratetype. I hate that, but at least it's not 
on the fast path...

> Thanks,
> Sasha
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
  2014-05-05 15:50                   ` Vlastimil Babka
@ 2014-05-05 16:37                     ` Sasha Levin
  -1 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-05-05 16:37 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Joonsoo Kim, Bartlomiej Zolnierkiewicz
  Cc: linux-kernel, linux-mm, Mel Gorman, Yong-Taek Lee, Minchan Kim,
	KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins, Rik van Riel,
	Michal Nazarewicz, Dave Jones

On 05/05/2014 11:50 AM, Vlastimil Babka wrote:
> On 05/05/2014 04:36 PM, Sasha Levin wrote:
>> On 05/02/2014 08:08 AM, Vlastimil Babka wrote:
>>> On 04/30/2014 11:46 PM, Sasha Levin wrote:
>>>>> On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
>>>>>>> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
>>>>>>> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
>>>>>>> pageblock might be changed to other migratetype in try_to_steal_freepages().
>>>>>>> For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
>>>>>>> they could get allocated as unmovable and result in CMA failure.
>>>>>>>
>>>>>>> This is ensured by setting the freepage_migratetype appropriately when placing
>>>>>>> pages on pcp lists, and using the information when releasing them back to
>>>>>>> free_list. It is also assumed that CMA and RESERVE pageblocks are created only
>>>>>>> in the init phase. This patch adds DEBUG_VM checks to catch any regressions
>>>>>>> introduced for this invariant.
>>>>>>>
>>>>>>> Cc: Yong-Taek Lee <ytk.lee@samsung.com>
>>>>>>> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
>>>>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>>>>>> Cc: Mel Gorman <mgorman@suse.de>
>>>>>>> Cc: Minchan Kim <minchan@kernel.org>
>>>>>>> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>>>>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>>>>>>> Cc: Hugh Dickins <hughd@google.com>
>>>>>>> Cc: Rik van Riel <riel@redhat.com>
>>>>>>> Cc: Michal Nazarewicz <mina86@mina86.com>
>>>>>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>>>>>
>>>>> Two issues with this patch.
>>>>>
>>>>> First:
>>>>>
>>>>> [ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
>>>>> [ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>>>> [ 3446.320082] Dumping ftrace buffer:
>>>>> [ 3446.320082]    (ftrace buffer empty)
>>>>> [ 3446.320082] Modules linked in:
>>>>> [ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
>>>>> [ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
>>>>> [ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
>>>>> [ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
>>>>> [ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
>>>>> [ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
>>>>> [ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
>>>>> [ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
>>>>> [ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
>>>>> [ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
>>>>> [ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>> [ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
>>>>> [ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
>>>>> [ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
>>>>> [ 3446.335888] Stack:
>>>>> [ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
>>>>> [ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
>>>>> [ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
>>>>> [ 3446.335888] Call Trace:
>>>>> [ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
>>>>> [ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
>>>>> [ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
>>>>> [ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
>>>>> [ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
>>>>> [ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
>>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
>>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:979)
>>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:940)
>>>>> [ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
>>>>> [ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
>>>>> [ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
>>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
>>>>> [ 3446.335888] shmem_fault (mm/shmem.c:1237)
>>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
>>>>> [ 3446.335888] __do_fault (mm/memory.c:3344)
>>>>> [ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
>>>>> [ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
>>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
>>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
>>>>> [ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
>>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
>>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>>>>> [ 3446.335888] handle_mm_fault (mm/memory.c:3973)
>>>>> [ 3446.335888] __get_user_pages (mm/memory.c:1863)
>>>>> [ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
>>>>> [ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
>>>>> [ 3446.335888] __mm_populate (mm/mlock.c:711)
>>>>> [ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
>>>>> [ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
>>>>> [ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
>>>>> [ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
>>>>> [ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
>>>>> [ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
>>>>> [ 3446.335888]  RSP <ffff88053e247778>
>>> Hey, that's not an issue, that means the patch works as intended :) And
>>> I believe it's not a bug introduced by PATCH 1/2.
>>>
>>> So, according to my decodecode reading, RAX is the results of
>>> get_pageblock_migratetype() and it's MIGRATE_RESERVE. RDX is the result
>>> of get_freepage_migratetype() and it's MIGRATE_UNMOVABLE. The
>>> freepage_migratetype has just been set either by __rmqueue_smallest() or
>>> __rmqueue_fallback(), according to the free_list the page has been taken
>>> from. So this looks like a page from MIGRATE_RESERVE pageblock found on
>>> the !MIGRATE_RESERVE free_list, which is exactly what the patch intends
>>> to catch.
>>>
>>> I think there are two possible explanations.
>>>
>>> 1) the pageblock is genuinely MIGRATE_RESERVE and it was misplaced by
>>> mistake. I think it wasn't in free_pcppages_bulk() as there's the same
>>> VM_BUG_ON which would supposedly trigger at the moment of displacing. In
>>> theory it's possible that there's a race through __free_pages_ok() ->
>>> free_one_page() where the get_pageblock_migratetype() in
>>> __free_pages_ok() would race with set_pageblock_migratetype() and result
>>> in bogus value. But nobody should be calling set_pageblock_migratetype()
>>> on a MIGRATE_RESERVE pageblock.
>>>
>>> 2) the pageblock was marked as MIGRATE_RESERVE due to a race between
>>> set_pageblock_migratetype() and set_pageblock_skip(). The latter is
>>> currently not serialized by zone->lock, nor it uses atomic bit set. So
>>> it may result in lost updates in a racing set_pageblock_migratetype(). I
>>> think a well-placed race when changing pageblock from MIGRATE_MOVABLE to
>>> MIGRATE_RECLAIMABLE could result in MIGRATE_RESERVE value. Similar races
>>> have been already observed to be a problem where frequent changing
>>> to/from MIGRATE_ISOLATE is involved, and I did a patch series to address
>>> this, but it was not complete and I postponed it after Mel's changes
>>> that remove the racy for-cycles completely. So it might be that his
>>> "[PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set
>>> pageblock bitmaps" already solves this bug (but maybe only on certain
>>> architectures where you don't need atomic operations). You might try
>>> that patch if you can reproduce this bug frequently enough?
>>
>> I've tried that patch, but still see the same BUG_ON.
> 
> Oh damn, I've realized that my assumptions about MIGRATE_RESERVE pageblocks being created only on zone init time were wrong. setup_zone_migrate_reserve() is called also from the handler of min_free_kbytes sysctl... does trinity try to change that while running?

There's nothing that will prevent it from changing that.

Thanks,
Sasha

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
@ 2014-05-05 16:37                     ` Sasha Levin
  0 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-05-05 16:37 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Joonsoo Kim, Bartlomiej Zolnierkiewicz
  Cc: linux-kernel, linux-mm, Mel Gorman, Yong-Taek Lee, Minchan Kim,
	KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins, Rik van Riel,
	Michal Nazarewicz, Dave Jones

On 05/05/2014 11:50 AM, Vlastimil Babka wrote:
> On 05/05/2014 04:36 PM, Sasha Levin wrote:
>> On 05/02/2014 08:08 AM, Vlastimil Babka wrote:
>>> On 04/30/2014 11:46 PM, Sasha Levin wrote:
>>>>> On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
>>>>>>> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
>>>>>>> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
>>>>>>> pageblock might be changed to other migratetype in try_to_steal_freepages().
>>>>>>> For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
>>>>>>> they could get allocated as unmovable and result in CMA failure.
>>>>>>>
>>>>>>> This is ensured by setting the freepage_migratetype appropriately when placing
>>>>>>> pages on pcp lists, and using the information when releasing them back to
>>>>>>> free_list. It is also assumed that CMA and RESERVE pageblocks are created only
>>>>>>> in the init phase. This patch adds DEBUG_VM checks to catch any regressions
>>>>>>> introduced for this invariant.
>>>>>>>
>>>>>>> Cc: Yong-Taek Lee <ytk.lee@samsung.com>
>>>>>>> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
>>>>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>>>>>> Cc: Mel Gorman <mgorman@suse.de>
>>>>>>> Cc: Minchan Kim <minchan@kernel.org>
>>>>>>> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>>>>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>>>>>>> Cc: Hugh Dickins <hughd@google.com>
>>>>>>> Cc: Rik van Riel <riel@redhat.com>
>>>>>>> Cc: Michal Nazarewicz <mina86@mina86.com>
>>>>>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>>>>>
>>>>> Two issues with this patch.
>>>>>
>>>>> First:
>>>>>
>>>>> [ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
>>>>> [ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>>>> [ 3446.320082] Dumping ftrace buffer:
>>>>> [ 3446.320082]    (ftrace buffer empty)
>>>>> [ 3446.320082] Modules linked in:
>>>>> [ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
>>>>> [ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
>>>>> [ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
>>>>> [ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
>>>>> [ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
>>>>> [ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
>>>>> [ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
>>>>> [ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
>>>>> [ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
>>>>> [ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
>>>>> [ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>> [ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
>>>>> [ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
>>>>> [ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
>>>>> [ 3446.335888] Stack:
>>>>> [ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
>>>>> [ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
>>>>> [ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
>>>>> [ 3446.335888] Call Trace:
>>>>> [ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
>>>>> [ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
>>>>> [ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
>>>>> [ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
>>>>> [ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
>>>>> [ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
>>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
>>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:979)
>>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:940)
>>>>> [ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
>>>>> [ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
>>>>> [ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
>>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
>>>>> [ 3446.335888] shmem_fault (mm/shmem.c:1237)
>>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
>>>>> [ 3446.335888] __do_fault (mm/memory.c:3344)
>>>>> [ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
>>>>> [ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
>>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
>>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
>>>>> [ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
>>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
>>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>>>>> [ 3446.335888] handle_mm_fault (mm/memory.c:3973)
>>>>> [ 3446.335888] __get_user_pages (mm/memory.c:1863)
>>>>> [ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
>>>>> [ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
>>>>> [ 3446.335888] __mm_populate (mm/mlock.c:711)
>>>>> [ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
>>>>> [ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
>>>>> [ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
>>>>> [ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
>>>>> [ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
>>>>> [ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
>>>>> [ 3446.335888]  RSP <ffff88053e247778>
>>> Hey, that's not an issue, that means the patch works as intended :) And
>>> I believe it's not a bug introduced by PATCH 1/2.
>>>
>>> So, according to my decodecode reading, RAX is the results of
>>> get_pageblock_migratetype() and it's MIGRATE_RESERVE. RDX is the result
>>> of get_freepage_migratetype() and it's MIGRATE_UNMOVABLE. The
>>> freepage_migratetype has just been set either by __rmqueue_smallest() or
>>> __rmqueue_fallback(), according to the free_list the page has been taken
>>> from. So this looks like a page from MIGRATE_RESERVE pageblock found on
>>> the !MIGRATE_RESERVE free_list, which is exactly what the patch intends
>>> to catch.
>>>
>>> I think there are two possible explanations.
>>>
>>> 1) the pageblock is genuinely MIGRATE_RESERVE and it was misplaced by
>>> mistake. I think it wasn't in free_pcppages_bulk() as there's the same
>>> VM_BUG_ON which would supposedly trigger at the moment of displacing. In
>>> theory it's possible that there's a race through __free_pages_ok() ->
>>> free_one_page() where the get_pageblock_migratetype() in
>>> __free_pages_ok() would race with set_pageblock_migratetype() and result
>>> in bogus value. But nobody should be calling set_pageblock_migratetype()
>>> on a MIGRATE_RESERVE pageblock.
>>>
>>> 2) the pageblock was marked as MIGRATE_RESERVE due to a race between
>>> set_pageblock_migratetype() and set_pageblock_skip(). The latter is
>>> currently not serialized by zone->lock, nor it uses atomic bit set. So
>>> it may result in lost updates in a racing set_pageblock_migratetype(). I
>>> think a well-placed race when changing pageblock from MIGRATE_MOVABLE to
>>> MIGRATE_RECLAIMABLE could result in MIGRATE_RESERVE value. Similar races
>>> have been already observed to be a problem where frequent changing
>>> to/from MIGRATE_ISOLATE is involved, and I did a patch series to address
>>> this, but it was not complete and I postponed it after Mel's changes
>>> that remove the racy for-cycles completely. So it might be that his
>>> "[PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set
>>> pageblock bitmaps" already solves this bug (but maybe only on certain
>>> architectures where you don't need atomic operations). You might try
>>> that patch if you can reproduce this bug frequently enough?
>>
>> I've tried that patch, but still see the same BUG_ON.
> 
> Oh damn, I've realized that my assumptions about MIGRATE_RESERVE pageblocks being created only on zone init time were wrong. setup_zone_migrate_reserve() is called also from the handler of min_free_kbytes sysctl... does trinity try to change that while running?

There's nothing that will prevent it from changing that.

Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
  2014-05-05 15:50                   ` Vlastimil Babka
@ 2014-05-07  1:33                     ` Minchan Kim
  -1 siblings, 0 replies; 52+ messages in thread
From: Minchan Kim @ 2014-05-07  1:33 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sasha Levin, Andrew Morton, Joonsoo Kim,
	Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm, Mel Gorman,
	Yong-Taek Lee, KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins,
	Rik van Riel, Michal Nazarewicz, Dave Jones

On Mon, May 05, 2014 at 05:50:46PM +0200, Vlastimil Babka wrote:
> On 05/05/2014 04:36 PM, Sasha Levin wrote:
> >On 05/02/2014 08:08 AM, Vlastimil Babka wrote:
> >>On 04/30/2014 11:46 PM, Sasha Levin wrote:
> >>>>On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
> >>>>>>For the MIGRATE_RESERVE pages, it is important they do not get misplaced
> >>>>>>on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
> >>>>>>pageblock might be changed to other migratetype in try_to_steal_freepages().
> >>>>>>For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
> >>>>>>they could get allocated as unmovable and result in CMA failure.
> >>>>>>
> >>>>>>This is ensured by setting the freepage_migratetype appropriately when placing
> >>>>>>pages on pcp lists, and using the information when releasing them back to
> >>>>>>free_list. It is also assumed that CMA and RESERVE pageblocks are created only
> >>>>>>in the init phase. This patch adds DEBUG_VM checks to catch any regressions
> >>>>>>introduced for this invariant.
> >>>>>>
> >>>>>>Cc: Yong-Taek Lee <ytk.lee@samsung.com>
> >>>>>>Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
> >>>>>>Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >>>>>>Cc: Mel Gorman <mgorman@suse.de>
> >>>>>>Cc: Minchan Kim <minchan@kernel.org>
> >>>>>>Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >>>>>>Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> >>>>>>Cc: Hugh Dickins <hughd@google.com>
> >>>>>>Cc: Rik van Riel <riel@redhat.com>
> >>>>>>Cc: Michal Nazarewicz <mina86@mina86.com>
> >>>>>>Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> >>>>
> >>>>Two issues with this patch.
> >>>>
> >>>>First:
> >>>>
> >>>>[ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
> >>>>[ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> >>>>[ 3446.320082] Dumping ftrace buffer:
> >>>>[ 3446.320082]    (ftrace buffer empty)
> >>>>[ 3446.320082] Modules linked in:
> >>>>[ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
> >>>>[ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
> >>>>[ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
> >>>>[ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
> >>>>[ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
> >>>>[ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
> >>>>[ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
> >>>>[ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
> >>>>[ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
> >>>>[ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
> >>>>[ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >>>>[ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
> >>>>[ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
> >>>>[ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
> >>>>[ 3446.335888] Stack:
> >>>>[ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
> >>>>[ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
> >>>>[ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
> >>>>[ 3446.335888] Call Trace:
> >>>>[ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
> >>>>[ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> >>>>[ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
> >>>>[ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
> >>>>[ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
> >>>>[ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
> >>>>[ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
> >>>>[ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> >>>>[ 3446.335888] ? find_get_entry (mm/filemap.c:979)
> >>>>[ 3446.335888] ? find_get_entry (mm/filemap.c:940)
> >>>>[ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
> >>>>[ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
> >>>>[ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
> >>>>[ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
> >>>>[ 3446.335888] shmem_fault (mm/shmem.c:1237)
> >>>>[ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
> >>>>[ 3446.335888] __do_fault (mm/memory.c:3344)
> >>>>[ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> >>>>[ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
> >>>>[ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
> >>>>[ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
> >>>>[ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
> >>>>[ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
> >>>>[ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> >>>>[ 3446.335888] handle_mm_fault (mm/memory.c:3973)
> >>>>[ 3446.335888] __get_user_pages (mm/memory.c:1863)
> >>>>[ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
> >>>>[ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
> >>>>[ 3446.335888] __mm_populate (mm/mlock.c:711)
> >>>>[ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
> >>>>[ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
> >>>>[ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
> >>>>[ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
> >>>>[ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
> >>>>[ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
> >>>>[ 3446.335888]  RSP <ffff88053e247778>
> >>Hey, that's not an issue, that means the patch works as intended :) And
> >>I believe it's not a bug introduced by PATCH 1/2.
> >>
> >>So, according to my decodecode reading, RAX is the results of
> >>get_pageblock_migratetype() and it's MIGRATE_RESERVE. RDX is the result
> >>of get_freepage_migratetype() and it's MIGRATE_UNMOVABLE. The
> >>freepage_migratetype has just been set either by __rmqueue_smallest() or
> >>__rmqueue_fallback(), according to the free_list the page has been taken
> >>from. So this looks like a page from MIGRATE_RESERVE pageblock found on
> >>the !MIGRATE_RESERVE free_list, which is exactly what the patch intends
> >>to catch.
> >>
> >>I think there are two possible explanations.
> >>
> >>1) the pageblock is genuinely MIGRATE_RESERVE and it was misplaced by
> >>mistake. I think it wasn't in free_pcppages_bulk() as there's the same
> >>VM_BUG_ON which would supposedly trigger at the moment of displacing. In
> >>theory it's possible that there's a race through __free_pages_ok() ->
> >>free_one_page() where the get_pageblock_migratetype() in
> >>__free_pages_ok() would race with set_pageblock_migratetype() and result
> >>in bogus value. But nobody should be calling set_pageblock_migratetype()
> >>on a MIGRATE_RESERVE pageblock.
> >>
> >>2) the pageblock was marked as MIGRATE_RESERVE due to a race between
> >>set_pageblock_migratetype() and set_pageblock_skip(). The latter is
> >>currently not serialized by zone->lock, nor it uses atomic bit set. So
> >>it may result in lost updates in a racing set_pageblock_migratetype(). I
> >>think a well-placed race when changing pageblock from MIGRATE_MOVABLE to
> >>MIGRATE_RECLAIMABLE could result in MIGRATE_RESERVE value. Similar races
> >>have been already observed to be a problem where frequent changing
> >>to/from MIGRATE_ISOLATE is involved, and I did a patch series to address
> >>this, but it was not complete and I postponed it after Mel's changes
> >>that remove the racy for-cycles completely. So it might be that his
> >>"[PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set
> >>pageblock bitmaps" already solves this bug (but maybe only on certain
> >>architectures where you don't need atomic operations). You might try
> >>that patch if you can reproduce this bug frequently enough?
> >
> >I've tried that patch, but still see the same BUG_ON.
> 
> Oh damn, I've realized that my assumptions about MIGRATE_RESERVE
> pageblocks being created only on zone init time were wrong.
> setup_zone_migrate_reserve() is called also from the handler of
> min_free_kbytes sysctl... does trinity try to change that while
> running?
> The function will change MOVABLE pageblocks to RESERVE and try to
> move all free pages to the RESERVE free_list, but of course pages on
> pcplists will remain MOVABLE and may trigger the VM_BUG_ON. You
> triggered the bug with page on MOVABLE free_list (in the first reply
> I said its UNMOVABLE by mistake) so this might be good explanation
> if trinity changes min_free_kbytes.
> 
> Furthermore, I think there's a problem that
> setup_zone_migrate_reserve() operates on pageblocks, but as MAX_ODER
> is higher than pageblock_order, RESERVE pages might be merged with
> buddies of different migratetype and end up on their free_list. That
> seems to me like a flaw in the design of reserves, but perhaps
> others won't think it's serious enough to fix?
> 
> So in the end this VM_DEBUG check probably cannot work anymore for
> MIGRATE_RESERVE, only for CMA. I'm not sure if it's worth keeping it
> only for CMA, what are the CMA guys' opinions on that?

I really don't want it. That was I didn't add my Acked-by at that time.
For a long time, I never wanted to add more overhead hot path due to
CMA unless it's really critical. It's same to this.
Although such debug patch helps to notice something goes wrong for CMA,
more information would be helpful to know why CMA failed because
there are another potential reasons to fail CMA allocation.

One of the idea about that is to store alloc trace into somewhere(ex,
naive idea is page description like page-owner) and then we could investigate
what's the owner of that page so we could know why we fail to migrate it out.
With that, we would figure out how on earth such page is allocated from CMA
and it would be more helpful rather just VM_BUG_ON notice.

The whole point is I'd like to avoid adding more overhead to hot path for
rare case although it's debugging feature.

> 
> Also this means that the 1/2 patch "prevent MIGRATE_RESERVE pages
> from being misplaced" still won't prevent stealing a MIGRATE_RESERVE
> pageblock when __rmqueue_fallback() encounters a strayed
> MIGRATE_RESERVE page on e.g. a MOVABLE freelist. This is fixable by
> having __rmqueue_fallback() not trusting the migratetype of
> free_list and checking for pageblock_migratetype. I hate that, but
> at least it's not on the fast path...
> 
> >Thanks,
> >Sasha
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
@ 2014-05-07  1:33                     ` Minchan Kim
  0 siblings, 0 replies; 52+ messages in thread
From: Minchan Kim @ 2014-05-07  1:33 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sasha Levin, Andrew Morton, Joonsoo Kim,
	Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm, Mel Gorman,
	Yong-Taek Lee, KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins,
	Rik van Riel, Michal Nazarewicz, Dave Jones

On Mon, May 05, 2014 at 05:50:46PM +0200, Vlastimil Babka wrote:
> On 05/05/2014 04:36 PM, Sasha Levin wrote:
> >On 05/02/2014 08:08 AM, Vlastimil Babka wrote:
> >>On 04/30/2014 11:46 PM, Sasha Levin wrote:
> >>>>On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
> >>>>>>For the MIGRATE_RESERVE pages, it is important they do not get misplaced
> >>>>>>on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
> >>>>>>pageblock might be changed to other migratetype in try_to_steal_freepages().
> >>>>>>For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
> >>>>>>they could get allocated as unmovable and result in CMA failure.
> >>>>>>
> >>>>>>This is ensured by setting the freepage_migratetype appropriately when placing
> >>>>>>pages on pcp lists, and using the information when releasing them back to
> >>>>>>free_list. It is also assumed that CMA and RESERVE pageblocks are created only
> >>>>>>in the init phase. This patch adds DEBUG_VM checks to catch any regressions
> >>>>>>introduced for this invariant.
> >>>>>>
> >>>>>>Cc: Yong-Taek Lee <ytk.lee@samsung.com>
> >>>>>>Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
> >>>>>>Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >>>>>>Cc: Mel Gorman <mgorman@suse.de>
> >>>>>>Cc: Minchan Kim <minchan@kernel.org>
> >>>>>>Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >>>>>>Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> >>>>>>Cc: Hugh Dickins <hughd@google.com>
> >>>>>>Cc: Rik van Riel <riel@redhat.com>
> >>>>>>Cc: Michal Nazarewicz <mina86@mina86.com>
> >>>>>>Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> >>>>
> >>>>Two issues with this patch.
> >>>>
> >>>>First:
> >>>>
> >>>>[ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
> >>>>[ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> >>>>[ 3446.320082] Dumping ftrace buffer:
> >>>>[ 3446.320082]    (ftrace buffer empty)
> >>>>[ 3446.320082] Modules linked in:
> >>>>[ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
> >>>>[ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
> >>>>[ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
> >>>>[ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
> >>>>[ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
> >>>>[ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
> >>>>[ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
> >>>>[ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
> >>>>[ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
> >>>>[ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
> >>>>[ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >>>>[ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
> >>>>[ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
> >>>>[ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
> >>>>[ 3446.335888] Stack:
> >>>>[ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
> >>>>[ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
> >>>>[ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
> >>>>[ 3446.335888] Call Trace:
> >>>>[ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
> >>>>[ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> >>>>[ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
> >>>>[ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
> >>>>[ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
> >>>>[ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
> >>>>[ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
> >>>>[ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> >>>>[ 3446.335888] ? find_get_entry (mm/filemap.c:979)
> >>>>[ 3446.335888] ? find_get_entry (mm/filemap.c:940)
> >>>>[ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
> >>>>[ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
> >>>>[ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
> >>>>[ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
> >>>>[ 3446.335888] shmem_fault (mm/shmem.c:1237)
> >>>>[ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
> >>>>[ 3446.335888] __do_fault (mm/memory.c:3344)
> >>>>[ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> >>>>[ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
> >>>>[ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
> >>>>[ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
> >>>>[ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
> >>>>[ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
> >>>>[ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> >>>>[ 3446.335888] handle_mm_fault (mm/memory.c:3973)
> >>>>[ 3446.335888] __get_user_pages (mm/memory.c:1863)
> >>>>[ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
> >>>>[ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
> >>>>[ 3446.335888] __mm_populate (mm/mlock.c:711)
> >>>>[ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
> >>>>[ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
> >>>>[ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
> >>>>[ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
> >>>>[ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
> >>>>[ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
> >>>>[ 3446.335888]  RSP <ffff88053e247778>
> >>Hey, that's not an issue, that means the patch works as intended :) And
> >>I believe it's not a bug introduced by PATCH 1/2.
> >>
> >>So, according to my decodecode reading, RAX is the results of
> >>get_pageblock_migratetype() and it's MIGRATE_RESERVE. RDX is the result
> >>of get_freepage_migratetype() and it's MIGRATE_UNMOVABLE. The
> >>freepage_migratetype has just been set either by __rmqueue_smallest() or
> >>__rmqueue_fallback(), according to the free_list the page has been taken
> >>from. So this looks like a page from MIGRATE_RESERVE pageblock found on
> >>the !MIGRATE_RESERVE free_list, which is exactly what the patch intends
> >>to catch.
> >>
> >>I think there are two possible explanations.
> >>
> >>1) the pageblock is genuinely MIGRATE_RESERVE and it was misplaced by
> >>mistake. I think it wasn't in free_pcppages_bulk() as there's the same
> >>VM_BUG_ON which would supposedly trigger at the moment of displacing. In
> >>theory it's possible that there's a race through __free_pages_ok() ->
> >>free_one_page() where the get_pageblock_migratetype() in
> >>__free_pages_ok() would race with set_pageblock_migratetype() and result
> >>in bogus value. But nobody should be calling set_pageblock_migratetype()
> >>on a MIGRATE_RESERVE pageblock.
> >>
> >>2) the pageblock was marked as MIGRATE_RESERVE due to a race between
> >>set_pageblock_migratetype() and set_pageblock_skip(). The latter is
> >>currently not serialized by zone->lock, nor it uses atomic bit set. So
> >>it may result in lost updates in a racing set_pageblock_migratetype(). I
> >>think a well-placed race when changing pageblock from MIGRATE_MOVABLE to
> >>MIGRATE_RECLAIMABLE could result in MIGRATE_RESERVE value. Similar races
> >>have been already observed to be a problem where frequent changing
> >>to/from MIGRATE_ISOLATE is involved, and I did a patch series to address
> >>this, but it was not complete and I postponed it after Mel's changes
> >>that remove the racy for-cycles completely. So it might be that his
> >>"[PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set
> >>pageblock bitmaps" already solves this bug (but maybe only on certain
> >>architectures where you don't need atomic operations). You might try
> >>that patch if you can reproduce this bug frequently enough?
> >
> >I've tried that patch, but still see the same BUG_ON.
> 
> Oh damn, I've realized that my assumptions about MIGRATE_RESERVE
> pageblocks being created only on zone init time were wrong.
> setup_zone_migrate_reserve() is called also from the handler of
> min_free_kbytes sysctl... does trinity try to change that while
> running?
> The function will change MOVABLE pageblocks to RESERVE and try to
> move all free pages to the RESERVE free_list, but of course pages on
> pcplists will remain MOVABLE and may trigger the VM_BUG_ON. You
> triggered the bug with page on MOVABLE free_list (in the first reply
> I said its UNMOVABLE by mistake) so this might be good explanation
> if trinity changes min_free_kbytes.
> 
> Furthermore, I think there's a problem that
> setup_zone_migrate_reserve() operates on pageblocks, but as MAX_ODER
> is higher than pageblock_order, RESERVE pages might be merged with
> buddies of different migratetype and end up on their free_list. That
> seems to me like a flaw in the design of reserves, but perhaps
> others won't think it's serious enough to fix?
> 
> So in the end this VM_DEBUG check probably cannot work anymore for
> MIGRATE_RESERVE, only for CMA. I'm not sure if it's worth keeping it
> only for CMA, what are the CMA guys' opinions on that?

I really don't want it. That was I didn't add my Acked-by at that time.
For a long time, I never wanted to add more overhead hot path due to
CMA unless it's really critical. It's same to this.
Although such debug patch helps to notice something goes wrong for CMA,
more information would be helpful to know why CMA failed because
there are another potential reasons to fail CMA allocation.

One of the idea about that is to store alloc trace into somewhere(ex,
naive idea is page description like page-owner) and then we could investigate
what's the owner of that page so we could know why we fail to migrate it out.
With that, we would figure out how on earth such page is allocated from CMA
and it would be more helpful rather just VM_BUG_ON notice.

The whole point is I'd like to avoid adding more overhead to hot path for
rare case although it's debugging feature.

> 
> Also this means that the 1/2 patch "prevent MIGRATE_RESERVE pages
> from being misplaced" still won't prevent stealing a MIGRATE_RESERVE
> pageblock when __rmqueue_fallback() encounters a strayed
> MIGRATE_RESERVE page on e.g. a MOVABLE freelist. This is fixable by
> having __rmqueue_fallback() not trusting the migratetype of
> free_list and checking for pageblock_migratetype. I hate that, but
> at least it's not on the fast path...
> 
> >Thanks,
> >Sasha
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
  2014-05-07  1:33                     ` Minchan Kim
@ 2014-05-07 14:59                       ` Vlastimil Babka
  -1 siblings, 0 replies; 52+ messages in thread
From: Vlastimil Babka @ 2014-05-07 14:59 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: Sasha Levin, Joonsoo Kim, Bartlomiej Zolnierkiewicz,
	linux-kernel, linux-mm, Mel Gorman, Yong-Taek Lee,
	KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins, Rik van Riel,
	Michal Nazarewicz, Dave Jones

On 05/07/2014 03:33 AM, Minchan Kim wrote:
> On Mon, May 05, 2014 at 05:50:46PM +0200, Vlastimil Babka wrote:
>> On 05/05/2014 04:36 PM, Sasha Levin wrote:
>>> On 05/02/2014 08:08 AM, Vlastimil Babka wrote:
>>>> On 04/30/2014 11:46 PM, Sasha Levin wrote:
>>>>>> On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
>>>>>>>> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
>>>>>>>> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
>>>>>>>> pageblock might be changed to other migratetype in try_to_steal_freepages().
>>>>>>>> For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
>>>>>>>> they could get allocated as unmovable and result in CMA failure.
>>>>>>>>
>>>>>>>> This is ensured by setting the freepage_migratetype appropriately when placing
>>>>>>>> pages on pcp lists, and using the information when releasing them back to
>>>>>>>> free_list. It is also assumed that CMA and RESERVE pageblocks are created only
>>>>>>>> in the init phase. This patch adds DEBUG_VM checks to catch any regressions
>>>>>>>> introduced for this invariant.
>>>>>>>>
>>>>>>>> Cc: Yong-Taek Lee <ytk.lee@samsung.com>
>>>>>>>> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
>>>>>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>>>>>>> Cc: Mel Gorman <mgorman@suse.de>
>>>>>>>> Cc: Minchan Kim <minchan@kernel.org>
>>>>>>>> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>>>>>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>>>>>>>> Cc: Hugh Dickins <hughd@google.com>
>>>>>>>> Cc: Rik van Riel <riel@redhat.com>
>>>>>>>> Cc: Michal Nazarewicz <mina86@mina86.com>
>>>>>>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>>>>>>
>>>>>> Two issues with this patch.
>>>>>>
>>>>>> First:
>>>>>>
>>>>>> [ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
>>>>>> [ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>>>>> [ 3446.320082] Dumping ftrace buffer:
>>>>>> [ 3446.320082]    (ftrace buffer empty)
>>>>>> [ 3446.320082] Modules linked in:
>>>>>> [ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
>>>>>> [ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
>>>>>> [ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
>>>>>> [ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
>>>>>> [ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
>>>>>> [ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
>>>>>> [ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
>>>>>> [ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
>>>>>> [ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
>>>>>> [ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
>>>>>> [ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>>> [ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
>>>>>> [ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
>>>>>> [ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
>>>>>> [ 3446.335888] Stack:
>>>>>> [ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
>>>>>> [ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
>>>>>> [ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
>>>>>> [ 3446.335888] Call Trace:
>>>>>> [ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
>>>>>> [ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
>>>>>> [ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
>>>>>> [ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
>>>>>> [ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
>>>>>> [ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
>>>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
>>>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>>>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:979)
>>>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:940)
>>>>>> [ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
>>>>>> [ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
>>>>>> [ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
>>>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
>>>>>> [ 3446.335888] shmem_fault (mm/shmem.c:1237)
>>>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
>>>>>> [ 3446.335888] __do_fault (mm/memory.c:3344)
>>>>>> [ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
>>>>>> [ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
>>>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
>>>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
>>>>>> [ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
>>>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
>>>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>>>>>> [ 3446.335888] handle_mm_fault (mm/memory.c:3973)
>>>>>> [ 3446.335888] __get_user_pages (mm/memory.c:1863)
>>>>>> [ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
>>>>>> [ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
>>>>>> [ 3446.335888] __mm_populate (mm/mlock.c:711)
>>>>>> [ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
>>>>>> [ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
>>>>>> [ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
>>>>>> [ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
>>>>>> [ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
>>>>>> [ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
>>>>>> [ 3446.335888]  RSP <ffff88053e247778>
>>>> Hey, that's not an issue, that means the patch works as intended :) And
>>>> I believe it's not a bug introduced by PATCH 1/2.
>>>>
>>>> So, according to my decodecode reading, RAX is the results of
>>>> get_pageblock_migratetype() and it's MIGRATE_RESERVE. RDX is the result
>>>> of get_freepage_migratetype() and it's MIGRATE_UNMOVABLE. The
>>>> freepage_migratetype has just been set either by __rmqueue_smallest() or
>>>> __rmqueue_fallback(), according to the free_list the page has been taken
>>>> from. So this looks like a page from MIGRATE_RESERVE pageblock found on
>>>> the !MIGRATE_RESERVE free_list, which is exactly what the patch intends
>>>> to catch.
>>>>
>>>> I think there are two possible explanations.
>>>>
>>>> 1) the pageblock is genuinely MIGRATE_RESERVE and it was misplaced by
>>>> mistake. I think it wasn't in free_pcppages_bulk() as there's the same
>>>> VM_BUG_ON which would supposedly trigger at the moment of displacing. In
>>>> theory it's possible that there's a race through __free_pages_ok() ->
>>>> free_one_page() where the get_pageblock_migratetype() in
>>>> __free_pages_ok() would race with set_pageblock_migratetype() and result
>>>> in bogus value. But nobody should be calling set_pageblock_migratetype()
>>>> on a MIGRATE_RESERVE pageblock.
>>>>
>>>> 2) the pageblock was marked as MIGRATE_RESERVE due to a race between
>>>> set_pageblock_migratetype() and set_pageblock_skip(). The latter is
>>>> currently not serialized by zone->lock, nor it uses atomic bit set. So
>>>> it may result in lost updates in a racing set_pageblock_migratetype(). I
>>>> think a well-placed race when changing pageblock from MIGRATE_MOVABLE to
>>>> MIGRATE_RECLAIMABLE could result in MIGRATE_RESERVE value. Similar races
>>>> have been already observed to be a problem where frequent changing
>>>> to/from MIGRATE_ISOLATE is involved, and I did a patch series to address
>>>> this, but it was not complete and I postponed it after Mel's changes
>>>> that remove the racy for-cycles completely. So it might be that his
>>>> "[PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set
>>>> pageblock bitmaps" already solves this bug (but maybe only on certain
>>>> architectures where you don't need atomic operations). You might try
>>>> that patch if you can reproduce this bug frequently enough?
>>>
>>> I've tried that patch, but still see the same BUG_ON.
>>
>> Oh damn, I've realized that my assumptions about MIGRATE_RESERVE
>> pageblocks being created only on zone init time were wrong.
>> setup_zone_migrate_reserve() is called also from the handler of
>> min_free_kbytes sysctl... does trinity try to change that while
>> running?
>> The function will change MOVABLE pageblocks to RESERVE and try to
>> move all free pages to the RESERVE free_list, but of course pages on
>> pcplists will remain MOVABLE and may trigger the VM_BUG_ON. You
>> triggered the bug with page on MOVABLE free_list (in the first reply
>> I said its UNMOVABLE by mistake) so this might be good explanation
>> if trinity changes min_free_kbytes.
>>
>> Furthermore, I think there's a problem that
>> setup_zone_migrate_reserve() operates on pageblocks, but as MAX_ODER
>> is higher than pageblock_order, RESERVE pages might be merged with
>> buddies of different migratetype and end up on their free_list. That
>> seems to me like a flaw in the design of reserves, but perhaps
>> others won't think it's serious enough to fix?
>>
>> So in the end this VM_DEBUG check probably cannot work anymore for
>> MIGRATE_RESERVE, only for CMA. I'm not sure if it's worth keeping it
>> only for CMA, what are the CMA guys' opinions on that?
> 
> I really don't want it. That was I didn't add my Acked-by at that time.
> For a long time, I never wanted to add more overhead hot path due to
> CMA unless it's really critical. It's same to this.
> Although such debug patch helps to notice something goes wrong for CMA,
> more information would be helpful to know why CMA failed because
> there are another potential reasons to fail CMA allocation.
> 
> One of the idea about that is to store alloc trace into somewhere(ex,
> naive idea is page description like page-owner) and then we could investigate
> what's the owner of that page so we could know why we fail to migrate it out.
> With that, we would figure out how on earth such page is allocated from CMA
> and it would be more helpful rather just VM_BUG_ON notice.
> 
> The whole point is I'd like to avoid adding more overhead to hot path for
> rare case although it's debugging feature.

OK. I'm not that concerned with VM_DEBUG overhead as it's intended for
testing, not production. But as you say the patch is not that useful
without the MIGRATE_RESERVE part, so Andrew could you please drop the
patch (mm-page_alloc-debug_vm-checks-for-free_list-placement-of-cma-and-reserve-pages.patch)?

It would be also nice to change the commit log of patch
(mm-page_alloc-prevent-migrate_reserve-pages-from-being-misplaced.patch)
to reflect the recent findings. I will send another patch to deal with
MIGRATE_RESERVE pageblock stealing, as we clearly cannot prevent
misplaced MIGRATE_RESERVE pages when MIGRATE_RESERVE pageblocks are
created dynamically through min_free_kbytes sysctl.

Thanks.

---8<---
From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm/page_alloc: prevent rmqueue_bulk misplacing MIGRATE_RESERVE pages

For the MIGRATE_RESERVE pages, it is useful when they do not get misplaced on
free_list of other migratetype, otherwise they might get allocated prematurely
and e.g. fragment the MIGRATE_RESEVE pageblocks. While this cannot be avoided
completely when allocating new MIGRATE_RESERVE pageblocks in min_free_kbytes
sysctl handler, we should prevent the misplacement where possible.

Currently, it is possible for the misplacement to happen when a MIGRATE_RESERVE
page is allocated on pcplist through rmqueue_bulk() as a fallback for other
desired migratetype, and then later freed back through free_pcppages_bulk()
without being actually used. This happens because free_pcppages_bulk() uses
get_freepage_migratetype() to choose the free_list, and rmqueue_bulk() calls
set_freepage_migratetype() with the *desired* migratetype and not the page's
original MIGRATE_RESERVE migratetype.

This patch fixes the problem by moving the call to set_freepage_migratetype()
from rmqueue_bulk() down to __rmqueue_smallest() and __rmqueue_fallback() where
the actual page's migratetype (e.g. from which free_list the page is taken from)
is used. Note that this migratetype might be different from the pageblock's
migratetype due to freepage stealing decisions. This is OK, as page stealing
never uses MIGRATE_RESERVE as a fallback, and also takes care to leave all
MIGRATE_CMA pages on the correct freelist.

Therefore, as an additional benefit, the call to get_pageblock_migratetype()
from rmqueue_bulk() when CMA is enabled, can be removed completely.  This
relies on the fact that MIGRATE_CMA pageblocks are created only during system
init, and the above. The related is_migrate_isolate() check is also
unnecessary, as memory isolation has other ways to move pages between
freelists, and drain pcp lists containing pages that should be isolated. The
buffered_rmqueue() can also benefit from calling get_freepage_migratetype()
instead of get_pageblock_migratetype().

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
@ 2014-05-07 14:59                       ` Vlastimil Babka
  0 siblings, 0 replies; 52+ messages in thread
From: Vlastimil Babka @ 2014-05-07 14:59 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: Sasha Levin, Joonsoo Kim, Bartlomiej Zolnierkiewicz,
	linux-kernel, linux-mm, Mel Gorman, Yong-Taek Lee,
	KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins, Rik van Riel,
	Michal Nazarewicz, Dave Jones

On 05/07/2014 03:33 AM, Minchan Kim wrote:
> On Mon, May 05, 2014 at 05:50:46PM +0200, Vlastimil Babka wrote:
>> On 05/05/2014 04:36 PM, Sasha Levin wrote:
>>> On 05/02/2014 08:08 AM, Vlastimil Babka wrote:
>>>> On 04/30/2014 11:46 PM, Sasha Levin wrote:
>>>>>> On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
>>>>>>>> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
>>>>>>>> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
>>>>>>>> pageblock might be changed to other migratetype in try_to_steal_freepages().
>>>>>>>> For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
>>>>>>>> they could get allocated as unmovable and result in CMA failure.
>>>>>>>>
>>>>>>>> This is ensured by setting the freepage_migratetype appropriately when placing
>>>>>>>> pages on pcp lists, and using the information when releasing them back to
>>>>>>>> free_list. It is also assumed that CMA and RESERVE pageblocks are created only
>>>>>>>> in the init phase. This patch adds DEBUG_VM checks to catch any regressions
>>>>>>>> introduced for this invariant.
>>>>>>>>
>>>>>>>> Cc: Yong-Taek Lee <ytk.lee@samsung.com>
>>>>>>>> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
>>>>>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>>>>>>> Cc: Mel Gorman <mgorman@suse.de>
>>>>>>>> Cc: Minchan Kim <minchan@kernel.org>
>>>>>>>> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>>>>>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>>>>>>>> Cc: Hugh Dickins <hughd@google.com>
>>>>>>>> Cc: Rik van Riel <riel@redhat.com>
>>>>>>>> Cc: Michal Nazarewicz <mina86@mina86.com>
>>>>>>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>>>>>>
>>>>>> Two issues with this patch.
>>>>>>
>>>>>> First:
>>>>>>
>>>>>> [ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
>>>>>> [ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>>>>> [ 3446.320082] Dumping ftrace buffer:
>>>>>> [ 3446.320082]    (ftrace buffer empty)
>>>>>> [ 3446.320082] Modules linked in:
>>>>>> [ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
>>>>>> [ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
>>>>>> [ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
>>>>>> [ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
>>>>>> [ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
>>>>>> [ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
>>>>>> [ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
>>>>>> [ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
>>>>>> [ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
>>>>>> [ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
>>>>>> [ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>>> [ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
>>>>>> [ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
>>>>>> [ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
>>>>>> [ 3446.335888] Stack:
>>>>>> [ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
>>>>>> [ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
>>>>>> [ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
>>>>>> [ 3446.335888] Call Trace:
>>>>>> [ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
>>>>>> [ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
>>>>>> [ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
>>>>>> [ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
>>>>>> [ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
>>>>>> [ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
>>>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
>>>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>>>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:979)
>>>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:940)
>>>>>> [ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
>>>>>> [ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
>>>>>> [ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
>>>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
>>>>>> [ 3446.335888] shmem_fault (mm/shmem.c:1237)
>>>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
>>>>>> [ 3446.335888] __do_fault (mm/memory.c:3344)
>>>>>> [ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
>>>>>> [ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
>>>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
>>>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
>>>>>> [ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
>>>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
>>>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>>>>>> [ 3446.335888] handle_mm_fault (mm/memory.c:3973)
>>>>>> [ 3446.335888] __get_user_pages (mm/memory.c:1863)
>>>>>> [ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
>>>>>> [ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
>>>>>> [ 3446.335888] __mm_populate (mm/mlock.c:711)
>>>>>> [ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
>>>>>> [ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
>>>>>> [ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
>>>>>> [ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
>>>>>> [ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
>>>>>> [ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
>>>>>> [ 3446.335888]  RSP <ffff88053e247778>
>>>> Hey, that's not an issue, that means the patch works as intended :) And
>>>> I believe it's not a bug introduced by PATCH 1/2.
>>>>
>>>> So, according to my decodecode reading, RAX is the results of
>>>> get_pageblock_migratetype() and it's MIGRATE_RESERVE. RDX is the result
>>>> of get_freepage_migratetype() and it's MIGRATE_UNMOVABLE. The
>>>> freepage_migratetype has just been set either by __rmqueue_smallest() or
>>>> __rmqueue_fallback(), according to the free_list the page has been taken
>>>> from. So this looks like a page from MIGRATE_RESERVE pageblock found on
>>>> the !MIGRATE_RESERVE free_list, which is exactly what the patch intends
>>>> to catch.
>>>>
>>>> I think there are two possible explanations.
>>>>
>>>> 1) the pageblock is genuinely MIGRATE_RESERVE and it was misplaced by
>>>> mistake. I think it wasn't in free_pcppages_bulk() as there's the same
>>>> VM_BUG_ON which would supposedly trigger at the moment of displacing. In
>>>> theory it's possible that there's a race through __free_pages_ok() ->
>>>> free_one_page() where the get_pageblock_migratetype() in
>>>> __free_pages_ok() would race with set_pageblock_migratetype() and result
>>>> in bogus value. But nobody should be calling set_pageblock_migratetype()
>>>> on a MIGRATE_RESERVE pageblock.
>>>>
>>>> 2) the pageblock was marked as MIGRATE_RESERVE due to a race between
>>>> set_pageblock_migratetype() and set_pageblock_skip(). The latter is
>>>> currently not serialized by zone->lock, nor it uses atomic bit set. So
>>>> it may result in lost updates in a racing set_pageblock_migratetype(). I
>>>> think a well-placed race when changing pageblock from MIGRATE_MOVABLE to
>>>> MIGRATE_RECLAIMABLE could result in MIGRATE_RESERVE value. Similar races
>>>> have been already observed to be a problem where frequent changing
>>>> to/from MIGRATE_ISOLATE is involved, and I did a patch series to address
>>>> this, but it was not complete and I postponed it after Mel's changes
>>>> that remove the racy for-cycles completely. So it might be that his
>>>> "[PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set
>>>> pageblock bitmaps" already solves this bug (but maybe only on certain
>>>> architectures where you don't need atomic operations). You might try
>>>> that patch if you can reproduce this bug frequently enough?
>>>
>>> I've tried that patch, but still see the same BUG_ON.
>>
>> Oh damn, I've realized that my assumptions about MIGRATE_RESERVE
>> pageblocks being created only on zone init time were wrong.
>> setup_zone_migrate_reserve() is called also from the handler of
>> min_free_kbytes sysctl... does trinity try to change that while
>> running?
>> The function will change MOVABLE pageblocks to RESERVE and try to
>> move all free pages to the RESERVE free_list, but of course pages on
>> pcplists will remain MOVABLE and may trigger the VM_BUG_ON. You
>> triggered the bug with page on MOVABLE free_list (in the first reply
>> I said its UNMOVABLE by mistake) so this might be good explanation
>> if trinity changes min_free_kbytes.
>>
>> Furthermore, I think there's a problem that
>> setup_zone_migrate_reserve() operates on pageblocks, but as MAX_ODER
>> is higher than pageblock_order, RESERVE pages might be merged with
>> buddies of different migratetype and end up on their free_list. That
>> seems to me like a flaw in the design of reserves, but perhaps
>> others won't think it's serious enough to fix?
>>
>> So in the end this VM_DEBUG check probably cannot work anymore for
>> MIGRATE_RESERVE, only for CMA. I'm not sure if it's worth keeping it
>> only for CMA, what are the CMA guys' opinions on that?
> 
> I really don't want it. That was I didn't add my Acked-by at that time.
> For a long time, I never wanted to add more overhead hot path due to
> CMA unless it's really critical. It's same to this.
> Although such debug patch helps to notice something goes wrong for CMA,
> more information would be helpful to know why CMA failed because
> there are another potential reasons to fail CMA allocation.
> 
> One of the idea about that is to store alloc trace into somewhere(ex,
> naive idea is page description like page-owner) and then we could investigate
> what's the owner of that page so we could know why we fail to migrate it out.
> With that, we would figure out how on earth such page is allocated from CMA
> and it would be more helpful rather just VM_BUG_ON notice.
> 
> The whole point is I'd like to avoid adding more overhead to hot path for
> rare case although it's debugging feature.

OK. I'm not that concerned with VM_DEBUG overhead as it's intended for
testing, not production. But as you say the patch is not that useful
without the MIGRATE_RESERVE part, so Andrew could you please drop the
patch (mm-page_alloc-debug_vm-checks-for-free_list-placement-of-cma-and-reserve-pages.patch)?

It would be also nice to change the commit log of patch
(mm-page_alloc-prevent-migrate_reserve-pages-from-being-misplaced.patch)
to reflect the recent findings. I will send another patch to deal with
MIGRATE_RESERVE pageblock stealing, as we clearly cannot prevent
misplaced MIGRATE_RESERVE pages when MIGRATE_RESERVE pageblocks are
created dynamically through min_free_kbytes sysctl.

Thanks.

---8<---
From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm/page_alloc: prevent rmqueue_bulk misplacing MIGRATE_RESERVE pages

For the MIGRATE_RESERVE pages, it is useful when they do not get misplaced on
free_list of other migratetype, otherwise they might get allocated prematurely
and e.g. fragment the MIGRATE_RESEVE pageblocks. While this cannot be avoided
completely when allocating new MIGRATE_RESERVE pageblocks in min_free_kbytes
sysctl handler, we should prevent the misplacement where possible.

Currently, it is possible for the misplacement to happen when a MIGRATE_RESERVE
page is allocated on pcplist through rmqueue_bulk() as a fallback for other
desired migratetype, and then later freed back through free_pcppages_bulk()
without being actually used. This happens because free_pcppages_bulk() uses
get_freepage_migratetype() to choose the free_list, and rmqueue_bulk() calls
set_freepage_migratetype() with the *desired* migratetype and not the page's
original MIGRATE_RESERVE migratetype.

This patch fixes the problem by moving the call to set_freepage_migratetype()
from rmqueue_bulk() down to __rmqueue_smallest() and __rmqueue_fallback() where
the actual page's migratetype (e.g. from which free_list the page is taken from)
is used. Note that this migratetype might be different from the pageblock's
migratetype due to freepage stealing decisions. This is OK, as page stealing
never uses MIGRATE_RESERVE as a fallback, and also takes care to leave all
MIGRATE_CMA pages on the correct freelist.

Therefore, as an additional benefit, the call to get_pageblock_migratetype()
from rmqueue_bulk() when CMA is enabled, can be removed completely.  This
relies on the fact that MIGRATE_CMA pageblocks are created only during system
init, and the above. The related is_migrate_isolate() check is also
unnecessary, as memory isolation has other ways to move pages between
freelists, and drain pcp lists containing pages that should be isolated. The
buffered_rmqueue() can also benefit from calling get_freepage_migratetype()
instead of get_pageblock_migratetype().

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
  2014-05-07 14:59                       ` Vlastimil Babka
@ 2014-05-08  5:54                         ` Joonsoo Kim
  -1 siblings, 0 replies; 52+ messages in thread
From: Joonsoo Kim @ 2014-05-08  5:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Minchan Kim, Andrew Morton, Sasha Levin,
	Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm, Mel Gorman,
	Yong-Taek Lee, KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins,
	Rik van Riel, Michal Nazarewicz, Dave Jones

On Wed, May 07, 2014 at 04:59:07PM +0200, Vlastimil Babka wrote:
> On 05/07/2014 03:33 AM, Minchan Kim wrote:
> > On Mon, May 05, 2014 at 05:50:46PM +0200, Vlastimil Babka wrote:
> >> On 05/05/2014 04:36 PM, Sasha Levin wrote:
> >>> On 05/02/2014 08:08 AM, Vlastimil Babka wrote:
> >>>> On 04/30/2014 11:46 PM, Sasha Levin wrote:
> >>>>>> On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
> >>>>>>>> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
> >>>>>>>> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
> >>>>>>>> pageblock might be changed to other migratetype in try_to_steal_freepages().
> >>>>>>>> For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
> >>>>>>>> they could get allocated as unmovable and result in CMA failure.
> >>>>>>>>
> >>>>>>>> This is ensured by setting the freepage_migratetype appropriately when placing
> >>>>>>>> pages on pcp lists, and using the information when releasing them back to
> >>>>>>>> free_list. It is also assumed that CMA and RESERVE pageblocks are created only
> >>>>>>>> in the init phase. This patch adds DEBUG_VM checks to catch any regressions
> >>>>>>>> introduced for this invariant.
> >>>>>>>>
> >>>>>>>> Cc: Yong-Taek Lee <ytk.lee@samsung.com>
> >>>>>>>> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
> >>>>>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >>>>>>>> Cc: Mel Gorman <mgorman@suse.de>
> >>>>>>>> Cc: Minchan Kim <minchan@kernel.org>
> >>>>>>>> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >>>>>>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> >>>>>>>> Cc: Hugh Dickins <hughd@google.com>
> >>>>>>>> Cc: Rik van Riel <riel@redhat.com>
> >>>>>>>> Cc: Michal Nazarewicz <mina86@mina86.com>
> >>>>>>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> >>>>>>
> >>>>>> Two issues with this patch.
> >>>>>>
> >>>>>> First:
> >>>>>>
> >>>>>> [ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
> >>>>>> [ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> >>>>>> [ 3446.320082] Dumping ftrace buffer:
> >>>>>> [ 3446.320082]    (ftrace buffer empty)
> >>>>>> [ 3446.320082] Modules linked in:
> >>>>>> [ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
> >>>>>> [ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
> >>>>>> [ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
> >>>>>> [ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
> >>>>>> [ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
> >>>>>> [ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
> >>>>>> [ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
> >>>>>> [ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
> >>>>>> [ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
> >>>>>> [ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
> >>>>>> [ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >>>>>> [ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
> >>>>>> [ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
> >>>>>> [ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
> >>>>>> [ 3446.335888] Stack:
> >>>>>> [ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
> >>>>>> [ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
> >>>>>> [ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
> >>>>>> [ 3446.335888] Call Trace:
> >>>>>> [ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
> >>>>>> [ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> >>>>>> [ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
> >>>>>> [ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
> >>>>>> [ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
> >>>>>> [ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
> >>>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
> >>>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> >>>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:979)
> >>>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:940)
> >>>>>> [ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
> >>>>>> [ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
> >>>>>> [ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
> >>>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
> >>>>>> [ 3446.335888] shmem_fault (mm/shmem.c:1237)
> >>>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
> >>>>>> [ 3446.335888] __do_fault (mm/memory.c:3344)
> >>>>>> [ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> >>>>>> [ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
> >>>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
> >>>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
> >>>>>> [ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
> >>>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
> >>>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> >>>>>> [ 3446.335888] handle_mm_fault (mm/memory.c:3973)
> >>>>>> [ 3446.335888] __get_user_pages (mm/memory.c:1863)
> >>>>>> [ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
> >>>>>> [ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
> >>>>>> [ 3446.335888] __mm_populate (mm/mlock.c:711)
> >>>>>> [ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
> >>>>>> [ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
> >>>>>> [ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
> >>>>>> [ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
> >>>>>> [ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
> >>>>>> [ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
> >>>>>> [ 3446.335888]  RSP <ffff88053e247778>
> >>>> Hey, that's not an issue, that means the patch works as intended :) And
> >>>> I believe it's not a bug introduced by PATCH 1/2.
> >>>>
> >>>> So, according to my decodecode reading, RAX is the results of
> >>>> get_pageblock_migratetype() and it's MIGRATE_RESERVE. RDX is the result
> >>>> of get_freepage_migratetype() and it's MIGRATE_UNMOVABLE. The
> >>>> freepage_migratetype has just been set either by __rmqueue_smallest() or
> >>>> __rmqueue_fallback(), according to the free_list the page has been taken
> >>>> from. So this looks like a page from MIGRATE_RESERVE pageblock found on
> >>>> the !MIGRATE_RESERVE free_list, which is exactly what the patch intends
> >>>> to catch.
> >>>>
> >>>> I think there are two possible explanations.
> >>>>
> >>>> 1) the pageblock is genuinely MIGRATE_RESERVE and it was misplaced by
> >>>> mistake. I think it wasn't in free_pcppages_bulk() as there's the same
> >>>> VM_BUG_ON which would supposedly trigger at the moment of displacing. In
> >>>> theory it's possible that there's a race through __free_pages_ok() ->
> >>>> free_one_page() where the get_pageblock_migratetype() in
> >>>> __free_pages_ok() would race with set_pageblock_migratetype() and result
> >>>> in bogus value. But nobody should be calling set_pageblock_migratetype()
> >>>> on a MIGRATE_RESERVE pageblock.
> >>>>
> >>>> 2) the pageblock was marked as MIGRATE_RESERVE due to a race between
> >>>> set_pageblock_migratetype() and set_pageblock_skip(). The latter is
> >>>> currently not serialized by zone->lock, nor it uses atomic bit set. So
> >>>> it may result in lost updates in a racing set_pageblock_migratetype(). I
> >>>> think a well-placed race when changing pageblock from MIGRATE_MOVABLE to
> >>>> MIGRATE_RECLAIMABLE could result in MIGRATE_RESERVE value. Similar races
> >>>> have been already observed to be a problem where frequent changing
> >>>> to/from MIGRATE_ISOLATE is involved, and I did a patch series to address
> >>>> this, but it was not complete and I postponed it after Mel's changes
> >>>> that remove the racy for-cycles completely. So it might be that his
> >>>> "[PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set
> >>>> pageblock bitmaps" already solves this bug (but maybe only on certain
> >>>> architectures where you don't need atomic operations). You might try
> >>>> that patch if you can reproduce this bug frequently enough?
> >>>
> >>> I've tried that patch, but still see the same BUG_ON.
> >>
> >> Oh damn, I've realized that my assumptions about MIGRATE_RESERVE
> >> pageblocks being created only on zone init time were wrong.
> >> setup_zone_migrate_reserve() is called also from the handler of
> >> min_free_kbytes sysctl... does trinity try to change that while
> >> running?
> >> The function will change MOVABLE pageblocks to RESERVE and try to
> >> move all free pages to the RESERVE free_list, but of course pages on
> >> pcplists will remain MOVABLE and may trigger the VM_BUG_ON. You
> >> triggered the bug with page on MOVABLE free_list (in the first reply
> >> I said its UNMOVABLE by mistake) so this might be good explanation
> >> if trinity changes min_free_kbytes.
> >>
> >> Furthermore, I think there's a problem that
> >> setup_zone_migrate_reserve() operates on pageblocks, but as MAX_ODER
> >> is higher than pageblock_order, RESERVE pages might be merged with
> >> buddies of different migratetype and end up on their free_list. That
> >> seems to me like a flaw in the design of reserves, but perhaps
> >> others won't think it's serious enough to fix?

I wanna know who want MIGRATE_RESERVE. On my previous testing, one
pageblock for MIGRATE_RESERVE is merged with buddies of different
migratetype during boot-up and never come back again. But my system works
well. :)

If it is really useful feature, we can fix the situation by aligning
reserve size and pfn to MAX_ORDER_NR_PAGES.

And, IMHO, it isn't reasonable to increase, decrease and change
MIGRATE_RESERVE pageblock by the handler of min_free_kbytes sysctl,
because new pageblock may have no free pages to reserve. I think that
it is better to prevent sysctl handler from changing MIGRATE_RESERVE
pageblock, after initialization.

In conclusion, if MIGRATE_RESERVE is useful enough to maintain, to fix
above problem and keep this patch is preferable to me.

> >>
> >> So in the end this VM_DEBUG check probably cannot work anymore for
> >> MIGRATE_RESERVE, only for CMA. I'm not sure if it's worth keeping it
> >> only for CMA, what are the CMA guys' opinions on that?
> > 
> > I really don't want it. That was I didn't add my Acked-by at that time.
> > For a long time, I never wanted to add more overhead hot path due to
> > CMA unless it's really critical. It's same to this.
> > Although such debug patch helps to notice something goes wrong for CMA,
> > more information would be helpful to know why CMA failed because
> > there are another potential reasons to fail CMA allocation.
> > 
> > One of the idea about that is to store alloc trace into somewhere(ex,
> > naive idea is page description like page-owner) and then we could investigate
> > what's the owner of that page so we could know why we fail to migrate it out.
> > With that, we would figure out how on earth such page is allocated from CMA
> > and it would be more helpful rather just VM_BUG_ON notice.
> > 
> > The whole point is I'd like to avoid adding more overhead to hot path for
> > rare case although it's debugging feature.
> 
> OK. I'm not that concerned with VM_DEBUG overhead as it's intended for
> testing, not production. But as you say the patch is not that useful
> without the MIGRATE_RESERVE part, so Andrew could you please drop the
> patch (mm-page_alloc-debug_vm-checks-for-free_list-placement-of-cma-and-reserve-pages.patch)?

I also think that VM_DEBUG overhead isn't problem because of same
reason from Vlastimil.

Thanks.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
@ 2014-05-08  5:54                         ` Joonsoo Kim
  0 siblings, 0 replies; 52+ messages in thread
From: Joonsoo Kim @ 2014-05-08  5:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Minchan Kim, Andrew Morton, Sasha Levin,
	Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm, Mel Gorman,
	Yong-Taek Lee, KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins,
	Rik van Riel, Michal Nazarewicz, Dave Jones

On Wed, May 07, 2014 at 04:59:07PM +0200, Vlastimil Babka wrote:
> On 05/07/2014 03:33 AM, Minchan Kim wrote:
> > On Mon, May 05, 2014 at 05:50:46PM +0200, Vlastimil Babka wrote:
> >> On 05/05/2014 04:36 PM, Sasha Levin wrote:
> >>> On 05/02/2014 08:08 AM, Vlastimil Babka wrote:
> >>>> On 04/30/2014 11:46 PM, Sasha Levin wrote:
> >>>>>> On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
> >>>>>>>> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
> >>>>>>>> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
> >>>>>>>> pageblock might be changed to other migratetype in try_to_steal_freepages().
> >>>>>>>> For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
> >>>>>>>> they could get allocated as unmovable and result in CMA failure.
> >>>>>>>>
> >>>>>>>> This is ensured by setting the freepage_migratetype appropriately when placing
> >>>>>>>> pages on pcp lists, and using the information when releasing them back to
> >>>>>>>> free_list. It is also assumed that CMA and RESERVE pageblocks are created only
> >>>>>>>> in the init phase. This patch adds DEBUG_VM checks to catch any regressions
> >>>>>>>> introduced for this invariant.
> >>>>>>>>
> >>>>>>>> Cc: Yong-Taek Lee <ytk.lee@samsung.com>
> >>>>>>>> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
> >>>>>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >>>>>>>> Cc: Mel Gorman <mgorman@suse.de>
> >>>>>>>> Cc: Minchan Kim <minchan@kernel.org>
> >>>>>>>> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >>>>>>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> >>>>>>>> Cc: Hugh Dickins <hughd@google.com>
> >>>>>>>> Cc: Rik van Riel <riel@redhat.com>
> >>>>>>>> Cc: Michal Nazarewicz <mina86@mina86.com>
> >>>>>>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> >>>>>>
> >>>>>> Two issues with this patch.
> >>>>>>
> >>>>>> First:
> >>>>>>
> >>>>>> [ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
> >>>>>> [ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> >>>>>> [ 3446.320082] Dumping ftrace buffer:
> >>>>>> [ 3446.320082]    (ftrace buffer empty)
> >>>>>> [ 3446.320082] Modules linked in:
> >>>>>> [ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
> >>>>>> [ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
> >>>>>> [ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
> >>>>>> [ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
> >>>>>> [ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
> >>>>>> [ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
> >>>>>> [ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
> >>>>>> [ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
> >>>>>> [ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
> >>>>>> [ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
> >>>>>> [ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >>>>>> [ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
> >>>>>> [ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
> >>>>>> [ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
> >>>>>> [ 3446.335888] Stack:
> >>>>>> [ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
> >>>>>> [ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
> >>>>>> [ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
> >>>>>> [ 3446.335888] Call Trace:
> >>>>>> [ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
> >>>>>> [ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> >>>>>> [ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
> >>>>>> [ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
> >>>>>> [ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
> >>>>>> [ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
> >>>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
> >>>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> >>>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:979)
> >>>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:940)
> >>>>>> [ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
> >>>>>> [ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
> >>>>>> [ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
> >>>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
> >>>>>> [ 3446.335888] shmem_fault (mm/shmem.c:1237)
> >>>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
> >>>>>> [ 3446.335888] __do_fault (mm/memory.c:3344)
> >>>>>> [ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> >>>>>> [ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
> >>>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
> >>>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
> >>>>>> [ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
> >>>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
> >>>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> >>>>>> [ 3446.335888] handle_mm_fault (mm/memory.c:3973)
> >>>>>> [ 3446.335888] __get_user_pages (mm/memory.c:1863)
> >>>>>> [ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
> >>>>>> [ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
> >>>>>> [ 3446.335888] __mm_populate (mm/mlock.c:711)
> >>>>>> [ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
> >>>>>> [ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
> >>>>>> [ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
> >>>>>> [ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
> >>>>>> [ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
> >>>>>> [ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
> >>>>>> [ 3446.335888]  RSP <ffff88053e247778>
> >>>> Hey, that's not an issue, that means the patch works as intended :) And
> >>>> I believe it's not a bug introduced by PATCH 1/2.
> >>>>
> >>>> So, according to my decodecode reading, RAX is the results of
> >>>> get_pageblock_migratetype() and it's MIGRATE_RESERVE. RDX is the result
> >>>> of get_freepage_migratetype() and it's MIGRATE_UNMOVABLE. The
> >>>> freepage_migratetype has just been set either by __rmqueue_smallest() or
> >>>> __rmqueue_fallback(), according to the free_list the page has been taken
> >>>> from. So this looks like a page from MIGRATE_RESERVE pageblock found on
> >>>> the !MIGRATE_RESERVE free_list, which is exactly what the patch intends
> >>>> to catch.
> >>>>
> >>>> I think there are two possible explanations.
> >>>>
> >>>> 1) the pageblock is genuinely MIGRATE_RESERVE and it was misplaced by
> >>>> mistake. I think it wasn't in free_pcppages_bulk() as there's the same
> >>>> VM_BUG_ON which would supposedly trigger at the moment of displacing. In
> >>>> theory it's possible that there's a race through __free_pages_ok() ->
> >>>> free_one_page() where the get_pageblock_migratetype() in
> >>>> __free_pages_ok() would race with set_pageblock_migratetype() and result
> >>>> in bogus value. But nobody should be calling set_pageblock_migratetype()
> >>>> on a MIGRATE_RESERVE pageblock.
> >>>>
> >>>> 2) the pageblock was marked as MIGRATE_RESERVE due to a race between
> >>>> set_pageblock_migratetype() and set_pageblock_skip(). The latter is
> >>>> currently not serialized by zone->lock, nor it uses atomic bit set. So
> >>>> it may result in lost updates in a racing set_pageblock_migratetype(). I
> >>>> think a well-placed race when changing pageblock from MIGRATE_MOVABLE to
> >>>> MIGRATE_RECLAIMABLE could result in MIGRATE_RESERVE value. Similar races
> >>>> have been already observed to be a problem where frequent changing
> >>>> to/from MIGRATE_ISOLATE is involved, and I did a patch series to address
> >>>> this, but it was not complete and I postponed it after Mel's changes
> >>>> that remove the racy for-cycles completely. So it might be that his
> >>>> "[PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set
> >>>> pageblock bitmaps" already solves this bug (but maybe only on certain
> >>>> architectures where you don't need atomic operations). You might try
> >>>> that patch if you can reproduce this bug frequently enough?
> >>>
> >>> I've tried that patch, but still see the same BUG_ON.
> >>
> >> Oh damn, I've realized that my assumptions about MIGRATE_RESERVE
> >> pageblocks being created only on zone init time were wrong.
> >> setup_zone_migrate_reserve() is called also from the handler of
> >> min_free_kbytes sysctl... does trinity try to change that while
> >> running?
> >> The function will change MOVABLE pageblocks to RESERVE and try to
> >> move all free pages to the RESERVE free_list, but of course pages on
> >> pcplists will remain MOVABLE and may trigger the VM_BUG_ON. You
> >> triggered the bug with page on MOVABLE free_list (in the first reply
> >> I said its UNMOVABLE by mistake) so this might be good explanation
> >> if trinity changes min_free_kbytes.
> >>
> >> Furthermore, I think there's a problem that
> >> setup_zone_migrate_reserve() operates on pageblocks, but as MAX_ODER
> >> is higher than pageblock_order, RESERVE pages might be merged with
> >> buddies of different migratetype and end up on their free_list. That
> >> seems to me like a flaw in the design of reserves, but perhaps
> >> others won't think it's serious enough to fix?

I wanna know who want MIGRATE_RESERVE. On my previous testing, one
pageblock for MIGRATE_RESERVE is merged with buddies of different
migratetype during boot-up and never come back again. But my system works
well. :)

If it is really useful feature, we can fix the situation by aligning
reserve size and pfn to MAX_ORDER_NR_PAGES.

And, IMHO, it isn't reasonable to increase, decrease and change
MIGRATE_RESERVE pageblock by the handler of min_free_kbytes sysctl,
because new pageblock may have no free pages to reserve. I think that
it is better to prevent sysctl handler from changing MIGRATE_RESERVE
pageblock, after initialization.

In conclusion, if MIGRATE_RESERVE is useful enough to maintain, to fix
above problem and keep this patch is preferable to me.

> >>
> >> So in the end this VM_DEBUG check probably cannot work anymore for
> >> MIGRATE_RESERVE, only for CMA. I'm not sure if it's worth keeping it
> >> only for CMA, what are the CMA guys' opinions on that?
> > 
> > I really don't want it. That was I didn't add my Acked-by at that time.
> > For a long time, I never wanted to add more overhead hot path due to
> > CMA unless it's really critical. It's same to this.
> > Although such debug patch helps to notice something goes wrong for CMA,
> > more information would be helpful to know why CMA failed because
> > there are another potential reasons to fail CMA allocation.
> > 
> > One of the idea about that is to store alloc trace into somewhere(ex,
> > naive idea is page description like page-owner) and then we could investigate
> > what's the owner of that page so we could know why we fail to migrate it out.
> > With that, we would figure out how on earth such page is allocated from CMA
> > and it would be more helpful rather just VM_BUG_ON notice.
> > 
> > The whole point is I'd like to avoid adding more overhead to hot path for
> > rare case although it's debugging feature.
> 
> OK. I'm not that concerned with VM_DEBUG overhead as it's intended for
> testing, not production. But as you say the patch is not that useful
> without the MIGRATE_RESERVE part, so Andrew could you please drop the
> patch (mm-page_alloc-debug_vm-checks-for-free_list-placement-of-cma-and-reserve-pages.patch)?

I also think that VM_DEBUG overhead isn't problem because of same
reason from Vlastimil.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
  2014-05-08  5:54                         ` Joonsoo Kim
@ 2014-05-08  6:19                           ` Minchan Kim
  -1 siblings, 0 replies; 52+ messages in thread
From: Minchan Kim @ 2014-05-08  6:19 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Vlastimil Babka, Andrew Morton, Sasha Levin,
	Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm, Mel Gorman,
	Yong-Taek Lee, KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins,
	Rik van Riel, Michal Nazarewicz, Dave Jones

On Thu, May 08, 2014 at 02:54:21PM +0900, Joonsoo Kim wrote:
> On Wed, May 07, 2014 at 04:59:07PM +0200, Vlastimil Babka wrote:
> > On 05/07/2014 03:33 AM, Minchan Kim wrote:
> > > On Mon, May 05, 2014 at 05:50:46PM +0200, Vlastimil Babka wrote:
> > >> On 05/05/2014 04:36 PM, Sasha Levin wrote:
> > >>> On 05/02/2014 08:08 AM, Vlastimil Babka wrote:
> > >>>> On 04/30/2014 11:46 PM, Sasha Levin wrote:
> > >>>>>> On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
> > >>>>>>>> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
> > >>>>>>>> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
> > >>>>>>>> pageblock might be changed to other migratetype in try_to_steal_freepages().
> > >>>>>>>> For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
> > >>>>>>>> they could get allocated as unmovable and result in CMA failure.
> > >>>>>>>>
> > >>>>>>>> This is ensured by setting the freepage_migratetype appropriately when placing
> > >>>>>>>> pages on pcp lists, and using the information when releasing them back to
> > >>>>>>>> free_list. It is also assumed that CMA and RESERVE pageblocks are created only
> > >>>>>>>> in the init phase. This patch adds DEBUG_VM checks to catch any regressions
> > >>>>>>>> introduced for this invariant.
> > >>>>>>>>
> > >>>>>>>> Cc: Yong-Taek Lee <ytk.lee@samsung.com>
> > >>>>>>>> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
> > >>>>>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > >>>>>>>> Cc: Mel Gorman <mgorman@suse.de>
> > >>>>>>>> Cc: Minchan Kim <minchan@kernel.org>
> > >>>>>>>> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > >>>>>>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > >>>>>>>> Cc: Hugh Dickins <hughd@google.com>
> > >>>>>>>> Cc: Rik van Riel <riel@redhat.com>
> > >>>>>>>> Cc: Michal Nazarewicz <mina86@mina86.com>
> > >>>>>>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > >>>>>>
> > >>>>>> Two issues with this patch.
> > >>>>>>
> > >>>>>> First:
> > >>>>>>
> > >>>>>> [ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
> > >>>>>> [ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > >>>>>> [ 3446.320082] Dumping ftrace buffer:
> > >>>>>> [ 3446.320082]    (ftrace buffer empty)
> > >>>>>> [ 3446.320082] Modules linked in:
> > >>>>>> [ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
> > >>>>>> [ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
> > >>>>>> [ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
> > >>>>>> [ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
> > >>>>>> [ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
> > >>>>>> [ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
> > >>>>>> [ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
> > >>>>>> [ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
> > >>>>>> [ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
> > >>>>>> [ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
> > >>>>>> [ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > >>>>>> [ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
> > >>>>>> [ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
> > >>>>>> [ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
> > >>>>>> [ 3446.335888] Stack:
> > >>>>>> [ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
> > >>>>>> [ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
> > >>>>>> [ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
> > >>>>>> [ 3446.335888] Call Trace:
> > >>>>>> [ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
> > >>>>>> [ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> > >>>>>> [ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
> > >>>>>> [ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
> > >>>>>> [ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
> > >>>>>> [ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
> > >>>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
> > >>>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> > >>>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:979)
> > >>>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:940)
> > >>>>>> [ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
> > >>>>>> [ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
> > >>>>>> [ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
> > >>>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
> > >>>>>> [ 3446.335888] shmem_fault (mm/shmem.c:1237)
> > >>>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
> > >>>>>> [ 3446.335888] __do_fault (mm/memory.c:3344)
> > >>>>>> [ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> > >>>>>> [ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
> > >>>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
> > >>>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
> > >>>>>> [ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
> > >>>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
> > >>>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> > >>>>>> [ 3446.335888] handle_mm_fault (mm/memory.c:3973)
> > >>>>>> [ 3446.335888] __get_user_pages (mm/memory.c:1863)
> > >>>>>> [ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
> > >>>>>> [ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
> > >>>>>> [ 3446.335888] __mm_populate (mm/mlock.c:711)
> > >>>>>> [ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
> > >>>>>> [ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
> > >>>>>> [ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
> > >>>>>> [ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
> > >>>>>> [ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
> > >>>>>> [ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
> > >>>>>> [ 3446.335888]  RSP <ffff88053e247778>
> > >>>> Hey, that's not an issue, that means the patch works as intended :) And
> > >>>> I believe it's not a bug introduced by PATCH 1/2.
> > >>>>
> > >>>> So, according to my decodecode reading, RAX is the results of
> > >>>> get_pageblock_migratetype() and it's MIGRATE_RESERVE. RDX is the result
> > >>>> of get_freepage_migratetype() and it's MIGRATE_UNMOVABLE. The
> > >>>> freepage_migratetype has just been set either by __rmqueue_smallest() or
> > >>>> __rmqueue_fallback(), according to the free_list the page has been taken
> > >>>> from. So this looks like a page from MIGRATE_RESERVE pageblock found on
> > >>>> the !MIGRATE_RESERVE free_list, which is exactly what the patch intends
> > >>>> to catch.
> > >>>>
> > >>>> I think there are two possible explanations.
> > >>>>
> > >>>> 1) the pageblock is genuinely MIGRATE_RESERVE and it was misplaced by
> > >>>> mistake. I think it wasn't in free_pcppages_bulk() as there's the same
> > >>>> VM_BUG_ON which would supposedly trigger at the moment of displacing. In
> > >>>> theory it's possible that there's a race through __free_pages_ok() ->
> > >>>> free_one_page() where the get_pageblock_migratetype() in
> > >>>> __free_pages_ok() would race with set_pageblock_migratetype() and result
> > >>>> in bogus value. But nobody should be calling set_pageblock_migratetype()
> > >>>> on a MIGRATE_RESERVE pageblock.
> > >>>>
> > >>>> 2) the pageblock was marked as MIGRATE_RESERVE due to a race between
> > >>>> set_pageblock_migratetype() and set_pageblock_skip(). The latter is
> > >>>> currently not serialized by zone->lock, nor it uses atomic bit set. So
> > >>>> it may result in lost updates in a racing set_pageblock_migratetype(). I
> > >>>> think a well-placed race when changing pageblock from MIGRATE_MOVABLE to
> > >>>> MIGRATE_RECLAIMABLE could result in MIGRATE_RESERVE value. Similar races
> > >>>> have been already observed to be a problem where frequent changing
> > >>>> to/from MIGRATE_ISOLATE is involved, and I did a patch series to address
> > >>>> this, but it was not complete and I postponed it after Mel's changes
> > >>>> that remove the racy for-cycles completely. So it might be that his
> > >>>> "[PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set
> > >>>> pageblock bitmaps" already solves this bug (but maybe only on certain
> > >>>> architectures where you don't need atomic operations). You might try
> > >>>> that patch if you can reproduce this bug frequently enough?
> > >>>
> > >>> I've tried that patch, but still see the same BUG_ON.
> > >>
> > >> Oh damn, I've realized that my assumptions about MIGRATE_RESERVE
> > >> pageblocks being created only on zone init time were wrong.
> > >> setup_zone_migrate_reserve() is called also from the handler of
> > >> min_free_kbytes sysctl... does trinity try to change that while
> > >> running?
> > >> The function will change MOVABLE pageblocks to RESERVE and try to
> > >> move all free pages to the RESERVE free_list, but of course pages on
> > >> pcplists will remain MOVABLE and may trigger the VM_BUG_ON. You
> > >> triggered the bug with page on MOVABLE free_list (in the first reply
> > >> I said its UNMOVABLE by mistake) so this might be good explanation
> > >> if trinity changes min_free_kbytes.
> > >>
> > >> Furthermore, I think there's a problem that
> > >> setup_zone_migrate_reserve() operates on pageblocks, but as MAX_ODER
> > >> is higher than pageblock_order, RESERVE pages might be merged with
> > >> buddies of different migratetype and end up on their free_list. That
> > >> seems to me like a flaw in the design of reserves, but perhaps
> > >> others won't think it's serious enough to fix?
> 
> I wanna know who want MIGRATE_RESERVE. On my previous testing, one
> pageblock for MIGRATE_RESERVE is merged with buddies of different
> migratetype during boot-up and never come back again. But my system works
> well. :)

AFAIRC, it was introduced for high-order atomic alloation and Mel tested
it with small memory system and wireless network device.

> 
> If it is really useful feature, we can fix the situation by aligning
> reserve size and pfn to MAX_ORDER_NR_PAGES.
> 
> And, IMHO, it isn't reasonable to increase, decrease and change
> MIGRATE_RESERVE pageblock by the handler of min_free_kbytes sysctl,
> because new pageblock may have no free pages to reserve. I think that
> it is better to prevent sysctl handler from changing MIGRATE_RESERVE
> pageblock, after initialization.
> 
> In conclusion, if MIGRATE_RESERVE is useful enough to maintain, to fix
> above problem and keep this patch is preferable to me.
> 
> > >>
> > >> So in the end this VM_DEBUG check probably cannot work anymore for
> > >> MIGRATE_RESERVE, only for CMA. I'm not sure if it's worth keeping it
> > >> only for CMA, what are the CMA guys' opinions on that?
> > > 
> > > I really don't want it. That was I didn't add my Acked-by at that time.
> > > For a long time, I never wanted to add more overhead hot path due to
> > > CMA unless it's really critical. It's same to this.
> > > Although such debug patch helps to notice something goes wrong for CMA,
> > > more information would be helpful to know why CMA failed because
> > > there are another potential reasons to fail CMA allocation.
> > > 
> > > One of the idea about that is to store alloc trace into somewhere(ex,
> > > naive idea is page description like page-owner) and then we could investigate
> > > what's the owner of that page so we could know why we fail to migrate it out.
> > > With that, we would figure out how on earth such page is allocated from CMA
> > > and it would be more helpful rather just VM_BUG_ON notice.
> > > 
> > > The whole point is I'd like to avoid adding more overhead to hot path for
> > > rare case although it's debugging feature.
> > 
> > OK. I'm not that concerned with VM_DEBUG overhead as it's intended for
> > testing, not production. But as you say the patch is not that useful
> > without the MIGRATE_RESERVE part, so Andrew could you please drop the
> > patch (mm-page_alloc-debug_vm-checks-for-free_list-placement-of-cma-and-reserve-pages.patch)?
> 
> I also think that VM_DEBUG overhead isn't problem because of same
> reason from Vlastimil.

Guys, please read this.

https://lkml.org/lkml/2013/7/17/591

If you guys really want it, we could separate it with
CONFIG_DEBUG_CMA or CONFIG_DEBUG_RESERVE like stuff.
Otherwise, just remain in mmotm.

> 
> Thanks.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
@ 2014-05-08  6:19                           ` Minchan Kim
  0 siblings, 0 replies; 52+ messages in thread
From: Minchan Kim @ 2014-05-08  6:19 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Vlastimil Babka, Andrew Morton, Sasha Levin,
	Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm, Mel Gorman,
	Yong-Taek Lee, KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins,
	Rik van Riel, Michal Nazarewicz, Dave Jones

On Thu, May 08, 2014 at 02:54:21PM +0900, Joonsoo Kim wrote:
> On Wed, May 07, 2014 at 04:59:07PM +0200, Vlastimil Babka wrote:
> > On 05/07/2014 03:33 AM, Minchan Kim wrote:
> > > On Mon, May 05, 2014 at 05:50:46PM +0200, Vlastimil Babka wrote:
> > >> On 05/05/2014 04:36 PM, Sasha Levin wrote:
> > >>> On 05/02/2014 08:08 AM, Vlastimil Babka wrote:
> > >>>> On 04/30/2014 11:46 PM, Sasha Levin wrote:
> > >>>>>> On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
> > >>>>>>>> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
> > >>>>>>>> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
> > >>>>>>>> pageblock might be changed to other migratetype in try_to_steal_freepages().
> > >>>>>>>> For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
> > >>>>>>>> they could get allocated as unmovable and result in CMA failure.
> > >>>>>>>>
> > >>>>>>>> This is ensured by setting the freepage_migratetype appropriately when placing
> > >>>>>>>> pages on pcp lists, and using the information when releasing them back to
> > >>>>>>>> free_list. It is also assumed that CMA and RESERVE pageblocks are created only
> > >>>>>>>> in the init phase. This patch adds DEBUG_VM checks to catch any regressions
> > >>>>>>>> introduced for this invariant.
> > >>>>>>>>
> > >>>>>>>> Cc: Yong-Taek Lee <ytk.lee@samsung.com>
> > >>>>>>>> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
> > >>>>>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > >>>>>>>> Cc: Mel Gorman <mgorman@suse.de>
> > >>>>>>>> Cc: Minchan Kim <minchan@kernel.org>
> > >>>>>>>> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > >>>>>>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > >>>>>>>> Cc: Hugh Dickins <hughd@google.com>
> > >>>>>>>> Cc: Rik van Riel <riel@redhat.com>
> > >>>>>>>> Cc: Michal Nazarewicz <mina86@mina86.com>
> > >>>>>>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > >>>>>>
> > >>>>>> Two issues with this patch.
> > >>>>>>
> > >>>>>> First:
> > >>>>>>
> > >>>>>> [ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
> > >>>>>> [ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > >>>>>> [ 3446.320082] Dumping ftrace buffer:
> > >>>>>> [ 3446.320082]    (ftrace buffer empty)
> > >>>>>> [ 3446.320082] Modules linked in:
> > >>>>>> [ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
> > >>>>>> [ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
> > >>>>>> [ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
> > >>>>>> [ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
> > >>>>>> [ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
> > >>>>>> [ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
> > >>>>>> [ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
> > >>>>>> [ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
> > >>>>>> [ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
> > >>>>>> [ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
> > >>>>>> [ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > >>>>>> [ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
> > >>>>>> [ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
> > >>>>>> [ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
> > >>>>>> [ 3446.335888] Stack:
> > >>>>>> [ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
> > >>>>>> [ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
> > >>>>>> [ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
> > >>>>>> [ 3446.335888] Call Trace:
> > >>>>>> [ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
> > >>>>>> [ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> > >>>>>> [ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
> > >>>>>> [ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
> > >>>>>> [ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
> > >>>>>> [ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
> > >>>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
> > >>>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> > >>>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:979)
> > >>>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:940)
> > >>>>>> [ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
> > >>>>>> [ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
> > >>>>>> [ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
> > >>>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
> > >>>>>> [ 3446.335888] shmem_fault (mm/shmem.c:1237)
> > >>>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
> > >>>>>> [ 3446.335888] __do_fault (mm/memory.c:3344)
> > >>>>>> [ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> > >>>>>> [ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
> > >>>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
> > >>>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
> > >>>>>> [ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
> > >>>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
> > >>>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> > >>>>>> [ 3446.335888] handle_mm_fault (mm/memory.c:3973)
> > >>>>>> [ 3446.335888] __get_user_pages (mm/memory.c:1863)
> > >>>>>> [ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
> > >>>>>> [ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
> > >>>>>> [ 3446.335888] __mm_populate (mm/mlock.c:711)
> > >>>>>> [ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
> > >>>>>> [ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
> > >>>>>> [ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
> > >>>>>> [ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
> > >>>>>> [ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
> > >>>>>> [ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
> > >>>>>> [ 3446.335888]  RSP <ffff88053e247778>
> > >>>> Hey, that's not an issue, that means the patch works as intended :) And
> > >>>> I believe it's not a bug introduced by PATCH 1/2.
> > >>>>
> > >>>> So, according to my decodecode reading, RAX is the results of
> > >>>> get_pageblock_migratetype() and it's MIGRATE_RESERVE. RDX is the result
> > >>>> of get_freepage_migratetype() and it's MIGRATE_UNMOVABLE. The
> > >>>> freepage_migratetype has just been set either by __rmqueue_smallest() or
> > >>>> __rmqueue_fallback(), according to the free_list the page has been taken
> > >>>> from. So this looks like a page from MIGRATE_RESERVE pageblock found on
> > >>>> the !MIGRATE_RESERVE free_list, which is exactly what the patch intends
> > >>>> to catch.
> > >>>>
> > >>>> I think there are two possible explanations.
> > >>>>
> > >>>> 1) the pageblock is genuinely MIGRATE_RESERVE and it was misplaced by
> > >>>> mistake. I think it wasn't in free_pcppages_bulk() as there's the same
> > >>>> VM_BUG_ON which would supposedly trigger at the moment of displacing. In
> > >>>> theory it's possible that there's a race through __free_pages_ok() ->
> > >>>> free_one_page() where the get_pageblock_migratetype() in
> > >>>> __free_pages_ok() would race with set_pageblock_migratetype() and result
> > >>>> in bogus value. But nobody should be calling set_pageblock_migratetype()
> > >>>> on a MIGRATE_RESERVE pageblock.
> > >>>>
> > >>>> 2) the pageblock was marked as MIGRATE_RESERVE due to a race between
> > >>>> set_pageblock_migratetype() and set_pageblock_skip(). The latter is
> > >>>> currently not serialized by zone->lock, nor it uses atomic bit set. So
> > >>>> it may result in lost updates in a racing set_pageblock_migratetype(). I
> > >>>> think a well-placed race when changing pageblock from MIGRATE_MOVABLE to
> > >>>> MIGRATE_RECLAIMABLE could result in MIGRATE_RESERVE value. Similar races
> > >>>> have been already observed to be a problem where frequent changing
> > >>>> to/from MIGRATE_ISOLATE is involved, and I did a patch series to address
> > >>>> this, but it was not complete and I postponed it after Mel's changes
> > >>>> that remove the racy for-cycles completely. So it might be that his
> > >>>> "[PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set
> > >>>> pageblock bitmaps" already solves this bug (but maybe only on certain
> > >>>> architectures where you don't need atomic operations). You might try
> > >>>> that patch if you can reproduce this bug frequently enough?
> > >>>
> > >>> I've tried that patch, but still see the same BUG_ON.
> > >>
> > >> Oh damn, I've realized that my assumptions about MIGRATE_RESERVE
> > >> pageblocks being created only on zone init time were wrong.
> > >> setup_zone_migrate_reserve() is called also from the handler of
> > >> min_free_kbytes sysctl... does trinity try to change that while
> > >> running?
> > >> The function will change MOVABLE pageblocks to RESERVE and try to
> > >> move all free pages to the RESERVE free_list, but of course pages on
> > >> pcplists will remain MOVABLE and may trigger the VM_BUG_ON. You
> > >> triggered the bug with page on MOVABLE free_list (in the first reply
> > >> I said its UNMOVABLE by mistake) so this might be good explanation
> > >> if trinity changes min_free_kbytes.
> > >>
> > >> Furthermore, I think there's a problem that
> > >> setup_zone_migrate_reserve() operates on pageblocks, but as MAX_ODER
> > >> is higher than pageblock_order, RESERVE pages might be merged with
> > >> buddies of different migratetype and end up on their free_list. That
> > >> seems to me like a flaw in the design of reserves, but perhaps
> > >> others won't think it's serious enough to fix?
> 
> I wanna know who want MIGRATE_RESERVE. On my previous testing, one
> pageblock for MIGRATE_RESERVE is merged with buddies of different
> migratetype during boot-up and never come back again. But my system works
> well. :)

AFAIRC, it was introduced for high-order atomic alloation and Mel tested
it with small memory system and wireless network device.

> 
> If it is really useful feature, we can fix the situation by aligning
> reserve size and pfn to MAX_ORDER_NR_PAGES.
> 
> And, IMHO, it isn't reasonable to increase, decrease and change
> MIGRATE_RESERVE pageblock by the handler of min_free_kbytes sysctl,
> because new pageblock may have no free pages to reserve. I think that
> it is better to prevent sysctl handler from changing MIGRATE_RESERVE
> pageblock, after initialization.
> 
> In conclusion, if MIGRATE_RESERVE is useful enough to maintain, to fix
> above problem and keep this patch is preferable to me.
> 
> > >>
> > >> So in the end this VM_DEBUG check probably cannot work anymore for
> > >> MIGRATE_RESERVE, only for CMA. I'm not sure if it's worth keeping it
> > >> only for CMA, what are the CMA guys' opinions on that?
> > > 
> > > I really don't want it. That was I didn't add my Acked-by at that time.
> > > For a long time, I never wanted to add more overhead hot path due to
> > > CMA unless it's really critical. It's same to this.
> > > Although such debug patch helps to notice something goes wrong for CMA,
> > > more information would be helpful to know why CMA failed because
> > > there are another potential reasons to fail CMA allocation.
> > > 
> > > One of the idea about that is to store alloc trace into somewhere(ex,
> > > naive idea is page description like page-owner) and then we could investigate
> > > what's the owner of that page so we could know why we fail to migrate it out.
> > > With that, we would figure out how on earth such page is allocated from CMA
> > > and it would be more helpful rather just VM_BUG_ON notice.
> > > 
> > > The whole point is I'd like to avoid adding more overhead to hot path for
> > > rare case although it's debugging feature.
> > 
> > OK. I'm not that concerned with VM_DEBUG overhead as it's intended for
> > testing, not production. But as you say the patch is not that useful
> > without the MIGRATE_RESERVE part, so Andrew could you please drop the
> > patch (mm-page_alloc-debug_vm-checks-for-free_list-placement-of-cma-and-reserve-pages.patch)?
> 
> I also think that VM_DEBUG overhead isn't problem because of same
> reason from Vlastimil.

Guys, please read this.

https://lkml.org/lkml/2013/7/17/591

If you guys really want it, we could separate it with
CONFIG_DEBUG_CMA or CONFIG_DEBUG_RESERVE like stuff.
Otherwise, just remain in mmotm.

> 
> Thanks.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
  2014-05-08  5:54                         ` Joonsoo Kim
@ 2014-05-08  8:51                           ` Mel Gorman
  -1 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2014-05-08  8:51 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Vlastimil Babka, Minchan Kim, Andrew Morton, Sasha Levin,
	Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm, Yong-Taek Lee,
	KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins, Rik van Riel,
	Michal Nazarewicz, Dave Jones

On Thu, May 08, 2014 at 02:54:21PM +0900, Joonsoo Kim wrote:
> > >> Furthermore, I think there's a problem that
> > >> setup_zone_migrate_reserve() operates on pageblocks, but as MAX_ODER
> > >> is higher than pageblock_order, RESERVE pages might be merged with
> > >> buddies of different migratetype and end up on their free_list. That
> > >> seems to me like a flaw in the design of reserves, but perhaps
> > >> others won't think it's serious enough to fix?
> 
> I wanna know who want MIGRATE_RESERVE. On my previous testing, one
> pageblock for MIGRATE_RESERVE is merged with buddies of different
> migratetype during boot-up and never come back again. But my system works
> well. :)
> 

It's important for short-lived high-order atomic allocations.
MIGRATE_RESERVE preserves a property of the buddy allocator prior to the
merging of fragmentation avoidance. Most users will not notice as not
many drivers depend on these allocations working. If they are getting
destroyed at boot-up, it's a bug.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
@ 2014-05-08  8:51                           ` Mel Gorman
  0 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2014-05-08  8:51 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Vlastimil Babka, Minchan Kim, Andrew Morton, Sasha Levin,
	Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm, Yong-Taek Lee,
	KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins, Rik van Riel,
	Michal Nazarewicz, Dave Jones

On Thu, May 08, 2014 at 02:54:21PM +0900, Joonsoo Kim wrote:
> > >> Furthermore, I think there's a problem that
> > >> setup_zone_migrate_reserve() operates on pageblocks, but as MAX_ODER
> > >> is higher than pageblock_order, RESERVE pages might be merged with
> > >> buddies of different migratetype and end up on their free_list. That
> > >> seems to me like a flaw in the design of reserves, but perhaps
> > >> others won't think it's serious enough to fix?
> 
> I wanna know who want MIGRATE_RESERVE. On my previous testing, one
> pageblock for MIGRATE_RESERVE is merged with buddies of different
> migratetype during boot-up and never come back again. But my system works
> well. :)
> 

It's important for short-lived high-order atomic allocations.
MIGRATE_RESERVE preserves a property of the buddy allocator prior to the
merging of fragmentation avoidance. Most users will not notice as not
many drivers depend on these allocations working. If they are getting
destroyed at boot-up, it's a bug.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
  2014-05-08  6:19                           ` Minchan Kim
@ 2014-05-08 22:34                             ` Andrew Morton
  -1 siblings, 0 replies; 52+ messages in thread
From: Andrew Morton @ 2014-05-08 22:34 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Joonsoo Kim, Vlastimil Babka, Sasha Levin,
	Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm, Mel Gorman,
	Yong-Taek Lee, KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins,
	Rik van Riel, Michal Nazarewicz, Dave Jones

On Thu, 8 May 2014 15:19:37 +0900 Minchan Kim <minchan@kernel.org> wrote:

> > I also think that VM_DEBUG overhead isn't problem because of same
> > reason from Vlastimil.
> 
> Guys, please read this.
> 
> https://lkml.org/lkml/2013/7/17/591
> 
> If you guys really want it, we could separate it with
> CONFIG_DEBUG_CMA or CONFIG_DEBUG_RESERVE like stuff.
> Otherwise, just remain in mmotm.

Wise words, those.

Yes, these checks are in a pretty hot path.  I'm inclined to make the
patch -mm (and -next) only.

Unless there's a really good reason, such as "nobody who uses CMA is
likely to be testing -next", which sounds likely :(


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
@ 2014-05-08 22:34                             ` Andrew Morton
  0 siblings, 0 replies; 52+ messages in thread
From: Andrew Morton @ 2014-05-08 22:34 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Joonsoo Kim, Vlastimil Babka, Sasha Levin,
	Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm, Mel Gorman,
	Yong-Taek Lee, KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins,
	Rik van Riel, Michal Nazarewicz, Dave Jones

On Thu, 8 May 2014 15:19:37 +0900 Minchan Kim <minchan@kernel.org> wrote:

> > I also think that VM_DEBUG overhead isn't problem because of same
> > reason from Vlastimil.
> 
> Guys, please read this.
> 
> https://lkml.org/lkml/2013/7/17/591
> 
> If you guys really want it, we could separate it with
> CONFIG_DEBUG_CMA or CONFIG_DEBUG_RESERVE like stuff.
> Otherwise, just remain in mmotm.

Wise words, those.

Yes, these checks are in a pretty hot path.  I'm inclined to make the
patch -mm (and -next) only.

Unless there's a really good reason, such as "nobody who uses CMA is
likely to be testing -next", which sounds likely :(

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
  2014-05-08  5:54                         ` Joonsoo Kim
@ 2014-05-12  8:28                           ` Vlastimil Babka
  -1 siblings, 0 replies; 52+ messages in thread
From: Vlastimil Babka @ 2014-05-12  8:28 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Minchan Kim, Andrew Morton, Sasha Levin,
	Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm, Mel Gorman,
	Yong-Taek Lee, KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins,
	Rik van Riel, Michal Nazarewicz, Dave Jones

On 05/08/2014 07:54 AM, Joonsoo Kim wrote:
> On Wed, May 07, 2014 at 04:59:07PM +0200, Vlastimil Babka wrote:
>> On 05/07/2014 03:33 AM, Minchan Kim wrote:
>>> On Mon, May 05, 2014 at 05:50:46PM +0200, Vlastimil Babka wrote:
>>>> On 05/05/2014 04:36 PM, Sasha Levin wrote:
>>>>> On 05/02/2014 08:08 AM, Vlastimil Babka wrote:
>>>>>> On 04/30/2014 11:46 PM, Sasha Levin wrote:
>>>>>>>> On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
>>>>>>>>>> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
>>>>>>>>>> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
>>>>>>>>>> pageblock might be changed to other migratetype in try_to_steal_freepages().
>>>>>>>>>> For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
>>>>>>>>>> they could get allocated as unmovable and result in CMA failure.
>>>>>>>>>>
>>>>>>>>>> This is ensured by setting the freepage_migratetype appropriately when placing
>>>>>>>>>> pages on pcp lists, and using the information when releasing them back to
>>>>>>>>>> free_list. It is also assumed that CMA and RESERVE pageblocks are created only
>>>>>>>>>> in the init phase. This patch adds DEBUG_VM checks to catch any regressions
>>>>>>>>>> introduced for this invariant.
>>>>>>>>>>
>>>>>>>>>> Cc: Yong-Taek Lee <ytk.lee@samsung.com>
>>>>>>>>>> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
>>>>>>>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>>>>>>>>> Cc: Mel Gorman <mgorman@suse.de>
>>>>>>>>>> Cc: Minchan Kim <minchan@kernel.org>
>>>>>>>>>> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>>>>>>>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>>>>>>>>>> Cc: Hugh Dickins <hughd@google.com>
>>>>>>>>>> Cc: Rik van Riel <riel@redhat.com>
>>>>>>>>>> Cc: Michal Nazarewicz <mina86@mina86.com>
>>>>>>>>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>>>>>>>>
>>>>>>>> Two issues with this patch.
>>>>>>>>
>>>>>>>> First:
>>>>>>>>
>>>>>>>> [ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
>>>>>>>> [ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>>>>>>> [ 3446.320082] Dumping ftrace buffer:
>>>>>>>> [ 3446.320082]    (ftrace buffer empty)
>>>>>>>> [ 3446.320082] Modules linked in:
>>>>>>>> [ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
>>>>>>>> [ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
>>>>>>>> [ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
>>>>>>>> [ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
>>>>>>>> [ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
>>>>>>>> [ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
>>>>>>>> [ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
>>>>>>>> [ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
>>>>>>>> [ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
>>>>>>>> [ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
>>>>>>>> [ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>>>>> [ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
>>>>>>>> [ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
>>>>>>>> [ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
>>>>>>>> [ 3446.335888] Stack:
>>>>>>>> [ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
>>>>>>>> [ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
>>>>>>>> [ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
>>>>>>>> [ 3446.335888] Call Trace:
>>>>>>>> [ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
>>>>>>>> [ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
>>>>>>>> [ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
>>>>>>>> [ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
>>>>>>>> [ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
>>>>>>>> [ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
>>>>>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
>>>>>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>>>>>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:979)
>>>>>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:940)
>>>>>>>> [ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
>>>>>>>> [ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
>>>>>>>> [ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
>>>>>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
>>>>>>>> [ 3446.335888] shmem_fault (mm/shmem.c:1237)
>>>>>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
>>>>>>>> [ 3446.335888] __do_fault (mm/memory.c:3344)
>>>>>>>> [ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
>>>>>>>> [ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
>>>>>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
>>>>>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
>>>>>>>> [ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
>>>>>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
>>>>>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>>>>>>>> [ 3446.335888] handle_mm_fault (mm/memory.c:3973)
>>>>>>>> [ 3446.335888] __get_user_pages (mm/memory.c:1863)
>>>>>>>> [ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
>>>>>>>> [ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
>>>>>>>> [ 3446.335888] __mm_populate (mm/mlock.c:711)
>>>>>>>> [ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
>>>>>>>> [ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
>>>>>>>> [ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
>>>>>>>> [ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
>>>>>>>> [ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
>>>>>>>> [ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
>>>>>>>> [ 3446.335888]  RSP <ffff88053e247778>
>>>>>> Hey, that's not an issue, that means the patch works as intended :) And
>>>>>> I believe it's not a bug introduced by PATCH 1/2.
>>>>>>
>>>>>> So, according to my decodecode reading, RAX is the results of
>>>>>> get_pageblock_migratetype() and it's MIGRATE_RESERVE. RDX is the result
>>>>>> of get_freepage_migratetype() and it's MIGRATE_UNMOVABLE. The
>>>>>> freepage_migratetype has just been set either by __rmqueue_smallest() or
>>>>>> __rmqueue_fallback(), according to the free_list the page has been taken
>>>>>> from. So this looks like a page from MIGRATE_RESERVE pageblock found on
>>>>>> the !MIGRATE_RESERVE free_list, which is exactly what the patch intends
>>>>>> to catch.
>>>>>>
>>>>>> I think there are two possible explanations.
>>>>>>
>>>>>> 1) the pageblock is genuinely MIGRATE_RESERVE and it was misplaced by
>>>>>> mistake. I think it wasn't in free_pcppages_bulk() as there's the same
>>>>>> VM_BUG_ON which would supposedly trigger at the moment of displacing. In
>>>>>> theory it's possible that there's a race through __free_pages_ok() ->
>>>>>> free_one_page() where the get_pageblock_migratetype() in
>>>>>> __free_pages_ok() would race with set_pageblock_migratetype() and result
>>>>>> in bogus value. But nobody should be calling set_pageblock_migratetype()
>>>>>> on a MIGRATE_RESERVE pageblock.
>>>>>>
>>>>>> 2) the pageblock was marked as MIGRATE_RESERVE due to a race between
>>>>>> set_pageblock_migratetype() and set_pageblock_skip(). The latter is
>>>>>> currently not serialized by zone->lock, nor it uses atomic bit set. So
>>>>>> it may result in lost updates in a racing set_pageblock_migratetype(). I
>>>>>> think a well-placed race when changing pageblock from MIGRATE_MOVABLE to
>>>>>> MIGRATE_RECLAIMABLE could result in MIGRATE_RESERVE value. Similar races
>>>>>> have been already observed to be a problem where frequent changing
>>>>>> to/from MIGRATE_ISOLATE is involved, and I did a patch series to address
>>>>>> this, but it was not complete and I postponed it after Mel's changes
>>>>>> that remove the racy for-cycles completely. So it might be that his
>>>>>> "[PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set
>>>>>> pageblock bitmaps" already solves this bug (but maybe only on certain
>>>>>> architectures where you don't need atomic operations). You might try
>>>>>> that patch if you can reproduce this bug frequently enough?
>>>>>
>>>>> I've tried that patch, but still see the same BUG_ON.
>>>>
>>>> Oh damn, I've realized that my assumptions about MIGRATE_RESERVE
>>>> pageblocks being created only on zone init time were wrong.
>>>> setup_zone_migrate_reserve() is called also from the handler of
>>>> min_free_kbytes sysctl... does trinity try to change that while
>>>> running?
>>>> The function will change MOVABLE pageblocks to RESERVE and try to
>>>> move all free pages to the RESERVE free_list, but of course pages on
>>>> pcplists will remain MOVABLE and may trigger the VM_BUG_ON. You
>>>> triggered the bug with page on MOVABLE free_list (in the first reply
>>>> I said its UNMOVABLE by mistake) so this might be good explanation
>>>> if trinity changes min_free_kbytes.
>>>>
>>>> Furthermore, I think there's a problem that
>>>> setup_zone_migrate_reserve() operates on pageblocks, but as MAX_ODER
>>>> is higher than pageblock_order, RESERVE pages might be merged with
>>>> buddies of different migratetype and end up on their free_list. That
>>>> seems to me like a flaw in the design of reserves, but perhaps
>>>> others won't think it's serious enough to fix?
>
> I wanna know who want MIGRATE_RESERVE. On my previous testing, one
> pageblock for MIGRATE_RESERVE is merged with buddies of different
> migratetype during boot-up and never come back again. But my system works
> well. :)
>
> If it is really useful feature, we can fix the situation by aligning
> reserve size and pfn to MAX_ORDER_NR_PAGES.

Yes, I plan to do that.

> And, IMHO, it isn't reasonable to increase, decrease and change
> MIGRATE_RESERVE pageblock by the handler of min_free_kbytes sysctl,
> because new pageblock may have no free pages to reserve. I think that
> it is better to prevent sysctl handler from changing MIGRATE_RESERVE
> pageblock, after initialization.

This dynamic allocation could be more aggressive by doing the same stuff 
as memory isolation. In any case, I think that if it cannot guarantee 
the pageblock to be free, there is little benefit from trying to make 
sure no pages from the pageblock are strayed on pcplists and later 
misplaced in the free_list. The only danger of misplacement is page 
stealing code changing MIGRATE_RESERVE pageblock to something else. This 
can be easily avoided directly in the page stealing code by checking the 
pageblock migratetype before trying to change it. This has very little 
extra overhead, in a path that's not hot to begin with.

> In conclusion, if MIGRATE_RESERVE is useful enough to maintain, to fix
> above problem and keep this patch is preferable to me.

I think it's useful in general, but not so critical to make the 
min_free_kbytes handler provide 100% guarantees. Presumably one uses the 
handler only during early boot time when there's enough free memory?
So that leaves the debug patch only for CMA which I won't push anymore 
to mainline, but feel free to adapt it for mm only.

>>>>
>>>> So in the end this VM_DEBUG check probably cannot work anymore for
>>>> MIGRATE_RESERVE, only for CMA. I'm not sure if it's worth keeping it
>>>> only for CMA, what are the CMA guys' opinions on that?
>>>
>>> I really don't want it. That was I didn't add my Acked-by at that time.
>>> For a long time, I never wanted to add more overhead hot path due to
>>> CMA unless it's really critical. It's same to this.
>>> Although such debug patch helps to notice something goes wrong for CMA,
>>> more information would be helpful to know why CMA failed because
>>> there are another potential reasons to fail CMA allocation.
>>>
>>> One of the idea about that is to store alloc trace into somewhere(ex,
>>> naive idea is page description like page-owner) and then we could investigate
>>> what's the owner of that page so we could know why we fail to migrate it out.
>>> With that, we would figure out how on earth such page is allocated from CMA
>>> and it would be more helpful rather just VM_BUG_ON notice.
>>>
>>> The whole point is I'd like to avoid adding more overhead to hot path for
>>> rare case although it's debugging feature.
>>
>> OK. I'm not that concerned with VM_DEBUG overhead as it's intended for
>> testing, not production. But as you say the patch is not that useful
>> without the MIGRATE_RESERVE part, so Andrew could you please drop the
>> patch (mm-page_alloc-debug_vm-checks-for-free_list-placement-of-cma-and-reserve-pages.patch)?
>
> I also think that VM_DEBUG overhead isn't problem because of same
> reason from Vlastimil.
>
> Thanks.
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
@ 2014-05-12  8:28                           ` Vlastimil Babka
  0 siblings, 0 replies; 52+ messages in thread
From: Vlastimil Babka @ 2014-05-12  8:28 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Minchan Kim, Andrew Morton, Sasha Levin,
	Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm, Mel Gorman,
	Yong-Taek Lee, KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins,
	Rik van Riel, Michal Nazarewicz, Dave Jones

On 05/08/2014 07:54 AM, Joonsoo Kim wrote:
> On Wed, May 07, 2014 at 04:59:07PM +0200, Vlastimil Babka wrote:
>> On 05/07/2014 03:33 AM, Minchan Kim wrote:
>>> On Mon, May 05, 2014 at 05:50:46PM +0200, Vlastimil Babka wrote:
>>>> On 05/05/2014 04:36 PM, Sasha Levin wrote:
>>>>> On 05/02/2014 08:08 AM, Vlastimil Babka wrote:
>>>>>> On 04/30/2014 11:46 PM, Sasha Levin wrote:
>>>>>>>> On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
>>>>>>>>>> For the MIGRATE_RESERVE pages, it is important they do not get misplaced
>>>>>>>>>> on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
>>>>>>>>>> pageblock might be changed to other migratetype in try_to_steal_freepages().
>>>>>>>>>> For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
>>>>>>>>>> they could get allocated as unmovable and result in CMA failure.
>>>>>>>>>>
>>>>>>>>>> This is ensured by setting the freepage_migratetype appropriately when placing
>>>>>>>>>> pages on pcp lists, and using the information when releasing them back to
>>>>>>>>>> free_list. It is also assumed that CMA and RESERVE pageblocks are created only
>>>>>>>>>> in the init phase. This patch adds DEBUG_VM checks to catch any regressions
>>>>>>>>>> introduced for this invariant.
>>>>>>>>>>
>>>>>>>>>> Cc: Yong-Taek Lee <ytk.lee@samsung.com>
>>>>>>>>>> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
>>>>>>>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>>>>>>>>> Cc: Mel Gorman <mgorman@suse.de>
>>>>>>>>>> Cc: Minchan Kim <minchan@kernel.org>
>>>>>>>>>> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>>>>>>>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>>>>>>>>>> Cc: Hugh Dickins <hughd@google.com>
>>>>>>>>>> Cc: Rik van Riel <riel@redhat.com>
>>>>>>>>>> Cc: Michal Nazarewicz <mina86@mina86.com>
>>>>>>>>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>>>>>>>>
>>>>>>>> Two issues with this patch.
>>>>>>>>
>>>>>>>> First:
>>>>>>>>
>>>>>>>> [ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
>>>>>>>> [ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>>>>>>> [ 3446.320082] Dumping ftrace buffer:
>>>>>>>> [ 3446.320082]    (ftrace buffer empty)
>>>>>>>> [ 3446.320082] Modules linked in:
>>>>>>>> [ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
>>>>>>>> [ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
>>>>>>>> [ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
>>>>>>>> [ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
>>>>>>>> [ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
>>>>>>>> [ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
>>>>>>>> [ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
>>>>>>>> [ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
>>>>>>>> [ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
>>>>>>>> [ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
>>>>>>>> [ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>>>>> [ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
>>>>>>>> [ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
>>>>>>>> [ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
>>>>>>>> [ 3446.335888] Stack:
>>>>>>>> [ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
>>>>>>>> [ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
>>>>>>>> [ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
>>>>>>>> [ 3446.335888] Call Trace:
>>>>>>>> [ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
>>>>>>>> [ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
>>>>>>>> [ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
>>>>>>>> [ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
>>>>>>>> [ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
>>>>>>>> [ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
>>>>>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
>>>>>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>>>>>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:979)
>>>>>>>> [ 3446.335888] ? find_get_entry (mm/filemap.c:940)
>>>>>>>> [ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
>>>>>>>> [ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
>>>>>>>> [ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
>>>>>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
>>>>>>>> [ 3446.335888] shmem_fault (mm/shmem.c:1237)
>>>>>>>> [ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
>>>>>>>> [ 3446.335888] __do_fault (mm/memory.c:3344)
>>>>>>>> [ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
>>>>>>>> [ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
>>>>>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
>>>>>>>> [ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
>>>>>>>> [ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
>>>>>>>> [ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
>>>>>>>> [ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>>>>>>>> [ 3446.335888] handle_mm_fault (mm/memory.c:3973)
>>>>>>>> [ 3446.335888] __get_user_pages (mm/memory.c:1863)
>>>>>>>> [ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
>>>>>>>> [ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
>>>>>>>> [ 3446.335888] __mm_populate (mm/mlock.c:711)
>>>>>>>> [ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
>>>>>>>> [ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
>>>>>>>> [ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
>>>>>>>> [ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
>>>>>>>> [ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
>>>>>>>> [ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
>>>>>>>> [ 3446.335888]  RSP <ffff88053e247778>
>>>>>> Hey, that's not an issue, that means the patch works as intended :) And
>>>>>> I believe it's not a bug introduced by PATCH 1/2.
>>>>>>
>>>>>> So, according to my decodecode reading, RAX is the results of
>>>>>> get_pageblock_migratetype() and it's MIGRATE_RESERVE. RDX is the result
>>>>>> of get_freepage_migratetype() and it's MIGRATE_UNMOVABLE. The
>>>>>> freepage_migratetype has just been set either by __rmqueue_smallest() or
>>>>>> __rmqueue_fallback(), according to the free_list the page has been taken
>>>>>> from. So this looks like a page from MIGRATE_RESERVE pageblock found on
>>>>>> the !MIGRATE_RESERVE free_list, which is exactly what the patch intends
>>>>>> to catch.
>>>>>>
>>>>>> I think there are two possible explanations.
>>>>>>
>>>>>> 1) the pageblock is genuinely MIGRATE_RESERVE and it was misplaced by
>>>>>> mistake. I think it wasn't in free_pcppages_bulk() as there's the same
>>>>>> VM_BUG_ON which would supposedly trigger at the moment of displacing. In
>>>>>> theory it's possible that there's a race through __free_pages_ok() ->
>>>>>> free_one_page() where the get_pageblock_migratetype() in
>>>>>> __free_pages_ok() would race with set_pageblock_migratetype() and result
>>>>>> in bogus value. But nobody should be calling set_pageblock_migratetype()
>>>>>> on a MIGRATE_RESERVE pageblock.
>>>>>>
>>>>>> 2) the pageblock was marked as MIGRATE_RESERVE due to a race between
>>>>>> set_pageblock_migratetype() and set_pageblock_skip(). The latter is
>>>>>> currently not serialized by zone->lock, nor it uses atomic bit set. So
>>>>>> it may result in lost updates in a racing set_pageblock_migratetype(). I
>>>>>> think a well-placed race when changing pageblock from MIGRATE_MOVABLE to
>>>>>> MIGRATE_RECLAIMABLE could result in MIGRATE_RESERVE value. Similar races
>>>>>> have been already observed to be a problem where frequent changing
>>>>>> to/from MIGRATE_ISOLATE is involved, and I did a patch series to address
>>>>>> this, but it was not complete and I postponed it after Mel's changes
>>>>>> that remove the racy for-cycles completely. So it might be that his
>>>>>> "[PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set
>>>>>> pageblock bitmaps" already solves this bug (but maybe only on certain
>>>>>> architectures where you don't need atomic operations). You might try
>>>>>> that patch if you can reproduce this bug frequently enough?
>>>>>
>>>>> I've tried that patch, but still see the same BUG_ON.
>>>>
>>>> Oh damn, I've realized that my assumptions about MIGRATE_RESERVE
>>>> pageblocks being created only on zone init time were wrong.
>>>> setup_zone_migrate_reserve() is called also from the handler of
>>>> min_free_kbytes sysctl... does trinity try to change that while
>>>> running?
>>>> The function will change MOVABLE pageblocks to RESERVE and try to
>>>> move all free pages to the RESERVE free_list, but of course pages on
>>>> pcplists will remain MOVABLE and may trigger the VM_BUG_ON. You
>>>> triggered the bug with page on MOVABLE free_list (in the first reply
>>>> I said its UNMOVABLE by mistake) so this might be good explanation
>>>> if trinity changes min_free_kbytes.
>>>>
>>>> Furthermore, I think there's a problem that
>>>> setup_zone_migrate_reserve() operates on pageblocks, but as MAX_ODER
>>>> is higher than pageblock_order, RESERVE pages might be merged with
>>>> buddies of different migratetype and end up on their free_list. That
>>>> seems to me like a flaw in the design of reserves, but perhaps
>>>> others won't think it's serious enough to fix?
>
> I wanna know who want MIGRATE_RESERVE. On my previous testing, one
> pageblock for MIGRATE_RESERVE is merged with buddies of different
> migratetype during boot-up and never come back again. But my system works
> well. :)
>
> If it is really useful feature, we can fix the situation by aligning
> reserve size and pfn to MAX_ORDER_NR_PAGES.

Yes, I plan to do that.

> And, IMHO, it isn't reasonable to increase, decrease and change
> MIGRATE_RESERVE pageblock by the handler of min_free_kbytes sysctl,
> because new pageblock may have no free pages to reserve. I think that
> it is better to prevent sysctl handler from changing MIGRATE_RESERVE
> pageblock, after initialization.

This dynamic allocation could be more aggressive by doing the same stuff 
as memory isolation. In any case, I think that if it cannot guarantee 
the pageblock to be free, there is little benefit from trying to make 
sure no pages from the pageblock are strayed on pcplists and later 
misplaced in the free_list. The only danger of misplacement is page 
stealing code changing MIGRATE_RESERVE pageblock to something else. This 
can be easily avoided directly in the page stealing code by checking the 
pageblock migratetype before trying to change it. This has very little 
extra overhead, in a path that's not hot to begin with.

> In conclusion, if MIGRATE_RESERVE is useful enough to maintain, to fix
> above problem and keep this patch is preferable to me.

I think it's useful in general, but not so critical to make the 
min_free_kbytes handler provide 100% guarantees. Presumably one uses the 
handler only during early boot time when there's enough free memory?
So that leaves the debug patch only for CMA which I won't push anymore 
to mainline, but feel free to adapt it for mm only.

>>>>
>>>> So in the end this VM_DEBUG check probably cannot work anymore for
>>>> MIGRATE_RESERVE, only for CMA. I'm not sure if it's worth keeping it
>>>> only for CMA, what are the CMA guys' opinions on that?
>>>
>>> I really don't want it. That was I didn't add my Acked-by at that time.
>>> For a long time, I never wanted to add more overhead hot path due to
>>> CMA unless it's really critical. It's same to this.
>>> Although such debug patch helps to notice something goes wrong for CMA,
>>> more information would be helpful to know why CMA failed because
>>> there are another potential reasons to fail CMA allocation.
>>>
>>> One of the idea about that is to store alloc trace into somewhere(ex,
>>> naive idea is page description like page-owner) and then we could investigate
>>> what's the owner of that page so we could know why we fail to migrate it out.
>>> With that, we would figure out how on earth such page is allocated from CMA
>>> and it would be more helpful rather just VM_BUG_ON notice.
>>>
>>> The whole point is I'd like to avoid adding more overhead to hot path for
>>> rare case although it's debugging feature.
>>
>> OK. I'm not that concerned with VM_DEBUG overhead as it's intended for
>> testing, not production. But as you say the patch is not that useful
>> without the MIGRATE_RESERVE part, so Andrew could you please drop the
>> patch (mm-page_alloc-debug_vm-checks-for-free_list-placement-of-cma-and-reserve-pages.patch)?
>
> I also think that VM_DEBUG overhead isn't problem because of same
> reason from Vlastimil.
>
> Thanks.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
  2014-05-12  8:28                           ` Vlastimil Babka
@ 2014-05-13  1:37                             ` Joonsoo Kim
  -1 siblings, 0 replies; 52+ messages in thread
From: Joonsoo Kim @ 2014-05-13  1:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Minchan Kim, Andrew Morton, Sasha Levin,
	Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm, Mel Gorman,
	Yong-Taek Lee, KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins,
	Rik van Riel, Michal Nazarewicz, Dave Jones

On Mon, May 12, 2014 at 10:28:25AM +0200, Vlastimil Babka wrote:
> On 05/08/2014 07:54 AM, Joonsoo Kim wrote:
> >On Wed, May 07, 2014 at 04:59:07PM +0200, Vlastimil Babka wrote:
> >>On 05/07/2014 03:33 AM, Minchan Kim wrote:
> >>>On Mon, May 05, 2014 at 05:50:46PM +0200, Vlastimil Babka wrote:
> >>>>On 05/05/2014 04:36 PM, Sasha Levin wrote:
> >>>>>On 05/02/2014 08:08 AM, Vlastimil Babka wrote:
> >>>>>>On 04/30/2014 11:46 PM, Sasha Levin wrote:
> >>>>>>>>On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
> >>>>>>>>>>For the MIGRATE_RESERVE pages, it is important they do not get misplaced
> >>>>>>>>>>on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
> >>>>>>>>>>pageblock might be changed to other migratetype in try_to_steal_freepages().
> >>>>>>>>>>For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
> >>>>>>>>>>they could get allocated as unmovable and result in CMA failure.
> >>>>>>>>>>
> >>>>>>>>>>This is ensured by setting the freepage_migratetype appropriately when placing
> >>>>>>>>>>pages on pcp lists, and using the information when releasing them back to
> >>>>>>>>>>free_list. It is also assumed that CMA and RESERVE pageblocks are created only
> >>>>>>>>>>in the init phase. This patch adds DEBUG_VM checks to catch any regressions
> >>>>>>>>>>introduced for this invariant.
> >>>>>>>>>>
> >>>>>>>>>>Cc: Yong-Taek Lee <ytk.lee@samsung.com>
> >>>>>>>>>>Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
> >>>>>>>>>>Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >>>>>>>>>>Cc: Mel Gorman <mgorman@suse.de>
> >>>>>>>>>>Cc: Minchan Kim <minchan@kernel.org>
> >>>>>>>>>>Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >>>>>>>>>>Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> >>>>>>>>>>Cc: Hugh Dickins <hughd@google.com>
> >>>>>>>>>>Cc: Rik van Riel <riel@redhat.com>
> >>>>>>>>>>Cc: Michal Nazarewicz <mina86@mina86.com>
> >>>>>>>>>>Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> >>>>>>>>
> >>>>>>>>Two issues with this patch.
> >>>>>>>>
> >>>>>>>>First:
> >>>>>>>>
> >>>>>>>>[ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
> >>>>>>>>[ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> >>>>>>>>[ 3446.320082] Dumping ftrace buffer:
> >>>>>>>>[ 3446.320082]    (ftrace buffer empty)
> >>>>>>>>[ 3446.320082] Modules linked in:
> >>>>>>>>[ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
> >>>>>>>>[ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
> >>>>>>>>[ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
> >>>>>>>>[ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
> >>>>>>>>[ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
> >>>>>>>>[ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
> >>>>>>>>[ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
> >>>>>>>>[ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
> >>>>>>>>[ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
> >>>>>>>>[ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
> >>>>>>>>[ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >>>>>>>>[ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
> >>>>>>>>[ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
> >>>>>>>>[ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
> >>>>>>>>[ 3446.335888] Stack:
> >>>>>>>>[ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
> >>>>>>>>[ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
> >>>>>>>>[ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
> >>>>>>>>[ 3446.335888] Call Trace:
> >>>>>>>>[ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
> >>>>>>>>[ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> >>>>>>>>[ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
> >>>>>>>>[ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
> >>>>>>>>[ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
> >>>>>>>>[ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
> >>>>>>>>[ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
> >>>>>>>>[ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> >>>>>>>>[ 3446.335888] ? find_get_entry (mm/filemap.c:979)
> >>>>>>>>[ 3446.335888] ? find_get_entry (mm/filemap.c:940)
> >>>>>>>>[ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
> >>>>>>>>[ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
> >>>>>>>>[ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
> >>>>>>>>[ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
> >>>>>>>>[ 3446.335888] shmem_fault (mm/shmem.c:1237)
> >>>>>>>>[ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
> >>>>>>>>[ 3446.335888] __do_fault (mm/memory.c:3344)
> >>>>>>>>[ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> >>>>>>>>[ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
> >>>>>>>>[ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
> >>>>>>>>[ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
> >>>>>>>>[ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
> >>>>>>>>[ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
> >>>>>>>>[ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> >>>>>>>>[ 3446.335888] handle_mm_fault (mm/memory.c:3973)
> >>>>>>>>[ 3446.335888] __get_user_pages (mm/memory.c:1863)
> >>>>>>>>[ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
> >>>>>>>>[ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
> >>>>>>>>[ 3446.335888] __mm_populate (mm/mlock.c:711)
> >>>>>>>>[ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
> >>>>>>>>[ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
> >>>>>>>>[ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
> >>>>>>>>[ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
> >>>>>>>>[ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
> >>>>>>>>[ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
> >>>>>>>>[ 3446.335888]  RSP <ffff88053e247778>
> >>>>>>Hey, that's not an issue, that means the patch works as intended :) And
> >>>>>>I believe it's not a bug introduced by PATCH 1/2.
> >>>>>>
> >>>>>>So, according to my decodecode reading, RAX is the results of
> >>>>>>get_pageblock_migratetype() and it's MIGRATE_RESERVE. RDX is the result
> >>>>>>of get_freepage_migratetype() and it's MIGRATE_UNMOVABLE. The
> >>>>>>freepage_migratetype has just been set either by __rmqueue_smallest() or
> >>>>>>__rmqueue_fallback(), according to the free_list the page has been taken
> >>>>>>from. So this looks like a page from MIGRATE_RESERVE pageblock found on
> >>>>>>the !MIGRATE_RESERVE free_list, which is exactly what the patch intends
> >>>>>>to catch.
> >>>>>>
> >>>>>>I think there are two possible explanations.
> >>>>>>
> >>>>>>1) the pageblock is genuinely MIGRATE_RESERVE and it was misplaced by
> >>>>>>mistake. I think it wasn't in free_pcppages_bulk() as there's the same
> >>>>>>VM_BUG_ON which would supposedly trigger at the moment of displacing. In
> >>>>>>theory it's possible that there's a race through __free_pages_ok() ->
> >>>>>>free_one_page() where the get_pageblock_migratetype() in
> >>>>>>__free_pages_ok() would race with set_pageblock_migratetype() and result
> >>>>>>in bogus value. But nobody should be calling set_pageblock_migratetype()
> >>>>>>on a MIGRATE_RESERVE pageblock.
> >>>>>>
> >>>>>>2) the pageblock was marked as MIGRATE_RESERVE due to a race between
> >>>>>>set_pageblock_migratetype() and set_pageblock_skip(). The latter is
> >>>>>>currently not serialized by zone->lock, nor it uses atomic bit set. So
> >>>>>>it may result in lost updates in a racing set_pageblock_migratetype(). I
> >>>>>>think a well-placed race when changing pageblock from MIGRATE_MOVABLE to
> >>>>>>MIGRATE_RECLAIMABLE could result in MIGRATE_RESERVE value. Similar races
> >>>>>>have been already observed to be a problem where frequent changing
> >>>>>>to/from MIGRATE_ISOLATE is involved, and I did a patch series to address
> >>>>>>this, but it was not complete and I postponed it after Mel's changes
> >>>>>>that remove the racy for-cycles completely. So it might be that his
> >>>>>>"[PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set
> >>>>>>pageblock bitmaps" already solves this bug (but maybe only on certain
> >>>>>>architectures where you don't need atomic operations). You might try
> >>>>>>that patch if you can reproduce this bug frequently enough?
> >>>>>
> >>>>>I've tried that patch, but still see the same BUG_ON.
> >>>>
> >>>>Oh damn, I've realized that my assumptions about MIGRATE_RESERVE
> >>>>pageblocks being created only on zone init time were wrong.
> >>>>setup_zone_migrate_reserve() is called also from the handler of
> >>>>min_free_kbytes sysctl... does trinity try to change that while
> >>>>running?
> >>>>The function will change MOVABLE pageblocks to RESERVE and try to
> >>>>move all free pages to the RESERVE free_list, but of course pages on
> >>>>pcplists will remain MOVABLE and may trigger the VM_BUG_ON. You
> >>>>triggered the bug with page on MOVABLE free_list (in the first reply
> >>>>I said its UNMOVABLE by mistake) so this might be good explanation
> >>>>if trinity changes min_free_kbytes.
> >>>>
> >>>>Furthermore, I think there's a problem that
> >>>>setup_zone_migrate_reserve() operates on pageblocks, but as MAX_ODER
> >>>>is higher than pageblock_order, RESERVE pages might be merged with
> >>>>buddies of different migratetype and end up on their free_list. That
> >>>>seems to me like a flaw in the design of reserves, but perhaps
> >>>>others won't think it's serious enough to fix?
> >
> >I wanna know who want MIGRATE_RESERVE. On my previous testing, one
> >pageblock for MIGRATE_RESERVE is merged with buddies of different
> >migratetype during boot-up and never come back again. But my system works
> >well. :)
> >
> >If it is really useful feature, we can fix the situation by aligning
> >reserve size and pfn to MAX_ORDER_NR_PAGES.
> 
> Yes, I plan to do that.
> 
> >And, IMHO, it isn't reasonable to increase, decrease and change
> >MIGRATE_RESERVE pageblock by the handler of min_free_kbytes sysctl,
> >because new pageblock may have no free pages to reserve. I think that
> >it is better to prevent sysctl handler from changing MIGRATE_RESERVE
> >pageblock, after initialization.
> 
> This dynamic allocation could be more aggressive by doing the same
> stuff as memory isolation. In any case, I think that if it cannot
> guarantee the pageblock to be free, there is little benefit from
> trying to make sure no pages from the pageblock are strayed on
> pcplists and later misplaced in the free_list. The only danger of
> misplacement is page stealing code changing MIGRATE_RESERVE
> pageblock to something else. This can be easily avoided directly in
> the page stealing code by checking the pageblock migratetype before
> trying to change it. This has very little extra overhead, in a path
> that's not hot to begin with.

I think that we first investigate why reserve pageblock is depend on
min_free_kbytes changing. I guess that this is for maintainig reserve
pageblock size below certain watermark so that reserve pageblock
aren't used easily. But, now, we limit reserve pageblock to just 2,
that is, really small value, so reserve pageblock don't need to change
when min_free_kbytes is changed.

> >In conclusion, if MIGRATE_RESERVE is useful enough to maintain, to fix
> >above problem and keep this patch is preferable to me.
> 
> I think it's useful in general, but not so critical to make the
> min_free_kbytes handler provide 100% guarantees. Presumably one uses
> the handler only during early boot time when there's enough free
> memory?
> So that leaves the debug patch only for CMA which I won't push
> anymore to mainline, but feel free to adapt it for mm only.

Okay. :)

Thanks.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
@ 2014-05-13  1:37                             ` Joonsoo Kim
  0 siblings, 0 replies; 52+ messages in thread
From: Joonsoo Kim @ 2014-05-13  1:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Minchan Kim, Andrew Morton, Sasha Levin,
	Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm, Mel Gorman,
	Yong-Taek Lee, KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins,
	Rik van Riel, Michal Nazarewicz, Dave Jones

On Mon, May 12, 2014 at 10:28:25AM +0200, Vlastimil Babka wrote:
> On 05/08/2014 07:54 AM, Joonsoo Kim wrote:
> >On Wed, May 07, 2014 at 04:59:07PM +0200, Vlastimil Babka wrote:
> >>On 05/07/2014 03:33 AM, Minchan Kim wrote:
> >>>On Mon, May 05, 2014 at 05:50:46PM +0200, Vlastimil Babka wrote:
> >>>>On 05/05/2014 04:36 PM, Sasha Levin wrote:
> >>>>>On 05/02/2014 08:08 AM, Vlastimil Babka wrote:
> >>>>>>On 04/30/2014 11:46 PM, Sasha Levin wrote:
> >>>>>>>>On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
> >>>>>>>>>>For the MIGRATE_RESERVE pages, it is important they do not get misplaced
> >>>>>>>>>>on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
> >>>>>>>>>>pageblock might be changed to other migratetype in try_to_steal_freepages().
> >>>>>>>>>>For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
> >>>>>>>>>>they could get allocated as unmovable and result in CMA failure.
> >>>>>>>>>>
> >>>>>>>>>>This is ensured by setting the freepage_migratetype appropriately when placing
> >>>>>>>>>>pages on pcp lists, and using the information when releasing them back to
> >>>>>>>>>>free_list. It is also assumed that CMA and RESERVE pageblocks are created only
> >>>>>>>>>>in the init phase. This patch adds DEBUG_VM checks to catch any regressions
> >>>>>>>>>>introduced for this invariant.
> >>>>>>>>>>
> >>>>>>>>>>Cc: Yong-Taek Lee <ytk.lee@samsung.com>
> >>>>>>>>>>Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
> >>>>>>>>>>Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >>>>>>>>>>Cc: Mel Gorman <mgorman@suse.de>
> >>>>>>>>>>Cc: Minchan Kim <minchan@kernel.org>
> >>>>>>>>>>Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >>>>>>>>>>Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> >>>>>>>>>>Cc: Hugh Dickins <hughd@google.com>
> >>>>>>>>>>Cc: Rik van Riel <riel@redhat.com>
> >>>>>>>>>>Cc: Michal Nazarewicz <mina86@mina86.com>
> >>>>>>>>>>Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> >>>>>>>>
> >>>>>>>>Two issues with this patch.
> >>>>>>>>
> >>>>>>>>First:
> >>>>>>>>
> >>>>>>>>[ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
> >>>>>>>>[ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> >>>>>>>>[ 3446.320082] Dumping ftrace buffer:
> >>>>>>>>[ 3446.320082]    (ftrace buffer empty)
> >>>>>>>>[ 3446.320082] Modules linked in:
> >>>>>>>>[ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
> >>>>>>>>[ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
> >>>>>>>>[ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
> >>>>>>>>[ 3446.320082] RSP: 0018:ffff88053e247778  EFLAGS: 00010002
> >>>>>>>>[ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
> >>>>>>>>[ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
> >>>>>>>>[ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
> >>>>>>>>[ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
> >>>>>>>>[ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
> >>>>>>>>[ 3446.335888] FS:  00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
> >>>>>>>>[ 3446.335888] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >>>>>>>>[ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
> >>>>>>>>[ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
> >>>>>>>>[ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
> >>>>>>>>[ 3446.335888] Stack:
> >>>>>>>>[ 3446.335888]  ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
> >>>>>>>>[ 3446.335888]  ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
> >>>>>>>>[ 3446.335888]  ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
> >>>>>>>>[ 3446.335888] Call Trace:
> >>>>>>>>[ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
> >>>>>>>>[ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> >>>>>>>>[ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
> >>>>>>>>[ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
> >>>>>>>>[ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
> >>>>>>>>[ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
> >>>>>>>>[ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
> >>>>>>>>[ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> >>>>>>>>[ 3446.335888] ? find_get_entry (mm/filemap.c:979)
> >>>>>>>>[ 3446.335888] ? find_get_entry (mm/filemap.c:940)
> >>>>>>>>[ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
> >>>>>>>>[ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
> >>>>>>>>[ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
> >>>>>>>>[ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
> >>>>>>>>[ 3446.335888] shmem_fault (mm/shmem.c:1237)
> >>>>>>>>[ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
> >>>>>>>>[ 3446.335888] __do_fault (mm/memory.c:3344)
> >>>>>>>>[ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> >>>>>>>>[ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
> >>>>>>>>[ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
> >>>>>>>>[ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
> >>>>>>>>[ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
> >>>>>>>>[ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
> >>>>>>>>[ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> >>>>>>>>[ 3446.335888] handle_mm_fault (mm/memory.c:3973)
> >>>>>>>>[ 3446.335888] __get_user_pages (mm/memory.c:1863)
> >>>>>>>>[ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
> >>>>>>>>[ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
> >>>>>>>>[ 3446.335888] __mm_populate (mm/mlock.c:711)
> >>>>>>>>[ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
> >>>>>>>>[ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
> >>>>>>>>[ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
> >>>>>>>>[ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
> >>>>>>>>[ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
> >>>>>>>>[ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
> >>>>>>>>[ 3446.335888]  RSP <ffff88053e247778>
> >>>>>>Hey, that's not an issue, that means the patch works as intended :) And
> >>>>>>I believe it's not a bug introduced by PATCH 1/2.
> >>>>>>
> >>>>>>So, according to my decodecode reading, RAX is the results of
> >>>>>>get_pageblock_migratetype() and it's MIGRATE_RESERVE. RDX is the result
> >>>>>>of get_freepage_migratetype() and it's MIGRATE_UNMOVABLE. The
> >>>>>>freepage_migratetype has just been set either by __rmqueue_smallest() or
> >>>>>>__rmqueue_fallback(), according to the free_list the page has been taken
> >>>>>>from. So this looks like a page from MIGRATE_RESERVE pageblock found on
> >>>>>>the !MIGRATE_RESERVE free_list, which is exactly what the patch intends
> >>>>>>to catch.
> >>>>>>
> >>>>>>I think there are two possible explanations.
> >>>>>>
> >>>>>>1) the pageblock is genuinely MIGRATE_RESERVE and it was misplaced by
> >>>>>>mistake. I think it wasn't in free_pcppages_bulk() as there's the same
> >>>>>>VM_BUG_ON which would supposedly trigger at the moment of displacing. In
> >>>>>>theory it's possible that there's a race through __free_pages_ok() ->
> >>>>>>free_one_page() where the get_pageblock_migratetype() in
> >>>>>>__free_pages_ok() would race with set_pageblock_migratetype() and result
> >>>>>>in bogus value. But nobody should be calling set_pageblock_migratetype()
> >>>>>>on a MIGRATE_RESERVE pageblock.
> >>>>>>
> >>>>>>2) the pageblock was marked as MIGRATE_RESERVE due to a race between
> >>>>>>set_pageblock_migratetype() and set_pageblock_skip(). The latter is
> >>>>>>currently not serialized by zone->lock, nor it uses atomic bit set. So
> >>>>>>it may result in lost updates in a racing set_pageblock_migratetype(). I
> >>>>>>think a well-placed race when changing pageblock from MIGRATE_MOVABLE to
> >>>>>>MIGRATE_RECLAIMABLE could result in MIGRATE_RESERVE value. Similar races
> >>>>>>have been already observed to be a problem where frequent changing
> >>>>>>to/from MIGRATE_ISOLATE is involved, and I did a patch series to address
> >>>>>>this, but it was not complete and I postponed it after Mel's changes
> >>>>>>that remove the racy for-cycles completely. So it might be that his
> >>>>>>"[PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set
> >>>>>>pageblock bitmaps" already solves this bug (but maybe only on certain
> >>>>>>architectures where you don't need atomic operations). You might try
> >>>>>>that patch if you can reproduce this bug frequently enough?
> >>>>>
> >>>>>I've tried that patch, but still see the same BUG_ON.
> >>>>
> >>>>Oh damn, I've realized that my assumptions about MIGRATE_RESERVE
> >>>>pageblocks being created only on zone init time were wrong.
> >>>>setup_zone_migrate_reserve() is called also from the handler of
> >>>>min_free_kbytes sysctl... does trinity try to change that while
> >>>>running?
> >>>>The function will change MOVABLE pageblocks to RESERVE and try to
> >>>>move all free pages to the RESERVE free_list, but of course pages on
> >>>>pcplists will remain MOVABLE and may trigger the VM_BUG_ON. You
> >>>>triggered the bug with page on MOVABLE free_list (in the first reply
> >>>>I said its UNMOVABLE by mistake) so this might be good explanation
> >>>>if trinity changes min_free_kbytes.
> >>>>
> >>>>Furthermore, I think there's a problem that
> >>>>setup_zone_migrate_reserve() operates on pageblocks, but as MAX_ODER
> >>>>is higher than pageblock_order, RESERVE pages might be merged with
> >>>>buddies of different migratetype and end up on their free_list. That
> >>>>seems to me like a flaw in the design of reserves, but perhaps
> >>>>others won't think it's serious enough to fix?
> >
> >I wanna know who want MIGRATE_RESERVE. On my previous testing, one
> >pageblock for MIGRATE_RESERVE is merged with buddies of different
> >migratetype during boot-up and never come back again. But my system works
> >well. :)
> >
> >If it is really useful feature, we can fix the situation by aligning
> >reserve size and pfn to MAX_ORDER_NR_PAGES.
> 
> Yes, I plan to do that.
> 
> >And, IMHO, it isn't reasonable to increase, decrease and change
> >MIGRATE_RESERVE pageblock by the handler of min_free_kbytes sysctl,
> >because new pageblock may have no free pages to reserve. I think that
> >it is better to prevent sysctl handler from changing MIGRATE_RESERVE
> >pageblock, after initialization.
> 
> This dynamic allocation could be more aggressive by doing the same
> stuff as memory isolation. In any case, I think that if it cannot
> guarantee the pageblock to be free, there is little benefit from
> trying to make sure no pages from the pageblock are strayed on
> pcplists and later misplaced in the free_list. The only danger of
> misplacement is page stealing code changing MIGRATE_RESERVE
> pageblock to something else. This can be easily avoided directly in
> the page stealing code by checking the pageblock migratetype before
> trying to change it. This has very little extra overhead, in a path
> that's not hot to begin with.

I think that we first investigate why reserve pageblock is depend on
min_free_kbytes changing. I guess that this is for maintainig reserve
pageblock size below certain watermark so that reserve pageblock
aren't used easily. But, now, we limit reserve pageblock to just 2,
that is, really small value, so reserve pageblock don't need to change
when min_free_kbytes is changed.

> >In conclusion, if MIGRATE_RESERVE is useful enough to maintain, to fix
> >above problem and keep this patch is preferable to me.
> 
> I think it's useful in general, but not so critical to make the
> min_free_kbytes handler provide 100% guarantees. Presumably one uses
> the handler only during early boot time when there's enough free
> memory?
> So that leaves the debug patch only for CMA which I won't push
> anymore to mainline, but feel free to adapt it for mm only.

Okay. :)

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
  2014-05-08 22:34                             ` Andrew Morton
@ 2014-05-13  1:40                               ` Joonsoo Kim
  -1 siblings, 0 replies; 52+ messages in thread
From: Joonsoo Kim @ 2014-05-13  1:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, Vlastimil Babka, Sasha Levin,
	Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm, Mel Gorman,
	Yong-Taek Lee, KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins,
	Rik van Riel, Michal Nazarewicz, Dave Jones

On Thu, May 08, 2014 at 03:34:33PM -0700, Andrew Morton wrote:
> On Thu, 8 May 2014 15:19:37 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
> > > I also think that VM_DEBUG overhead isn't problem because of same
> > > reason from Vlastimil.
> > 
> > Guys, please read this.
> > 
> > https://lkml.org/lkml/2013/7/17/591
> > 
> > If you guys really want it, we could separate it with
> > CONFIG_DEBUG_CMA or CONFIG_DEBUG_RESERVE like stuff.
> > Otherwise, just remain in mmotm.
> 
> Wise words, those.
> 
> Yes, these checks are in a pretty hot path.  I'm inclined to make the
> patch -mm (and -next) only.
> 
> Unless there's a really good reason, such as "nobody who uses CMA is
> likely to be testing -next", which sounds likely :(

Hello,

Now, I think that dropping this patch is better if we can only use it
on MIGRATE_CMA case. Later, if I feel that this case should be checked,
I will resend the patch with appropriate argument.

Thanks.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
@ 2014-05-13  1:40                               ` Joonsoo Kim
  0 siblings, 0 replies; 52+ messages in thread
From: Joonsoo Kim @ 2014-05-13  1:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, Vlastimil Babka, Sasha Levin,
	Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm, Mel Gorman,
	Yong-Taek Lee, KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins,
	Rik van Riel, Michal Nazarewicz, Dave Jones

On Thu, May 08, 2014 at 03:34:33PM -0700, Andrew Morton wrote:
> On Thu, 8 May 2014 15:19:37 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
> > > I also think that VM_DEBUG overhead isn't problem because of same
> > > reason from Vlastimil.
> > 
> > Guys, please read this.
> > 
> > https://lkml.org/lkml/2013/7/17/591
> > 
> > If you guys really want it, we could separate it with
> > CONFIG_DEBUG_CMA or CONFIG_DEBUG_RESERVE like stuff.
> > Otherwise, just remain in mmotm.
> 
> Wise words, those.
> 
> Yes, these checks are in a pretty hot path.  I'm inclined to make the
> patch -mm (and -next) only.
> 
> Unless there's a really good reason, such as "nobody who uses CMA is
> likely to be testing -next", which sounds likely :(

Hello,

Now, I think that dropping this patch is better if we can only use it
on MIGRATE_CMA case. Later, if I feel that this case should be checked,
I will resend the patch with appropriate argument.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
  2014-05-05 15:50                   ` Vlastimil Babka
@ 2014-05-14  3:47                     ` Sasha Levin
  -1 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-05-14  3:47 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Joonsoo Kim, Bartlomiej Zolnierkiewicz
  Cc: linux-kernel, linux-mm, Mel Gorman, Yong-Taek Lee, Minchan Kim,
	KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins, Rik van Riel,
	Michal Nazarewicz, Dave Jones

On 05/05/2014 11:50 AM, Vlastimil Babka wrote:
> So in the end this VM_DEBUG check probably cannot work anymore for MIGRATE_RESERVE, only for CMA. I'm not sure if it's worth keeping it only for CMA, what are the CMA guys' opinions on that?

The way I understood it is that this patch is wrong, but it's still
alive in -mm. Should it still be there?


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
@ 2014-05-14  3:47                     ` Sasha Levin
  0 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2014-05-14  3:47 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Joonsoo Kim, Bartlomiej Zolnierkiewicz
  Cc: linux-kernel, linux-mm, Mel Gorman, Yong-Taek Lee, Minchan Kim,
	KOSAKI Motohiro, Marek Szyprowski, Hugh Dickins, Rik van Riel,
	Michal Nazarewicz, Dave Jones

On 05/05/2014 11:50 AM, Vlastimil Babka wrote:
> So in the end this VM_DEBUG check probably cannot work anymore for MIGRATE_RESERVE, only for CMA. I'm not sure if it's worth keeping it only for CMA, what are the CMA guys' opinions on that?

The way I understood it is that this patch is wrong, but it's still
alive in -mm. Should it still be there?


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
  2014-05-14  3:47                     ` Sasha Levin
@ 2014-05-14  5:19                       ` Hugh Dickins
  -1 siblings, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2014-05-14  5:19 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Vlastimil Babka, Andrew Morton, Joonsoo Kim,
	Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm, Mel Gorman,
	Yong-Taek Lee, Minchan Kim, KOSAKI Motohiro, Marek Szyprowski,
	Hugh Dickins, Rik van Riel, Michal Nazarewicz, Dave Jones

On Tue, 13 May 2014, Sasha Levin wrote:
> On 05/05/2014 11:50 AM, Vlastimil Babka wrote:
> > So in the end this VM_DEBUG check probably cannot work anymore for MIGRATE_RESERVE, only for CMA. I'm not sure if it's worth keeping it only for CMA, what are the CMA guys' opinions on that?
> 
> The way I understood it is that this patch is wrong, but it's still
> alive in -mm. Should it still be there?

I agree that it should be dropped.  I did not follow the discussion,
but mmotm soon gives me BUG at mm/page_alloc.c:1242 under swapping load.

Hugh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
@ 2014-05-14  5:19                       ` Hugh Dickins
  0 siblings, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2014-05-14  5:19 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Vlastimil Babka, Andrew Morton, Joonsoo Kim,
	Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm, Mel Gorman,
	Yong-Taek Lee, Minchan Kim, KOSAKI Motohiro, Marek Szyprowski,
	Hugh Dickins, Rik van Riel, Michal Nazarewicz, Dave Jones

On Tue, 13 May 2014, Sasha Levin wrote:
> On 05/05/2014 11:50 AM, Vlastimil Babka wrote:
> > So in the end this VM_DEBUG check probably cannot work anymore for MIGRATE_RESERVE, only for CMA. I'm not sure if it's worth keeping it only for CMA, what are the CMA guys' opinions on that?
> 
> The way I understood it is that this patch is wrong, but it's still
> alive in -mm. Should it still be there?

I agree that it should be dropped.  I did not follow the discussion,
but mmotm soon gives me BUG at mm/page_alloc.c:1242 under swapping load.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
  2014-05-14  5:19                       ` Hugh Dickins
@ 2014-05-14  9:01                         ` Vlastimil Babka
  -1 siblings, 0 replies; 52+ messages in thread
From: Vlastimil Babka @ 2014-05-14  9:01 UTC (permalink / raw)
  To: Hugh Dickins, Sasha Levin, Andrew Morton
  Cc: Joonsoo Kim, Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm,
	Mel Gorman, Yong-Taek Lee, Minchan Kim, KOSAKI Motohiro,
	Marek Szyprowski, Rik van Riel, Michal Nazarewicz, Dave Jones

On 05/14/2014 07:19 AM, Hugh Dickins wrote:
> On Tue, 13 May 2014, Sasha Levin wrote:
>> On 05/05/2014 11:50 AM, Vlastimil Babka wrote:
>>> So in the end this VM_DEBUG check probably cannot work anymore for MIGRATE_RESERVE, only for CMA. I'm not sure if it's worth keeping it only for CMA, what are the CMA guys' opinions on that?
>>
>> The way I understood it is that this patch is wrong, but it's still
>> alive in -mm. Should it still be there?
>
> I agree that it should be dropped.  I did not follow the discussion,
> but mmotm soon gives me BUG at mm/page_alloc.c:1242 under swapping load.

Yes, I have already asked for dropping, and updating message of PATCH 
1/2 at http://marc.info/?l=linux-mm&m=139947475413079&w=2

Vlastimil

> Hugh
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
@ 2014-05-14  9:01                         ` Vlastimil Babka
  0 siblings, 0 replies; 52+ messages in thread
From: Vlastimil Babka @ 2014-05-14  9:01 UTC (permalink / raw)
  To: Hugh Dickins, Sasha Levin, Andrew Morton
  Cc: Joonsoo Kim, Bartlomiej Zolnierkiewicz, linux-kernel, linux-mm,
	Mel Gorman, Yong-Taek Lee, Minchan Kim, KOSAKI Motohiro,
	Marek Szyprowski, Rik van Riel, Michal Nazarewicz, Dave Jones

On 05/14/2014 07:19 AM, Hugh Dickins wrote:
> On Tue, 13 May 2014, Sasha Levin wrote:
>> On 05/05/2014 11:50 AM, Vlastimil Babka wrote:
>>> So in the end this VM_DEBUG check probably cannot work anymore for MIGRATE_RESERVE, only for CMA. I'm not sure if it's worth keeping it only for CMA, what are the CMA guys' opinions on that?
>>
>> The way I understood it is that this patch is wrong, but it's still
>> alive in -mm. Should it still be there?
>
> I agree that it should be dropped.  I did not follow the discussion,
> but mmotm soon gives me BUG at mm/page_alloc.c:1242 under swapping load.

Yes, I have already asked for dropping, and updating message of PATCH 
1/2 at http://marc.info/?l=linux-mm&m=139947475413079&w=2

Vlastimil

> Hugh
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2014-05-14  9:01 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-06 17:35 [PATCH v3] mm/page_alloc: fix freeing of MIGRATE_RESERVE migratetype pages Bartlomiej Zolnierkiewicz
2014-03-06 17:35 ` Bartlomiej Zolnierkiewicz
2014-03-21 14:16 ` Vlastimil Babka
2014-03-21 14:16   ` Vlastimil Babka
2014-03-25 13:47   ` Bartlomiej Zolnierkiewicz
2014-03-25 13:47     ` Bartlomiej Zolnierkiewicz
2014-04-03 15:36     ` Vlastimil Babka
2014-04-03 15:36       ` Vlastimil Babka
2014-04-03 15:40       ` [PATCH 1/2] mm/page_alloc: prevent MIGRATE_RESERVE pages from being misplaced Vlastimil Babka
2014-04-03 15:40         ` Vlastimil Babka
2014-04-03 15:40         ` [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages Vlastimil Babka
2014-04-03 15:40           ` Vlastimil Babka
2014-04-16  1:09           ` Joonsoo Kim
2014-04-16  1:09             ` Joonsoo Kim
2014-04-30 21:46           ` Sasha Levin
2014-04-30 21:46             ` Sasha Levin
2014-05-02 12:08             ` Vlastimil Babka
2014-05-02 12:08               ` Vlastimil Babka
2014-05-05 14:36               ` Sasha Levin
2014-05-05 14:36                 ` Sasha Levin
2014-05-05 15:50                 ` Vlastimil Babka
2014-05-05 15:50                   ` Vlastimil Babka
2014-05-05 16:37                   ` Sasha Levin
2014-05-05 16:37                     ` Sasha Levin
2014-05-07  1:33                   ` Minchan Kim
2014-05-07  1:33                     ` Minchan Kim
2014-05-07 14:59                     ` Vlastimil Babka
2014-05-07 14:59                       ` Vlastimil Babka
2014-05-08  5:54                       ` Joonsoo Kim
2014-05-08  5:54                         ` Joonsoo Kim
2014-05-08  6:19                         ` Minchan Kim
2014-05-08  6:19                           ` Minchan Kim
2014-05-08 22:34                           ` Andrew Morton
2014-05-08 22:34                             ` Andrew Morton
2014-05-13  1:40                             ` Joonsoo Kim
2014-05-13  1:40                               ` Joonsoo Kim
2014-05-08  8:51                         ` Mel Gorman
2014-05-08  8:51                           ` Mel Gorman
2014-05-12  8:28                         ` Vlastimil Babka
2014-05-12  8:28                           ` Vlastimil Babka
2014-05-13  1:37                           ` Joonsoo Kim
2014-05-13  1:37                             ` Joonsoo Kim
2014-05-14  3:47                   ` Sasha Levin
2014-05-14  3:47                     ` Sasha Levin
2014-05-14  5:19                     ` Hugh Dickins
2014-05-14  5:19                       ` Hugh Dickins
2014-05-14  9:01                       ` Vlastimil Babka
2014-05-14  9:01                         ` Vlastimil Babka
2014-04-16  0:56         ` [PATCH 1/2] mm/page_alloc: prevent MIGRATE_RESERVE pages from being misplaced Joonsoo Kim
2014-04-16  0:56           ` Joonsoo Kim
2014-04-17 23:29         ` Minchan Kim
2014-04-17 23:29           ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.