All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/10] fix freepage count problems due to memory isolation
@ 2014-07-04  7:57 ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

Hello,

This patchset aims at fixing problems due to memory isolation found by
testing my patchset [1].

These are really subtle problems so I can be wrong. If you find what I am
missing, please let me know.

Before describing bugs itself, I first explain definition of freepage.

1. pages on buddy list are counted as freepage.
2. pages on isolate migratetype buddy list are *not* counted as freepage.
3. pages on cma buddy list are counted as CMA freepage, too.
4. pages for guard are *not* counted as freepage.

Now, I describe problems and related patch.

1. Patch 2: If guard page are cleared and merged into isolate buddy list,
we should not add freepage count.

2. Patch 3: When the page return back from pcp to buddy, we should
account it to freepage counter. In this case, we should check the
pageblock migratetype of the page and should insert the page into
appropriate buddy list. Although we checked it in current code, we
didn't insert the page into appropriate buddy list so that freepage
counting can be wrong.

3. Patch 4: There is race condition so that some freepages could be
on isolate buddy list. If so, we can't use this page until next isolation
attempt on this pageblock.

4. Patch 5: There is race condition that page on isolate pageblock
can go into non-isolate buddy list. If so, buddy allocator would
merge pages on non-isolate buddy list and isolate buddy list, respectively,
and freepage count will be wrong.

5. Patch 9: move_freepages(_block) returns *not* number of moved pages.
Instead, it returns number of pages linked in that migratetype buddy list.
So accouting with this return value makes freepage count wrong.

6. Patch 10: buddy allocator would merge pages on non-isolate buddy list
and isolate buddy list, respectively. This leads to freepage counting
problem so fix it by stopping merging in this case.

Without patchset [1], above problem doesn't happens on my CMA allocation
test, because CMA reserved pages aren't used at all. So there is no
chance for above race.

With patchset [1], I did simple CMA allocation test and get below result.

- Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
- run kernel build (make -j16) on background
- 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
- Result: more than 5000 freepage count are missed

With patchset [1] and this patchset, I found that no freepage count are
missed so that I conclude that problems are solved.

These problems can be possible on memory hot remove users, although
I didn't check it further.

Other patches are either for the base to fix these problems or for
simple clean-up. Please see individual patches for more information.

This patchset is based on linux-next-20140703.

Thanks.

[1]: Aggressively allocate the pages on cma reserved memory
     https://lkml.org/lkml/2014/5/30/291


Joonsoo Kim (10):
  mm/page_alloc: remove unlikely macro on free_one_page()
  mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
  mm/page_alloc: handle page on pcp correctly if it's pageblock is
    isolated
  mm/page_alloc: carefully free the page on isolate pageblock
  mm/page_alloc: optimize and unify pageblock migratetype check in free
    path
  mm/page_alloc: separate freepage migratetype interface
  mm/page_alloc: store migratetype of the buddy list into freepage
    correctly
  mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
  mm/page_alloc: fix possible wrongly calculated freepage counter
  mm/page_alloc: Stop merging pages on non-isolate and isolate buddy
    list

 include/linux/mm.h             |   30 +++++++--
 include/linux/mmzone.h         |    5 ++
 include/linux/page-isolation.h |    8 +++
 mm/page_alloc.c                |  138 +++++++++++++++++++++++++++++-----------
 mm/page_isolation.c            |   18 ++----
 5 files changed, 147 insertions(+), 52 deletions(-)

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 00/10] fix freepage count problems due to memory isolation
@ 2014-07-04  7:57 ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

Hello,

This patchset aims at fixing problems due to memory isolation found by
testing my patchset [1].

These are really subtle problems so I can be wrong. If you find what I am
missing, please let me know.

Before describing bugs itself, I first explain definition of freepage.

1. pages on buddy list are counted as freepage.
2. pages on isolate migratetype buddy list are *not* counted as freepage.
3. pages on cma buddy list are counted as CMA freepage, too.
4. pages for guard are *not* counted as freepage.

Now, I describe problems and related patch.

1. Patch 2: If guard page are cleared and merged into isolate buddy list,
we should not add freepage count.

2. Patch 3: When the page return back from pcp to buddy, we should
account it to freepage counter. In this case, we should check the
pageblock migratetype of the page and should insert the page into
appropriate buddy list. Although we checked it in current code, we
didn't insert the page into appropriate buddy list so that freepage
counting can be wrong.

3. Patch 4: There is race condition so that some freepages could be
on isolate buddy list. If so, we can't use this page until next isolation
attempt on this pageblock.

4. Patch 5: There is race condition that page on isolate pageblock
can go into non-isolate buddy list. If so, buddy allocator would
merge pages on non-isolate buddy list and isolate buddy list, respectively,
and freepage count will be wrong.

5. Patch 9: move_freepages(_block) returns *not* number of moved pages.
Instead, it returns number of pages linked in that migratetype buddy list.
So accouting with this return value makes freepage count wrong.

6. Patch 10: buddy allocator would merge pages on non-isolate buddy list
and isolate buddy list, respectively. This leads to freepage counting
problem so fix it by stopping merging in this case.

Without patchset [1], above problem doesn't happens on my CMA allocation
test, because CMA reserved pages aren't used at all. So there is no
chance for above race.

With patchset [1], I did simple CMA allocation test and get below result.

- Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
- run kernel build (make -j16) on background
- 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
- Result: more than 5000 freepage count are missed

With patchset [1] and this patchset, I found that no freepage count are
missed so that I conclude that problems are solved.

These problems can be possible on memory hot remove users, although
I didn't check it further.

Other patches are either for the base to fix these problems or for
simple clean-up. Please see individual patches for more information.

This patchset is based on linux-next-20140703.

Thanks.

[1]: Aggressively allocate the pages on cma reserved memory
     https://lkml.org/lkml/2014/5/30/291


Joonsoo Kim (10):
  mm/page_alloc: remove unlikely macro on free_one_page()
  mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
  mm/page_alloc: handle page on pcp correctly if it's pageblock is
    isolated
  mm/page_alloc: carefully free the page on isolate pageblock
  mm/page_alloc: optimize and unify pageblock migratetype check in free
    path
  mm/page_alloc: separate freepage migratetype interface
  mm/page_alloc: store migratetype of the buddy list into freepage
    correctly
  mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
  mm/page_alloc: fix possible wrongly calculated freepage counter
  mm/page_alloc: Stop merging pages on non-isolate and isolate buddy
    list

 include/linux/mm.h             |   30 +++++++--
 include/linux/mmzone.h         |    5 ++
 include/linux/page-isolation.h |    8 +++
 mm/page_alloc.c                |  138 +++++++++++++++++++++++++++++-----------
 mm/page_isolation.c            |   18 ++----
 5 files changed, 147 insertions(+), 52 deletions(-)

-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 01/10] mm/page_alloc: remove unlikely macro on free_one_page()
  2014-07-04  7:57 ` Joonsoo Kim
@ 2014-07-04  7:57   ` Joonsoo Kim
  -1 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

Isolation is really rare case so !is_migrate_isolate() is
likely case. Remove unlikely macro.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/page_alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8dac0f0..0d4cf7a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -735,7 +735,7 @@ static void free_one_page(struct zone *zone,
 	zone->pages_scanned = 0;
 
 	__free_one_page(page, pfn, zone, order, migratetype);
-	if (unlikely(!is_migrate_isolate(migratetype)))
+	if (!is_migrate_isolate(migratetype))
 		__mod_zone_freepage_state(zone, 1 << order, migratetype);
 	spin_unlock(&zone->lock);
 }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 01/10] mm/page_alloc: remove unlikely macro on free_one_page()
@ 2014-07-04  7:57   ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

Isolation is really rare case so !is_migrate_isolate() is
likely case. Remove unlikely macro.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/page_alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8dac0f0..0d4cf7a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -735,7 +735,7 @@ static void free_one_page(struct zone *zone,
 	zone->pages_scanned = 0;
 
 	__free_one_page(page, pfn, zone, order, migratetype);
-	if (unlikely(!is_migrate_isolate(migratetype)))
+	if (!is_migrate_isolate(migratetype))
 		__mod_zone_freepage_state(zone, 1 << order, migratetype);
 	spin_unlock(&zone->lock);
 }
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 02/10] mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
  2014-07-04  7:57 ` Joonsoo Kim
@ 2014-07-04  7:57   ` Joonsoo Kim
  -1 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

In __free_one_page(), we check the buddy page if it is guard page.
And, if so, we should clear guard attribute on the buddy page. But,
currently, we clear original page's order rather than buddy one's.
This doesn't have any problem, because resetting buddy's order
is useless and the original page's order is re-assigned soon.
But, it is better to correct code.

Additionally, I change (set/clear)_page_guard_flag() to
(set/clear)_page_guard() and makes these functions do all works
needed for guard page. This may make code more understandable.

One more thing, I did in this patch, is that fixing freepage accounting.
If we clear guard page and link it onto isolate buddy list, we should
not increase freepage count.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/page_alloc.c |   29 ++++++++++++++++-------------
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d4cf7a..aeb51d1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -441,18 +441,28 @@ static int __init debug_guardpage_minorder_setup(char *buf)
 }
 __setup("debug_guardpage_minorder=", debug_guardpage_minorder_setup);
 
-static inline void set_page_guard_flag(struct page *page)
+static inline void set_page_guard(struct zone *zone, struct page *page,
+				unsigned int order, int migratetype)
 {
 	__set_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags);
+	set_page_private(page, order);
+	/* Guard pages are not available for any usage */
+	__mod_zone_freepage_state(zone, -(1 << order), migratetype);
 }
 
-static inline void clear_page_guard_flag(struct page *page)
+static inline void clear_page_guard(struct zone *zone, struct page *page,
+				unsigned int order, int migratetype)
 {
 	__clear_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags);
+	set_page_private(page, 0);
+	if (!is_migrate_isolate(migratetype))
+		__mod_zone_freepage_state(zone, (1 << order), migratetype);
 }
 #else
-static inline void set_page_guard_flag(struct page *page) { }
-static inline void clear_page_guard_flag(struct page *page) { }
+static inline void set_page_guard(struct zone *zone, struct page *page,
+				unsigned int order, int migratetype) {}
+static inline void clear_page_guard(struct zone *zone, struct page *page,
+				unsigned int order, int migratetype) {}
 #endif
 
 static inline void set_page_order(struct page *page, unsigned int order)
@@ -594,10 +604,7 @@ static inline void __free_one_page(struct page *page,
 		 * merge with it and move up one order.
 		 */
 		if (page_is_guard(buddy)) {
-			clear_page_guard_flag(buddy);
-			set_page_private(page, 0);
-			__mod_zone_freepage_state(zone, 1 << order,
-						  migratetype);
+			clear_page_guard(zone, buddy, order, migratetype);
 		} else {
 			list_del(&buddy->lru);
 			zone->free_area[order].nr_free--;
@@ -919,11 +926,7 @@ static inline void expand(struct zone *zone, struct page *page,
 			 * pages will stay not present in virtual address space
 			 */
 			INIT_LIST_HEAD(&page[size].lru);
-			set_page_guard_flag(&page[size]);
-			set_page_private(&page[size], high);
-			/* Guard pages are not available for any usage */
-			__mod_zone_freepage_state(zone, -(1 << high),
-						  migratetype);
+			set_page_guard(zone, &page[size], high, migratetype);
 			continue;
 		}
 #endif
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 02/10] mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
@ 2014-07-04  7:57   ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

In __free_one_page(), we check the buddy page if it is guard page.
And, if so, we should clear guard attribute on the buddy page. But,
currently, we clear original page's order rather than buddy one's.
This doesn't have any problem, because resetting buddy's order
is useless and the original page's order is re-assigned soon.
But, it is better to correct code.

Additionally, I change (set/clear)_page_guard_flag() to
(set/clear)_page_guard() and makes these functions do all works
needed for guard page. This may make code more understandable.

One more thing, I did in this patch, is that fixing freepage accounting.
If we clear guard page and link it onto isolate buddy list, we should
not increase freepage count.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/page_alloc.c |   29 ++++++++++++++++-------------
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d4cf7a..aeb51d1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -441,18 +441,28 @@ static int __init debug_guardpage_minorder_setup(char *buf)
 }
 __setup("debug_guardpage_minorder=", debug_guardpage_minorder_setup);
 
-static inline void set_page_guard_flag(struct page *page)
+static inline void set_page_guard(struct zone *zone, struct page *page,
+				unsigned int order, int migratetype)
 {
 	__set_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags);
+	set_page_private(page, order);
+	/* Guard pages are not available for any usage */
+	__mod_zone_freepage_state(zone, -(1 << order), migratetype);
 }
 
-static inline void clear_page_guard_flag(struct page *page)
+static inline void clear_page_guard(struct zone *zone, struct page *page,
+				unsigned int order, int migratetype)
 {
 	__clear_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags);
+	set_page_private(page, 0);
+	if (!is_migrate_isolate(migratetype))
+		__mod_zone_freepage_state(zone, (1 << order), migratetype);
 }
 #else
-static inline void set_page_guard_flag(struct page *page) { }
-static inline void clear_page_guard_flag(struct page *page) { }
+static inline void set_page_guard(struct zone *zone, struct page *page,
+				unsigned int order, int migratetype) {}
+static inline void clear_page_guard(struct zone *zone, struct page *page,
+				unsigned int order, int migratetype) {}
 #endif
 
 static inline void set_page_order(struct page *page, unsigned int order)
@@ -594,10 +604,7 @@ static inline void __free_one_page(struct page *page,
 		 * merge with it and move up one order.
 		 */
 		if (page_is_guard(buddy)) {
-			clear_page_guard_flag(buddy);
-			set_page_private(page, 0);
-			__mod_zone_freepage_state(zone, 1 << order,
-						  migratetype);
+			clear_page_guard(zone, buddy, order, migratetype);
 		} else {
 			list_del(&buddy->lru);
 			zone->free_area[order].nr_free--;
@@ -919,11 +926,7 @@ static inline void expand(struct zone *zone, struct page *page,
 			 * pages will stay not present in virtual address space
 			 */
 			INIT_LIST_HEAD(&page[size].lru);
-			set_page_guard_flag(&page[size]);
-			set_page_private(&page[size], high);
-			/* Guard pages are not available for any usage */
-			__mod_zone_freepage_state(zone, -(1 << high),
-						  migratetype);
+			set_page_guard(zone, &page[size], high, migratetype);
 			continue;
 		}
 #endif
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 03/10] mm/page_alloc: handle page on pcp correctly if it's pageblock is isolated
  2014-07-04  7:57 ` Joonsoo Kim
@ 2014-07-04  7:57   ` Joonsoo Kim
  -1 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

If pageblock of page on pcp are isolated now, we should free it to isolate
buddy list to prevent future allocation on it. But current code doesn't
do this.

Moreover, there is a freepage counting problem on current code. Although
pageblock of page on pcp are isolated now, it could go normal buddy list,
because get_onpcp_migratetype() will return non-isolate migratetype.
In this case, we should do either adding freepage count or changing
migratetype to MIGRATE_ISOLATE, but, current code do neither.

This patch fixes these two problems by handling pageblock migratetype
before calling __free_one_page(). And, if we find the page on isolated
pageblock, change migratetype to MIGRATE_ISOLATE to prevent future
allocation of this page and freepage counting problem.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/page_alloc.c |   14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aeb51d1..99c05f7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -719,15 +719,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			page = list_entry(list->prev, struct page, lru);
 			/* must delete as __free_one_page list manipulates */
 			list_del(&page->lru);
-			mt = get_freepage_migratetype(page);
+
+			if (unlikely(is_migrate_isolate_page(page))) {
+				mt = MIGRATE_ISOLATE;
+			} else {
+				mt = get_freepage_migratetype(page);
+				__mod_zone_freepage_state(zone, 1, mt);
+			}
+
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
 			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
 			trace_mm_page_pcpu_drain(page, 0, mt);
-			if (likely(!is_migrate_isolate_page(page))) {
-				__mod_zone_page_state(zone, NR_FREE_PAGES, 1);
-				if (is_migrate_cma(mt))
-					__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
-			}
 		} while (--to_free && --batch_free && !list_empty(list));
 	}
 	spin_unlock(&zone->lock);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 03/10] mm/page_alloc: handle page on pcp correctly if it's pageblock is isolated
@ 2014-07-04  7:57   ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

If pageblock of page on pcp are isolated now, we should free it to isolate
buddy list to prevent future allocation on it. But current code doesn't
do this.

Moreover, there is a freepage counting problem on current code. Although
pageblock of page on pcp are isolated now, it could go normal buddy list,
because get_onpcp_migratetype() will return non-isolate migratetype.
In this case, we should do either adding freepage count or changing
migratetype to MIGRATE_ISOLATE, but, current code do neither.

This patch fixes these two problems by handling pageblock migratetype
before calling __free_one_page(). And, if we find the page on isolated
pageblock, change migratetype to MIGRATE_ISOLATE to prevent future
allocation of this page and freepage counting problem.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/page_alloc.c |   14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aeb51d1..99c05f7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -719,15 +719,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			page = list_entry(list->prev, struct page, lru);
 			/* must delete as __free_one_page list manipulates */
 			list_del(&page->lru);
-			mt = get_freepage_migratetype(page);
+
+			if (unlikely(is_migrate_isolate_page(page))) {
+				mt = MIGRATE_ISOLATE;
+			} else {
+				mt = get_freepage_migratetype(page);
+				__mod_zone_freepage_state(zone, 1, mt);
+			}
+
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
 			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
 			trace_mm_page_pcpu_drain(page, 0, mt);
-			if (likely(!is_migrate_isolate_page(page))) {
-				__mod_zone_page_state(zone, NR_FREE_PAGES, 1);
-				if (is_migrate_cma(mt))
-					__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
-			}
 		} while (--to_free && --batch_free && !list_empty(list));
 	}
 	spin_unlock(&zone->lock);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 04/10] mm/page_alloc: carefully free the page on isolate pageblock
  2014-07-04  7:57 ` Joonsoo Kim
@ 2014-07-04  7:57   ` Joonsoo Kim
  -1 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

We got migratetype without holding the lock so it could be
racy. If some pages go on the isolate migratetype buddy list
by this race, we can't allocate this page anymore until next
isolation attempt on this pageblock. Below is possible
scenario of this race.

pageblock 1 is isolate migratetype.

CPU1					CPU2
- get_pfnblock_migratetype(pageblock 1),
so MIGRATE_ISOLATE is returned
- call free_one_page() with MIGRATE_ISOLATE
					- grab the zone lock
					- unisolate pageblock 1
					- release the zone lock
- grab the zone lock
- call __free_one_page() with MIGRATE_ISOLATE
- free page go into isolate buddy list
and we can't use it anymore

To prevent this possibility, re-check migratetype with holding the lock.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/page_alloc.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 99c05f7..d8feedc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -743,6 +743,17 @@ static void free_one_page(struct zone *zone,
 	spin_lock(&zone->lock);
 	zone->pages_scanned = 0;
 
+	if (unlikely(is_migrate_isolate(migratetype))) {
+		/*
+		 * We got migratetype without holding the lock so it could be
+		 * racy. If some pages go on the isolate migratetype buddy list
+		 * by this race, we can't allocate this page anymore until next
+		 * isolation attempt on this pageblock. To prevent this
+		 * possibility, re-check migratetype with holding the lock.
+		 */
+		migratetype = get_pfnblock_migratetype(page, pfn);
+	}
+
 	__free_one_page(page, pfn, zone, order, migratetype);
 	if (!is_migrate_isolate(migratetype))
 		__mod_zone_freepage_state(zone, 1 << order, migratetype);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 04/10] mm/page_alloc: carefully free the page on isolate pageblock
@ 2014-07-04  7:57   ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

We got migratetype without holding the lock so it could be
racy. If some pages go on the isolate migratetype buddy list
by this race, we can't allocate this page anymore until next
isolation attempt on this pageblock. Below is possible
scenario of this race.

pageblock 1 is isolate migratetype.

CPU1					CPU2
- get_pfnblock_migratetype(pageblock 1),
so MIGRATE_ISOLATE is returned
- call free_one_page() with MIGRATE_ISOLATE
					- grab the zone lock
					- unisolate pageblock 1
					- release the zone lock
- grab the zone lock
- call __free_one_page() with MIGRATE_ISOLATE
- free page go into isolate buddy list
and we can't use it anymore

To prevent this possibility, re-check migratetype with holding the lock.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/page_alloc.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 99c05f7..d8feedc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -743,6 +743,17 @@ static void free_one_page(struct zone *zone,
 	spin_lock(&zone->lock);
 	zone->pages_scanned = 0;
 
+	if (unlikely(is_migrate_isolate(migratetype))) {
+		/*
+		 * We got migratetype without holding the lock so it could be
+		 * racy. If some pages go on the isolate migratetype buddy list
+		 * by this race, we can't allocate this page anymore until next
+		 * isolation attempt on this pageblock. To prevent this
+		 * possibility, re-check migratetype with holding the lock.
+		 */
+		migratetype = get_pfnblock_migratetype(page, pfn);
+	}
+
 	__free_one_page(page, pfn, zone, order, migratetype);
 	if (!is_migrate_isolate(migratetype))
 		__mod_zone_freepage_state(zone, 1 << order, migratetype);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 05/10] mm/page_alloc: optimize and unify pageblock migratetype check in free path
  2014-07-04  7:57 ` Joonsoo Kim
@ 2014-07-04  7:57   ` Joonsoo Kim
  -1 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

Currently, when we free the page from pcp list to buddy, we check
pageblock of the page in order to isolate the page on isolated
pageblock. Although this could rarely happen and to check migratetype of
pageblock is somewhat expensive, we check it on free fast path. I think
that this is undesirable. To prevent this situation, I introduce new
variable, nr_isolate_pageblock on struct zone and use it to determine
if we should check pageblock migratetype. Isolation on pageblock rarely
happens so we can mostly avoid this pageblock migratetype check.

Additionally, unify freepage counting code, because it can be done in
common part, __free_one_page(). This unifying provides extra guarantee
that the page on isolate pageblock don't go into non-isolate buddy list.
This is similar situation describing in previous patch so refer it
if you need more explanation.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/mmzone.h         |    5 +++++
 include/linux/page-isolation.h |    8 ++++++++
 mm/page_alloc.c                |   40 +++++++++++++++++++---------------------
 mm/page_isolation.c            |    2 ++
 4 files changed, 34 insertions(+), 21 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fd48890..e9c194f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -374,6 +374,11 @@ struct zone {
 	/* pfn where async and sync compaction migration scanner should start */
 	unsigned long		compact_cached_migrate_pfn[2];
 #endif
+
+#ifdef CONFIG_MEMORY_ISOLATION
+	unsigned long		nr_isolate_pageblock;
+#endif
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 	/* see spanned/present_pages for more description */
 	seqlock_t		span_seqlock;
diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 3fff8e7..2dc1e16 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -2,6 +2,10 @@
 #define __LINUX_PAGEISOLATION_H
 
 #ifdef CONFIG_MEMORY_ISOLATION
+static inline bool has_isolate_pageblock(struct zone *zone)
+{
+	return zone->nr_isolate_pageblock;
+}
 static inline bool is_migrate_isolate_page(struct page *page)
 {
 	return get_pageblock_migratetype(page) == MIGRATE_ISOLATE;
@@ -11,6 +15,10 @@ static inline bool is_migrate_isolate(int migratetype)
 	return migratetype == MIGRATE_ISOLATE;
 }
 #else
+static inline bool has_isolate_pageblock(struct zone *zone)
+{
+	return false;
+}
 static inline bool is_migrate_isolate_page(struct page *page)
 {
 	return false;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d8feedc..dcc2f08 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -594,6 +594,24 @@ static inline void __free_one_page(struct page *page,
 	VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
 
+	/*
+	 * pcp pages could go normal buddy list due to stale pageblock
+	 * migratetype so re-check it if there is isolate pageblock.
+	 *
+	 * And, we got migratetype without holding the lock so it could be
+	 * racy. If some pages go on the isolate migratetype buddy list
+	 * by this race, we can't allocate this page anymore until next
+	 * isolation attempt on this pageblock. To prevent this
+	 * possibility, re-check migratetype with holding the lock.
+	 */
+	if (unlikely(has_isolate_pageblock(zone) ||
+		is_migrate_isolate(migratetype))) {
+		migratetype = get_pfnblock_migratetype(page, pfn);
+	}
+
+	if (!is_migrate_isolate(migratetype))
+		__mod_zone_freepage_state(zone, 1 << order, migratetype);
+
 	while (order < MAX_ORDER-1) {
 		buddy_idx = __find_buddy_index(page_idx, order);
 		buddy = page + (buddy_idx - page_idx);
@@ -719,13 +737,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			page = list_entry(list->prev, struct page, lru);
 			/* must delete as __free_one_page list manipulates */
 			list_del(&page->lru);
-
-			if (unlikely(is_migrate_isolate_page(page))) {
-				mt = MIGRATE_ISOLATE;
-			} else {
-				mt = get_freepage_migratetype(page);
-				__mod_zone_freepage_state(zone, 1, mt);
-			}
+			mt = get_freepage_migratetype(page);
 
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
 			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
@@ -742,21 +754,7 @@ static void free_one_page(struct zone *zone,
 {
 	spin_lock(&zone->lock);
 	zone->pages_scanned = 0;
-
-	if (unlikely(is_migrate_isolate(migratetype))) {
-		/*
-		 * We got migratetype without holding the lock so it could be
-		 * racy. If some pages go on the isolate migratetype buddy list
-		 * by this race, we can't allocate this page anymore until next
-		 * isolation attempt on this pageblock. To prevent this
-		 * possibility, re-check migratetype with holding the lock.
-		 */
-		migratetype = get_pfnblock_migratetype(page, pfn);
-	}
-
 	__free_one_page(page, pfn, zone, order, migratetype);
-	if (!is_migrate_isolate(migratetype))
-		__mod_zone_freepage_state(zone, 1 << order, migratetype);
 	spin_unlock(&zone->lock);
 }
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index d1473b2..1fa4a4d 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -60,6 +60,7 @@ out:
 		int migratetype = get_pageblock_migratetype(page);
 
 		set_pageblock_migratetype(page, MIGRATE_ISOLATE);
+		zone->nr_isolate_pageblock++;
 		nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
 
 		__mod_zone_freepage_state(zone, -nr_pages, migratetype);
@@ -83,6 +84,7 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
 	nr_pages = move_freepages_block(zone, page, migratetype);
 	__mod_zone_freepage_state(zone, nr_pages, migratetype);
 	set_pageblock_migratetype(page, migratetype);
+	zone->nr_isolate_pageblock--;
 out:
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 05/10] mm/page_alloc: optimize and unify pageblock migratetype check in free path
@ 2014-07-04  7:57   ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

Currently, when we free the page from pcp list to buddy, we check
pageblock of the page in order to isolate the page on isolated
pageblock. Although this could rarely happen and to check migratetype of
pageblock is somewhat expensive, we check it on free fast path. I think
that this is undesirable. To prevent this situation, I introduce new
variable, nr_isolate_pageblock on struct zone and use it to determine
if we should check pageblock migratetype. Isolation on pageblock rarely
happens so we can mostly avoid this pageblock migratetype check.

Additionally, unify freepage counting code, because it can be done in
common part, __free_one_page(). This unifying provides extra guarantee
that the page on isolate pageblock don't go into non-isolate buddy list.
This is similar situation describing in previous patch so refer it
if you need more explanation.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/mmzone.h         |    5 +++++
 include/linux/page-isolation.h |    8 ++++++++
 mm/page_alloc.c                |   40 +++++++++++++++++++---------------------
 mm/page_isolation.c            |    2 ++
 4 files changed, 34 insertions(+), 21 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fd48890..e9c194f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -374,6 +374,11 @@ struct zone {
 	/* pfn where async and sync compaction migration scanner should start */
 	unsigned long		compact_cached_migrate_pfn[2];
 #endif
+
+#ifdef CONFIG_MEMORY_ISOLATION
+	unsigned long		nr_isolate_pageblock;
+#endif
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 	/* see spanned/present_pages for more description */
 	seqlock_t		span_seqlock;
diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 3fff8e7..2dc1e16 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -2,6 +2,10 @@
 #define __LINUX_PAGEISOLATION_H
 
 #ifdef CONFIG_MEMORY_ISOLATION
+static inline bool has_isolate_pageblock(struct zone *zone)
+{
+	return zone->nr_isolate_pageblock;
+}
 static inline bool is_migrate_isolate_page(struct page *page)
 {
 	return get_pageblock_migratetype(page) == MIGRATE_ISOLATE;
@@ -11,6 +15,10 @@ static inline bool is_migrate_isolate(int migratetype)
 	return migratetype == MIGRATE_ISOLATE;
 }
 #else
+static inline bool has_isolate_pageblock(struct zone *zone)
+{
+	return false;
+}
 static inline bool is_migrate_isolate_page(struct page *page)
 {
 	return false;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d8feedc..dcc2f08 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -594,6 +594,24 @@ static inline void __free_one_page(struct page *page,
 	VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
 
+	/*
+	 * pcp pages could go normal buddy list due to stale pageblock
+	 * migratetype so re-check it if there is isolate pageblock.
+	 *
+	 * And, we got migratetype without holding the lock so it could be
+	 * racy. If some pages go on the isolate migratetype buddy list
+	 * by this race, we can't allocate this page anymore until next
+	 * isolation attempt on this pageblock. To prevent this
+	 * possibility, re-check migratetype with holding the lock.
+	 */
+	if (unlikely(has_isolate_pageblock(zone) ||
+		is_migrate_isolate(migratetype))) {
+		migratetype = get_pfnblock_migratetype(page, pfn);
+	}
+
+	if (!is_migrate_isolate(migratetype))
+		__mod_zone_freepage_state(zone, 1 << order, migratetype);
+
 	while (order < MAX_ORDER-1) {
 		buddy_idx = __find_buddy_index(page_idx, order);
 		buddy = page + (buddy_idx - page_idx);
@@ -719,13 +737,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			page = list_entry(list->prev, struct page, lru);
 			/* must delete as __free_one_page list manipulates */
 			list_del(&page->lru);
-
-			if (unlikely(is_migrate_isolate_page(page))) {
-				mt = MIGRATE_ISOLATE;
-			} else {
-				mt = get_freepage_migratetype(page);
-				__mod_zone_freepage_state(zone, 1, mt);
-			}
+			mt = get_freepage_migratetype(page);
 
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
 			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
@@ -742,21 +754,7 @@ static void free_one_page(struct zone *zone,
 {
 	spin_lock(&zone->lock);
 	zone->pages_scanned = 0;
-
-	if (unlikely(is_migrate_isolate(migratetype))) {
-		/*
-		 * We got migratetype without holding the lock so it could be
-		 * racy. If some pages go on the isolate migratetype buddy list
-		 * by this race, we can't allocate this page anymore until next
-		 * isolation attempt on this pageblock. To prevent this
-		 * possibility, re-check migratetype with holding the lock.
-		 */
-		migratetype = get_pfnblock_migratetype(page, pfn);
-	}
-
 	__free_one_page(page, pfn, zone, order, migratetype);
-	if (!is_migrate_isolate(migratetype))
-		__mod_zone_freepage_state(zone, 1 << order, migratetype);
 	spin_unlock(&zone->lock);
 }
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index d1473b2..1fa4a4d 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -60,6 +60,7 @@ out:
 		int migratetype = get_pageblock_migratetype(page);
 
 		set_pageblock_migratetype(page, MIGRATE_ISOLATE);
+		zone->nr_isolate_pageblock++;
 		nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
 
 		__mod_zone_freepage_state(zone, -nr_pages, migratetype);
@@ -83,6 +84,7 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
 	nr_pages = move_freepages_block(zone, page, migratetype);
 	__mod_zone_freepage_state(zone, nr_pages, migratetype);
 	set_pageblock_migratetype(page, migratetype);
+	zone->nr_isolate_pageblock--;
 out:
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 06/10] mm/page_alloc: separate freepage migratetype interface
  2014-07-04  7:57 ` Joonsoo Kim
@ 2014-07-04  7:57   ` Joonsoo Kim
  -1 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

Currently, we use (set/get)_freepage_migratetype in two use cases.
One usecase is to know migratetype of buddy list where page will be
linked later. The other one is to know migratetype of buddy list where
page is linked now.

Although there is incompleteness for later case, there is no serious
problem, because it is only used by limited context, such as memory
isolation. But, now I'm preparing to fix many freepage counting bugs, and
accurate information about the migratetype of buddy list where page is
linked now is really needed. So this incompleteness would be problem.

Before fixing this incompleteness, separation of interface is needed,
because it is only used if CONFIG_MEMORY_ISOLATION is enabled and it has
some overhead.

So this patch just do separation of interface. There is no functional
change and following patch will describe what we are missing and fix it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/mm.h  |   24 ++++++++++++++++++++----
 mm/page_alloc.c     |   18 +++++++++---------
 mm/page_isolation.c |    4 ++--
 3 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e03dd29..278ecfd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -281,14 +281,30 @@ struct inode;
 #define page_private(page)		((page)->private)
 #define set_page_private(page, v)	((page)->private = (v))
 
-/* It's valid only if the page is free path or free_list */
-static inline void set_freepage_migratetype(struct page *page, int migratetype)
+static inline void set_onbuddy_migratetype(struct page *page, int migratetype)
 {
 	page->index = migratetype;
 }
 
-/* It's valid only if the page is free path or free_list */
-static inline int get_freepage_migratetype(struct page *page)
+/*
+ * It's valid only if the page is on buddy list. It represents
+ * migratetype of the buddy list where page is linked now.
+ */
+static inline int get_onbuddy_migratetype(struct page *page)
+{
+	return page->index;
+}
+
+static inline void set_onpcp_migratetype(struct page *page, int migratetype)
+{
+	page->index = migratetype;
+}
+
+/*
+ * It's valid only if the page is on pcp. It represents migratetype of
+ * the buddy list where page will be linked later when going buddy.
+ */
+static inline int get_onpcp_migratetype(struct page *page)
 {
 	return page->index;
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dcc2f08..9d8ba2d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -737,7 +737,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			page = list_entry(list->prev, struct page, lru);
 			/* must delete as __free_one_page list manipulates */
 			list_del(&page->lru);
-			mt = get_freepage_migratetype(page);
+			mt = get_onpcp_migratetype(page);
 
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
 			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
@@ -797,7 +797,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	set_freepage_migratetype(page, migratetype);
+	set_onbuddy_migratetype(page, migratetype);
 	free_one_page(page_zone(page), page, pfn, order, migratetype);
 	local_irq_restore(flags);
 }
@@ -1023,7 +1023,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 		rmv_page_order(page);
 		area->nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);
-		set_freepage_migratetype(page, migratetype);
+		set_onpcp_migratetype(page, migratetype);
 		return page;
 	}
 
@@ -1091,7 +1091,7 @@ int move_freepages(struct zone *zone,
 		order = page_order(page);
 		list_move(&page->lru,
 			  &zone->free_area[order].free_list[migratetype]);
-		set_freepage_migratetype(page, migratetype);
+		set_onbuddy_migratetype(page, migratetype);
 		page += 1 << order;
 		pages_moved += 1 << order;
 	}
@@ -1221,12 +1221,12 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
 
 			expand(zone, page, order, current_order, area,
 			       new_type);
-			/* The freepage_migratetype may differ from pageblock's
+			/* The onpcp_migratetype may differ from pageblock's
 			 * migratetype depending on the decisions in
 			 * try_to_steal_freepages. This is OK as long as it does
 			 * not differ for MIGRATE_CMA type.
 			 */
-			set_freepage_migratetype(page, new_type);
+			set_onpcp_migratetype(page, new_type);
 
 			trace_mm_page_alloc_extfrag(page, order, current_order,
 				start_migratetype, migratetype, new_type);
@@ -1344,7 +1344,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		else
 			list_add_tail(&page->lru, list);
 		list = &page->lru;
-		if (is_migrate_cma(get_freepage_migratetype(page)))
+		if (is_migrate_cma(get_onpcp_migratetype(page)))
 			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
 					      -(1 << order));
 	}
@@ -1510,7 +1510,7 @@ void free_hot_cold_page(struct page *page, bool cold)
 		return;
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
-	set_freepage_migratetype(page, migratetype);
+	set_onpcp_migratetype(page, migratetype);
 	local_irq_save(flags);
 	__count_vm_event(PGFREE);
 
@@ -1710,7 +1710,7 @@ again:
 		if (!page)
 			goto failed;
 		__mod_zone_freepage_state(zone, -(1 << order),
-					  get_freepage_migratetype(page));
+					  get_onpcp_migratetype(page));
 	}
 
 	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 1fa4a4d..6e4e86b 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -192,7 +192,7 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
 			 * is MIGRATE_ISOLATE. Catch it and move the page into
 			 * MIGRATE_ISOLATE list.
 			 */
-			if (get_freepage_migratetype(page) != MIGRATE_ISOLATE) {
+			if (get_onbuddy_migratetype(page) != MIGRATE_ISOLATE) {
 				struct page *end_page;
 
 				end_page = page + (1 << page_order(page)) - 1;
@@ -202,7 +202,7 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
 			pfn += 1 << page_order(page);
 		}
 		else if (page_count(page) == 0 &&
-			get_freepage_migratetype(page) == MIGRATE_ISOLATE)
+			get_onpcp_migratetype(page) == MIGRATE_ISOLATE)
 			pfn += 1;
 		else if (skip_hwpoisoned_pages && PageHWPoison(page)) {
 			/*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 06/10] mm/page_alloc: separate freepage migratetype interface
@ 2014-07-04  7:57   ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

Currently, we use (set/get)_freepage_migratetype in two use cases.
One usecase is to know migratetype of buddy list where page will be
linked later. The other one is to know migratetype of buddy list where
page is linked now.

Although there is incompleteness for later case, there is no serious
problem, because it is only used by limited context, such as memory
isolation. But, now I'm preparing to fix many freepage counting bugs, and
accurate information about the migratetype of buddy list where page is
linked now is really needed. So this incompleteness would be problem.

Before fixing this incompleteness, separation of interface is needed,
because it is only used if CONFIG_MEMORY_ISOLATION is enabled and it has
some overhead.

So this patch just do separation of interface. There is no functional
change and following patch will describe what we are missing and fix it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/mm.h  |   24 ++++++++++++++++++++----
 mm/page_alloc.c     |   18 +++++++++---------
 mm/page_isolation.c |    4 ++--
 3 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e03dd29..278ecfd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -281,14 +281,30 @@ struct inode;
 #define page_private(page)		((page)->private)
 #define set_page_private(page, v)	((page)->private = (v))
 
-/* It's valid only if the page is free path or free_list */
-static inline void set_freepage_migratetype(struct page *page, int migratetype)
+static inline void set_onbuddy_migratetype(struct page *page, int migratetype)
 {
 	page->index = migratetype;
 }
 
-/* It's valid only if the page is free path or free_list */
-static inline int get_freepage_migratetype(struct page *page)
+/*
+ * It's valid only if the page is on buddy list. It represents
+ * migratetype of the buddy list where page is linked now.
+ */
+static inline int get_onbuddy_migratetype(struct page *page)
+{
+	return page->index;
+}
+
+static inline void set_onpcp_migratetype(struct page *page, int migratetype)
+{
+	page->index = migratetype;
+}
+
+/*
+ * It's valid only if the page is on pcp. It represents migratetype of
+ * the buddy list where page will be linked later when going buddy.
+ */
+static inline int get_onpcp_migratetype(struct page *page)
 {
 	return page->index;
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dcc2f08..9d8ba2d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -737,7 +737,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			page = list_entry(list->prev, struct page, lru);
 			/* must delete as __free_one_page list manipulates */
 			list_del(&page->lru);
-			mt = get_freepage_migratetype(page);
+			mt = get_onpcp_migratetype(page);
 
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
 			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
@@ -797,7 +797,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	set_freepage_migratetype(page, migratetype);
+	set_onbuddy_migratetype(page, migratetype);
 	free_one_page(page_zone(page), page, pfn, order, migratetype);
 	local_irq_restore(flags);
 }
@@ -1023,7 +1023,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 		rmv_page_order(page);
 		area->nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);
-		set_freepage_migratetype(page, migratetype);
+		set_onpcp_migratetype(page, migratetype);
 		return page;
 	}
 
@@ -1091,7 +1091,7 @@ int move_freepages(struct zone *zone,
 		order = page_order(page);
 		list_move(&page->lru,
 			  &zone->free_area[order].free_list[migratetype]);
-		set_freepage_migratetype(page, migratetype);
+		set_onbuddy_migratetype(page, migratetype);
 		page += 1 << order;
 		pages_moved += 1 << order;
 	}
@@ -1221,12 +1221,12 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
 
 			expand(zone, page, order, current_order, area,
 			       new_type);
-			/* The freepage_migratetype may differ from pageblock's
+			/* The onpcp_migratetype may differ from pageblock's
 			 * migratetype depending on the decisions in
 			 * try_to_steal_freepages. This is OK as long as it does
 			 * not differ for MIGRATE_CMA type.
 			 */
-			set_freepage_migratetype(page, new_type);
+			set_onpcp_migratetype(page, new_type);
 
 			trace_mm_page_alloc_extfrag(page, order, current_order,
 				start_migratetype, migratetype, new_type);
@@ -1344,7 +1344,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		else
 			list_add_tail(&page->lru, list);
 		list = &page->lru;
-		if (is_migrate_cma(get_freepage_migratetype(page)))
+		if (is_migrate_cma(get_onpcp_migratetype(page)))
 			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
 					      -(1 << order));
 	}
@@ -1510,7 +1510,7 @@ void free_hot_cold_page(struct page *page, bool cold)
 		return;
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
-	set_freepage_migratetype(page, migratetype);
+	set_onpcp_migratetype(page, migratetype);
 	local_irq_save(flags);
 	__count_vm_event(PGFREE);
 
@@ -1710,7 +1710,7 @@ again:
 		if (!page)
 			goto failed;
 		__mod_zone_freepage_state(zone, -(1 << order),
-					  get_freepage_migratetype(page));
+					  get_onpcp_migratetype(page));
 	}
 
 	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 1fa4a4d..6e4e86b 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -192,7 +192,7 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
 			 * is MIGRATE_ISOLATE. Catch it and move the page into
 			 * MIGRATE_ISOLATE list.
 			 */
-			if (get_freepage_migratetype(page) != MIGRATE_ISOLATE) {
+			if (get_onbuddy_migratetype(page) != MIGRATE_ISOLATE) {
 				struct page *end_page;
 
 				end_page = page + (1 << page_order(page)) - 1;
@@ -202,7 +202,7 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
 			pfn += 1 << page_order(page);
 		}
 		else if (page_count(page) == 0 &&
-			get_freepage_migratetype(page) == MIGRATE_ISOLATE)
+			get_onpcp_migratetype(page) == MIGRATE_ISOLATE)
 			pfn += 1;
 		else if (skip_hwpoisoned_pages && PageHWPoison(page)) {
 			/*
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 07/10] mm/page_alloc: store migratetype of the buddy list into freepage correctly
  2014-07-04  7:57 ` Joonsoo Kim
@ 2014-07-04  7:57   ` Joonsoo Kim
  -1 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

Whenever page is splited or merged, we don't set migratetype to new page,
so there can be no accurate onbuddy_migratetype information. To maintain
it correctly, we should reset whenever page order is changed.
I think that set_page_order() is the best place to do, because it is
called whenever page is merged or splited. Hence, this patch adds
set_onbuddy_migratetype() to set_page_order().

And this patch makes set/get_onbuddy_migratetype() only enabled if
memory isolation is enabeld, because it doesn't needed in other case.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/mm.h |    6 ++++++
 mm/page_alloc.c    |    9 +++++----
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 278ecfd..b35bd3b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -281,6 +281,7 @@ struct inode;
 #define page_private(page)		((page)->private)
 #define set_page_private(page, v)	((page)->private = (v))
 
+#if defined(CONFIG_MEMORY_ISOLATION)
 static inline void set_onbuddy_migratetype(struct page *page, int migratetype)
 {
 	page->index = migratetype;
@@ -294,6 +295,11 @@ static inline int get_onbuddy_migratetype(struct page *page)
 {
 	return page->index;
 }
+#else
+static inline void set_onbuddy_migratetype(struct page *page,
+						int migratetype) {}
+static inline int get_onbuddy_migratetype(struct page *page) { return 0; }
+#endif
 
 static inline void set_onpcp_migratetype(struct page *page, int migratetype)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d8ba2d..e1c4c3e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -465,9 +465,11 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
 				unsigned int order, int migratetype) {}
 #endif
 
-static inline void set_page_order(struct page *page, unsigned int order)
+static inline void set_page_order(struct page *page, unsigned int order,
+							int migratetype)
 {
 	set_page_private(page, order);
+	set_onbuddy_migratetype(page, migratetype);
 	__SetPageBuddy(page);
 }
 
@@ -633,7 +635,7 @@ static inline void __free_one_page(struct page *page,
 		page_idx = combined_idx;
 		order++;
 	}
-	set_page_order(page, order);
+	set_page_order(page, order, migratetype);
 
 	/*
 	 * If this is not the largest possible page, check if the buddy
@@ -797,7 +799,6 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	set_onbuddy_migratetype(page, migratetype);
 	free_one_page(page_zone(page), page, pfn, order, migratetype);
 	local_irq_restore(flags);
 }
@@ -943,7 +944,7 @@ static inline void expand(struct zone *zone, struct page *page,
 #endif
 		list_add(&page[size].lru, &area->free_list[migratetype]);
 		area->nr_free++;
-		set_page_order(&page[size], high);
+		set_page_order(&page[size], high, migratetype);
 	}
 }
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 07/10] mm/page_alloc: store migratetype of the buddy list into freepage correctly
@ 2014-07-04  7:57   ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

Whenever page is splited or merged, we don't set migratetype to new page,
so there can be no accurate onbuddy_migratetype information. To maintain
it correctly, we should reset whenever page order is changed.
I think that set_page_order() is the best place to do, because it is
called whenever page is merged or splited. Hence, this patch adds
set_onbuddy_migratetype() to set_page_order().

And this patch makes set/get_onbuddy_migratetype() only enabled if
memory isolation is enabeld, because it doesn't needed in other case.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/mm.h |    6 ++++++
 mm/page_alloc.c    |    9 +++++----
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 278ecfd..b35bd3b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -281,6 +281,7 @@ struct inode;
 #define page_private(page)		((page)->private)
 #define set_page_private(page, v)	((page)->private = (v))
 
+#if defined(CONFIG_MEMORY_ISOLATION)
 static inline void set_onbuddy_migratetype(struct page *page, int migratetype)
 {
 	page->index = migratetype;
@@ -294,6 +295,11 @@ static inline int get_onbuddy_migratetype(struct page *page)
 {
 	return page->index;
 }
+#else
+static inline void set_onbuddy_migratetype(struct page *page,
+						int migratetype) {}
+static inline int get_onbuddy_migratetype(struct page *page) { return 0; }
+#endif
 
 static inline void set_onpcp_migratetype(struct page *page, int migratetype)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d8ba2d..e1c4c3e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -465,9 +465,11 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
 				unsigned int order, int migratetype) {}
 #endif
 
-static inline void set_page_order(struct page *page, unsigned int order)
+static inline void set_page_order(struct page *page, unsigned int order,
+							int migratetype)
 {
 	set_page_private(page, order);
+	set_onbuddy_migratetype(page, migratetype);
 	__SetPageBuddy(page);
 }
 
@@ -633,7 +635,7 @@ static inline void __free_one_page(struct page *page,
 		page_idx = combined_idx;
 		order++;
 	}
-	set_page_order(page, order);
+	set_page_order(page, order, migratetype);
 
 	/*
 	 * If this is not the largest possible page, check if the buddy
@@ -797,7 +799,6 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	set_onbuddy_migratetype(page, migratetype);
 	free_one_page(page_zone(page), page, pfn, order, migratetype);
 	local_irq_restore(flags);
 }
@@ -943,7 +944,7 @@ static inline void expand(struct zone *zone, struct page *page,
 #endif
 		list_add(&page[size].lru, &area->free_list[migratetype]);
 		area->nr_free++;
-		set_page_order(&page[size], high);
+		set_page_order(&page[size], high, migratetype);
 	}
 }
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 08/10] mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
  2014-07-04  7:57 ` Joonsoo Kim
@ 2014-07-04  7:57   ` Joonsoo Kim
  -1 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

When isolating free page, what we want to know is which list
the page is linked. If it is linked in isolate migratetype buddy list,
we can skip watermark check and freepage counting. And if it is linked
in CMA migratetype buddy list, we need to fixup freepage counting. For
this purpose, get_onbuddy_migratetype() is more fit and cheap than
get_pageblock_migratetype(). So use it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/page_alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e1c4c3e..d9fb8bb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1597,7 +1597,7 @@ static int __isolate_free_page(struct page *page, unsigned int order)
 	BUG_ON(!PageBuddy(page));
 
 	zone = page_zone(page);
-	mt = get_pageblock_migratetype(page);
+	mt = get_onbuddy_migratetype(page);
 
 	if (!is_migrate_isolate(mt)) {
 		/* Obey watermarks as if the page was being allocated */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 08/10] mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
@ 2014-07-04  7:57   ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

When isolating free page, what we want to know is which list
the page is linked. If it is linked in isolate migratetype buddy list,
we can skip watermark check and freepage counting. And if it is linked
in CMA migratetype buddy list, we need to fixup freepage counting. For
this purpose, get_onbuddy_migratetype() is more fit and cheap than
get_pageblock_migratetype(). So use it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/page_alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e1c4c3e..d9fb8bb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1597,7 +1597,7 @@ static int __isolate_free_page(struct page *page, unsigned int order)
 	BUG_ON(!PageBuddy(page));
 
 	zone = page_zone(page);
-	mt = get_pageblock_migratetype(page);
+	mt = get_onbuddy_migratetype(page);
 
 	if (!is_migrate_isolate(mt)) {
 		/* Obey watermarks as if the page was being allocated */
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 09/10] mm/page_alloc: fix possible wrongly calculated freepage counter
  2014-07-04  7:57 ` Joonsoo Kim
@ 2014-07-04  7:57   ` Joonsoo Kim
  -1 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

When isolating/unisolating some pageblock, we add/sub number of
pages returend by move_freepages_block() to calculate number of freepage.
But, this value is invalid for calculation, because it means number of
pages in buddy list of that migratetype rather than number of *moved*
pages. So number of freepage could be incorrect more and more whenever
calling these functions.

And, there is one more counting problem on
__test_page_isolated_in_pageblock(). move_freepages() is called, but missed
to fixup number of freepage. I think that counting should be done in
move_freepages(), otherwise, another future user to this function also
missed to fixup number of freepage again.

Now, we have proper infrastructure, get_onbuddy_migratetype(), which can
be used to get current migratetype of buddy list. So fix this situation.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/page_alloc.c     |   37 ++++++++++++++++++++++++++++++++++---
 mm/page_isolation.c |   12 +++---------
 2 files changed, 37 insertions(+), 12 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d9fb8bb..80c9bd8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -465,6 +465,33 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
 				unsigned int order, int migratetype) {}
 #endif
 
+static inline void count_freepage_nr(struct page *page, int order,
+				int *nr_isolate, int *nr_cma, int *nr_others)
+{
+	int nr = 1 << order;
+	int migratetype = get_onbuddy_migratetype(page);
+
+	if (is_migrate_isolate(migratetype))
+		*nr_isolate += nr;
+	else if (is_migrate_cma(migratetype))
+		*nr_cma += nr;
+	else
+		*nr_others += nr;
+}
+
+static void fixup_freepage_nr(struct zone *zone, int migratetype,
+				int nr_isolate, int nr_cma, int nr_others)
+{
+	int nr_free = nr_cma + nr_others;
+
+	if (is_migrate_isolate(migratetype)) {
+		__mod_zone_page_state(zone, NR_FREE_PAGES, -nr_free);
+		__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, -nr_cma);
+	} else {
+		__mod_zone_freepage_state(zone, nr_isolate, migratetype);
+	}
+}
+
 static inline void set_page_order(struct page *page, unsigned int order,
 							int migratetype)
 {
@@ -619,6 +646,7 @@ static inline void __free_one_page(struct page *page,
 		buddy = page + (buddy_idx - page_idx);
 		if (!page_is_buddy(page, buddy, order))
 			break;
+
 		/*
 		 * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
 		 * merge with it and move up one order.
@@ -1062,7 +1090,7 @@ int move_freepages(struct zone *zone,
 {
 	struct page *page;
 	unsigned long order;
-	int pages_moved = 0;
+	int nr_pages = 0, nr_isolate = 0, nr_cma = 0, nr_others = 0;
 
 #ifndef CONFIG_HOLES_IN_ZONE
 	/*
@@ -1090,14 +1118,17 @@ int move_freepages(struct zone *zone,
 		}
 
 		order = page_order(page);
+		count_freepage_nr(page, order,
+			&nr_isolate, &nr_cma, &nr_others);
 		list_move(&page->lru,
 			  &zone->free_area[order].free_list[migratetype]);
 		set_onbuddy_migratetype(page, migratetype);
 		page += 1 << order;
-		pages_moved += 1 << order;
+		nr_pages += 1 << order;
 	}
 
-	return pages_moved;
+	fixup_freepage_nr(zone, migratetype, nr_isolate, nr_cma, nr_others);
+	return nr_pages;
 }
 
 int move_freepages_block(struct zone *zone, struct page *page,
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 6e4e86b..62676de 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -56,14 +56,9 @@ int set_migratetype_isolate(struct page *page, bool skip_hwpoisoned_pages)
 
 out:
 	if (!ret) {
-		unsigned long nr_pages;
-		int migratetype = get_pageblock_migratetype(page);
-
 		set_pageblock_migratetype(page, MIGRATE_ISOLATE);
 		zone->nr_isolate_pageblock++;
-		nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
-
-		__mod_zone_freepage_state(zone, -nr_pages, migratetype);
+		move_freepages_block(zone, page, MIGRATE_ISOLATE);
 	}
 
 	spin_unlock_irqrestore(&zone->lock, flags);
@@ -75,14 +70,13 @@ out:
 void unset_migratetype_isolate(struct page *page, unsigned migratetype)
 {
 	struct zone *zone;
-	unsigned long flags, nr_pages;
+	unsigned long flags;
 
 	zone = page_zone(page);
 	spin_lock_irqsave(&zone->lock, flags);
 	if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
 		goto out;
-	nr_pages = move_freepages_block(zone, page, migratetype);
-	__mod_zone_freepage_state(zone, nr_pages, migratetype);
+	move_freepages_block(zone, page, migratetype);
 	set_pageblock_migratetype(page, migratetype);
 	zone->nr_isolate_pageblock--;
 out:
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 09/10] mm/page_alloc: fix possible wrongly calculated freepage counter
@ 2014-07-04  7:57   ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

When isolating/unisolating some pageblock, we add/sub number of
pages returend by move_freepages_block() to calculate number of freepage.
But, this value is invalid for calculation, because it means number of
pages in buddy list of that migratetype rather than number of *moved*
pages. So number of freepage could be incorrect more and more whenever
calling these functions.

And, there is one more counting problem on
__test_page_isolated_in_pageblock(). move_freepages() is called, but missed
to fixup number of freepage. I think that counting should be done in
move_freepages(), otherwise, another future user to this function also
missed to fixup number of freepage again.

Now, we have proper infrastructure, get_onbuddy_migratetype(), which can
be used to get current migratetype of buddy list. So fix this situation.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/page_alloc.c     |   37 ++++++++++++++++++++++++++++++++++---
 mm/page_isolation.c |   12 +++---------
 2 files changed, 37 insertions(+), 12 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d9fb8bb..80c9bd8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -465,6 +465,33 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
 				unsigned int order, int migratetype) {}
 #endif
 
+static inline void count_freepage_nr(struct page *page, int order,
+				int *nr_isolate, int *nr_cma, int *nr_others)
+{
+	int nr = 1 << order;
+	int migratetype = get_onbuddy_migratetype(page);
+
+	if (is_migrate_isolate(migratetype))
+		*nr_isolate += nr;
+	else if (is_migrate_cma(migratetype))
+		*nr_cma += nr;
+	else
+		*nr_others += nr;
+}
+
+static void fixup_freepage_nr(struct zone *zone, int migratetype,
+				int nr_isolate, int nr_cma, int nr_others)
+{
+	int nr_free = nr_cma + nr_others;
+
+	if (is_migrate_isolate(migratetype)) {
+		__mod_zone_page_state(zone, NR_FREE_PAGES, -nr_free);
+		__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, -nr_cma);
+	} else {
+		__mod_zone_freepage_state(zone, nr_isolate, migratetype);
+	}
+}
+
 static inline void set_page_order(struct page *page, unsigned int order,
 							int migratetype)
 {
@@ -619,6 +646,7 @@ static inline void __free_one_page(struct page *page,
 		buddy = page + (buddy_idx - page_idx);
 		if (!page_is_buddy(page, buddy, order))
 			break;
+
 		/*
 		 * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
 		 * merge with it and move up one order.
@@ -1062,7 +1090,7 @@ int move_freepages(struct zone *zone,
 {
 	struct page *page;
 	unsigned long order;
-	int pages_moved = 0;
+	int nr_pages = 0, nr_isolate = 0, nr_cma = 0, nr_others = 0;
 
 #ifndef CONFIG_HOLES_IN_ZONE
 	/*
@@ -1090,14 +1118,17 @@ int move_freepages(struct zone *zone,
 		}
 
 		order = page_order(page);
+		count_freepage_nr(page, order,
+			&nr_isolate, &nr_cma, &nr_others);
 		list_move(&page->lru,
 			  &zone->free_area[order].free_list[migratetype]);
 		set_onbuddy_migratetype(page, migratetype);
 		page += 1 << order;
-		pages_moved += 1 << order;
+		nr_pages += 1 << order;
 	}
 
-	return pages_moved;
+	fixup_freepage_nr(zone, migratetype, nr_isolate, nr_cma, nr_others);
+	return nr_pages;
 }
 
 int move_freepages_block(struct zone *zone, struct page *page,
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 6e4e86b..62676de 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -56,14 +56,9 @@ int set_migratetype_isolate(struct page *page, bool skip_hwpoisoned_pages)
 
 out:
 	if (!ret) {
-		unsigned long nr_pages;
-		int migratetype = get_pageblock_migratetype(page);
-
 		set_pageblock_migratetype(page, MIGRATE_ISOLATE);
 		zone->nr_isolate_pageblock++;
-		nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
-
-		__mod_zone_freepage_state(zone, -nr_pages, migratetype);
+		move_freepages_block(zone, page, MIGRATE_ISOLATE);
 	}
 
 	spin_unlock_irqrestore(&zone->lock, flags);
@@ -75,14 +70,13 @@ out:
 void unset_migratetype_isolate(struct page *page, unsigned migratetype)
 {
 	struct zone *zone;
-	unsigned long flags, nr_pages;
+	unsigned long flags;
 
 	zone = page_zone(page);
 	spin_lock_irqsave(&zone->lock, flags);
 	if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
 		goto out;
-	nr_pages = move_freepages_block(zone, page, migratetype);
-	__mod_zone_freepage_state(zone, nr_pages, migratetype);
+	move_freepages_block(zone, page, migratetype);
 	set_pageblock_migratetype(page, migratetype);
 	zone->nr_isolate_pageblock--;
 out:
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 10/10] mm/page_alloc: Stop merging pages on non-isolate and isolate buddy list
  2014-07-04  7:57 ` Joonsoo Kim
@ 2014-07-04  7:57   ` Joonsoo Kim
  -1 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

If we merge pages on non-isolate buddy list and isolate buddy list,
respectively, we should fixup freepage count, because we don't regard
pages in isolate buddy list as freepage. But this will impose some
overhead on __free_one_page() which is core function of page free path
so this overhead looks undesirable to me. Instead, we can stop merging
in this case. With this approach, we can skip to fixup freepage count
with low overhead.

The side-effect of this change is that some buddies equal or larger than
pageblock order isn't merged if one of buddy is on isolate pageblock. But,
I think that this is no problem, because isolation means that we will use
page on isolate pageblock specially, so it will split soon in any case.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/page_alloc.c |   18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80c9bd8..da4da66 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -648,6 +648,24 @@ static inline void __free_one_page(struct page *page,
 			break;
 
 		/*
+		 * Stop merging between page on non-isolate buddy list and
+		 * isolate buddy list, respectively. This case is only possible
+		 * for pages equal or larger than pageblock_order, because
+		 * pageblock migratetype can be changed in this granularity.
+		 */
+		if (unlikely(order >= pageblock_order &&
+			has_isolate_pageblock(zone))) {
+			int buddy_mt = get_onbuddy_migratetype(buddy);
+
+			if (migratetype != buddy_mt) {
+				if (is_migrate_isolate(migratetype))
+					break;
+				else if (is_migrate_isolate(buddy_mt))
+					break;
+			}
+		}
+
+		/*
 		 * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
 		 * merge with it and move up one order.
 		 */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 10/10] mm/page_alloc: Stop merging pages on non-isolate and isolate buddy list
@ 2014-07-04  7:57   ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel,
	Joonsoo Kim

If we merge pages on non-isolate buddy list and isolate buddy list,
respectively, we should fixup freepage count, because we don't regard
pages in isolate buddy list as freepage. But this will impose some
overhead on __free_one_page() which is core function of page free path
so this overhead looks undesirable to me. Instead, we can stop merging
in this case. With this approach, we can skip to fixup freepage count
with low overhead.

The side-effect of this change is that some buddies equal or larger than
pageblock order isn't merged if one of buddy is on isolate pageblock. But,
I think that this is no problem, because isolation means that we will use
page on isolate pageblock specially, so it will split soon in any case.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/page_alloc.c |   18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80c9bd8..da4da66 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -648,6 +648,24 @@ static inline void __free_one_page(struct page *page,
 			break;
 
 		/*
+		 * Stop merging between page on non-isolate buddy list and
+		 * isolate buddy list, respectively. This case is only possible
+		 * for pages equal or larger than pageblock_order, because
+		 * pageblock migratetype can be changed in this granularity.
+		 */
+		if (unlikely(order >= pageblock_order &&
+			has_isolate_pageblock(zone))) {
+			int buddy_mt = get_onbuddy_migratetype(buddy);
+
+			if (migratetype != buddy_mt) {
+				if (is_migrate_isolate(migratetype))
+					break;
+				else if (is_migrate_isolate(buddy_mt))
+					break;
+			}
+		}
+
+		/*
 		 * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
 		 * merge with it and move up one order.
 		 */
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH 01/10] mm/page_alloc: remove unlikely macro on free_one_page()
  2014-07-04  7:57   ` Joonsoo Kim
@ 2014-07-04 12:03     ` Vlastimil Babka
  -1 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-04 12:03 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> Isolation is really rare case so !is_migrate_isolate() is
> likely case. Remove unlikely macro.

Good catch. Why not replace it with likely then? Any difference in the assembly?

> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>  mm/page_alloc.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8dac0f0..0d4cf7a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -735,7 +735,7 @@ static void free_one_page(struct zone *zone,
>  	zone->pages_scanned = 0;
>  
>  	__free_one_page(page, pfn, zone, order, migratetype);
> -	if (unlikely(!is_migrate_isolate(migratetype)))
> +	if (!is_migrate_isolate(migratetype))
>  		__mod_zone_freepage_state(zone, 1 << order, migratetype);
>  	spin_unlock(&zone->lock);
>  }
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 01/10] mm/page_alloc: remove unlikely macro on free_one_page()
@ 2014-07-04 12:03     ` Vlastimil Babka
  0 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-04 12:03 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> Isolation is really rare case so !is_migrate_isolate() is
> likely case. Remove unlikely macro.

Good catch. Why not replace it with likely then? Any difference in the assembly?

> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>  mm/page_alloc.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8dac0f0..0d4cf7a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -735,7 +735,7 @@ static void free_one_page(struct zone *zone,
>  	zone->pages_scanned = 0;
>  
>  	__free_one_page(page, pfn, zone, order, migratetype);
> -	if (unlikely(!is_migrate_isolate(migratetype)))
> +	if (!is_migrate_isolate(migratetype))
>  		__mod_zone_freepage_state(zone, 1 << order, migratetype);
>  	spin_unlock(&zone->lock);
>  }
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 01/10] mm/page_alloc: remove unlikely macro on free_one_page()
  2014-07-04  7:57   ` Joonsoo Kim
@ 2014-07-04 12:52     ` Michal Nazarewicz
  -1 siblings, 0 replies; 79+ messages in thread
From: Michal Nazarewicz @ 2014-07-04 12:52 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Laura Abbott, Heesub Shin, Aneesh Kumar K.V, Ritesh Harjani,
	t.stanislaws, Gioh Kim, linux-mm, linux-kernel, Joonsoo Kim

On Fri, Jul 04 2014, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> Isolation is really rare case so !is_migrate_isolate() is
> likely case. Remove unlikely macro.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>



> ---
>  mm/page_alloc.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8dac0f0..0d4cf7a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -735,7 +735,7 @@ static void free_one_page(struct zone *zone,
>  	zone->pages_scanned = 0;
>  
>  	__free_one_page(page, pfn, zone, order, migratetype);
> -	if (unlikely(!is_migrate_isolate(migratetype)))
> +	if (!is_migrate_isolate(migratetype))
>  		__mod_zone_freepage_state(zone, 1 << order, migratetype);
>  	spin_unlock(&zone->lock);
>  }
> -- 
> 1.7.9.5
>

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +--<mpn@google.com>--<xmpp:mina86@jabber.org>--ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 01/10] mm/page_alloc: remove unlikely macro on free_one_page()
@ 2014-07-04 12:52     ` Michal Nazarewicz
  0 siblings, 0 replies; 79+ messages in thread
From: Michal Nazarewicz @ 2014-07-04 12:52 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Laura Abbott, Heesub Shin, Aneesh Kumar K.V, Ritesh Harjani,
	t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On Fri, Jul 04 2014, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> Isolation is really rare case so !is_migrate_isolate() is
> likely case. Remove unlikely macro.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>



> ---
>  mm/page_alloc.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8dac0f0..0d4cf7a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -735,7 +735,7 @@ static void free_one_page(struct zone *zone,
>  	zone->pages_scanned = 0;
>  
>  	__free_one_page(page, pfn, zone, order, migratetype);
> -	if (unlikely(!is_migrate_isolate(migratetype)))
> +	if (!is_migrate_isolate(migratetype))
>  		__mod_zone_freepage_state(zone, 1 << order, migratetype);
>  	spin_unlock(&zone->lock);
>  }
> -- 
> 1.7.9.5
>

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +--<mpn@google.com>--<xmpp:mina86@jabber.org>--ooO--(_)--Ooo--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 01/10] mm/page_alloc: remove unlikely macro on free_one_page()
  2014-07-04  7:57   ` Joonsoo Kim
@ 2014-07-04 12:53     ` Michal Nazarewicz
  -1 siblings, 0 replies; 79+ messages in thread
From: Michal Nazarewicz @ 2014-07-04 12:53 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Laura Abbott, Heesub Shin, Aneesh Kumar K.V, Ritesh Harjani,
	t.stanislaws, Gioh Kim, linux-mm, linux-kernel, Joonsoo Kim

On Fri, Jul 04 2014, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> Isolation is really rare case so !is_migrate_isolate() is
> likely case. Remove unlikely macro.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Acked-by: Michal Nazarewicz <mina86@mina86.com>

> ---
>  mm/page_alloc.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8dac0f0..0d4cf7a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -735,7 +735,7 @@ static void free_one_page(struct zone *zone,
>  	zone->pages_scanned = 0;
>  
>  	__free_one_page(page, pfn, zone, order, migratetype);
> -	if (unlikely(!is_migrate_isolate(migratetype)))
> +	if (!is_migrate_isolate(migratetype))
>  		__mod_zone_freepage_state(zone, 1 << order, migratetype);
>  	spin_unlock(&zone->lock);
>  }

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +--<mpn@google.com>--<xmpp:mina86@jabber.org>--ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 01/10] mm/page_alloc: remove unlikely macro on free_one_page()
@ 2014-07-04 12:53     ` Michal Nazarewicz
  0 siblings, 0 replies; 79+ messages in thread
From: Michal Nazarewicz @ 2014-07-04 12:53 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Laura Abbott, Heesub Shin, Aneesh Kumar K.V, Ritesh Harjani,
	t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On Fri, Jul 04 2014, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> Isolation is really rare case so !is_migrate_isolate() is
> likely case. Remove unlikely macro.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Acked-by: Michal Nazarewicz <mina86@mina86.com>

> ---
>  mm/page_alloc.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8dac0f0..0d4cf7a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -735,7 +735,7 @@ static void free_one_page(struct zone *zone,
>  	zone->pages_scanned = 0;
>  
>  	__free_one_page(page, pfn, zone, order, migratetype);
> -	if (unlikely(!is_migrate_isolate(migratetype)))
> +	if (!is_migrate_isolate(migratetype))
>  		__mod_zone_freepage_state(zone, 1 << order, migratetype);
>  	spin_unlock(&zone->lock);
>  }

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +--<mpn@google.com>--<xmpp:mina86@jabber.org>--ooO--(_)--Ooo--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
  2014-07-04  7:57 ` Joonsoo Kim
@ 2014-07-04 15:33   ` Vlastimil Babka
  -1 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-04 15:33 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> Hello,

Hi Joonsoo,

please CC me on further updates, this is relevant to me.

> This patchset aims at fixing problems due to memory isolation found by
> testing my patchset [1].
> 
> These are really subtle problems so I can be wrong. If you find what I am
> missing, please let me know.
> 
> Before describing bugs itself, I first explain definition of freepage.
> 
> 1. pages on buddy list are counted as freepage.
> 2. pages on isolate migratetype buddy list are *not* counted as freepage.

I think the second point is causing us a lot of trouble. And I wonder if it's really
justified! We already have some is_migrate_isolate() checks in the fast path and now you
would add more, mostly just to keep this accounting correct.

So the question is, does it have to be correct? And (admiteddly not after a completely
exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)

Well I of course don't mean that the freepage accounts could go random completely, but
what if we allowed them to drift a bit, limiting both the max error and the timeframe
where errors are possible? After all, watermarks checking is already racy so I don't think
it would be hurt that much.

Now if we look at what both CMA and memory hot-remove does is:

1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
start_isolate_page_range(). As part of this, all free pages in that area are
moved on the isolate freelist through move_freepages_block().

2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
caches.

3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
undo pageblock isolation if not.

So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
allocator, which would therefore account free pages on the isolate freelist as normal free
pages.
The accounting of isolated pages would be instead done only on the top level of CMA/hot
remove in the three steps outlined above, which would be modified as follows:

1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
as usual. Calculate X as the number of pages that move_freepages_block() has moved.
Subtract X from freepages (this is the same as it is done now), but also *remember the
value of X*

2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
isolate freelist, or not. We don't care, they will be accounted as freepages either way.
This is where some inaccuracy in accounted freepages would build up.

3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
that we have a isolated range of N pages that nobody can steal now as everything is on
isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
X, then N-X). So the accounting matches reality.

If we have to undo, we undo the isolation and as part of this, we use
move_freepages_block() to move pages from isolate freelist to the normal ones. But we
don't care how many pages were moved. We simply add the remembered value of X to the
number of freepages, undoing the change from step 1. Again, the accounting matches reality.


The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
handled.

As a possible improvement, we can assume during phase 2 that every page freed by migration
will end up correctly on isolate free list. So we create M free pages by migration, and
subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
+ M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
pages on pcplists and the possible races it will be less). I think with this improvement,
any error would be negligible.

Thoughts?

> 3. pages on cma buddy list are counted as CMA freepage, too.
> 4. pages for guard are *not* counted as freepage.
> 
> Now, I describe problems and related patch.
> 
> 1. Patch 2: If guard page are cleared and merged into isolate buddy list,
> we should not add freepage count.
> 
> 2. Patch 3: When the page return back from pcp to buddy, we should
> account it to freepage counter. In this case, we should check the
> pageblock migratetype of the page and should insert the page into
> appropriate buddy list. Although we checked it in current code, we
> didn't insert the page into appropriate buddy list so that freepage
> counting can be wrong.
> 
> 3. Patch 4: There is race condition so that some freepages could be
> on isolate buddy list. If so, we can't use this page until next isolation
> attempt on this pageblock.
> 
> 4. Patch 5: There is race condition that page on isolate pageblock
> can go into non-isolate buddy list. If so, buddy allocator would
> merge pages on non-isolate buddy list and isolate buddy list, respectively,
> and freepage count will be wrong.
> 
> 5. Patch 9: move_freepages(_block) returns *not* number of moved pages.
> Instead, it returns number of pages linked in that migratetype buddy list.
> So accouting with this return value makes freepage count wrong.
> 
> 6. Patch 10: buddy allocator would merge pages on non-isolate buddy list
> and isolate buddy list, respectively. This leads to freepage counting
> problem so fix it by stopping merging in this case.
> 
> Without patchset [1], above problem doesn't happens on my CMA allocation
> test, because CMA reserved pages aren't used at all. So there is no
> chance for above race.
> 
> With patchset [1], I did simple CMA allocation test and get below result.
> 
> - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
> - run kernel build (make -j16) on background
> - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
> - Result: more than 5000 freepage count are missed
> 
> With patchset [1] and this patchset, I found that no freepage count are
> missed so that I conclude that problems are solved.
> 
> These problems can be possible on memory hot remove users, although
> I didn't check it further.
> 
> Other patches are either for the base to fix these problems or for
> simple clean-up. Please see individual patches for more information.
> 
> This patchset is based on linux-next-20140703.
> 
> Thanks.
> 
> [1]: Aggressively allocate the pages on cma reserved memory
>      https://lkml.org/lkml/2014/5/30/291
> 
> 
> Joonsoo Kim (10):
>   mm/page_alloc: remove unlikely macro on free_one_page()
>   mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
>   mm/page_alloc: handle page on pcp correctly if it's pageblock is
>     isolated
>   mm/page_alloc: carefully free the page on isolate pageblock
>   mm/page_alloc: optimize and unify pageblock migratetype check in free
>     path
>   mm/page_alloc: separate freepage migratetype interface
>   mm/page_alloc: store migratetype of the buddy list into freepage
>     correctly
>   mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
>   mm/page_alloc: fix possible wrongly calculated freepage counter
>   mm/page_alloc: Stop merging pages on non-isolate and isolate buddy
>     list
> 
>  include/linux/mm.h             |   30 +++++++--
>  include/linux/mmzone.h         |    5 ++
>  include/linux/page-isolation.h |    8 +++
>  mm/page_alloc.c                |  138 +++++++++++++++++++++++++++++-----------
>  mm/page_isolation.c            |   18 ++----
>  5 files changed, 147 insertions(+), 52 deletions(-)
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
@ 2014-07-04 15:33   ` Vlastimil Babka
  0 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-04 15:33 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> Hello,

Hi Joonsoo,

please CC me on further updates, this is relevant to me.

> This patchset aims at fixing problems due to memory isolation found by
> testing my patchset [1].
> 
> These are really subtle problems so I can be wrong. If you find what I am
> missing, please let me know.
> 
> Before describing bugs itself, I first explain definition of freepage.
> 
> 1. pages on buddy list are counted as freepage.
> 2. pages on isolate migratetype buddy list are *not* counted as freepage.

I think the second point is causing us a lot of trouble. And I wonder if it's really
justified! We already have some is_migrate_isolate() checks in the fast path and now you
would add more, mostly just to keep this accounting correct.

So the question is, does it have to be correct? And (admiteddly not after a completely
exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)

Well I of course don't mean that the freepage accounts could go random completely, but
what if we allowed them to drift a bit, limiting both the max error and the timeframe
where errors are possible? After all, watermarks checking is already racy so I don't think
it would be hurt that much.

Now if we look at what both CMA and memory hot-remove does is:

1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
start_isolate_page_range(). As part of this, all free pages in that area are
moved on the isolate freelist through move_freepages_block().

2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
caches.

3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
undo pageblock isolation if not.

So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
allocator, which would therefore account free pages on the isolate freelist as normal free
pages.
The accounting of isolated pages would be instead done only on the top level of CMA/hot
remove in the three steps outlined above, which would be modified as follows:

1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
as usual. Calculate X as the number of pages that move_freepages_block() has moved.
Subtract X from freepages (this is the same as it is done now), but also *remember the
value of X*

2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
isolate freelist, or not. We don't care, they will be accounted as freepages either way.
This is where some inaccuracy in accounted freepages would build up.

3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
that we have a isolated range of N pages that nobody can steal now as everything is on
isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
X, then N-X). So the accounting matches reality.

If we have to undo, we undo the isolation and as part of this, we use
move_freepages_block() to move pages from isolate freelist to the normal ones. But we
don't care how many pages were moved. We simply add the remembered value of X to the
number of freepages, undoing the change from step 1. Again, the accounting matches reality.


The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
handled.

As a possible improvement, we can assume during phase 2 that every page freed by migration
will end up correctly on isolate free list. So we create M free pages by migration, and
subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
+ M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
pages on pcplists and the possible races it will be less). I think with this improvement,
any error would be negligible.

Thoughts?

> 3. pages on cma buddy list are counted as CMA freepage, too.
> 4. pages for guard are *not* counted as freepage.
> 
> Now, I describe problems and related patch.
> 
> 1. Patch 2: If guard page are cleared and merged into isolate buddy list,
> we should not add freepage count.
> 
> 2. Patch 3: When the page return back from pcp to buddy, we should
> account it to freepage counter. In this case, we should check the
> pageblock migratetype of the page and should insert the page into
> appropriate buddy list. Although we checked it in current code, we
> didn't insert the page into appropriate buddy list so that freepage
> counting can be wrong.
> 
> 3. Patch 4: There is race condition so that some freepages could be
> on isolate buddy list. If so, we can't use this page until next isolation
> attempt on this pageblock.
> 
> 4. Patch 5: There is race condition that page on isolate pageblock
> can go into non-isolate buddy list. If so, buddy allocator would
> merge pages on non-isolate buddy list and isolate buddy list, respectively,
> and freepage count will be wrong.
> 
> 5. Patch 9: move_freepages(_block) returns *not* number of moved pages.
> Instead, it returns number of pages linked in that migratetype buddy list.
> So accouting with this return value makes freepage count wrong.
> 
> 6. Patch 10: buddy allocator would merge pages on non-isolate buddy list
> and isolate buddy list, respectively. This leads to freepage counting
> problem so fix it by stopping merging in this case.
> 
> Without patchset [1], above problem doesn't happens on my CMA allocation
> test, because CMA reserved pages aren't used at all. So there is no
> chance for above race.
> 
> With patchset [1], I did simple CMA allocation test and get below result.
> 
> - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
> - run kernel build (make -j16) on background
> - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
> - Result: more than 5000 freepage count are missed
> 
> With patchset [1] and this patchset, I found that no freepage count are
> missed so that I conclude that problems are solved.
> 
> These problems can be possible on memory hot remove users, although
> I didn't check it further.
> 
> Other patches are either for the base to fix these problems or for
> simple clean-up. Please see individual patches for more information.
> 
> This patchset is based on linux-next-20140703.
> 
> Thanks.
> 
> [1]: Aggressively allocate the pages on cma reserved memory
>      https://lkml.org/lkml/2014/5/30/291
> 
> 
> Joonsoo Kim (10):
>   mm/page_alloc: remove unlikely macro on free_one_page()
>   mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
>   mm/page_alloc: handle page on pcp correctly if it's pageblock is
>     isolated
>   mm/page_alloc: carefully free the page on isolate pageblock
>   mm/page_alloc: optimize and unify pageblock migratetype check in free
>     path
>   mm/page_alloc: separate freepage migratetype interface
>   mm/page_alloc: store migratetype of the buddy list into freepage
>     correctly
>   mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
>   mm/page_alloc: fix possible wrongly calculated freepage counter
>   mm/page_alloc: Stop merging pages on non-isolate and isolate buddy
>     list
> 
>  include/linux/mm.h             |   30 +++++++--
>  include/linux/mmzone.h         |    5 ++
>  include/linux/page-isolation.h |    8 +++
>  mm/page_alloc.c                |  138 +++++++++++++++++++++++++++++-----------
>  mm/page_isolation.c            |   18 ++----
>  5 files changed, 147 insertions(+), 52 deletions(-)
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
  2014-07-04 15:33   ` Vlastimil Babka
@ 2014-07-07  4:49     ` Joonsoo Kim
  -1 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-07  4:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, Lisa Du,
	linux-kernel

Ccing Lisa, because there was bug report it may be related this
topic last Saturday.

http://www.spinics.net/lists/linux-mm/msg75741.html

On Fri, Jul 04, 2014 at 05:33:27PM +0200, Vlastimil Babka wrote:
> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> > Hello,
> 
> Hi Joonsoo,
> 
> please CC me on further updates, this is relevant to me.

Hello, Vlastimil.

Sorry for missing you. :)

> 
> > This patchset aims at fixing problems due to memory isolation found by
> > testing my patchset [1].
> > 
> > These are really subtle problems so I can be wrong. If you find what I am
> > missing, please let me know.
> > 
> > Before describing bugs itself, I first explain definition of freepage.
> > 
> > 1. pages on buddy list are counted as freepage.
> > 2. pages on isolate migratetype buddy list are *not* counted as freepage.
> 
> I think the second point is causing us a lot of trouble. And I wonder if it's really
> justified! We already have some is_migrate_isolate() checks in the fast path and now you
> would add more, mostly just to keep this accounting correct.

It's not just for keeping accouting correct. There is another
purpose for is_migrate_isolate(). It forces freed pages to go into
isolate buddy list. Without it, freed pages would go into other
buddy list and will be used soon. So memory isolation can't work well
without is_migrate_isolate() checks and success rate could decrease.

And, I just added three more is_migrate_isolate() in the fast
path, but, two checks are in same *unlikely* branch and I can remove
another one easily. Therefore it's not quite problem I guess. (It even
does no-op if MEMORY_ISOLATION is disabled.)
Moreover, I removed one unconditional get_pageblock_migratetype() in
free_pcppages_bulk() so, in performance point or view, freepath would
be improved.

> 
> So the question is, does it have to be correct? And (admiteddly not after a completely
> exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)
> 
> Well I of course don't mean that the freepage accounts could go random completely, but
> what if we allowed them to drift a bit, limiting both the max error and the timeframe
> where errors are possible? After all, watermarks checking is already racy so I don't think
> it would be hurt that much.

I understand your suggestion. I once thought like as you, but give up
that idea. Watermark checking is already racy, but, it's *only*
protection to prevent memory allocation. After passing that check,
there is no mean to prevent us from allocating memory. So it should
be accurate as much as possible. If we allow someone to get the
memory without considering memory isolation, free memory could be in
really low state and system could be broken occasionally.

> 
> Now if we look at what both CMA and memory hot-remove does is:
> 
> 1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
> start_isolate_page_range(). As part of this, all free pages in that area are
> moved on the isolate freelist through move_freepages_block().
> 
> 2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
> caches.
> 
> 3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
> undo pageblock isolation if not.
> 
> So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
> allocator, which would therefore account free pages on the isolate freelist as normal free
> pages.
> The accounting of isolated pages would be instead done only on the top level of CMA/hot
> remove in the three steps outlined above, which would be modified as follows:
> 
> 1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
> as usual. Calculate X as the number of pages that move_freepages_block() has moved.
> Subtract X from freepages (this is the same as it is done now), but also *remember the
> value of X*
> 
> 2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
> isolate freelist, or not. We don't care, they will be accounted as freepages either way.
> This is where some inaccuracy in accounted freepages would build up.
> 
> 3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
> that we have a isolated range of N pages that nobody can steal now as everything is on
> isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
> X, then N-X). So the accounting matches reality.
> 
> If we have to undo, we undo the isolation and as part of this, we use
> move_freepages_block() to move pages from isolate freelist to the normal ones. But we
> don't care how many pages were moved. We simply add the remembered value of X to the
> number of freepages, undoing the change from step 1. Again, the accounting matches reality.
> 
> 
> The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
> be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
> handled.

The 4MB error in accounting looks serious for me. min_free_kbytes is
4MB in 1GB system. So this 4MB error would makes all things broken in
such systems. Moreover, there are some ARCH having larger
pageblock_order than MAX_ORDER. In this case, the error will be larger
than 4MB.

In addition, I have a plan to extend CMA to work in parallel. It means
that there could be parallel memory isolation users rather than just
one user at the same time, so, we cannot easily bound the error under
some degree.

> 
> As a possible improvement, we can assume during phase 2 that every page freed by migration
> will end up correctly on isolate free list. So we create M free pages by migration, and
> subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
> + M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
> pages on pcplists and the possible races it will be less). I think with this improvement,
> any error would be negligible.
> 
> Thoughts?

Thanks for suggestion. :)
It is really good topic to think deeply.

For now, I'd like to fix these problems without side-effect as you
suggested. Your suggestion changes the meaning of freepage that
isolated pages are included in nr_freepage and there could be possible
regression in success rate of memory hotplug and CMA. Possibly, it
is the way we have to go, but, IMHO, it isn't the time to go. Before
going that way, we should fix current implementation first so that
fixes can be backported to old kernel if someone needs. Interestingly,
on last Saturday, Lisa Du reported CMA accounting bugs.

http://www.spinics.net/lists/linux-mm/msg75741.html

I don't look at it in detail, but, maybe it is related to these
problems and we should fix it without side-effect.

So, in conclusion, I think that your suggestion is beyond the scope of
this patchset because of following two reasons.

1. I'd like to fix these problems without side-effect(possible
regression in success rate of memory hotplug and CMA, and nr_freepage
meanging change) due to backport possibility.
2. nr_freepage without considering memory isolation is somewhat dangerous
and not suitable for some systems.

If you have any objection, please let me know. But, I will go on
a vacation for a week so I can't answer your further comments
for a week. I will reply them next week. :)

Thanks.

> > 3. pages on cma buddy list are counted as CMA freepage, too.
> > 4. pages for guard are *not* counted as freepage.
> > 
> > Now, I describe problems and related patch.
> > 
> > 1. Patch 2: If guard page are cleared and merged into isolate buddy list,
> > we should not add freepage count.
> > 
> > 2. Patch 3: When the page return back from pcp to buddy, we should
> > account it to freepage counter. In this case, we should check the
> > pageblock migratetype of the page and should insert the page into
> > appropriate buddy list. Although we checked it in current code, we
> > didn't insert the page into appropriate buddy list so that freepage
> > counting can be wrong.
> > 
> > 3. Patch 4: There is race condition so that some freepages could be
> > on isolate buddy list. If so, we can't use this page until next isolation
> > attempt on this pageblock.
> > 
> > 4. Patch 5: There is race condition that page on isolate pageblock
> > can go into non-isolate buddy list. If so, buddy allocator would
> > merge pages on non-isolate buddy list and isolate buddy list, respectively,
> > and freepage count will be wrong.
> > 
> > 5. Patch 9: move_freepages(_block) returns *not* number of moved pages.
> > Instead, it returns number of pages linked in that migratetype buddy list.
> > So accouting with this return value makes freepage count wrong.
> > 
> > 6. Patch 10: buddy allocator would merge pages on non-isolate buddy list
> > and isolate buddy list, respectively. This leads to freepage counting
> > problem so fix it by stopping merging in this case.
> > 
> > Without patchset [1], above problem doesn't happens on my CMA allocation
> > test, because CMA reserved pages aren't used at all. So there is no
> > chance for above race.
> > 
> > With patchset [1], I did simple CMA allocation test and get below result.
> > 
> > - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
> > - run kernel build (make -j16) on background
> > - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
> > - Result: more than 5000 freepage count are missed
> > 
> > With patchset [1] and this patchset, I found that no freepage count are
> > missed so that I conclude that problems are solved.
> > 
> > These problems can be possible on memory hot remove users, although
> > I didn't check it further.
> > 
> > Other patches are either for the base to fix these problems or for
> > simple clean-up. Please see individual patches for more information.
> > 
> > This patchset is based on linux-next-20140703.
> > 
> > Thanks.
> > 
> > [1]: Aggressively allocate the pages on cma reserved memory
> >      https://lkml.org/lkml/2014/5/30/291
> > 
> > 
> > Joonsoo Kim (10):
> >   mm/page_alloc: remove unlikely macro on free_one_page()
> >   mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
> >   mm/page_alloc: handle page on pcp correctly if it's pageblock is
> >     isolated
> >   mm/page_alloc: carefully free the page on isolate pageblock
> >   mm/page_alloc: optimize and unify pageblock migratetype check in free
> >     path
> >   mm/page_alloc: separate freepage migratetype interface
> >   mm/page_alloc: store migratetype of the buddy list into freepage
> >     correctly
> >   mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
> >   mm/page_alloc: fix possible wrongly calculated freepage counter
> >   mm/page_alloc: Stop merging pages on non-isolate and isolate buddy
> >     list
> > 
> >  include/linux/mm.h             |   30 +++++++--
> >  include/linux/mmzone.h         |    5 ++
> >  include/linux/page-isolation.h |    8 +++
> >  mm/page_alloc.c                |  138 +++++++++++++++++++++++++++++-----------
> >  mm/page_isolation.c            |   18 ++----
> >  5 files changed, 147 insertions(+), 52 deletions(-)
> > 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
@ 2014-07-07  4:49     ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-07  4:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, Lisa Du,
	linux-kernel

Ccing Lisa, because there was bug report it may be related this
topic last Saturday.

http://www.spinics.net/lists/linux-mm/msg75741.html

On Fri, Jul 04, 2014 at 05:33:27PM +0200, Vlastimil Babka wrote:
> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> > Hello,
> 
> Hi Joonsoo,
> 
> please CC me on further updates, this is relevant to me.

Hello, Vlastimil.

Sorry for missing you. :)

> 
> > This patchset aims at fixing problems due to memory isolation found by
> > testing my patchset [1].
> > 
> > These are really subtle problems so I can be wrong. If you find what I am
> > missing, please let me know.
> > 
> > Before describing bugs itself, I first explain definition of freepage.
> > 
> > 1. pages on buddy list are counted as freepage.
> > 2. pages on isolate migratetype buddy list are *not* counted as freepage.
> 
> I think the second point is causing us a lot of trouble. And I wonder if it's really
> justified! We already have some is_migrate_isolate() checks in the fast path and now you
> would add more, mostly just to keep this accounting correct.

It's not just for keeping accouting correct. There is another
purpose for is_migrate_isolate(). It forces freed pages to go into
isolate buddy list. Without it, freed pages would go into other
buddy list and will be used soon. So memory isolation can't work well
without is_migrate_isolate() checks and success rate could decrease.

And, I just added three more is_migrate_isolate() in the fast
path, but, two checks are in same *unlikely* branch and I can remove
another one easily. Therefore it's not quite problem I guess. (It even
does no-op if MEMORY_ISOLATION is disabled.)
Moreover, I removed one unconditional get_pageblock_migratetype() in
free_pcppages_bulk() so, in performance point or view, freepath would
be improved.

> 
> So the question is, does it have to be correct? And (admiteddly not after a completely
> exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)
> 
> Well I of course don't mean that the freepage accounts could go random completely, but
> what if we allowed them to drift a bit, limiting both the max error and the timeframe
> where errors are possible? After all, watermarks checking is already racy so I don't think
> it would be hurt that much.

I understand your suggestion. I once thought like as you, but give up
that idea. Watermark checking is already racy, but, it's *only*
protection to prevent memory allocation. After passing that check,
there is no mean to prevent us from allocating memory. So it should
be accurate as much as possible. If we allow someone to get the
memory without considering memory isolation, free memory could be in
really low state and system could be broken occasionally.

> 
> Now if we look at what both CMA and memory hot-remove does is:
> 
> 1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
> start_isolate_page_range(). As part of this, all free pages in that area are
> moved on the isolate freelist through move_freepages_block().
> 
> 2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
> caches.
> 
> 3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
> undo pageblock isolation if not.
> 
> So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
> allocator, which would therefore account free pages on the isolate freelist as normal free
> pages.
> The accounting of isolated pages would be instead done only on the top level of CMA/hot
> remove in the three steps outlined above, which would be modified as follows:
> 
> 1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
> as usual. Calculate X as the number of pages that move_freepages_block() has moved.
> Subtract X from freepages (this is the same as it is done now), but also *remember the
> value of X*
> 
> 2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
> isolate freelist, or not. We don't care, they will be accounted as freepages either way.
> This is where some inaccuracy in accounted freepages would build up.
> 
> 3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
> that we have a isolated range of N pages that nobody can steal now as everything is on
> isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
> X, then N-X). So the accounting matches reality.
> 
> If we have to undo, we undo the isolation and as part of this, we use
> move_freepages_block() to move pages from isolate freelist to the normal ones. But we
> don't care how many pages were moved. We simply add the remembered value of X to the
> number of freepages, undoing the change from step 1. Again, the accounting matches reality.
> 
> 
> The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
> be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
> handled.

The 4MB error in accounting looks serious for me. min_free_kbytes is
4MB in 1GB system. So this 4MB error would makes all things broken in
such systems. Moreover, there are some ARCH having larger
pageblock_order than MAX_ORDER. In this case, the error will be larger
than 4MB.

In addition, I have a plan to extend CMA to work in parallel. It means
that there could be parallel memory isolation users rather than just
one user at the same time, so, we cannot easily bound the error under
some degree.

> 
> As a possible improvement, we can assume during phase 2 that every page freed by migration
> will end up correctly on isolate free list. So we create M free pages by migration, and
> subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
> + M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
> pages on pcplists and the possible races it will be less). I think with this improvement,
> any error would be negligible.
> 
> Thoughts?

Thanks for suggestion. :)
It is really good topic to think deeply.

For now, I'd like to fix these problems without side-effect as you
suggested. Your suggestion changes the meaning of freepage that
isolated pages are included in nr_freepage and there could be possible
regression in success rate of memory hotplug and CMA. Possibly, it
is the way we have to go, but, IMHO, it isn't the time to go. Before
going that way, we should fix current implementation first so that
fixes can be backported to old kernel if someone needs. Interestingly,
on last Saturday, Lisa Du reported CMA accounting bugs.

http://www.spinics.net/lists/linux-mm/msg75741.html

I don't look at it in detail, but, maybe it is related to these
problems and we should fix it without side-effect.

So, in conclusion, I think that your suggestion is beyond the scope of
this patchset because of following two reasons.

1. I'd like to fix these problems without side-effect(possible
regression in success rate of memory hotplug and CMA, and nr_freepage
meanging change) due to backport possibility.
2. nr_freepage without considering memory isolation is somewhat dangerous
and not suitable for some systems.

If you have any objection, please let me know. But, I will go on
a vacation for a week so I can't answer your further comments
for a week. I will reply them next week. :)

Thanks.

> > 3. pages on cma buddy list are counted as CMA freepage, too.
> > 4. pages for guard are *not* counted as freepage.
> > 
> > Now, I describe problems and related patch.
> > 
> > 1. Patch 2: If guard page are cleared and merged into isolate buddy list,
> > we should not add freepage count.
> > 
> > 2. Patch 3: When the page return back from pcp to buddy, we should
> > account it to freepage counter. In this case, we should check the
> > pageblock migratetype of the page and should insert the page into
> > appropriate buddy list. Although we checked it in current code, we
> > didn't insert the page into appropriate buddy list so that freepage
> > counting can be wrong.
> > 
> > 3. Patch 4: There is race condition so that some freepages could be
> > on isolate buddy list. If so, we can't use this page until next isolation
> > attempt on this pageblock.
> > 
> > 4. Patch 5: There is race condition that page on isolate pageblock
> > can go into non-isolate buddy list. If so, buddy allocator would
> > merge pages on non-isolate buddy list and isolate buddy list, respectively,
> > and freepage count will be wrong.
> > 
> > 5. Patch 9: move_freepages(_block) returns *not* number of moved pages.
> > Instead, it returns number of pages linked in that migratetype buddy list.
> > So accouting with this return value makes freepage count wrong.
> > 
> > 6. Patch 10: buddy allocator would merge pages on non-isolate buddy list
> > and isolate buddy list, respectively. This leads to freepage counting
> > problem so fix it by stopping merging in this case.
> > 
> > Without patchset [1], above problem doesn't happens on my CMA allocation
> > test, because CMA reserved pages aren't used at all. So there is no
> > chance for above race.
> > 
> > With patchset [1], I did simple CMA allocation test and get below result.
> > 
> > - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
> > - run kernel build (make -j16) on background
> > - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
> > - Result: more than 5000 freepage count are missed
> > 
> > With patchset [1] and this patchset, I found that no freepage count are
> > missed so that I conclude that problems are solved.
> > 
> > These problems can be possible on memory hot remove users, although
> > I didn't check it further.
> > 
> > Other patches are either for the base to fix these problems or for
> > simple clean-up. Please see individual patches for more information.
> > 
> > This patchset is based on linux-next-20140703.
> > 
> > Thanks.
> > 
> > [1]: Aggressively allocate the pages on cma reserved memory
> >      https://lkml.org/lkml/2014/5/30/291
> > 
> > 
> > Joonsoo Kim (10):
> >   mm/page_alloc: remove unlikely macro on free_one_page()
> >   mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
> >   mm/page_alloc: handle page on pcp correctly if it's pageblock is
> >     isolated
> >   mm/page_alloc: carefully free the page on isolate pageblock
> >   mm/page_alloc: optimize and unify pageblock migratetype check in free
> >     path
> >   mm/page_alloc: separate freepage migratetype interface
> >   mm/page_alloc: store migratetype of the buddy list into freepage
> >     correctly
> >   mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
> >   mm/page_alloc: fix possible wrongly calculated freepage counter
> >   mm/page_alloc: Stop merging pages on non-isolate and isolate buddy
> >     list
> > 
> >  include/linux/mm.h             |   30 +++++++--
> >  include/linux/mmzone.h         |    5 ++
> >  include/linux/page-isolation.h |    8 +++
> >  mm/page_alloc.c                |  138 +++++++++++++++++++++++++++++-----------
> >  mm/page_isolation.c            |   18 ++----
> >  5 files changed, 147 insertions(+), 52 deletions(-)
> > 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 01/10] mm/page_alloc: remove unlikely macro on free_one_page()
  2014-07-04 12:03     ` Vlastimil Babka
@ 2014-07-07  4:58       ` Joonsoo Kim
  -1 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-07  4:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On Fri, Jul 04, 2014 at 02:03:10PM +0200, Vlastimil Babka wrote:
> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> > Isolation is really rare case so !is_migrate_isolate() is
> > likely case. Remove unlikely macro.
> 
> Good catch. Why not replace it with likely then? Any difference in the assembly?
> 

I didn't investigate it. I will check it in next spin.

Thanks.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 01/10] mm/page_alloc: remove unlikely macro on free_one_page()
@ 2014-07-07  4:58       ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-07  4:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On Fri, Jul 04, 2014 at 02:03:10PM +0200, Vlastimil Babka wrote:
> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> > Isolation is really rare case so !is_migrate_isolate() is
> > likely case. Remove unlikely macro.
> 
> Good catch. Why not replace it with likely then? Any difference in the assembly?
> 

I didn't investigate it. I will check it in next spin.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 03/10] mm/page_alloc: handle page on pcp correctly if it's pageblock is isolated
  2014-07-04  7:57   ` Joonsoo Kim
@ 2014-07-07  8:25     ` Gioh Kim
  -1 siblings, 0 replies; 79+ messages in thread
From: Gioh Kim @ 2014-07-07  8:25 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, linux-mm, linux-kernel



2014-07-04 오후 4:57, Joonsoo Kim 쓴 글:
> If pageblock of page on pcp are isolated now, we should free it to isolate
> buddy list to prevent future allocation on it. But current code doesn't
> do this.

I think it is strage that pcp can have isolated page.
I remember that The purpose of pcp is having frequently used pages (hot for cache),
but isolated page is not for frequently allocated and freed.

(Above is my guess. It can be wrong. I know CMA and hot-plug features are using isolated pages
so that I guess isolated pages are not for frequently allocated and freed memory.)

Anyway if isolated page is not good for pcp, what about preventing isolated page from inserting pcp?
I think fix of free_hot_cold_page() can be one of the preventing ways.
The free_hot_cold_page() checks migratetype and inserts CMA page into pcp.
Am I correct?

If I am correct I think inserting CMA page into pcp is harmful for both of CMA and pcp.
If CMA page is on pcp it will be used frequently and prevent making contiguous memory.
What about avoiding CMA page for pcp like following?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8c9eeec..1cbb816 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1413,13 +1413,8 @@ void free_hot_cold_page(struct page *page, bool cold)
         * areas back if necessary. Otherwise, we may have to free
         * excessively into the page allocator
         */
-       if (migratetype >= MIGRATE_PCPTYPES) {
-               if (unlikely(is_migrate_isolate(migratetype))) {
-                       free_one_page(zone, page, pfn, 0, migratetype);
-                       goto out;
-               }
-               migratetype = MIGRATE_MOVABLE;
-       }
+       if (migratetype > MIGRATE_PCPTYPES)
+               goto out;

        pcp = &this_cpu_ptr(zone->pageset)->pcp;
        if (!cold)


> 
> Moreover, there is a freepage counting problem on current code. Although
> pageblock of page on pcp are isolated now, it could go normal buddy list,
> because get_onpcp_migratetype() will return non-isolate migratetype.

I cannot find get_onpcp_migratetype() in v3.16.0-rc3 and also cannot google it.
Is it typo?

> In this case, we should do either adding freepage count or changing
> migratetype to MIGRATE_ISOLATE, but, current code do neither.
> 
> This patch fixes these two problems by handling pageblock migratetype
> before calling __free_one_page(). And, if we find the page on isolated
> pageblock, change migratetype to MIGRATE_ISOLATE to prevent future
> allocation of this page and freepage counting problem.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>   mm/page_alloc.c |   14 ++++++++------
>   1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index aeb51d1..99c05f7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -719,15 +719,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>   			page = list_entry(list->prev, struct page, lru);
>   			/* must delete as __free_one_page list manipulates */
>   			list_del(&page->lru);
> -			mt = get_freepage_migratetype(page);
> +
> +			if (unlikely(is_migrate_isolate_page(page))) {
> +				mt = MIGRATE_ISOLATE;
> +			} else {
> +				mt = get_freepage_migratetype(page);
> +				__mod_zone_freepage_state(zone, 1, mt);
> +			}
> +
>   			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
>   			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
>   			trace_mm_page_pcpu_drain(page, 0, mt);
> -			if (likely(!is_migrate_isolate_page(page))) {
> -				__mod_zone_page_state(zone, NR_FREE_PAGES, 1);
> -				if (is_migrate_cma(mt))
> -					__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
> -			}
>   		} while (--to_free && --batch_free && !list_empty(list));
>   	}
>   	spin_unlock(&zone->lock);
> 

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH 03/10] mm/page_alloc: handle page on pcp correctly if it's pageblock is isolated
@ 2014-07-07  8:25     ` Gioh Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Gioh Kim @ 2014-07-07  8:25 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, linux-mm, linux-kernel



2014-07-04 ?AEA 4:57, Joonsoo Kim  3/4 ' +-U:
> If pageblock of page on pcp are isolated now, we should free it to isolate
> buddy list to prevent future allocation on it. But current code doesn't
> do this.

I think it is strage that pcp can have isolated page.
I remember that The purpose of pcp is having frequently used pages (hot for cache),
but isolated page is not for frequently allocated and freed.

(Above is my guess. It can be wrong. I know CMA and hot-plug features are using isolated pages
so that I guess isolated pages are not for frequently allocated and freed memory.)

Anyway if isolated page is not good for pcp, what about preventing isolated page from inserting pcp?
I think fix of free_hot_cold_page() can be one of the preventing ways.
The free_hot_cold_page() checks migratetype and inserts CMA page into pcp.
Am I correct?

If I am correct I think inserting CMA page into pcp is harmful for both of CMA and pcp.
If CMA page is on pcp it will be used frequently and prevent making contiguous memory.
What about avoiding CMA page for pcp like following?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8c9eeec..1cbb816 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1413,13 +1413,8 @@ void free_hot_cold_page(struct page *page, bool cold)
         * areas back if necessary. Otherwise, we may have to free
         * excessively into the page allocator
         */
-       if (migratetype >= MIGRATE_PCPTYPES) {
-               if (unlikely(is_migrate_isolate(migratetype))) {
-                       free_one_page(zone, page, pfn, 0, migratetype);
-                       goto out;
-               }
-               migratetype = MIGRATE_MOVABLE;
-       }
+       if (migratetype > MIGRATE_PCPTYPES)
+               goto out;

        pcp = &this_cpu_ptr(zone->pageset)->pcp;
        if (!cold)


> 
> Moreover, there is a freepage counting problem on current code. Although
> pageblock of page on pcp are isolated now, it could go normal buddy list,
> because get_onpcp_migratetype() will return non-isolate migratetype.

I cannot find get_onpcp_migratetype() in v3.16.0-rc3 and also cannot google it.
Is it typo?

> In this case, we should do either adding freepage count or changing
> migratetype to MIGRATE_ISOLATE, but, current code do neither.
> 
> This patch fixes these two problems by handling pageblock migratetype
> before calling __free_one_page(). And, if we find the page on isolated
> pageblock, change migratetype to MIGRATE_ISOLATE to prevent future
> allocation of this page and freepage counting problem.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>   mm/page_alloc.c |   14 ++++++++------
>   1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index aeb51d1..99c05f7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -719,15 +719,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>   			page = list_entry(list->prev, struct page, lru);
>   			/* must delete as __free_one_page list manipulates */
>   			list_del(&page->lru);
> -			mt = get_freepage_migratetype(page);
> +
> +			if (unlikely(is_migrate_isolate_page(page))) {
> +				mt = MIGRATE_ISOLATE;
> +			} else {
> +				mt = get_freepage_migratetype(page);
> +				__mod_zone_freepage_state(zone, 1, mt);
> +			}
> +
>   			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
>   			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
>   			trace_mm_page_pcpu_drain(page, 0, mt);
> -			if (likely(!is_migrate_isolate_page(page))) {
> -				__mod_zone_page_state(zone, NR_FREE_PAGES, 1);
> -				if (is_migrate_cma(mt))
> -					__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
> -			}
>   		} while (--to_free && --batch_free && !list_empty(list));
>   	}
>   	spin_unlock(&zone->lock);
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
  2014-07-07  4:49     ` Joonsoo Kim
@ 2014-07-07 14:33       ` Vlastimil Babka
  -1 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-07 14:33 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, Lisa Du,
	linux-kernel

On 07/07/2014 06:49 AM, Joonsoo Kim wrote:
> Ccing Lisa, because there was bug report it may be related this
> topic last Saturday.
>
> http://www.spinics.net/lists/linux-mm/msg75741.html
>
> On Fri, Jul 04, 2014 at 05:33:27PM +0200, Vlastimil Babka wrote:
>> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
>>> Hello,
>>
>> Hi Joonsoo,
>>
>> please CC me on further updates, this is relevant to me.
>
> Hello, Vlastimil.
>
> Sorry for missing you. :)
>
>>
>>> This patchset aims at fixing problems due to memory isolation found by
>>> testing my patchset [1].
>>>
>>> These are really subtle problems so I can be wrong. If you find what I am
>>> missing, please let me know.
>>>
>>> Before describing bugs itself, I first explain definition of freepage.
>>>
>>> 1. pages on buddy list are counted as freepage.
>>> 2. pages on isolate migratetype buddy list are *not* counted as freepage.
>>
>> I think the second point is causing us a lot of trouble. And I wonder if it's really
>> justified! We already have some is_migrate_isolate() checks in the fast path and now you
>> would add more, mostly just to keep this accounting correct.
>
> It's not just for keeping accouting correct. There is another
> purpose for is_migrate_isolate(). It forces freed pages to go into
> isolate buddy list. Without it, freed pages would go into other
> buddy list and will be used soon. So memory isolation can't work well
> without is_migrate_isolate() checks and success rate could decrease.

Well I think that could be solved also by doing a lru/pcplists drain 
right after marking pageblock as MIGRATETYPE_ISOLATE. After that moment, 
anything newly put on pcplists should observe the new migratetype and go 
to the correct pcplists. As putting stuff on pcplists is done with 
disabled interrupts, and draining is done by IPI, I think it should work 
correctly if we put the migratetype determination under the disabled irq 
part in free_hot_cold_page().

> And, I just added three more is_migrate_isolate() in the fast
> path, but, two checks are in same *unlikely* branch and I can remove
> another one easily. Therefore it's not quite problem I guess. (It even
> does no-op if MEMORY_ISOLATION is disabled.)
> Moreover, I removed one unconditional get_pageblock_migratetype() in
> free_pcppages_bulk() so, in performance point or view, freepath would
> be improved.

I haven't checked the individual patches yet, but I'll try.

>>
>> So the question is, does it have to be correct? And (admiteddly not after a completely
>> exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)
>>
>> Well I of course don't mean that the freepage accounts could go random completely, but
>> what if we allowed them to drift a bit, limiting both the max error and the timeframe
>> where errors are possible? After all, watermarks checking is already racy so I don't think
>> it would be hurt that much.
>
> I understand your suggestion. I once thought like as you, but give up
> that idea. Watermark checking is already racy, but, it's *only*
> protection to prevent memory allocation. After passing that check,
> there is no mean to prevent us from allocating memory. So it should
> be accurate as much as possible. If we allow someone to get the
> memory without considering memory isolation, free memory could be in
> really low state and system could be broken occasionally.

It would definitely require more thorough review, but I think with the 
pcplists draining changes outlined above, there shouldn't be any more 
inaccuracy happening.

>>
>> Now if we look at what both CMA and memory hot-remove does is:
>>
>> 1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
>> start_isolate_page_range(). As part of this, all free pages in that area are
>> moved on the isolate freelist through move_freepages_block().
>>
>> 2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
>> caches.
>>
>> 3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
>> undo pageblock isolation if not.
>>
>> So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
>> allocator, which would therefore account free pages on the isolate freelist as normal free
>> pages.
>> The accounting of isolated pages would be instead done only on the top level of CMA/hot
>> remove in the three steps outlined above, which would be modified as follows:
>>
>> 1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
>> as usual. Calculate X as the number of pages that move_freepages_block() has moved.
>> Subtract X from freepages (this is the same as it is done now), but also *remember the
>> value of X*
>>
>> 2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
>> isolate freelist, or not. We don't care, they will be accounted as freepages either way.
>> This is where some inaccuracy in accounted freepages would build up.
>>
>> 3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
>> that we have a isolated range of N pages that nobody can steal now as everything is on
>> isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
>> X, then N-X). So the accounting matches reality.
>>
>> If we have to undo, we undo the isolation and as part of this, we use
>> move_freepages_block() to move pages from isolate freelist to the normal ones. But we
>> don't care how many pages were moved. We simply add the remembered value of X to the
>> number of freepages, undoing the change from step 1. Again, the accounting matches reality.
>>
>>
>> The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
>> be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
>> handled.
>
> The 4MB error in accounting looks serious for me. min_free_kbytes is
> 4MB in 1GB system. So this 4MB error would makes all things broken in
> such systems. Moreover, there are some ARCH having larger
> pageblock_order than MAX_ORDER. In this case, the error will be larger
> than 4MB.
>
> In addition, I have a plan to extend CMA to work in parallel. It means
> that there could be parallel memory isolation users rather than just
> one user at the same time, so, we cannot easily bound the error under
> some degree.

OK. I had thought that the error would be much smaller in practice, but 
probably due to how buddy merging works, it would be enough if just the 
very last freed page in the MAX_ORDER block was misplaced, and that 
would trigger a cascading merge that will end with single page at 
MAX_ORDER size becoming misplaced. So let's probably forget this approach.

>> As a possible improvement, we can assume during phase 2 that every page freed by migration
>> will end up correctly on isolate free list. So we create M free pages by migration, and
>> subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
>> + M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
>> pages on pcplists and the possible races it will be less). I think with this improvement,
>> any error would be negligible.
>>
>> Thoughts?
>
> Thanks for suggestion. :)
> It is really good topic to think deeply.
>
> For now, I'd like to fix these problems without side-effect as you
> suggested. Your suggestion changes the meaning of freepage that
> isolated pages are included in nr_freepage and there could be possible
> regression in success rate of memory hotplug and CMA. Possibly, it
> is the way we have to go, but, IMHO, it isn't the time to go. Before
> going that way, we should fix current implementation first so that
> fixes can be backported to old kernel if someone needs.

Adding the extra pcplist drains and reordering 
get_pageblock_migratetype() under disabled IRQ's should be small enough 
to be backportable and not bring any regressions to success rates. There 
will be some cost for the extra drains, but paid only when the memory 
isolation actually happens. Extending the scope of disabled irq's would 
affect fast path somewhat, but I guess less than extra checks?

> Interestingly,
> on last Saturday, Lisa Du reported CMA accounting bugs.
>
> http://www.spinics.net/lists/linux-mm/msg75741.html
>
> I don't look at it in detail, but, maybe it is related to these
> problems and we should fix it without side-effect.
>
> So, in conclusion, I think that your suggestion is beyond the scope of
> this patchset because of following two reasons.
>
> 1. I'd like to fix these problems without side-effect(possible
> regression in success rate of memory hotplug and CMA, and nr_freepage
> meanging change) due to backport possibility.
> 2. nr_freepage without considering memory isolation is somewhat dangerous
> and not suitable for some systems.
>
> If you have any objection, please let me know. But, I will go on
> a vacation for a week so I can't answer your further comments
> for a week. I will reply them next week. :)
>
> Thanks.
>
>>> 3. pages on cma buddy list are counted as CMA freepage, too.
>>> 4. pages for guard are *not* counted as freepage.
>>>
>>> Now, I describe problems and related patch.
>>>
>>> 1. Patch 2: If guard page are cleared and merged into isolate buddy list,
>>> we should not add freepage count.
>>>
>>> 2. Patch 3: When the page return back from pcp to buddy, we should
>>> account it to freepage counter. In this case, we should check the
>>> pageblock migratetype of the page and should insert the page into
>>> appropriate buddy list. Although we checked it in current code, we
>>> didn't insert the page into appropriate buddy list so that freepage
>>> counting can be wrong.
>>>
>>> 3. Patch 4: There is race condition so that some freepages could be
>>> on isolate buddy list. If so, we can't use this page until next isolation
>>> attempt on this pageblock.
>>>
>>> 4. Patch 5: There is race condition that page on isolate pageblock
>>> can go into non-isolate buddy list. If so, buddy allocator would
>>> merge pages on non-isolate buddy list and isolate buddy list, respectively,
>>> and freepage count will be wrong.
>>>
>>> 5. Patch 9: move_freepages(_block) returns *not* number of moved pages.
>>> Instead, it returns number of pages linked in that migratetype buddy list.
>>> So accouting with this return value makes freepage count wrong.
>>>
>>> 6. Patch 10: buddy allocator would merge pages on non-isolate buddy list
>>> and isolate buddy list, respectively. This leads to freepage counting
>>> problem so fix it by stopping merging in this case.
>>>
>>> Without patchset [1], above problem doesn't happens on my CMA allocation
>>> test, because CMA reserved pages aren't used at all. So there is no
>>> chance for above race.
>>>
>>> With patchset [1], I did simple CMA allocation test and get below result.
>>>
>>> - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
>>> - run kernel build (make -j16) on background
>>> - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
>>> - Result: more than 5000 freepage count are missed
>>>
>>> With patchset [1] and this patchset, I found that no freepage count are
>>> missed so that I conclude that problems are solved.
>>>
>>> These problems can be possible on memory hot remove users, although
>>> I didn't check it further.
>>>
>>> Other patches are either for the base to fix these problems or for
>>> simple clean-up. Please see individual patches for more information.
>>>
>>> This patchset is based on linux-next-20140703.
>>>
>>> Thanks.
>>>
>>> [1]: Aggressively allocate the pages on cma reserved memory
>>>       https://lkml.org/lkml/2014/5/30/291
>>>
>>>
>>> Joonsoo Kim (10):
>>>    mm/page_alloc: remove unlikely macro on free_one_page()
>>>    mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
>>>    mm/page_alloc: handle page on pcp correctly if it's pageblock is
>>>      isolated
>>>    mm/page_alloc: carefully free the page on isolate pageblock
>>>    mm/page_alloc: optimize and unify pageblock migratetype check in free
>>>      path
>>>    mm/page_alloc: separate freepage migratetype interface
>>>    mm/page_alloc: store migratetype of the buddy list into freepage
>>>      correctly
>>>    mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
>>>    mm/page_alloc: fix possible wrongly calculated freepage counter
>>>    mm/page_alloc: Stop merging pages on non-isolate and isolate buddy
>>>      list
>>>
>>>   include/linux/mm.h             |   30 +++++++--
>>>   include/linux/mmzone.h         |    5 ++
>>>   include/linux/page-isolation.h |    8 +++
>>>   mm/page_alloc.c                |  138 +++++++++++++++++++++++++++++-----------
>>>   mm/page_isolation.c            |   18 ++----
>>>   5 files changed, 147 insertions(+), 52 deletions(-)
>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
@ 2014-07-07 14:33       ` Vlastimil Babka
  0 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-07 14:33 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, Lisa Du,
	linux-kernel

On 07/07/2014 06:49 AM, Joonsoo Kim wrote:
> Ccing Lisa, because there was bug report it may be related this
> topic last Saturday.
>
> http://www.spinics.net/lists/linux-mm/msg75741.html
>
> On Fri, Jul 04, 2014 at 05:33:27PM +0200, Vlastimil Babka wrote:
>> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
>>> Hello,
>>
>> Hi Joonsoo,
>>
>> please CC me on further updates, this is relevant to me.
>
> Hello, Vlastimil.
>
> Sorry for missing you. :)
>
>>
>>> This patchset aims at fixing problems due to memory isolation found by
>>> testing my patchset [1].
>>>
>>> These are really subtle problems so I can be wrong. If you find what I am
>>> missing, please let me know.
>>>
>>> Before describing bugs itself, I first explain definition of freepage.
>>>
>>> 1. pages on buddy list are counted as freepage.
>>> 2. pages on isolate migratetype buddy list are *not* counted as freepage.
>>
>> I think the second point is causing us a lot of trouble. And I wonder if it's really
>> justified! We already have some is_migrate_isolate() checks in the fast path and now you
>> would add more, mostly just to keep this accounting correct.
>
> It's not just for keeping accouting correct. There is another
> purpose for is_migrate_isolate(). It forces freed pages to go into
> isolate buddy list. Without it, freed pages would go into other
> buddy list and will be used soon. So memory isolation can't work well
> without is_migrate_isolate() checks and success rate could decrease.

Well I think that could be solved also by doing a lru/pcplists drain 
right after marking pageblock as MIGRATETYPE_ISOLATE. After that moment, 
anything newly put on pcplists should observe the new migratetype and go 
to the correct pcplists. As putting stuff on pcplists is done with 
disabled interrupts, and draining is done by IPI, I think it should work 
correctly if we put the migratetype determination under the disabled irq 
part in free_hot_cold_page().

> And, I just added three more is_migrate_isolate() in the fast
> path, but, two checks are in same *unlikely* branch and I can remove
> another one easily. Therefore it's not quite problem I guess. (It even
> does no-op if MEMORY_ISOLATION is disabled.)
> Moreover, I removed one unconditional get_pageblock_migratetype() in
> free_pcppages_bulk() so, in performance point or view, freepath would
> be improved.

I haven't checked the individual patches yet, but I'll try.

>>
>> So the question is, does it have to be correct? And (admiteddly not after a completely
>> exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)
>>
>> Well I of course don't mean that the freepage accounts could go random completely, but
>> what if we allowed them to drift a bit, limiting both the max error and the timeframe
>> where errors are possible? After all, watermarks checking is already racy so I don't think
>> it would be hurt that much.
>
> I understand your suggestion. I once thought like as you, but give up
> that idea. Watermark checking is already racy, but, it's *only*
> protection to prevent memory allocation. After passing that check,
> there is no mean to prevent us from allocating memory. So it should
> be accurate as much as possible. If we allow someone to get the
> memory without considering memory isolation, free memory could be in
> really low state and system could be broken occasionally.

It would definitely require more thorough review, but I think with the 
pcplists draining changes outlined above, there shouldn't be any more 
inaccuracy happening.

>>
>> Now if we look at what both CMA and memory hot-remove does is:
>>
>> 1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
>> start_isolate_page_range(). As part of this, all free pages in that area are
>> moved on the isolate freelist through move_freepages_block().
>>
>> 2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
>> caches.
>>
>> 3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
>> undo pageblock isolation if not.
>>
>> So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
>> allocator, which would therefore account free pages on the isolate freelist as normal free
>> pages.
>> The accounting of isolated pages would be instead done only on the top level of CMA/hot
>> remove in the three steps outlined above, which would be modified as follows:
>>
>> 1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
>> as usual. Calculate X as the number of pages that move_freepages_block() has moved.
>> Subtract X from freepages (this is the same as it is done now), but also *remember the
>> value of X*
>>
>> 2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
>> isolate freelist, or not. We don't care, they will be accounted as freepages either way.
>> This is where some inaccuracy in accounted freepages would build up.
>>
>> 3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
>> that we have a isolated range of N pages that nobody can steal now as everything is on
>> isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
>> X, then N-X). So the accounting matches reality.
>>
>> If we have to undo, we undo the isolation and as part of this, we use
>> move_freepages_block() to move pages from isolate freelist to the normal ones. But we
>> don't care how many pages were moved. We simply add the remembered value of X to the
>> number of freepages, undoing the change from step 1. Again, the accounting matches reality.
>>
>>
>> The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
>> be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
>> handled.
>
> The 4MB error in accounting looks serious for me. min_free_kbytes is
> 4MB in 1GB system. So this 4MB error would makes all things broken in
> such systems. Moreover, there are some ARCH having larger
> pageblock_order than MAX_ORDER. In this case, the error will be larger
> than 4MB.
>
> In addition, I have a plan to extend CMA to work in parallel. It means
> that there could be parallel memory isolation users rather than just
> one user at the same time, so, we cannot easily bound the error under
> some degree.

OK. I had thought that the error would be much smaller in practice, but 
probably due to how buddy merging works, it would be enough if just the 
very last freed page in the MAX_ORDER block was misplaced, and that 
would trigger a cascading merge that will end with single page at 
MAX_ORDER size becoming misplaced. So let's probably forget this approach.

>> As a possible improvement, we can assume during phase 2 that every page freed by migration
>> will end up correctly on isolate free list. So we create M free pages by migration, and
>> subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
>> + M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
>> pages on pcplists and the possible races it will be less). I think with this improvement,
>> any error would be negligible.
>>
>> Thoughts?
>
> Thanks for suggestion. :)
> It is really good topic to think deeply.
>
> For now, I'd like to fix these problems without side-effect as you
> suggested. Your suggestion changes the meaning of freepage that
> isolated pages are included in nr_freepage and there could be possible
> regression in success rate of memory hotplug and CMA. Possibly, it
> is the way we have to go, but, IMHO, it isn't the time to go. Before
> going that way, we should fix current implementation first so that
> fixes can be backported to old kernel if someone needs.

Adding the extra pcplist drains and reordering 
get_pageblock_migratetype() under disabled IRQ's should be small enough 
to be backportable and not bring any regressions to success rates. There 
will be some cost for the extra drains, but paid only when the memory 
isolation actually happens. Extending the scope of disabled irq's would 
affect fast path somewhat, but I guess less than extra checks?

> Interestingly,
> on last Saturday, Lisa Du reported CMA accounting bugs.
>
> http://www.spinics.net/lists/linux-mm/msg75741.html
>
> I don't look at it in detail, but, maybe it is related to these
> problems and we should fix it without side-effect.
>
> So, in conclusion, I think that your suggestion is beyond the scope of
> this patchset because of following two reasons.
>
> 1. I'd like to fix these problems without side-effect(possible
> regression in success rate of memory hotplug and CMA, and nr_freepage
> meanging change) due to backport possibility.
> 2. nr_freepage without considering memory isolation is somewhat dangerous
> and not suitable for some systems.
>
> If you have any objection, please let me know. But, I will go on
> a vacation for a week so I can't answer your further comments
> for a week. I will reply them next week. :)
>
> Thanks.
>
>>> 3. pages on cma buddy list are counted as CMA freepage, too.
>>> 4. pages for guard are *not* counted as freepage.
>>>
>>> Now, I describe problems and related patch.
>>>
>>> 1. Patch 2: If guard page are cleared and merged into isolate buddy list,
>>> we should not add freepage count.
>>>
>>> 2. Patch 3: When the page return back from pcp to buddy, we should
>>> account it to freepage counter. In this case, we should check the
>>> pageblock migratetype of the page and should insert the page into
>>> appropriate buddy list. Although we checked it in current code, we
>>> didn't insert the page into appropriate buddy list so that freepage
>>> counting can be wrong.
>>>
>>> 3. Patch 4: There is race condition so that some freepages could be
>>> on isolate buddy list. If so, we can't use this page until next isolation
>>> attempt on this pageblock.
>>>
>>> 4. Patch 5: There is race condition that page on isolate pageblock
>>> can go into non-isolate buddy list. If so, buddy allocator would
>>> merge pages on non-isolate buddy list and isolate buddy list, respectively,
>>> and freepage count will be wrong.
>>>
>>> 5. Patch 9: move_freepages(_block) returns *not* number of moved pages.
>>> Instead, it returns number of pages linked in that migratetype buddy list.
>>> So accouting with this return value makes freepage count wrong.
>>>
>>> 6. Patch 10: buddy allocator would merge pages on non-isolate buddy list
>>> and isolate buddy list, respectively. This leads to freepage counting
>>> problem so fix it by stopping merging in this case.
>>>
>>> Without patchset [1], above problem doesn't happens on my CMA allocation
>>> test, because CMA reserved pages aren't used at all. So there is no
>>> chance for above race.
>>>
>>> With patchset [1], I did simple CMA allocation test and get below result.
>>>
>>> - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
>>> - run kernel build (make -j16) on background
>>> - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
>>> - Result: more than 5000 freepage count are missed
>>>
>>> With patchset [1] and this patchset, I found that no freepage count are
>>> missed so that I conclude that problems are solved.
>>>
>>> These problems can be possible on memory hot remove users, although
>>> I didn't check it further.
>>>
>>> Other patches are either for the base to fix these problems or for
>>> simple clean-up. Please see individual patches for more information.
>>>
>>> This patchset is based on linux-next-20140703.
>>>
>>> Thanks.
>>>
>>> [1]: Aggressively allocate the pages on cma reserved memory
>>>       https://lkml.org/lkml/2014/5/30/291
>>>
>>>
>>> Joonsoo Kim (10):
>>>    mm/page_alloc: remove unlikely macro on free_one_page()
>>>    mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
>>>    mm/page_alloc: handle page on pcp correctly if it's pageblock is
>>>      isolated
>>>    mm/page_alloc: carefully free the page on isolate pageblock
>>>    mm/page_alloc: optimize and unify pageblock migratetype check in free
>>>      path
>>>    mm/page_alloc: separate freepage migratetype interface
>>>    mm/page_alloc: store migratetype of the buddy list into freepage
>>>      correctly
>>>    mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
>>>    mm/page_alloc: fix possible wrongly calculated freepage counter
>>>    mm/page_alloc: Stop merging pages on non-isolate and isolate buddy
>>>      list
>>>
>>>   include/linux/mm.h             |   30 +++++++--
>>>   include/linux/mmzone.h         |    5 ++
>>>   include/linux/page-isolation.h |    8 +++
>>>   mm/page_alloc.c                |  138 +++++++++++++++++++++++++++++-----------
>>>   mm/page_isolation.c            |   18 ++----
>>>   5 files changed, 147 insertions(+), 52 deletions(-)
>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 02/10] mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
  2014-07-04  7:57   ` Joonsoo Kim
@ 2014-07-07 14:50     ` Vlastimil Babka
  -1 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-07 14:50 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> In __free_one_page(), we check the buddy page if it is guard page.
> And, if so, we should clear guard attribute on the buddy page. But,
> currently, we clear original page's order rather than buddy one's.
> This doesn't have any problem, because resetting buddy's order
> is useless

Well it might theoretically confuse someone examining a crash dump, to 
see an unexpected page->private value. So I agree it should be corrected.

> and the original page's order is re-assigned soon.
> But, it is better to correct code.
>
> Additionally, I change (set/clear)_page_guard_flag() to
> (set/clear)_page_guard() and makes these functions do all works
> needed for guard page. This may make code more understandable.
>
> One more thing, I did in this patch, is that fixing freepage accounting.
> If we clear guard page and link it onto isolate buddy list, we should
> not increase freepage count.

OK, since this is under CONFIG_DEBUG_PAGEALLOC, and I don't see other 
solution in this case, I agree.

> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   mm/page_alloc.c |   29 ++++++++++++++++-------------
>   1 file changed, 16 insertions(+), 13 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0d4cf7a..aeb51d1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -441,18 +441,28 @@ static int __init debug_guardpage_minorder_setup(char *buf)
>   }
>   __setup("debug_guardpage_minorder=", debug_guardpage_minorder_setup);
>
> -static inline void set_page_guard_flag(struct page *page)
> +static inline void set_page_guard(struct zone *zone, struct page *page,
> +				unsigned int order, int migratetype)
>   {
>   	__set_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags);
> +	set_page_private(page, order);
> +	/* Guard pages are not available for any usage */
> +	__mod_zone_freepage_state(zone, -(1 << order), migratetype);
>   }
>
> -static inline void clear_page_guard_flag(struct page *page)
> +static inline void clear_page_guard(struct zone *zone, struct page *page,
> +				unsigned int order, int migratetype)
>   {
>   	__clear_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags);
> +	set_page_private(page, 0);
> +	if (!is_migrate_isolate(migratetype))
> +		__mod_zone_freepage_state(zone, (1 << order), migratetype);
>   }
>   #else
> -static inline void set_page_guard_flag(struct page *page) { }
> -static inline void clear_page_guard_flag(struct page *page) { }
> +static inline void set_page_guard(struct zone *zone, struct page *page,
> +				unsigned int order, int migratetype) {}
> +static inline void clear_page_guard(struct zone *zone, struct page *page,
> +				unsigned int order, int migratetype) {}
>   #endif
>
>   static inline void set_page_order(struct page *page, unsigned int order)
> @@ -594,10 +604,7 @@ static inline void __free_one_page(struct page *page,
>   		 * merge with it and move up one order.
>   		 */
>   		if (page_is_guard(buddy)) {
> -			clear_page_guard_flag(buddy);
> -			set_page_private(page, 0);
> -			__mod_zone_freepage_state(zone, 1 << order,
> -						  migratetype);
> +			clear_page_guard(zone, buddy, order, migratetype);
>   		} else {
>   			list_del(&buddy->lru);
>   			zone->free_area[order].nr_free--;
> @@ -919,11 +926,7 @@ static inline void expand(struct zone *zone, struct page *page,
>   			 * pages will stay not present in virtual address space
>   			 */
>   			INIT_LIST_HEAD(&page[size].lru);
> -			set_page_guard_flag(&page[size]);
> -			set_page_private(&page[size], high);
> -			/* Guard pages are not available for any usage */
> -			__mod_zone_freepage_state(zone, -(1 << high),
> -						  migratetype);
> +			set_page_guard(zone, &page[size], high, migratetype);
>   			continue;
>   		}
>   #endif
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 02/10] mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
@ 2014-07-07 14:50     ` Vlastimil Babka
  0 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-07 14:50 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> In __free_one_page(), we check the buddy page if it is guard page.
> And, if so, we should clear guard attribute on the buddy page. But,
> currently, we clear original page's order rather than buddy one's.
> This doesn't have any problem, because resetting buddy's order
> is useless

Well it might theoretically confuse someone examining a crash dump, to 
see an unexpected page->private value. So I agree it should be corrected.

> and the original page's order is re-assigned soon.
> But, it is better to correct code.
>
> Additionally, I change (set/clear)_page_guard_flag() to
> (set/clear)_page_guard() and makes these functions do all works
> needed for guard page. This may make code more understandable.
>
> One more thing, I did in this patch, is that fixing freepage accounting.
> If we clear guard page and link it onto isolate buddy list, we should
> not increase freepage count.

OK, since this is under CONFIG_DEBUG_PAGEALLOC, and I don't see other 
solution in this case, I agree.

> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   mm/page_alloc.c |   29 ++++++++++++++++-------------
>   1 file changed, 16 insertions(+), 13 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0d4cf7a..aeb51d1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -441,18 +441,28 @@ static int __init debug_guardpage_minorder_setup(char *buf)
>   }
>   __setup("debug_guardpage_minorder=", debug_guardpage_minorder_setup);
>
> -static inline void set_page_guard_flag(struct page *page)
> +static inline void set_page_guard(struct zone *zone, struct page *page,
> +				unsigned int order, int migratetype)
>   {
>   	__set_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags);
> +	set_page_private(page, order);
> +	/* Guard pages are not available for any usage */
> +	__mod_zone_freepage_state(zone, -(1 << order), migratetype);
>   }
>
> -static inline void clear_page_guard_flag(struct page *page)
> +static inline void clear_page_guard(struct zone *zone, struct page *page,
> +				unsigned int order, int migratetype)
>   {
>   	__clear_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags);
> +	set_page_private(page, 0);
> +	if (!is_migrate_isolate(migratetype))
> +		__mod_zone_freepage_state(zone, (1 << order), migratetype);
>   }
>   #else
> -static inline void set_page_guard_flag(struct page *page) { }
> -static inline void clear_page_guard_flag(struct page *page) { }
> +static inline void set_page_guard(struct zone *zone, struct page *page,
> +				unsigned int order, int migratetype) {}
> +static inline void clear_page_guard(struct zone *zone, struct page *page,
> +				unsigned int order, int migratetype) {}
>   #endif
>
>   static inline void set_page_order(struct page *page, unsigned int order)
> @@ -594,10 +604,7 @@ static inline void __free_one_page(struct page *page,
>   		 * merge with it and move up one order.
>   		 */
>   		if (page_is_guard(buddy)) {
> -			clear_page_guard_flag(buddy);
> -			set_page_private(page, 0);
> -			__mod_zone_freepage_state(zone, 1 << order,
> -						  migratetype);
> +			clear_page_guard(zone, buddy, order, migratetype);
>   		} else {
>   			list_del(&buddy->lru);
>   			zone->free_area[order].nr_free--;
> @@ -919,11 +926,7 @@ static inline void expand(struct zone *zone, struct page *page,
>   			 * pages will stay not present in virtual address space
>   			 */
>   			INIT_LIST_HEAD(&page[size].lru);
> -			set_page_guard_flag(&page[size]);
> -			set_page_private(&page[size], high);
> -			/* Guard pages are not available for any usage */
> -			__mod_zone_freepage_state(zone, -(1 << high),
> -						  migratetype);
> +			set_page_guard(zone, &page[size], high, migratetype);
>   			continue;
>   		}
>   #endif
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 03/10] mm/page_alloc: handle page on pcp correctly if it's pageblock is isolated
  2014-07-04  7:57   ` Joonsoo Kim
@ 2014-07-07 15:19     ` Vlastimil Babka
  -1 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-07 15:19 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> If pageblock of page on pcp are isolated now, we should free it to isolate
> buddy list to prevent future allocation on it. But current code doesn't
> do this.
>
> Moreover, there is a freepage counting problem on current code. Although
> pageblock of page on pcp are isolated now, it could go normal buddy list,
> because get_onpcp_migratetype() will return non-isolate migratetype.

get_onpcp_migratetype() is only introduced in later patch.

> In this case, we should do either adding freepage count or changing
> migratetype to MIGRATE_ISOLATE, but, current code do neither.

I wouldn't say it "do neither". It already limits the freepage counting 
to !MIGRATE_ISOLATE case (and it's not converted to 
__mod_zone_freepage_state for some reason). So there's accounting 
mismatch in addition to buddy list misplacement.

> This patch fixes these two problems by handling pageblock migratetype
> before calling __free_one_page(). And, if we find the page on isolated
> pageblock, change migratetype to MIGRATE_ISOLATE to prevent future
> allocation of this page and freepage counting problem.

So although this is not an addition of a new pageblock migratetype check 
to the fast path (the check is already there), I would prefer removing 
the check :) With the approach of pcplists draining outlined in my reply 
to 00/10, we would allow a misplacement to happen (and the page 
accounted as freepage) immediately followed by move_frepages_block which 
would place the page onto isolate freelist with the rest. Anything newly 
freed will get isolate_migratetype determined in free_hot_cold_page or 
__free_pages_ok (where it would need moving the migratepage check under 
the disabled irq part) and be placed and buddy-merged properly.

> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>   mm/page_alloc.c |   14 ++++++++------
>   1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index aeb51d1..99c05f7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -719,15 +719,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>   			page = list_entry(list->prev, struct page, lru);
>   			/* must delete as __free_one_page list manipulates */
>   			list_del(&page->lru);
> -			mt = get_freepage_migratetype(page);
> +
> +			if (unlikely(is_migrate_isolate_page(page))) {
> +				mt = MIGRATE_ISOLATE;
> +			} else {
> +				mt = get_freepage_migratetype(page);
> +				__mod_zone_freepage_state(zone, 1, mt);
> +			}
> +
>   			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
>   			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
>   			trace_mm_page_pcpu_drain(page, 0, mt);
> -			if (likely(!is_migrate_isolate_page(page))) {
> -				__mod_zone_page_state(zone, NR_FREE_PAGES, 1);
> -				if (is_migrate_cma(mt))
> -					__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
> -			}
>   		} while (--to_free && --batch_free && !list_empty(list));
>   	}
>   	spin_unlock(&zone->lock);
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 03/10] mm/page_alloc: handle page on pcp correctly if it's pageblock is isolated
@ 2014-07-07 15:19     ` Vlastimil Babka
  0 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-07 15:19 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> If pageblock of page on pcp are isolated now, we should free it to isolate
> buddy list to prevent future allocation on it. But current code doesn't
> do this.
>
> Moreover, there is a freepage counting problem on current code. Although
> pageblock of page on pcp are isolated now, it could go normal buddy list,
> because get_onpcp_migratetype() will return non-isolate migratetype.

get_onpcp_migratetype() is only introduced in later patch.

> In this case, we should do either adding freepage count or changing
> migratetype to MIGRATE_ISOLATE, but, current code do neither.

I wouldn't say it "do neither". It already limits the freepage counting 
to !MIGRATE_ISOLATE case (and it's not converted to 
__mod_zone_freepage_state for some reason). So there's accounting 
mismatch in addition to buddy list misplacement.

> This patch fixes these two problems by handling pageblock migratetype
> before calling __free_one_page(). And, if we find the page on isolated
> pageblock, change migratetype to MIGRATE_ISOLATE to prevent future
> allocation of this page and freepage counting problem.

So although this is not an addition of a new pageblock migratetype check 
to the fast path (the check is already there), I would prefer removing 
the check :) With the approach of pcplists draining outlined in my reply 
to 00/10, we would allow a misplacement to happen (and the page 
accounted as freepage) immediately followed by move_frepages_block which 
would place the page onto isolate freelist with the rest. Anything newly 
freed will get isolate_migratetype determined in free_hot_cold_page or 
__free_pages_ok (where it would need moving the migratepage check under 
the disabled irq part) and be placed and buddy-merged properly.

> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>   mm/page_alloc.c |   14 ++++++++------
>   1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index aeb51d1..99c05f7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -719,15 +719,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>   			page = list_entry(list->prev, struct page, lru);
>   			/* must delete as __free_one_page list manipulates */
>   			list_del(&page->lru);
> -			mt = get_freepage_migratetype(page);
> +
> +			if (unlikely(is_migrate_isolate_page(page))) {
> +				mt = MIGRATE_ISOLATE;
> +			} else {
> +				mt = get_freepage_migratetype(page);
> +				__mod_zone_freepage_state(zone, 1, mt);
> +			}
> +
>   			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
>   			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
>   			trace_mm_page_pcpu_drain(page, 0, mt);
> -			if (likely(!is_migrate_isolate_page(page))) {
> -				__mod_zone_page_state(zone, NR_FREE_PAGES, 1);
> -				if (is_migrate_cma(mt))
> -					__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
> -			}
>   		} while (--to_free && --batch_free && !list_empty(list));
>   	}
>   	spin_unlock(&zone->lock);
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 04/10] mm/page_alloc: carefully free the page on isolate pageblock
  2014-07-04  7:57   ` Joonsoo Kim
@ 2014-07-07 15:43     ` Vlastimil Babka
  -1 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-07 15:43 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> We got migratetype without holding the lock so it could be
> racy. If some pages go on the isolate migratetype buddy list
> by this race, we can't allocate this page anymore until next
> isolation attempt on this pageblock. Below is possible
> scenario of this race.
>
> pageblock 1 is isolate migratetype.
>
> CPU1					CPU2
> - get_pfnblock_migratetype(pageblock 1),
> so MIGRATE_ISOLATE is returned
> - call free_one_page() with MIGRATE_ISOLATE
> 					- grab the zone lock
> 					- unisolate pageblock 1
> 					- release the zone lock
> - grab the zone lock
> - call __free_one_page() with MIGRATE_ISOLATE
> - free page go into isolate buddy list
> and we can't use it anymore
>
> To prevent this possibility, re-check migratetype with holding the lock.

This could be also solved similarly to the other races, if during 
unisolation, CPU2 sent a drain_all_pages() IPI and only then used 
move_freepages_block(). Again, get_pfnblock_migratetype() on CPU1 would 
need to be moved under disabled irq's.

> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>   mm/page_alloc.c |   11 +++++++++++
>   1 file changed, 11 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 99c05f7..d8feedc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -743,6 +743,17 @@ static void free_one_page(struct zone *zone,
>   	spin_lock(&zone->lock);
>   	zone->pages_scanned = 0;
>
> +	if (unlikely(is_migrate_isolate(migratetype))) {
> +		/*
> +		 * We got migratetype without holding the lock so it could be
> +		 * racy. If some pages go on the isolate migratetype buddy list
> +		 * by this race, we can't allocate this page anymore until next
> +		 * isolation attempt on this pageblock. To prevent this
> +		 * possibility, re-check migratetype with holding the lock.
> +		 */
> +		migratetype = get_pfnblock_migratetype(page, pfn);
> +	}
> +
>   	__free_one_page(page, pfn, zone, order, migratetype);
>   	if (!is_migrate_isolate(migratetype))
>   		__mod_zone_freepage_state(zone, 1 << order, migratetype);
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 04/10] mm/page_alloc: carefully free the page on isolate pageblock
@ 2014-07-07 15:43     ` Vlastimil Babka
  0 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-07 15:43 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> We got migratetype without holding the lock so it could be
> racy. If some pages go on the isolate migratetype buddy list
> by this race, we can't allocate this page anymore until next
> isolation attempt on this pageblock. Below is possible
> scenario of this race.
>
> pageblock 1 is isolate migratetype.
>
> CPU1					CPU2
> - get_pfnblock_migratetype(pageblock 1),
> so MIGRATE_ISOLATE is returned
> - call free_one_page() with MIGRATE_ISOLATE
> 					- grab the zone lock
> 					- unisolate pageblock 1
> 					- release the zone lock
> - grab the zone lock
> - call __free_one_page() with MIGRATE_ISOLATE
> - free page go into isolate buddy list
> and we can't use it anymore
>
> To prevent this possibility, re-check migratetype with holding the lock.

This could be also solved similarly to the other races, if during 
unisolation, CPU2 sent a drain_all_pages() IPI and only then used 
move_freepages_block(). Again, get_pfnblock_migratetype() on CPU1 would 
need to be moved under disabled irq's.

> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>   mm/page_alloc.c |   11 +++++++++++
>   1 file changed, 11 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 99c05f7..d8feedc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -743,6 +743,17 @@ static void free_one_page(struct zone *zone,
>   	spin_lock(&zone->lock);
>   	zone->pages_scanned = 0;
>
> +	if (unlikely(is_migrate_isolate(migratetype))) {
> +		/*
> +		 * We got migratetype without holding the lock so it could be
> +		 * racy. If some pages go on the isolate migratetype buddy list
> +		 * by this race, we can't allocate this page anymore until next
> +		 * isolation attempt on this pageblock. To prevent this
> +		 * possibility, re-check migratetype with holding the lock.
> +		 */
> +		migratetype = get_pfnblock_migratetype(page, pfn);
> +	}
> +
>   	__free_one_page(page, pfn, zone, order, migratetype);
>   	if (!is_migrate_isolate(migratetype))
>   		__mod_zone_freepage_state(zone, 1 << order, migratetype);
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 05/10] mm/page_alloc: optimize and unify pageblock migratetype check in free path
  2014-07-04  7:57   ` Joonsoo Kim
@ 2014-07-07 15:50     ` Vlastimil Babka
  -1 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-07 15:50 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> Currently, when we free the page from pcp list to buddy, we check
> pageblock of the page in order to isolate the page on isolated
> pageblock. Although this could rarely happen and to check migratetype of
> pageblock is somewhat expensive, we check it on free fast path. I think
> that this is undesirable. To prevent this situation, I introduce new
> variable, nr_isolate_pageblock on struct zone and use it to determine
> if we should check pageblock migratetype. Isolation on pageblock rarely
> happens so we can mostly avoid this pageblock migratetype check.

Better, but still there's a zone flag check and maintenance. So if it 
could be avoided, it would be better.

> Additionally, unify freepage counting code, because it can be done in
> common part, __free_one_page(). This unifying provides extra guarantee
> that the page on isolate pageblock don't go into non-isolate buddy list.
> This is similar situation describing in previous patch so refer it
> if you need more explanation.

You should make it clearer that you are solving misplacement of the type 
"page should be placed on isolated freelist but it's not" through 
free_one_page(), which was solved only for free_pcppages_bulk() in patch 
03/10. Mentioning patch 04/10 here, which solves the opposite problem 
"page shouldn't be placed on isolated freelist, but it is", only 
confuses the situation. Also this patch undoes everything of 04/10 and 
moves it elsewhere, so that would make it harder to git blame etc. I 
would reorder 04 and 05.

> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>   include/linux/mmzone.h         |    5 +++++
>   include/linux/page-isolation.h |    8 ++++++++
>   mm/page_alloc.c                |   40 +++++++++++++++++++---------------------
>   mm/page_isolation.c            |    2 ++
>   4 files changed, 34 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index fd48890..e9c194f 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -374,6 +374,11 @@ struct zone {
>   	/* pfn where async and sync compaction migration scanner should start */
>   	unsigned long		compact_cached_migrate_pfn[2];
>   #endif
> +
> +#ifdef CONFIG_MEMORY_ISOLATION
> +	unsigned long		nr_isolate_pageblock;
> +#endif
> +
>   #ifdef CONFIG_MEMORY_HOTPLUG
>   	/* see spanned/present_pages for more description */
>   	seqlock_t		span_seqlock;
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index 3fff8e7..2dc1e16 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -2,6 +2,10 @@
>   #define __LINUX_PAGEISOLATION_H
>
>   #ifdef CONFIG_MEMORY_ISOLATION
> +static inline bool has_isolate_pageblock(struct zone *zone)
> +{
> +	return zone->nr_isolate_pageblock;
> +}
>   static inline bool is_migrate_isolate_page(struct page *page)
>   {
>   	return get_pageblock_migratetype(page) == MIGRATE_ISOLATE;
> @@ -11,6 +15,10 @@ static inline bool is_migrate_isolate(int migratetype)
>   	return migratetype == MIGRATE_ISOLATE;
>   }
>   #else
> +static inline bool has_isolate_pageblock(struct zone *zone)
> +{
> +	return false;
> +}
>   static inline bool is_migrate_isolate_page(struct page *page)
>   {
>   	return false;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d8feedc..dcc2f08 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -594,6 +594,24 @@ static inline void __free_one_page(struct page *page,
>   	VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
>   	VM_BUG_ON_PAGE(bad_range(zone, page), page);
>
> +	/*
> +	 * pcp pages could go normal buddy list due to stale pageblock
> +	 * migratetype so re-check it if there is isolate pageblock.
> +	 *
> +	 * And, we got migratetype without holding the lock so it could be
> +	 * racy. If some pages go on the isolate migratetype buddy list
> +	 * by this race, we can't allocate this page anymore until next
> +	 * isolation attempt on this pageblock. To prevent this
> +	 * possibility, re-check migratetype with holding the lock.
> +	 */
> +	if (unlikely(has_isolate_pageblock(zone) ||
> +		is_migrate_isolate(migratetype))) {
> +		migratetype = get_pfnblock_migratetype(page, pfn);
> +	}
> +
> +	if (!is_migrate_isolate(migratetype))
> +		__mod_zone_freepage_state(zone, 1 << order, migratetype);
> +
>   	while (order < MAX_ORDER-1) {
>   		buddy_idx = __find_buddy_index(page_idx, order);
>   		buddy = page + (buddy_idx - page_idx);
> @@ -719,13 +737,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>   			page = list_entry(list->prev, struct page, lru);
>   			/* must delete as __free_one_page list manipulates */
>   			list_del(&page->lru);
> -
> -			if (unlikely(is_migrate_isolate_page(page))) {
> -				mt = MIGRATE_ISOLATE;
> -			} else {
> -				mt = get_freepage_migratetype(page);
> -				__mod_zone_freepage_state(zone, 1, mt);
> -			}
> +			mt = get_freepage_migratetype(page);
>
>   			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
>   			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
> @@ -742,21 +754,7 @@ static void free_one_page(struct zone *zone,
>   {
>   	spin_lock(&zone->lock);
>   	zone->pages_scanned = 0;
> -
> -	if (unlikely(is_migrate_isolate(migratetype))) {
> -		/*
> -		 * We got migratetype without holding the lock so it could be
> -		 * racy. If some pages go on the isolate migratetype buddy list
> -		 * by this race, we can't allocate this page anymore until next
> -		 * isolation attempt on this pageblock. To prevent this
> -		 * possibility, re-check migratetype with holding the lock.
> -		 */
> -		migratetype = get_pfnblock_migratetype(page, pfn);
> -	}
> -
>   	__free_one_page(page, pfn, zone, order, migratetype);
> -	if (!is_migrate_isolate(migratetype))
> -		__mod_zone_freepage_state(zone, 1 << order, migratetype);
>   	spin_unlock(&zone->lock);
>   }
>
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index d1473b2..1fa4a4d 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -60,6 +60,7 @@ out:
>   		int migratetype = get_pageblock_migratetype(page);
>
>   		set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> +		zone->nr_isolate_pageblock++;
>   		nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
>
>   		__mod_zone_freepage_state(zone, -nr_pages, migratetype);
> @@ -83,6 +84,7 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
>   	nr_pages = move_freepages_block(zone, page, migratetype);
>   	__mod_zone_freepage_state(zone, nr_pages, migratetype);
>   	set_pageblock_migratetype(page, migratetype);
> +	zone->nr_isolate_pageblock--;
>   out:
>   	spin_unlock_irqrestore(&zone->lock, flags);
>   }
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 05/10] mm/page_alloc: optimize and unify pageblock migratetype check in free path
@ 2014-07-07 15:50     ` Vlastimil Babka
  0 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-07 15:50 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> Currently, when we free the page from pcp list to buddy, we check
> pageblock of the page in order to isolate the page on isolated
> pageblock. Although this could rarely happen and to check migratetype of
> pageblock is somewhat expensive, we check it on free fast path. I think
> that this is undesirable. To prevent this situation, I introduce new
> variable, nr_isolate_pageblock on struct zone and use it to determine
> if we should check pageblock migratetype. Isolation on pageblock rarely
> happens so we can mostly avoid this pageblock migratetype check.

Better, but still there's a zone flag check and maintenance. So if it 
could be avoided, it would be better.

> Additionally, unify freepage counting code, because it can be done in
> common part, __free_one_page(). This unifying provides extra guarantee
> that the page on isolate pageblock don't go into non-isolate buddy list.
> This is similar situation describing in previous patch so refer it
> if you need more explanation.

You should make it clearer that you are solving misplacement of the type 
"page should be placed on isolated freelist but it's not" through 
free_one_page(), which was solved only for free_pcppages_bulk() in patch 
03/10. Mentioning patch 04/10 here, which solves the opposite problem 
"page shouldn't be placed on isolated freelist, but it is", only 
confuses the situation. Also this patch undoes everything of 04/10 and 
moves it elsewhere, so that would make it harder to git blame etc. I 
would reorder 04 and 05.

> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>   include/linux/mmzone.h         |    5 +++++
>   include/linux/page-isolation.h |    8 ++++++++
>   mm/page_alloc.c                |   40 +++++++++++++++++++---------------------
>   mm/page_isolation.c            |    2 ++
>   4 files changed, 34 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index fd48890..e9c194f 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -374,6 +374,11 @@ struct zone {
>   	/* pfn where async and sync compaction migration scanner should start */
>   	unsigned long		compact_cached_migrate_pfn[2];
>   #endif
> +
> +#ifdef CONFIG_MEMORY_ISOLATION
> +	unsigned long		nr_isolate_pageblock;
> +#endif
> +
>   #ifdef CONFIG_MEMORY_HOTPLUG
>   	/* see spanned/present_pages for more description */
>   	seqlock_t		span_seqlock;
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index 3fff8e7..2dc1e16 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -2,6 +2,10 @@
>   #define __LINUX_PAGEISOLATION_H
>
>   #ifdef CONFIG_MEMORY_ISOLATION
> +static inline bool has_isolate_pageblock(struct zone *zone)
> +{
> +	return zone->nr_isolate_pageblock;
> +}
>   static inline bool is_migrate_isolate_page(struct page *page)
>   {
>   	return get_pageblock_migratetype(page) == MIGRATE_ISOLATE;
> @@ -11,6 +15,10 @@ static inline bool is_migrate_isolate(int migratetype)
>   	return migratetype == MIGRATE_ISOLATE;
>   }
>   #else
> +static inline bool has_isolate_pageblock(struct zone *zone)
> +{
> +	return false;
> +}
>   static inline bool is_migrate_isolate_page(struct page *page)
>   {
>   	return false;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d8feedc..dcc2f08 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -594,6 +594,24 @@ static inline void __free_one_page(struct page *page,
>   	VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
>   	VM_BUG_ON_PAGE(bad_range(zone, page), page);
>
> +	/*
> +	 * pcp pages could go normal buddy list due to stale pageblock
> +	 * migratetype so re-check it if there is isolate pageblock.
> +	 *
> +	 * And, we got migratetype without holding the lock so it could be
> +	 * racy. If some pages go on the isolate migratetype buddy list
> +	 * by this race, we can't allocate this page anymore until next
> +	 * isolation attempt on this pageblock. To prevent this
> +	 * possibility, re-check migratetype with holding the lock.
> +	 */
> +	if (unlikely(has_isolate_pageblock(zone) ||
> +		is_migrate_isolate(migratetype))) {
> +		migratetype = get_pfnblock_migratetype(page, pfn);
> +	}
> +
> +	if (!is_migrate_isolate(migratetype))
> +		__mod_zone_freepage_state(zone, 1 << order, migratetype);
> +
>   	while (order < MAX_ORDER-1) {
>   		buddy_idx = __find_buddy_index(page_idx, order);
>   		buddy = page + (buddy_idx - page_idx);
> @@ -719,13 +737,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>   			page = list_entry(list->prev, struct page, lru);
>   			/* must delete as __free_one_page list manipulates */
>   			list_del(&page->lru);
> -
> -			if (unlikely(is_migrate_isolate_page(page))) {
> -				mt = MIGRATE_ISOLATE;
> -			} else {
> -				mt = get_freepage_migratetype(page);
> -				__mod_zone_freepage_state(zone, 1, mt);
> -			}
> +			mt = get_freepage_migratetype(page);
>
>   			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
>   			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
> @@ -742,21 +754,7 @@ static void free_one_page(struct zone *zone,
>   {
>   	spin_lock(&zone->lock);
>   	zone->pages_scanned = 0;
> -
> -	if (unlikely(is_migrate_isolate(migratetype))) {
> -		/*
> -		 * We got migratetype without holding the lock so it could be
> -		 * racy. If some pages go on the isolate migratetype buddy list
> -		 * by this race, we can't allocate this page anymore until next
> -		 * isolation attempt on this pageblock. To prevent this
> -		 * possibility, re-check migratetype with holding the lock.
> -		 */
> -		migratetype = get_pfnblock_migratetype(page, pfn);
> -	}
> -
>   	__free_one_page(page, pfn, zone, order, migratetype);
> -	if (!is_migrate_isolate(migratetype))
> -		__mod_zone_freepage_state(zone, 1 << order, migratetype);
>   	spin_unlock(&zone->lock);
>   }
>
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index d1473b2..1fa4a4d 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -60,6 +60,7 @@ out:
>   		int migratetype = get_pageblock_migratetype(page);
>
>   		set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> +		zone->nr_isolate_pageblock++;
>   		nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
>
>   		__mod_zone_freepage_state(zone, -nr_pages, migratetype);
> @@ -83,6 +84,7 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
>   	nr_pages = move_freepages_block(zone, page, migratetype);
>   	__mod_zone_freepage_state(zone, nr_pages, migratetype);
>   	set_pageblock_migratetype(page, migratetype);
> +	zone->nr_isolate_pageblock--;
>   out:
>   	spin_unlock_irqrestore(&zone->lock, flags);
>   }
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 08/10] mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
  2014-07-04  7:57   ` Joonsoo Kim
@ 2014-07-07 15:57     ` Vlastimil Babka
  -1 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-07 15:57 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> When isolating free page, what we want to know is which list
> the page is linked. If it is linked in isolate migratetype buddy list,
> we can skip watermark check and freepage counting. And if it is linked
> in CMA migratetype buddy list, we need to fixup freepage counting. For
> this purpose, get_onbuddy_migratetype() is more fit and cheap than
> get_pageblock_migratetype(). So use it.

Hm but you made get_onbuddy_migratetype() work only with 
CONFIG_MEMORY_ISOLATION. And __isolate_free_page is (despite the name) 
not at all limited to CONFIG_MEMORY_ISOLATION.

> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>   mm/page_alloc.c |    2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e1c4c3e..d9fb8bb 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1597,7 +1597,7 @@ static int __isolate_free_page(struct page *page, unsigned int order)
>   	BUG_ON(!PageBuddy(page));
>
>   	zone = page_zone(page);
> -	mt = get_pageblock_migratetype(page);
> +	mt = get_onbuddy_migratetype(page);
>
>   	if (!is_migrate_isolate(mt)) {
>   		/* Obey watermarks as if the page was being allocated */
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 08/10] mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
@ 2014-07-07 15:57     ` Vlastimil Babka
  0 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-07 15:57 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> When isolating free page, what we want to know is which list
> the page is linked. If it is linked in isolate migratetype buddy list,
> we can skip watermark check and freepage counting. And if it is linked
> in CMA migratetype buddy list, we need to fixup freepage counting. For
> this purpose, get_onbuddy_migratetype() is more fit and cheap than
> get_pageblock_migratetype(). So use it.

Hm but you made get_onbuddy_migratetype() work only with 
CONFIG_MEMORY_ISOLATION. And __isolate_free_page is (despite the name) 
not at all limited to CONFIG_MEMORY_ISOLATION.

> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>   mm/page_alloc.c |    2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e1c4c3e..d9fb8bb 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1597,7 +1597,7 @@ static int __isolate_free_page(struct page *page, unsigned int order)
>   	BUG_ON(!PageBuddy(page));
>
>   	zone = page_zone(page);
> -	mt = get_pageblock_migratetype(page);
> +	mt = get_onbuddy_migratetype(page);
>
>   	if (!is_migrate_isolate(mt)) {
>   		/* Obey watermarks as if the page was being allocated */
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 08/10] mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
  2014-07-07 15:57     ` Vlastimil Babka
@ 2014-07-08  1:01       ` Gioh Kim
  -1 siblings, 0 replies; 79+ messages in thread
From: Gioh Kim @ 2014-07-08  1:01 UTC (permalink / raw)
  To: Vlastimil Babka, Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, linux-mm, linux-kernel



2014-07-08 오전 12:57, Vlastimil Babka 쓴 글:
> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
>> When isolating free page, what we want to know is which list
>> the page is linked. If it is linked in isolate migratetype buddy list,
>> we can skip watermark check and freepage counting. And if it is linked
>> in CMA migratetype buddy list, we need to fixup freepage counting. For
>> this purpose, get_onbuddy_migratetype() is more fit and cheap than
>> get_pageblock_migratetype(). So use it.
>
> Hm but you made get_onbuddy_migratetype() work only with CONFIG_MEMORY_ISOLATION. And __isolate_free_page is (despite the name) not at all limited to CONFIG_MEMORY_ISOLATION.
>

Current __isolate_free_page is called by only split_free_page, and split_free_page by isolate_freepages_block.
split_free_page is called only for isolated pages now but It can be changed someday.
I think get_onbuddy_migratetype should work with any situation.

And I think the name of get_onbuddy_migratetype is confused.
Because of _onbuddy_, it might look like that the pages are buddy pages.
I think the original name _freepage_ is proper one.


>> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> ---
>>   mm/page_alloc.c |    2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index e1c4c3e..d9fb8bb 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1597,7 +1597,7 @@ static int __isolate_free_page(struct page *page, unsigned int order)
>>       BUG_ON(!PageBuddy(page));
>>
>>       zone = page_zone(page);
>> -    mt = get_pageblock_migratetype(page);
>> +    mt = get_onbuddy_migratetype(page);
>>
>>       if (!is_migrate_isolate(mt)) {
>>           /* Obey watermarks as if the page was being allocated */
>>
>
>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 08/10] mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
@ 2014-07-08  1:01       ` Gioh Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Gioh Kim @ 2014-07-08  1:01 UTC (permalink / raw)
  To: Vlastimil Babka, Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, linux-mm, linux-kernel



2014-07-08 i??i ? 12:57, Vlastimil Babka i?' e,?:
> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
>> When isolating free page, what we want to know is which list
>> the page is linked. If it is linked in isolate migratetype buddy list,
>> we can skip watermark check and freepage counting. And if it is linked
>> in CMA migratetype buddy list, we need to fixup freepage counting. For
>> this purpose, get_onbuddy_migratetype() is more fit and cheap than
>> get_pageblock_migratetype(). So use it.
>
> Hm but you made get_onbuddy_migratetype() work only with CONFIG_MEMORY_ISOLATION. And __isolate_free_page is (despite the name) not at all limited to CONFIG_MEMORY_ISOLATION.
>

Current __isolate_free_page is called by only split_free_page, and split_free_page by isolate_freepages_block.
split_free_page is called only for isolated pages now but It can be changed someday.
I think get_onbuddy_migratetype should work with any situation.

And I think the name of get_onbuddy_migratetype is confused.
Because of _onbuddy_, it might look like that the pages are buddy pages.
I think the original name _freepage_ is proper one.


>> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> ---
>>   mm/page_alloc.c |    2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index e1c4c3e..d9fb8bb 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1597,7 +1597,7 @@ static int __isolate_free_page(struct page *page, unsigned int order)
>>       BUG_ON(!PageBuddy(page));
>>
>>       zone = page_zone(page);
>> -    mt = get_pageblock_migratetype(page);
>> +    mt = get_onbuddy_migratetype(page);
>>
>>       if (!is_migrate_isolate(mt)) {
>>           /* Obey watermarks as if the page was being allocated */
>>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 03/10] mm/page_alloc: handle page on pcp correctly if it's pageblock is isolated
  2014-07-07  8:25     ` Gioh Kim
@ 2014-07-08  7:18       ` Vlastimil Babka
  -1 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-08  7:18 UTC (permalink / raw)
  To: Gioh Kim, Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, linux-mm, linux-kernel

On 07/07/2014 10:25 AM, Gioh Kim wrote:
> 
> 
> 2014-07-04 오후 4:57, Joonsoo Kim 쓴 글:
>> If pageblock of page on pcp are isolated now, we should free it to isolate
>> buddy list to prevent future allocation on it. But current code doesn't
>> do this.
> 
> I think it is strage that pcp can have isolated page.
> I remember that The purpose of pcp is having frequently used pages (hot for cache),
> but isolated page is not for frequently allocated and freed.

It can happen that a page was placed on pcplist *before* the pageblock
was isolated.

> (Above is my guess. It can be wrong. I know CMA and hot-plug features are using isolated pages
> so that I guess isolated pages are not for frequently allocated and freed memory.)
> 
> Anyway if isolated page is not good for pcp, what about preventing isolated page from inserting pcp?
> I think fix of free_hot_cold_page() can be one of the preventing ways.
> The free_hot_cold_page() checks migratetype and inserts CMA page into pcp.
> Am I correct?
> 
> If I am correct I think inserting CMA page into pcp is harmful for both of CMA and pcp.
> If CMA page is on pcp it will be used frequently and prevent making contiguous memory.
> What about avoiding CMA page for pcp like following?

CMA is not the only user of memory isolation. There's also memory
hot-remove. Also CMA tries to make the CMA-marked pages usable as normal
MOVABLE pages, until a CMA allocation is needed. Here you would restrict
their usage outside of pcplists.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8c9eeec..1cbb816 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1413,13 +1413,8 @@ void free_hot_cold_page(struct page *page, bool cold)
>           * areas back if necessary. Otherwise, we may have to free
>           * excessively into the page allocator
>           */
> -       if (migratetype >= MIGRATE_PCPTYPES) {
> -               if (unlikely(is_migrate_isolate(migratetype))) {
> -                       free_one_page(zone, page, pfn, 0, migratetype);
> -                       goto out;
> -               }
> -               migratetype = MIGRATE_MOVABLE;
> -       }
> +       if (migratetype > MIGRATE_PCPTYPES)
> +               goto out;

Certainly not, for the reasons above and also:
- by changing >= to > you stopped handling RESERVE pages specially
- by removing free_one_page() call you are now leaking the CMA and
ISOLATE pages.

>          pcp = &this_cpu_ptr(zone->pageset)->pcp;
>          if (!cold)
> 
> 
>>
>> Moreover, there is a freepage counting problem on current code. Although
>> pageblock of page on pcp are isolated now, it could go normal buddy list,
>> because get_onpcp_migratetype() will return non-isolate migratetype.
> 
> I cannot find get_onpcp_migratetype() in v3.16.0-rc3 and also cannot google it.
> Is it typo?
> 
>> In this case, we should do either adding freepage count or changing
>> migratetype to MIGRATE_ISOLATE, but, current code do neither.
>>
>> This patch fixes these two problems by handling pageblock migratetype
>> before calling __free_one_page(). And, if we find the page on isolated
>> pageblock, change migratetype to MIGRATE_ISOLATE to prevent future
>> allocation of this page and freepage counting problem.
>>
>> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> ---
>>    mm/page_alloc.c |   14 ++++++++------
>>    1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index aeb51d1..99c05f7 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -719,15 +719,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>    			page = list_entry(list->prev, struct page, lru);
>>    			/* must delete as __free_one_page list manipulates */
>>    			list_del(&page->lru);
>> -			mt = get_freepage_migratetype(page);
>> +
>> +			if (unlikely(is_migrate_isolate_page(page))) {
>> +				mt = MIGRATE_ISOLATE;
>> +			} else {
>> +				mt = get_freepage_migratetype(page);
>> +				__mod_zone_freepage_state(zone, 1, mt);
>> +			}
>> +
>>    			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
>>    			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
>>    			trace_mm_page_pcpu_drain(page, 0, mt);
>> -			if (likely(!is_migrate_isolate_page(page))) {
>> -				__mod_zone_page_state(zone, NR_FREE_PAGES, 1);
>> -				if (is_migrate_cma(mt))
>> -					__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
>> -			}
>>    		} while (--to_free && --batch_free && !list_empty(list));
>>    	}
>>    	spin_unlock(&zone->lock);
>>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 03/10] mm/page_alloc: handle page on pcp correctly if it's pageblock is isolated
@ 2014-07-08  7:18       ` Vlastimil Babka
  0 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-08  7:18 UTC (permalink / raw)
  To: Gioh Kim, Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, linux-mm, linux-kernel

On 07/07/2014 10:25 AM, Gioh Kim wrote:
> 
> 
> 2014-07-04 ?AEA 4:57, Joonsoo Kim  3/4 ' +-U:
>> If pageblock of page on pcp are isolated now, we should free it to isolate
>> buddy list to prevent future allocation on it. But current code doesn't
>> do this.
> 
> I think it is strage that pcp can have isolated page.
> I remember that The purpose of pcp is having frequently used pages (hot for cache),
> but isolated page is not for frequently allocated and freed.

It can happen that a page was placed on pcplist *before* the pageblock
was isolated.

> (Above is my guess. It can be wrong. I know CMA and hot-plug features are using isolated pages
> so that I guess isolated pages are not for frequently allocated and freed memory.)
> 
> Anyway if isolated page is not good for pcp, what about preventing isolated page from inserting pcp?
> I think fix of free_hot_cold_page() can be one of the preventing ways.
> The free_hot_cold_page() checks migratetype and inserts CMA page into pcp.
> Am I correct?
> 
> If I am correct I think inserting CMA page into pcp is harmful for both of CMA and pcp.
> If CMA page is on pcp it will be used frequently and prevent making contiguous memory.
> What about avoiding CMA page for pcp like following?

CMA is not the only user of memory isolation. There's also memory
hot-remove. Also CMA tries to make the CMA-marked pages usable as normal
MOVABLE pages, until a CMA allocation is needed. Here you would restrict
their usage outside of pcplists.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8c9eeec..1cbb816 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1413,13 +1413,8 @@ void free_hot_cold_page(struct page *page, bool cold)
>           * areas back if necessary. Otherwise, we may have to free
>           * excessively into the page allocator
>           */
> -       if (migratetype >= MIGRATE_PCPTYPES) {
> -               if (unlikely(is_migrate_isolate(migratetype))) {
> -                       free_one_page(zone, page, pfn, 0, migratetype);
> -                       goto out;
> -               }
> -               migratetype = MIGRATE_MOVABLE;
> -       }
> +       if (migratetype > MIGRATE_PCPTYPES)
> +               goto out;

Certainly not, for the reasons above and also:
- by changing >= to > you stopped handling RESERVE pages specially
- by removing free_one_page() call you are now leaking the CMA and
ISOLATE pages.

>          pcp = &this_cpu_ptr(zone->pageset)->pcp;
>          if (!cold)
> 
> 
>>
>> Moreover, there is a freepage counting problem on current code. Although
>> pageblock of page on pcp are isolated now, it could go normal buddy list,
>> because get_onpcp_migratetype() will return non-isolate migratetype.
> 
> I cannot find get_onpcp_migratetype() in v3.16.0-rc3 and also cannot google it.
> Is it typo?
> 
>> In this case, we should do either adding freepage count or changing
>> migratetype to MIGRATE_ISOLATE, but, current code do neither.
>>
>> This patch fixes these two problems by handling pageblock migratetype
>> before calling __free_one_page(). And, if we find the page on isolated
>> pageblock, change migratetype to MIGRATE_ISOLATE to prevent future
>> allocation of this page and freepage counting problem.
>>
>> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> ---
>>    mm/page_alloc.c |   14 ++++++++------
>>    1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index aeb51d1..99c05f7 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -719,15 +719,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>    			page = list_entry(list->prev, struct page, lru);
>>    			/* must delete as __free_one_page list manipulates */
>>    			list_del(&page->lru);
>> -			mt = get_freepage_migratetype(page);
>> +
>> +			if (unlikely(is_migrate_isolate_page(page))) {
>> +				mt = MIGRATE_ISOLATE;
>> +			} else {
>> +				mt = get_freepage_migratetype(page);
>> +				__mod_zone_freepage_state(zone, 1, mt);
>> +			}
>> +
>>    			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
>>    			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
>>    			trace_mm_page_pcpu_drain(page, 0, mt);
>> -			if (likely(!is_migrate_isolate_page(page))) {
>> -				__mod_zone_page_state(zone, NR_FREE_PAGES, 1);
>> -				if (is_migrate_cma(mt))
>> -					__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
>> -			}
>>    		} while (--to_free && --batch_free && !list_empty(list));
>>    	}
>>    	spin_unlock(&zone->lock);
>>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 08/10] mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
  2014-07-08  1:01       ` Gioh Kim
@ 2014-07-08  7:23         ` Vlastimil Babka
  -1 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-08  7:23 UTC (permalink / raw)
  To: Gioh Kim, Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, linux-mm, linux-kernel

On 07/08/2014 03:01 AM, Gioh Kim wrote:
>
>
> 2014-07-08 오전 12:57, Vlastimil Babka 쓴 글:
>> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
>>> When isolating free page, what we want to know is which list
>>> the page is linked. If it is linked in isolate migratetype buddy list,
>>> we can skip watermark check and freepage counting. And if it is linked
>>> in CMA migratetype buddy list, we need to fixup freepage counting. For
>>> this purpose, get_onbuddy_migratetype() is more fit and cheap than
>>> get_pageblock_migratetype(). So use it.
>>
>> Hm but you made get_onbuddy_migratetype() work only with CONFIG_MEMORY_ISOLATION. And __isolate_free_page is (despite the name) not at all limited to CONFIG_MEMORY_ISOLATION.
>>
>
> Current __isolate_free_page is called by only split_free_page, and split_free_page by isolate_freepages_block.
> split_free_page is called only for isolated pages now but It can be changed someday.

Yeah, but isolate_freepages_block is also not part of 
CONFIG_MEMORY_ISOLATION. Unfortunately, there are two distinct concepts 
of memory isolation and LRU isolation, that are not distinguished in 
function names.

> I think get_onbuddy_migratetype should work with any situation.
>
> And I think the name of get_onbuddy_migratetype is confused.
> Because of _onbuddy_, it might look like that the pages are buddy pages.
> I think the original name _freepage_ is proper one.
>
>
>>> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>> ---
>>>    mm/page_alloc.c |    2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index e1c4c3e..d9fb8bb 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -1597,7 +1597,7 @@ static int __isolate_free_page(struct page *page, unsigned int order)
>>>        BUG_ON(!PageBuddy(page));
>>>
>>>        zone = page_zone(page);
>>> -    mt = get_pageblock_migratetype(page);
>>> +    mt = get_onbuddy_migratetype(page);
>>>
>>>        if (!is_migrate_isolate(mt)) {
>>>            /* Obey watermarks as if the page was being allocated */
>>>
>>
>>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 08/10] mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
@ 2014-07-08  7:23         ` Vlastimil Babka
  0 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-08  7:23 UTC (permalink / raw)
  To: Gioh Kim, Joonsoo Kim, Andrew Morton
  Cc: Kirill A. Shutemov, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu, Zhang Yanfei,
	Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, linux-mm, linux-kernel

On 07/08/2014 03:01 AM, Gioh Kim wrote:
>
>
> 2014-07-08 i??i ? 12:57, Vlastimil Babka i?' e,?:
>> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
>>> When isolating free page, what we want to know is which list
>>> the page is linked. If it is linked in isolate migratetype buddy list,
>>> we can skip watermark check and freepage counting. And if it is linked
>>> in CMA migratetype buddy list, we need to fixup freepage counting. For
>>> this purpose, get_onbuddy_migratetype() is more fit and cheap than
>>> get_pageblock_migratetype(). So use it.
>>
>> Hm but you made get_onbuddy_migratetype() work only with CONFIG_MEMORY_ISOLATION. And __isolate_free_page is (despite the name) not at all limited to CONFIG_MEMORY_ISOLATION.
>>
>
> Current __isolate_free_page is called by only split_free_page, and split_free_page by isolate_freepages_block.
> split_free_page is called only for isolated pages now but It can be changed someday.

Yeah, but isolate_freepages_block is also not part of 
CONFIG_MEMORY_ISOLATION. Unfortunately, there are two distinct concepts 
of memory isolation and LRU isolation, that are not distinguished in 
function names.

> I think get_onbuddy_migratetype should work with any situation.
>
> And I think the name of get_onbuddy_migratetype is confused.
> Because of _onbuddy_, it might look like that the pages are buddy pages.
> I think the original name _freepage_ is proper one.
>
>
>>> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>> ---
>>>    mm/page_alloc.c |    2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index e1c4c3e..d9fb8bb 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -1597,7 +1597,7 @@ static int __isolate_free_page(struct page *page, unsigned int order)
>>>        BUG_ON(!PageBuddy(page));
>>>
>>>        zone = page_zone(page);
>>> -    mt = get_pageblock_migratetype(page);
>>> +    mt = get_onbuddy_migratetype(page);
>>>
>>>        if (!is_migrate_isolate(mt)) {
>>>            /* Obey watermarks as if the page was being allocated */
>>>
>>
>>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
  2014-07-07 14:33       ` Vlastimil Babka
@ 2014-07-14  6:22         ` Joonsoo Kim
  -1 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-14  6:22 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, Lisa Du,
	linux-kernel

On Mon, Jul 07, 2014 at 04:33:09PM +0200, Vlastimil Babka wrote:
> On 07/07/2014 06:49 AM, Joonsoo Kim wrote:
> >Ccing Lisa, because there was bug report it may be related this
> >topic last Saturday.
> >
> >http://www.spinics.net/lists/linux-mm/msg75741.html
> >
> >On Fri, Jul 04, 2014 at 05:33:27PM +0200, Vlastimil Babka wrote:
> >>On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> >>>Hello,
> >>
> >>Hi Joonsoo,
> >>
> >>please CC me on further updates, this is relevant to me.
> >
> >Hello, Vlastimil.
> >
> >Sorry for missing you. :)
> >
> >>
> >>>This patchset aims at fixing problems due to memory isolation found by
> >>>testing my patchset [1].
> >>>
> >>>These are really subtle problems so I can be wrong. If you find what I am
> >>>missing, please let me know.
> >>>
> >>>Before describing bugs itself, I first explain definition of freepage.
> >>>
> >>>1. pages on buddy list are counted as freepage.
> >>>2. pages on isolate migratetype buddy list are *not* counted as freepage.
> >>
> >>I think the second point is causing us a lot of trouble. And I wonder if it's really
> >>justified! We already have some is_migrate_isolate() checks in the fast path and now you
> >>would add more, mostly just to keep this accounting correct.
> >
> >It's not just for keeping accouting correct. There is another
> >purpose for is_migrate_isolate(). It forces freed pages to go into
> >isolate buddy list. Without it, freed pages would go into other
> >buddy list and will be used soon. So memory isolation can't work well
> >without is_migrate_isolate() checks and success rate could decrease.
> 
> Well I think that could be solved also by doing a lru/pcplists drain
> right after marking pageblock as MIGRATETYPE_ISOLATE. After that
> moment, anything newly put on pcplists should observe the new
> migratetype and go to the correct pcplists. As putting stuff on
> pcplists is done with disabled interrupts, and draining is done by
> IPI, I think it should work correctly if we put the migratetype
> determination under the disabled irq part in free_hot_cold_page().

Hello, sorry for late.

Yes, this can close the race on migratetype buddy list, but, there is
a problem. See the below.

> 
> >And, I just added three more is_migrate_isolate() in the fast
> >path, but, two checks are in same *unlikely* branch and I can remove
> >another one easily. Therefore it's not quite problem I guess. (It even
> >does no-op if MEMORY_ISOLATION is disabled.)
> >Moreover, I removed one unconditional get_pageblock_migratetype() in
> >free_pcppages_bulk() so, in performance point or view, freepath would
> >be improved.
> 
> I haven't checked the individual patches yet, but I'll try.
> 
> >>
> >>So the question is, does it have to be correct? And (admiteddly not after a completely
> >>exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)
> >>
> >>Well I of course don't mean that the freepage accounts could go random completely, but
> >>what if we allowed them to drift a bit, limiting both the max error and the timeframe
> >>where errors are possible? After all, watermarks checking is already racy so I don't think
> >>it would be hurt that much.
> >
> >I understand your suggestion. I once thought like as you, but give up
> >that idea. Watermark checking is already racy, but, it's *only*
> >protection to prevent memory allocation. After passing that check,
> >there is no mean to prevent us from allocating memory. So it should
> >be accurate as much as possible. If we allow someone to get the
> >memory without considering memory isolation, free memory could be in
> >really low state and system could be broken occasionally.
> 
> It would definitely require more thorough review, but I think with
> the pcplists draining changes outlined above, there shouldn't be any
> more inaccuracy happening.
> 
> >>
> >>Now if we look at what both CMA and memory hot-remove does is:
> >>
> >>1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
> >>start_isolate_page_range(). As part of this, all free pages in that area are
> >>moved on the isolate freelist through move_freepages_block().
> >>
> >>2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
> >>caches.
> >>
> >>3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
> >>undo pageblock isolation if not.
> >>
> >>So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
> >>allocator, which would therefore account free pages on the isolate freelist as normal free
> >>pages.
> >>The accounting of isolated pages would be instead done only on the top level of CMA/hot
> >>remove in the three steps outlined above, which would be modified as follows:
> >>
> >>1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
> >>as usual. Calculate X as the number of pages that move_freepages_block() has moved.
> >>Subtract X from freepages (this is the same as it is done now), but also *remember the
> >>value of X*
> >>
> >>2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
> >>isolate freelist, or not. We don't care, they will be accounted as freepages either way.
> >>This is where some inaccuracy in accounted freepages would build up.
> >>
> >>3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
> >>that we have a isolated range of N pages that nobody can steal now as everything is on
> >>isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
> >>X, then N-X). So the accounting matches reality.
> >>
> >>If we have to undo, we undo the isolation and as part of this, we use
> >>move_freepages_block() to move pages from isolate freelist to the normal ones. But we
> >>don't care how many pages were moved. We simply add the remembered value of X to the
> >>number of freepages, undoing the change from step 1. Again, the accounting matches reality.
> >>
> >>
> >>The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
> >>be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
> >>handled.
> >
> >The 4MB error in accounting looks serious for me. min_free_kbytes is
> >4MB in 1GB system. So this 4MB error would makes all things broken in
> >such systems. Moreover, there are some ARCH having larger
> >pageblock_order than MAX_ORDER. In this case, the error will be larger
> >than 4MB.
> >
> >In addition, I have a plan to extend CMA to work in parallel. It means
> >that there could be parallel memory isolation users rather than just
> >one user at the same time, so, we cannot easily bound the error under
> >some degree.
> 
> OK. I had thought that the error would be much smaller in practice,
> but probably due to how buddy merging works, it would be enough if
> just the very last freed page in the MAX_ORDER block was misplaced,
> and that would trigger a cascading merge that will end with single
> page at MAX_ORDER size becoming misplaced. So let's probably forget
> this approach.
> 
> >>As a possible improvement, we can assume during phase 2 that every page freed by migration
> >>will end up correctly on isolate free list. So we create M free pages by migration, and
> >>subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
> >>+ M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
> >>pages on pcplists and the possible races it will be less). I think with this improvement,
> >>any error would be negligible.
> >>
> >>Thoughts?
> >
> >Thanks for suggestion. :)
> >It is really good topic to think deeply.
> >
> >For now, I'd like to fix these problems without side-effect as you
> >suggested. Your suggestion changes the meaning of freepage that
> >isolated pages are included in nr_freepage and there could be possible
> >regression in success rate of memory hotplug and CMA. Possibly, it
> >is the way we have to go, but, IMHO, it isn't the time to go. Before
> >going that way, we should fix current implementation first so that
> >fixes can be backported to old kernel if someone needs.
> 
> Adding the extra pcplist drains and reordering
> get_pageblock_migratetype() under disabled IRQ's should be small
> enough to be backportable and not bring any regressions to success
> rates. There will be some cost for the extra drains, but paid only
> when the memory isolation actually happens. Extending the scope of
> disabled irq's would affect fast path somewhat, but I guess less
> than extra checks?

Your approach would close the race on migratetype buddy list. But,
I'm not sure how we can count number of freepages correctly. IIUC,
unlike my approach, this approach allows merging between isolate buddy
list and normal buddy list and it would results in miscounting number of
freepages. You probably think that move_freepages_block() could fixup
all the situation, but, I don't think so.

After applying your approach,

CPU1                                    CPU2
set_migratetype_isolate()               - free_one_page()
  - lock
  - set_pageblock_migratetype()
  - unlock
  - drain...
                                        - lock
                                        - __free_one_page() with MIGRATE_ISOLATE
                                        - merged with normal page and linked with 
                                        isolate buddy list
                                        - skip to count freepage, because of
                                        MIGRATE_ISOLATE
                                        - unlock
  - lock
  - move_freepages_block()    
    try to fixup freepages count
  - unlock

We want to fixup counting in move_freepages_block(), but, we can't
because we don't have enough information whether this page is already
accounted on number of freepage or not. Could you help me to find out
the whole mechanism of your approach?

And, IIUC, your approach can remove only one get_pageblock_migratetype()
in free_pcppages_bulk(). free_hot_cold_page() and free_one_page()
should have is_migrate_isolate() branches to free the page to correct
list and count number of freepage correctly. Additionally, it extends
scope of disabled irq. So it doesn't look much better than mine,
because my approach also removes get_pageblock_migratetype() in
free_pcppages_bulk(). 

Please let me know what I am missing. :)

Thanks.

> 
> >Interestingly,
> >on last Saturday, Lisa Du reported CMA accounting bugs.
> >
> >http://www.spinics.net/lists/linux-mm/msg75741.html
> >
> >I don't look at it in detail, but, maybe it is related to these
> >problems and we should fix it without side-effect.
> >
> >So, in conclusion, I think that your suggestion is beyond the scope of
> >this patchset because of following two reasons.
> >
> >1. I'd like to fix these problems without side-effect(possible
> >regression in success rate of memory hotplug and CMA, and nr_freepage
> >meanging change) due to backport possibility.
> >2. nr_freepage without considering memory isolation is somewhat dangerous
> >and not suitable for some systems.
> >
> >If you have any objection, please let me know. But, I will go on
> >a vacation for a week so I can't answer your further comments
> >for a week. I will reply them next week. :)
> >
> >Thanks.
> >
> >>>3. pages on cma buddy list are counted as CMA freepage, too.
> >>>4. pages for guard are *not* counted as freepage.
> >>>
> >>>Now, I describe problems and related patch.
> >>>
> >>>1. Patch 2: If guard page are cleared and merged into isolate buddy list,
> >>>we should not add freepage count.
> >>>
> >>>2. Patch 3: When the page return back from pcp to buddy, we should
> >>>account it to freepage counter. In this case, we should check the
> >>>pageblock migratetype of the page and should insert the page into
> >>>appropriate buddy list. Although we checked it in current code, we
> >>>didn't insert the page into appropriate buddy list so that freepage
> >>>counting can be wrong.
> >>>
> >>>3. Patch 4: There is race condition so that some freepages could be
> >>>on isolate buddy list. If so, we can't use this page until next isolation
> >>>attempt on this pageblock.
> >>>
> >>>4. Patch 5: There is race condition that page on isolate pageblock
> >>>can go into non-isolate buddy list. If so, buddy allocator would
> >>>merge pages on non-isolate buddy list and isolate buddy list, respectively,
> >>>and freepage count will be wrong.
> >>>
> >>>5. Patch 9: move_freepages(_block) returns *not* number of moved pages.
> >>>Instead, it returns number of pages linked in that migratetype buddy list.
> >>>So accouting with this return value makes freepage count wrong.
> >>>
> >>>6. Patch 10: buddy allocator would merge pages on non-isolate buddy list
> >>>and isolate buddy list, respectively. This leads to freepage counting
> >>>problem so fix it by stopping merging in this case.
> >>>
> >>>Without patchset [1], above problem doesn't happens on my CMA allocation
> >>>test, because CMA reserved pages aren't used at all. So there is no
> >>>chance for above race.
> >>>
> >>>With patchset [1], I did simple CMA allocation test and get below result.
> >>>
> >>>- Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
> >>>- run kernel build (make -j16) on background
> >>>- 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
> >>>- Result: more than 5000 freepage count are missed
> >>>
> >>>With patchset [1] and this patchset, I found that no freepage count are
> >>>missed so that I conclude that problems are solved.
> >>>
> >>>These problems can be possible on memory hot remove users, although
> >>>I didn't check it further.
> >>>
> >>>Other patches are either for the base to fix these problems or for
> >>>simple clean-up. Please see individual patches for more information.
> >>>
> >>>This patchset is based on linux-next-20140703.
> >>>
> >>>Thanks.
> >>>
> >>>[1]: Aggressively allocate the pages on cma reserved memory
> >>>      https://lkml.org/lkml/2014/5/30/291
> >>>
> >>>
> >>>Joonsoo Kim (10):
> >>>   mm/page_alloc: remove unlikely macro on free_one_page()
> >>>   mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
> >>>   mm/page_alloc: handle page on pcp correctly if it's pageblock is
> >>>     isolated
> >>>   mm/page_alloc: carefully free the page on isolate pageblock
> >>>   mm/page_alloc: optimize and unify pageblock migratetype check in free
> >>>     path
> >>>   mm/page_alloc: separate freepage migratetype interface
> >>>   mm/page_alloc: store migratetype of the buddy list into freepage
> >>>     correctly
> >>>   mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
> >>>   mm/page_alloc: fix possible wrongly calculated freepage counter
> >>>   mm/page_alloc: Stop merging pages on non-isolate and isolate buddy
> >>>     list
> >>>
> >>>  include/linux/mm.h             |   30 +++++++--
> >>>  include/linux/mmzone.h         |    5 ++
> >>>  include/linux/page-isolation.h |    8 +++
> >>>  mm/page_alloc.c                |  138 +++++++++++++++++++++++++++++-----------
> >>>  mm/page_isolation.c            |   18 ++----
> >>>  5 files changed, 147 insertions(+), 52 deletions(-)
> >>>
> >>
> >>--
> >>To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >>the body to majordomo@kvack.org.  For more info on Linux MM,
> >>see: http://www.linux-mm.org/ .
> >>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> >--
> >To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >the body to majordomo@kvack.org.  For more info on Linux MM,
> >see: http://www.linux-mm.org/ .
> >Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
@ 2014-07-14  6:22         ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-14  6:22 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, Lisa Du,
	linux-kernel

On Mon, Jul 07, 2014 at 04:33:09PM +0200, Vlastimil Babka wrote:
> On 07/07/2014 06:49 AM, Joonsoo Kim wrote:
> >Ccing Lisa, because there was bug report it may be related this
> >topic last Saturday.
> >
> >http://www.spinics.net/lists/linux-mm/msg75741.html
> >
> >On Fri, Jul 04, 2014 at 05:33:27PM +0200, Vlastimil Babka wrote:
> >>On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> >>>Hello,
> >>
> >>Hi Joonsoo,
> >>
> >>please CC me on further updates, this is relevant to me.
> >
> >Hello, Vlastimil.
> >
> >Sorry for missing you. :)
> >
> >>
> >>>This patchset aims at fixing problems due to memory isolation found by
> >>>testing my patchset [1].
> >>>
> >>>These are really subtle problems so I can be wrong. If you find what I am
> >>>missing, please let me know.
> >>>
> >>>Before describing bugs itself, I first explain definition of freepage.
> >>>
> >>>1. pages on buddy list are counted as freepage.
> >>>2. pages on isolate migratetype buddy list are *not* counted as freepage.
> >>
> >>I think the second point is causing us a lot of trouble. And I wonder if it's really
> >>justified! We already have some is_migrate_isolate() checks in the fast path and now you
> >>would add more, mostly just to keep this accounting correct.
> >
> >It's not just for keeping accouting correct. There is another
> >purpose for is_migrate_isolate(). It forces freed pages to go into
> >isolate buddy list. Without it, freed pages would go into other
> >buddy list and will be used soon. So memory isolation can't work well
> >without is_migrate_isolate() checks and success rate could decrease.
> 
> Well I think that could be solved also by doing a lru/pcplists drain
> right after marking pageblock as MIGRATETYPE_ISOLATE. After that
> moment, anything newly put on pcplists should observe the new
> migratetype and go to the correct pcplists. As putting stuff on
> pcplists is done with disabled interrupts, and draining is done by
> IPI, I think it should work correctly if we put the migratetype
> determination under the disabled irq part in free_hot_cold_page().

Hello, sorry for late.

Yes, this can close the race on migratetype buddy list, but, there is
a problem. See the below.

> 
> >And, I just added three more is_migrate_isolate() in the fast
> >path, but, two checks are in same *unlikely* branch and I can remove
> >another one easily. Therefore it's not quite problem I guess. (It even
> >does no-op if MEMORY_ISOLATION is disabled.)
> >Moreover, I removed one unconditional get_pageblock_migratetype() in
> >free_pcppages_bulk() so, in performance point or view, freepath would
> >be improved.
> 
> I haven't checked the individual patches yet, but I'll try.
> 
> >>
> >>So the question is, does it have to be correct? And (admiteddly not after a completely
> >>exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)
> >>
> >>Well I of course don't mean that the freepage accounts could go random completely, but
> >>what if we allowed them to drift a bit, limiting both the max error and the timeframe
> >>where errors are possible? After all, watermarks checking is already racy so I don't think
> >>it would be hurt that much.
> >
> >I understand your suggestion. I once thought like as you, but give up
> >that idea. Watermark checking is already racy, but, it's *only*
> >protection to prevent memory allocation. After passing that check,
> >there is no mean to prevent us from allocating memory. So it should
> >be accurate as much as possible. If we allow someone to get the
> >memory without considering memory isolation, free memory could be in
> >really low state and system could be broken occasionally.
> 
> It would definitely require more thorough review, but I think with
> the pcplists draining changes outlined above, there shouldn't be any
> more inaccuracy happening.
> 
> >>
> >>Now if we look at what both CMA and memory hot-remove does is:
> >>
> >>1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
> >>start_isolate_page_range(). As part of this, all free pages in that area are
> >>moved on the isolate freelist through move_freepages_block().
> >>
> >>2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
> >>caches.
> >>
> >>3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
> >>undo pageblock isolation if not.
> >>
> >>So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
> >>allocator, which would therefore account free pages on the isolate freelist as normal free
> >>pages.
> >>The accounting of isolated pages would be instead done only on the top level of CMA/hot
> >>remove in the three steps outlined above, which would be modified as follows:
> >>
> >>1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
> >>as usual. Calculate X as the number of pages that move_freepages_block() has moved.
> >>Subtract X from freepages (this is the same as it is done now), but also *remember the
> >>value of X*
> >>
> >>2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
> >>isolate freelist, or not. We don't care, they will be accounted as freepages either way.
> >>This is where some inaccuracy in accounted freepages would build up.
> >>
> >>3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
> >>that we have a isolated range of N pages that nobody can steal now as everything is on
> >>isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
> >>X, then N-X). So the accounting matches reality.
> >>
> >>If we have to undo, we undo the isolation and as part of this, we use
> >>move_freepages_block() to move pages from isolate freelist to the normal ones. But we
> >>don't care how many pages were moved. We simply add the remembered value of X to the
> >>number of freepages, undoing the change from step 1. Again, the accounting matches reality.
> >>
> >>
> >>The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
> >>be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
> >>handled.
> >
> >The 4MB error in accounting looks serious for me. min_free_kbytes is
> >4MB in 1GB system. So this 4MB error would makes all things broken in
> >such systems. Moreover, there are some ARCH having larger
> >pageblock_order than MAX_ORDER. In this case, the error will be larger
> >than 4MB.
> >
> >In addition, I have a plan to extend CMA to work in parallel. It means
> >that there could be parallel memory isolation users rather than just
> >one user at the same time, so, we cannot easily bound the error under
> >some degree.
> 
> OK. I had thought that the error would be much smaller in practice,
> but probably due to how buddy merging works, it would be enough if
> just the very last freed page in the MAX_ORDER block was misplaced,
> and that would trigger a cascading merge that will end with single
> page at MAX_ORDER size becoming misplaced. So let's probably forget
> this approach.
> 
> >>As a possible improvement, we can assume during phase 2 that every page freed by migration
> >>will end up correctly on isolate free list. So we create M free pages by migration, and
> >>subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
> >>+ M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
> >>pages on pcplists and the possible races it will be less). I think with this improvement,
> >>any error would be negligible.
> >>
> >>Thoughts?
> >
> >Thanks for suggestion. :)
> >It is really good topic to think deeply.
> >
> >For now, I'd like to fix these problems without side-effect as you
> >suggested. Your suggestion changes the meaning of freepage that
> >isolated pages are included in nr_freepage and there could be possible
> >regression in success rate of memory hotplug and CMA. Possibly, it
> >is the way we have to go, but, IMHO, it isn't the time to go. Before
> >going that way, we should fix current implementation first so that
> >fixes can be backported to old kernel if someone needs.
> 
> Adding the extra pcplist drains and reordering
> get_pageblock_migratetype() under disabled IRQ's should be small
> enough to be backportable and not bring any regressions to success
> rates. There will be some cost for the extra drains, but paid only
> when the memory isolation actually happens. Extending the scope of
> disabled irq's would affect fast path somewhat, but I guess less
> than extra checks?

Your approach would close the race on migratetype buddy list. But,
I'm not sure how we can count number of freepages correctly. IIUC,
unlike my approach, this approach allows merging between isolate buddy
list and normal buddy list and it would results in miscounting number of
freepages. You probably think that move_freepages_block() could fixup
all the situation, but, I don't think so.

After applying your approach,

CPU1                                    CPU2
set_migratetype_isolate()               - free_one_page()
  - lock
  - set_pageblock_migratetype()
  - unlock
  - drain...
                                        - lock
                                        - __free_one_page() with MIGRATE_ISOLATE
                                        - merged with normal page and linked with 
                                        isolate buddy list
                                        - skip to count freepage, because of
                                        MIGRATE_ISOLATE
                                        - unlock
  - lock
  - move_freepages_block()    
    try to fixup freepages count
  - unlock

We want to fixup counting in move_freepages_block(), but, we can't
because we don't have enough information whether this page is already
accounted on number of freepage or not. Could you help me to find out
the whole mechanism of your approach?

And, IIUC, your approach can remove only one get_pageblock_migratetype()
in free_pcppages_bulk(). free_hot_cold_page() and free_one_page()
should have is_migrate_isolate() branches to free the page to correct
list and count number of freepage correctly. Additionally, it extends
scope of disabled irq. So it doesn't look much better than mine,
because my approach also removes get_pageblock_migratetype() in
free_pcppages_bulk(). 

Please let me know what I am missing. :)

Thanks.

> 
> >Interestingly,
> >on last Saturday, Lisa Du reported CMA accounting bugs.
> >
> >http://www.spinics.net/lists/linux-mm/msg75741.html
> >
> >I don't look at it in detail, but, maybe it is related to these
> >problems and we should fix it without side-effect.
> >
> >So, in conclusion, I think that your suggestion is beyond the scope of
> >this patchset because of following two reasons.
> >
> >1. I'd like to fix these problems without side-effect(possible
> >regression in success rate of memory hotplug and CMA, and nr_freepage
> >meanging change) due to backport possibility.
> >2. nr_freepage without considering memory isolation is somewhat dangerous
> >and not suitable for some systems.
> >
> >If you have any objection, please let me know. But, I will go on
> >a vacation for a week so I can't answer your further comments
> >for a week. I will reply them next week. :)
> >
> >Thanks.
> >
> >>>3. pages on cma buddy list are counted as CMA freepage, too.
> >>>4. pages for guard are *not* counted as freepage.
> >>>
> >>>Now, I describe problems and related patch.
> >>>
> >>>1. Patch 2: If guard page are cleared and merged into isolate buddy list,
> >>>we should not add freepage count.
> >>>
> >>>2. Patch 3: When the page return back from pcp to buddy, we should
> >>>account it to freepage counter. In this case, we should check the
> >>>pageblock migratetype of the page and should insert the page into
> >>>appropriate buddy list. Although we checked it in current code, we
> >>>didn't insert the page into appropriate buddy list so that freepage
> >>>counting can be wrong.
> >>>
> >>>3. Patch 4: There is race condition so that some freepages could be
> >>>on isolate buddy list. If so, we can't use this page until next isolation
> >>>attempt on this pageblock.
> >>>
> >>>4. Patch 5: There is race condition that page on isolate pageblock
> >>>can go into non-isolate buddy list. If so, buddy allocator would
> >>>merge pages on non-isolate buddy list and isolate buddy list, respectively,
> >>>and freepage count will be wrong.
> >>>
> >>>5. Patch 9: move_freepages(_block) returns *not* number of moved pages.
> >>>Instead, it returns number of pages linked in that migratetype buddy list.
> >>>So accouting with this return value makes freepage count wrong.
> >>>
> >>>6. Patch 10: buddy allocator would merge pages on non-isolate buddy list
> >>>and isolate buddy list, respectively. This leads to freepage counting
> >>>problem so fix it by stopping merging in this case.
> >>>
> >>>Without patchset [1], above problem doesn't happens on my CMA allocation
> >>>test, because CMA reserved pages aren't used at all. So there is no
> >>>chance for above race.
> >>>
> >>>With patchset [1], I did simple CMA allocation test and get below result.
> >>>
> >>>- Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
> >>>- run kernel build (make -j16) on background
> >>>- 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
> >>>- Result: more than 5000 freepage count are missed
> >>>
> >>>With patchset [1] and this patchset, I found that no freepage count are
> >>>missed so that I conclude that problems are solved.
> >>>
> >>>These problems can be possible on memory hot remove users, although
> >>>I didn't check it further.
> >>>
> >>>Other patches are either for the base to fix these problems or for
> >>>simple clean-up. Please see individual patches for more information.
> >>>
> >>>This patchset is based on linux-next-20140703.
> >>>
> >>>Thanks.
> >>>
> >>>[1]: Aggressively allocate the pages on cma reserved memory
> >>>      https://lkml.org/lkml/2014/5/30/291
> >>>
> >>>
> >>>Joonsoo Kim (10):
> >>>   mm/page_alloc: remove unlikely macro on free_one_page()
> >>>   mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
> >>>   mm/page_alloc: handle page on pcp correctly if it's pageblock is
> >>>     isolated
> >>>   mm/page_alloc: carefully free the page on isolate pageblock
> >>>   mm/page_alloc: optimize and unify pageblock migratetype check in free
> >>>     path
> >>>   mm/page_alloc: separate freepage migratetype interface
> >>>   mm/page_alloc: store migratetype of the buddy list into freepage
> >>>     correctly
> >>>   mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
> >>>   mm/page_alloc: fix possible wrongly calculated freepage counter
> >>>   mm/page_alloc: Stop merging pages on non-isolate and isolate buddy
> >>>     list
> >>>
> >>>  include/linux/mm.h             |   30 +++++++--
> >>>  include/linux/mmzone.h         |    5 ++
> >>>  include/linux/page-isolation.h |    8 +++
> >>>  mm/page_alloc.c                |  138 +++++++++++++++++++++++++++++-----------
> >>>  mm/page_isolation.c            |   18 ++----
> >>>  5 files changed, 147 insertions(+), 52 deletions(-)
> >>>
> >>
> >>--
> >>To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >>the body to majordomo@kvack.org.  For more info on Linux MM,
> >>see: http://www.linux-mm.org/ .
> >>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> >--
> >To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >the body to majordomo@kvack.org.  For more info on Linux MM,
> >see: http://www.linux-mm.org/ .
> >Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 03/10] mm/page_alloc: handle page on pcp correctly if it's pageblock is isolated
  2014-07-07 15:19     ` Vlastimil Babka
@ 2014-07-14  6:24       ` Joonsoo Kim
  -1 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-14  6:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On Mon, Jul 07, 2014 at 05:19:48PM +0200, Vlastimil Babka wrote:
> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> >If pageblock of page on pcp are isolated now, we should free it to isolate
> >buddy list to prevent future allocation on it. But current code doesn't
> >do this.
> >
> >Moreover, there is a freepage counting problem on current code. Although
> >pageblock of page on pcp are isolated now, it could go normal buddy list,
> >because get_onpcp_migratetype() will return non-isolate migratetype.
> 
> get_onpcp_migratetype() is only introduced in later patch.

Yes, I will fix it.

> 
> >In this case, we should do either adding freepage count or changing
> >migratetype to MIGRATE_ISOLATE, but, current code do neither.
> 
> I wouldn't say it "do neither". It already limits the freepage
> counting to !MIGRATE_ISOLATE case (and it's not converted to
> __mod_zone_freepage_state for some reason). So there's accounting
> mismatch in addition to buddy list misplacement.

Okay.

> 
> >This patch fixes these two problems by handling pageblock migratetype
> >before calling __free_one_page(). And, if we find the page on isolated
> >pageblock, change migratetype to MIGRATE_ISOLATE to prevent future
> >allocation of this page and freepage counting problem.
> 
> So although this is not an addition of a new pageblock migratetype
> check to the fast path (the check is already there), I would prefer
> removing the check :) 

Yes, I want to do it if possible. :)

> With the approach of pcplists draining
> outlined in my reply to 00/10, we would allow a misplacement to
> happen (and the page accounted as freepage) immediately followed by
> move_frepages_block which would place the page onto isolate freelist
> with the rest. Anything newly freed will get isolate_migratetype
> determined in free_hot_cold_page or __free_pages_ok (where it would
> need moving the migratepage check under the disabled irq part) and
> be placed and buddy-merged properly.

I explained the problem of this approach in 00/10.

Thanks.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 03/10] mm/page_alloc: handle page on pcp correctly if it's pageblock is isolated
@ 2014-07-14  6:24       ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-14  6:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On Mon, Jul 07, 2014 at 05:19:48PM +0200, Vlastimil Babka wrote:
> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> >If pageblock of page on pcp are isolated now, we should free it to isolate
> >buddy list to prevent future allocation on it. But current code doesn't
> >do this.
> >
> >Moreover, there is a freepage counting problem on current code. Although
> >pageblock of page on pcp are isolated now, it could go normal buddy list,
> >because get_onpcp_migratetype() will return non-isolate migratetype.
> 
> get_onpcp_migratetype() is only introduced in later patch.

Yes, I will fix it.

> 
> >In this case, we should do either adding freepage count or changing
> >migratetype to MIGRATE_ISOLATE, but, current code do neither.
> 
> I wouldn't say it "do neither". It already limits the freepage
> counting to !MIGRATE_ISOLATE case (and it's not converted to
> __mod_zone_freepage_state for some reason). So there's accounting
> mismatch in addition to buddy list misplacement.

Okay.

> 
> >This patch fixes these two problems by handling pageblock migratetype
> >before calling __free_one_page(). And, if we find the page on isolated
> >pageblock, change migratetype to MIGRATE_ISOLATE to prevent future
> >allocation of this page and freepage counting problem.
> 
> So although this is not an addition of a new pageblock migratetype
> check to the fast path (the check is already there), I would prefer
> removing the check :) 

Yes, I want to do it if possible. :)

> With the approach of pcplists draining
> outlined in my reply to 00/10, we would allow a misplacement to
> happen (and the page accounted as freepage) immediately followed by
> move_frepages_block which would place the page onto isolate freelist
> with the rest. Anything newly freed will get isolate_migratetype
> determined in free_hot_cold_page or __free_pages_ok (where it would
> need moving the migratepage check under the disabled irq part) and
> be placed and buddy-merged properly.

I explained the problem of this approach in 00/10.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 05/10] mm/page_alloc: optimize and unify pageblock migratetype check in free path
  2014-07-07 15:50     ` Vlastimil Babka
@ 2014-07-14  6:28       ` Joonsoo Kim
  -1 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-14  6:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On Mon, Jul 07, 2014 at 05:50:09PM +0200, Vlastimil Babka wrote:
> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> >Currently, when we free the page from pcp list to buddy, we check
> >pageblock of the page in order to isolate the page on isolated
> >pageblock. Although this could rarely happen and to check migratetype of
> >pageblock is somewhat expensive, we check it on free fast path. I think
> >that this is undesirable. To prevent this situation, I introduce new
> >variable, nr_isolate_pageblock on struct zone and use it to determine
> >if we should check pageblock migratetype. Isolation on pageblock rarely
> >happens so we can mostly avoid this pageblock migratetype check.
> 
> Better, but still there's a zone flag check and maintenance. So if
> it could be avoided, it would be better.
> 
> >Additionally, unify freepage counting code, because it can be done in
> >common part, __free_one_page(). This unifying provides extra guarantee
> >that the page on isolate pageblock don't go into non-isolate buddy list.
> >This is similar situation describing in previous patch so refer it
> >if you need more explanation.
> 
> You should make it clearer that you are solving misplacement of the
> type "page should be placed on isolated freelist but it's not"
> through free_one_page(), which was solved only for
> free_pcppages_bulk() in patch 03/10. Mentioning patch 04/10 here,
> which solves the opposite problem "page shouldn't be placed on
> isolated freelist, but it is", only confuses the situation. Also
> this patch undoes everything of 04/10 and moves it elsewhere, so
> that would make it harder to git blame etc. I would reorder 04 and
> 05.

Okay. I will clarify what I am solving in commit description and
reorder patches appropriately.

Thanks.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 05/10] mm/page_alloc: optimize and unify pageblock migratetype check in free path
@ 2014-07-14  6:28       ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-14  6:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On Mon, Jul 07, 2014 at 05:50:09PM +0200, Vlastimil Babka wrote:
> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> >Currently, when we free the page from pcp list to buddy, we check
> >pageblock of the page in order to isolate the page on isolated
> >pageblock. Although this could rarely happen and to check migratetype of
> >pageblock is somewhat expensive, we check it on free fast path. I think
> >that this is undesirable. To prevent this situation, I introduce new
> >variable, nr_isolate_pageblock on struct zone and use it to determine
> >if we should check pageblock migratetype. Isolation on pageblock rarely
> >happens so we can mostly avoid this pageblock migratetype check.
> 
> Better, but still there's a zone flag check and maintenance. So if
> it could be avoided, it would be better.
> 
> >Additionally, unify freepage counting code, because it can be done in
> >common part, __free_one_page(). This unifying provides extra guarantee
> >that the page on isolate pageblock don't go into non-isolate buddy list.
> >This is similar situation describing in previous patch so refer it
> >if you need more explanation.
> 
> You should make it clearer that you are solving misplacement of the
> type "page should be placed on isolated freelist but it's not"
> through free_one_page(), which was solved only for
> free_pcppages_bulk() in patch 03/10. Mentioning patch 04/10 here,
> which solves the opposite problem "page shouldn't be placed on
> isolated freelist, but it is", only confuses the situation. Also
> this patch undoes everything of 04/10 and moves it elsewhere, so
> that would make it harder to git blame etc. I would reorder 04 and
> 05.

Okay. I will clarify what I am solving in commit description and
reorder patches appropriately.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 08/10] mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
  2014-07-07 15:57     ` Vlastimil Babka
@ 2014-07-14  6:34       ` Joonsoo Kim
  -1 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-14  6:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On Mon, Jul 07, 2014 at 05:57:49PM +0200, Vlastimil Babka wrote:
> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> >When isolating free page, what we want to know is which list
> >the page is linked. If it is linked in isolate migratetype buddy list,
> >we can skip watermark check and freepage counting. And if it is linked
> >in CMA migratetype buddy list, we need to fixup freepage counting. For
> >this purpose, get_onbuddy_migratetype() is more fit and cheap than
> >get_pageblock_migratetype(). So use it.
> 
> Hm but you made get_onbuddy_migratetype() work only with
> CONFIG_MEMORY_ISOLATION. And __isolate_free_page is (despite the
> name) not at all limited to CONFIG_MEMORY_ISOLATION.

get_onbuddy_migratetype() is only used for determining whether this
page is on isolate buddy list or not. So if !CONFIG_MEMORY_ISOLATION,
default value of get_onbuddy_migratetype() makes things correct.

But, I should write some code comment.

Thanks.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 08/10] mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
@ 2014-07-14  6:34       ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-14  6:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, linux-kernel

On Mon, Jul 07, 2014 at 05:57:49PM +0200, Vlastimil Babka wrote:
> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> >When isolating free page, what we want to know is which list
> >the page is linked. If it is linked in isolate migratetype buddy list,
> >we can skip watermark check and freepage counting. And if it is linked
> >in CMA migratetype buddy list, we need to fixup freepage counting. For
> >this purpose, get_onbuddy_migratetype() is more fit and cheap than
> >get_pageblock_migratetype(). So use it.
> 
> Hm but you made get_onbuddy_migratetype() work only with
> CONFIG_MEMORY_ISOLATION. And __isolate_free_page is (despite the
> name) not at all limited to CONFIG_MEMORY_ISOLATION.

get_onbuddy_migratetype() is only used for determining whether this
page is on isolate buddy list or not. So if !CONFIG_MEMORY_ISOLATION,
default value of get_onbuddy_migratetype() makes things correct.

But, I should write some code comment.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
  2014-07-14  6:22         ` Joonsoo Kim
@ 2014-07-14  9:49           ` Vlastimil Babka
  -1 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-14  9:49 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, Lisa Du,
	linux-kernel

On 07/14/2014 08:22 AM, Joonsoo Kim wrote:
> On Mon, Jul 07, 2014 at 04:33:09PM +0200, Vlastimil Babka wrote:
>> On 07/07/2014 06:49 AM, Joonsoo Kim wrote:
>>> Ccing Lisa, because there was bug report it may be related this
>>> topic last Saturday.
>>>
>>> http://www.spinics.net/lists/linux-mm/msg75741.html
>>>
>>> On Fri, Jul 04, 2014 at 05:33:27PM +0200, Vlastimil Babka wrote:
>>>> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
>>>>> Hello,
>>>>
>>>> Hi Joonsoo,
>>>>
>>>> please CC me on further updates, this is relevant to me.
>>>
>>> Hello, Vlastimil.
>>>
>>> Sorry for missing you. :)
>>>
>>>>
>>>>> This patchset aims at fixing problems due to memory isolation found by
>>>>> testing my patchset [1].
>>>>>
>>>>> These are really subtle problems so I can be wrong. If you find what I am
>>>>> missing, please let me know.
>>>>>
>>>>> Before describing bugs itself, I first explain definition of freepage.
>>>>>
>>>>> 1. pages on buddy list are counted as freepage.
>>>>> 2. pages on isolate migratetype buddy list are *not* counted as freepage.
>>>>
>>>> I think the second point is causing us a lot of trouble. And I wonder if it's really
>>>> justified! We already have some is_migrate_isolate() checks in the fast path and now you
>>>> would add more, mostly just to keep this accounting correct.
>>>
>>> It's not just for keeping accouting correct. There is another
>>> purpose for is_migrate_isolate(). It forces freed pages to go into
>>> isolate buddy list. Without it, freed pages would go into other
>>> buddy list and will be used soon. So memory isolation can't work well
>>> without is_migrate_isolate() checks and success rate could decrease.
>>
>> Well I think that could be solved also by doing a lru/pcplists drain
>> right after marking pageblock as MIGRATETYPE_ISOLATE. After that
>> moment, anything newly put on pcplists should observe the new
>> migratetype and go to the correct pcplists. As putting stuff on
>> pcplists is done with disabled interrupts, and draining is done by
>> IPI, I think it should work correctly if we put the migratetype
>> determination under the disabled irq part in free_hot_cold_page().
>
> Hello, sorry for late.
>
> Yes, this can close the race on migratetype buddy list, but, there is
> a problem. See the below.
>
>>
>>> And, I just added three more is_migrate_isolate() in the fast
>>> path, but, two checks are in same *unlikely* branch and I can remove
>>> another one easily. Therefore it's not quite problem I guess. (It even
>>> does no-op if MEMORY_ISOLATION is disabled.)
>>> Moreover, I removed one unconditional get_pageblock_migratetype() in
>>> free_pcppages_bulk() so, in performance point or view, freepath would
>>> be improved.
>>
>> I haven't checked the individual patches yet, but I'll try.
>>
>>>>
>>>> So the question is, does it have to be correct? And (admiteddly not after a completely
>>>> exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)
>>>>
>>>> Well I of course don't mean that the freepage accounts could go random completely, but
>>>> what if we allowed them to drift a bit, limiting both the max error and the timeframe
>>>> where errors are possible? After all, watermarks checking is already racy so I don't think
>>>> it would be hurt that much.
>>>
>>> I understand your suggestion. I once thought like as you, but give up
>>> that idea. Watermark checking is already racy, but, it's *only*
>>> protection to prevent memory allocation. After passing that check,
>>> there is no mean to prevent us from allocating memory. So it should
>>> be accurate as much as possible. If we allow someone to get the
>>> memory without considering memory isolation, free memory could be in
>>> really low state and system could be broken occasionally.
>>
>> It would definitely require more thorough review, but I think with
>> the pcplists draining changes outlined above, there shouldn't be any
>> more inaccuracy happening.
>>
>>>>
>>>> Now if we look at what both CMA and memory hot-remove does is:
>>>>
>>>> 1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
>>>> start_isolate_page_range(). As part of this, all free pages in that area are
>>>> moved on the isolate freelist through move_freepages_block().
>>>>
>>>> 2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
>>>> caches.
>>>>
>>>> 3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
>>>> undo pageblock isolation if not.
>>>>
>>>> So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
>>>> allocator, which would therefore account free pages on the isolate freelist as normal free
>>>> pages.
>>>> The accounting of isolated pages would be instead done only on the top level of CMA/hot
>>>> remove in the three steps outlined above, which would be modified as follows:
>>>>
>>>> 1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
>>>> as usual. Calculate X as the number of pages that move_freepages_block() has moved.
>>>> Subtract X from freepages (this is the same as it is done now), but also *remember the
>>>> value of X*
>>>>
>>>> 2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
>>>> isolate freelist, or not. We don't care, they will be accounted as freepages either way.
>>>> This is where some inaccuracy in accounted freepages would build up.
>>>>
>>>> 3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
>>>> that we have a isolated range of N pages that nobody can steal now as everything is on
>>>> isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
>>>> X, then N-X). So the accounting matches reality.
>>>>
>>>> If we have to undo, we undo the isolation and as part of this, we use
>>>> move_freepages_block() to move pages from isolate freelist to the normal ones. But we
>>>> don't care how many pages were moved. We simply add the remembered value of X to the
>>>> number of freepages, undoing the change from step 1. Again, the accounting matches reality.
>>>>
>>>>
>>>> The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
>>>> be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
>>>> handled.
>>>
>>> The 4MB error in accounting looks serious for me. min_free_kbytes is
>>> 4MB in 1GB system. So this 4MB error would makes all things broken in
>>> such systems. Moreover, there are some ARCH having larger
>>> pageblock_order than MAX_ORDER. In this case, the error will be larger
>>> than 4MB.
>>>
>>> In addition, I have a plan to extend CMA to work in parallel. It means
>>> that there could be parallel memory isolation users rather than just
>>> one user at the same time, so, we cannot easily bound the error under
>>> some degree.
>>
>> OK. I had thought that the error would be much smaller in practice,
>> but probably due to how buddy merging works, it would be enough if
>> just the very last freed page in the MAX_ORDER block was misplaced,
>> and that would trigger a cascading merge that will end with single
>> page at MAX_ORDER size becoming misplaced. So let's probably forget
>> this approach.
>>
>>>> As a possible improvement, we can assume during phase 2 that every page freed by migration
>>>> will end up correctly on isolate free list. So we create M free pages by migration, and
>>>> subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
>>>> + M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
>>>> pages on pcplists and the possible races it will be less). I think with this improvement,
>>>> any error would be negligible.
>>>>
>>>> Thoughts?
>>>
>>> Thanks for suggestion. :)
>>> It is really good topic to think deeply.
>>>
>>> For now, I'd like to fix these problems without side-effect as you
>>> suggested. Your suggestion changes the meaning of freepage that
>>> isolated pages are included in nr_freepage and there could be possible
>>> regression in success rate of memory hotplug and CMA. Possibly, it
>>> is the way we have to go, but, IMHO, it isn't the time to go. Before
>>> going that way, we should fix current implementation first so that
>>> fixes can be backported to old kernel if someone needs.
>>
>> Adding the extra pcplist drains and reordering
>> get_pageblock_migratetype() under disabled IRQ's should be small
>> enough to be backportable and not bring any regressions to success
>> rates. There will be some cost for the extra drains, but paid only
>> when the memory isolation actually happens. Extending the scope of
>> disabled irq's would affect fast path somewhat, but I guess less
>> than extra checks?
>
> Your approach would close the race on migratetype buddy list. But,
> I'm not sure how we can count number of freepages correctly. IIUC,
> unlike my approach, this approach allows merging between isolate buddy
> list and normal buddy list and it would results in miscounting number of
> freepages. You probably think that move_freepages_block() could fixup
> all the situation, but, I don't think so.

No, I didn't think that, I simply overlooked this scenario. Good catch.

> After applying your approach,
>
> CPU1                                    CPU2
> set_migratetype_isolate()               - free_one_page()
>    - lock
>    - set_pageblock_migratetype()
>    - unlock
>    - drain...
>                                          - lock
>                                          - __free_one_page() with MIGRATE_ISOLATE
>                                          - merged with normal page and linked with
>                                          isolate buddy list
>                                          - skip to count freepage, because of
>                                          MIGRATE_ISOLATE
>                                          - unlock
>    - lock
>    - move_freepages_block()
>      try to fixup freepages count
>    - unlock
>
> We want to fixup counting in move_freepages_block(), but, we can't
> because we don't have enough information whether this page is already
> accounted on number of freepage or not. Could you help me to find out
> the whole mechanism of your approach?

I can't think of an easy fix to this situation.

A non-trivial fix that comes to mind (and I might have overlooked 
something) is something like:

- distinguish MIGRATETYPE_ISOLATING and MIGRATETYPE_ISOLATED
- CPU1 first sets MIGRATETYPE_ISOLATING before the drain
- when CPU2 sees MIGRATETYPE_ISOLATING, it just puts the page on special 
unbounded pcplist and that's it
- CPU1 does the drain as usual, potentially misplacing some pages that 
move_freepages_block() will then fix. But no wrong merging can occur.
- after move_freepages_block(), CPU1 changes MIGRATETYPE_ISOLATING to 
MIGRATETYPE_ISOLATED
- CPU2 can then start freeing directly on isolate buddy list. There 
might be some pages still on the special pcplist of CPU2/CPUx but that 
means they won't merge yet.
- CPU1 executes on all CPU's a new operation that flushes the special 
pcplist on isolate buddy list and merge as needed.


> And, IIUC, your approach can remove only one get_pageblock_migratetype()
> in free_pcppages_bulk(). free_hot_cold_page() and free_one_page()
> should have is_migrate_isolate() branches to free the page to correct
> list and count number of freepage correctly. Additionally, it extends
> scope of disabled irq. So it doesn't look much better than mine,
> because my approach also removes get_pageblock_migratetype() in
> free_pcppages_bulk().
>
> Please let me know what I am missing. :)
>
> Thanks.
>
>>
>>> Interestingly,
>>> on last Saturday, Lisa Du reported CMA accounting bugs.
>>>
>>> http://www.spinics.net/lists/linux-mm/msg75741.html
>>>
>>> I don't look at it in detail, but, maybe it is related to these
>>> problems and we should fix it without side-effect.
>>>
>>> So, in conclusion, I think that your suggestion is beyond the scope of
>>> this patchset because of following two reasons.
>>>
>>> 1. I'd like to fix these problems without side-effect(possible
>>> regression in success rate of memory hotplug and CMA, and nr_freepage
>>> meanging change) due to backport possibility.
>>> 2. nr_freepage without considering memory isolation is somewhat dangerous
>>> and not suitable for some systems.
>>>
>>> If you have any objection, please let me know. But, I will go on
>>> a vacation for a week so I can't answer your further comments
>>> for a week. I will reply them next week. :)
>>>
>>> Thanks.
>>>
>>>>> 3. pages on cma buddy list are counted as CMA freepage, too.
>>>>> 4. pages for guard are *not* counted as freepage.
>>>>>
>>>>> Now, I describe problems and related patch.
>>>>>
>>>>> 1. Patch 2: If guard page are cleared and merged into isolate buddy list,
>>>>> we should not add freepage count.
>>>>>
>>>>> 2. Patch 3: When the page return back from pcp to buddy, we should
>>>>> account it to freepage counter. In this case, we should check the
>>>>> pageblock migratetype of the page and should insert the page into
>>>>> appropriate buddy list. Although we checked it in current code, we
>>>>> didn't insert the page into appropriate buddy list so that freepage
>>>>> counting can be wrong.
>>>>>
>>>>> 3. Patch 4: There is race condition so that some freepages could be
>>>>> on isolate buddy list. If so, we can't use this page until next isolation
>>>>> attempt on this pageblock.
>>>>>
>>>>> 4. Patch 5: There is race condition that page on isolate pageblock
>>>>> can go into non-isolate buddy list. If so, buddy allocator would
>>>>> merge pages on non-isolate buddy list and isolate buddy list, respectively,
>>>>> and freepage count will be wrong.
>>>>>
>>>>> 5. Patch 9: move_freepages(_block) returns *not* number of moved pages.
>>>>> Instead, it returns number of pages linked in that migratetype buddy list.
>>>>> So accouting with this return value makes freepage count wrong.
>>>>>
>>>>> 6. Patch 10: buddy allocator would merge pages on non-isolate buddy list
>>>>> and isolate buddy list, respectively. This leads to freepage counting
>>>>> problem so fix it by stopping merging in this case.
>>>>>
>>>>> Without patchset [1], above problem doesn't happens on my CMA allocation
>>>>> test, because CMA reserved pages aren't used at all. So there is no
>>>>> chance for above race.
>>>>>
>>>>> With patchset [1], I did simple CMA allocation test and get below result.
>>>>>
>>>>> - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
>>>>> - run kernel build (make -j16) on background
>>>>> - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
>>>>> - Result: more than 5000 freepage count are missed
>>>>>
>>>>> With patchset [1] and this patchset, I found that no freepage count are
>>>>> missed so that I conclude that problems are solved.
>>>>>
>>>>> These problems can be possible on memory hot remove users, although
>>>>> I didn't check it further.
>>>>>
>>>>> Other patches are either for the base to fix these problems or for
>>>>> simple clean-up. Please see individual patches for more information.
>>>>>
>>>>> This patchset is based on linux-next-20140703.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> [1]: Aggressively allocate the pages on cma reserved memory
>>>>>       https://lkml.org/lkml/2014/5/30/291
>>>>>
>>>>>
>>>>> Joonsoo Kim (10):
>>>>>    mm/page_alloc: remove unlikely macro on free_one_page()
>>>>>    mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
>>>>>    mm/page_alloc: handle page on pcp correctly if it's pageblock is
>>>>>      isolated
>>>>>    mm/page_alloc: carefully free the page on isolate pageblock
>>>>>    mm/page_alloc: optimize and unify pageblock migratetype check in free
>>>>>      path
>>>>>    mm/page_alloc: separate freepage migratetype interface
>>>>>    mm/page_alloc: store migratetype of the buddy list into freepage
>>>>>      correctly
>>>>>    mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
>>>>>    mm/page_alloc: fix possible wrongly calculated freepage counter
>>>>>    mm/page_alloc: Stop merging pages on non-isolate and isolate buddy
>>>>>      list
>>>>>
>>>>>   include/linux/mm.h             |   30 +++++++--
>>>>>   include/linux/mmzone.h         |    5 ++
>>>>>   include/linux/page-isolation.h |    8 +++
>>>>>   mm/page_alloc.c                |  138 +++++++++++++++++++++++++++++-----------
>>>>>   mm/page_isolation.c            |   18 ++----
>>>>>   5 files changed, 147 insertions(+), 52 deletions(-)
>>>>>
>>>>
>>>> --
>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>>> see: http://www.linux-mm.org/ .
>>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
@ 2014-07-14  9:49           ` Vlastimil Babka
  0 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-14  9:49 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, Lisa Du,
	linux-kernel

On 07/14/2014 08:22 AM, Joonsoo Kim wrote:
> On Mon, Jul 07, 2014 at 04:33:09PM +0200, Vlastimil Babka wrote:
>> On 07/07/2014 06:49 AM, Joonsoo Kim wrote:
>>> Ccing Lisa, because there was bug report it may be related this
>>> topic last Saturday.
>>>
>>> http://www.spinics.net/lists/linux-mm/msg75741.html
>>>
>>> On Fri, Jul 04, 2014 at 05:33:27PM +0200, Vlastimil Babka wrote:
>>>> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
>>>>> Hello,
>>>>
>>>> Hi Joonsoo,
>>>>
>>>> please CC me on further updates, this is relevant to me.
>>>
>>> Hello, Vlastimil.
>>>
>>> Sorry for missing you. :)
>>>
>>>>
>>>>> This patchset aims at fixing problems due to memory isolation found by
>>>>> testing my patchset [1].
>>>>>
>>>>> These are really subtle problems so I can be wrong. If you find what I am
>>>>> missing, please let me know.
>>>>>
>>>>> Before describing bugs itself, I first explain definition of freepage.
>>>>>
>>>>> 1. pages on buddy list are counted as freepage.
>>>>> 2. pages on isolate migratetype buddy list are *not* counted as freepage.
>>>>
>>>> I think the second point is causing us a lot of trouble. And I wonder if it's really
>>>> justified! We already have some is_migrate_isolate() checks in the fast path and now you
>>>> would add more, mostly just to keep this accounting correct.
>>>
>>> It's not just for keeping accouting correct. There is another
>>> purpose for is_migrate_isolate(). It forces freed pages to go into
>>> isolate buddy list. Without it, freed pages would go into other
>>> buddy list and will be used soon. So memory isolation can't work well
>>> without is_migrate_isolate() checks and success rate could decrease.
>>
>> Well I think that could be solved also by doing a lru/pcplists drain
>> right after marking pageblock as MIGRATETYPE_ISOLATE. After that
>> moment, anything newly put on pcplists should observe the new
>> migratetype and go to the correct pcplists. As putting stuff on
>> pcplists is done with disabled interrupts, and draining is done by
>> IPI, I think it should work correctly if we put the migratetype
>> determination under the disabled irq part in free_hot_cold_page().
>
> Hello, sorry for late.
>
> Yes, this can close the race on migratetype buddy list, but, there is
> a problem. See the below.
>
>>
>>> And, I just added three more is_migrate_isolate() in the fast
>>> path, but, two checks are in same *unlikely* branch and I can remove
>>> another one easily. Therefore it's not quite problem I guess. (It even
>>> does no-op if MEMORY_ISOLATION is disabled.)
>>> Moreover, I removed one unconditional get_pageblock_migratetype() in
>>> free_pcppages_bulk() so, in performance point or view, freepath would
>>> be improved.
>>
>> I haven't checked the individual patches yet, but I'll try.
>>
>>>>
>>>> So the question is, does it have to be correct? And (admiteddly not after a completely
>>>> exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)
>>>>
>>>> Well I of course don't mean that the freepage accounts could go random completely, but
>>>> what if we allowed them to drift a bit, limiting both the max error and the timeframe
>>>> where errors are possible? After all, watermarks checking is already racy so I don't think
>>>> it would be hurt that much.
>>>
>>> I understand your suggestion. I once thought like as you, but give up
>>> that idea. Watermark checking is already racy, but, it's *only*
>>> protection to prevent memory allocation. After passing that check,
>>> there is no mean to prevent us from allocating memory. So it should
>>> be accurate as much as possible. If we allow someone to get the
>>> memory without considering memory isolation, free memory could be in
>>> really low state and system could be broken occasionally.
>>
>> It would definitely require more thorough review, but I think with
>> the pcplists draining changes outlined above, there shouldn't be any
>> more inaccuracy happening.
>>
>>>>
>>>> Now if we look at what both CMA and memory hot-remove does is:
>>>>
>>>> 1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
>>>> start_isolate_page_range(). As part of this, all free pages in that area are
>>>> moved on the isolate freelist through move_freepages_block().
>>>>
>>>> 2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
>>>> caches.
>>>>
>>>> 3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
>>>> undo pageblock isolation if not.
>>>>
>>>> So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
>>>> allocator, which would therefore account free pages on the isolate freelist as normal free
>>>> pages.
>>>> The accounting of isolated pages would be instead done only on the top level of CMA/hot
>>>> remove in the three steps outlined above, which would be modified as follows:
>>>>
>>>> 1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
>>>> as usual. Calculate X as the number of pages that move_freepages_block() has moved.
>>>> Subtract X from freepages (this is the same as it is done now), but also *remember the
>>>> value of X*
>>>>
>>>> 2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
>>>> isolate freelist, or not. We don't care, they will be accounted as freepages either way.
>>>> This is where some inaccuracy in accounted freepages would build up.
>>>>
>>>> 3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
>>>> that we have a isolated range of N pages that nobody can steal now as everything is on
>>>> isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
>>>> X, then N-X). So the accounting matches reality.
>>>>
>>>> If we have to undo, we undo the isolation and as part of this, we use
>>>> move_freepages_block() to move pages from isolate freelist to the normal ones. But we
>>>> don't care how many pages were moved. We simply add the remembered value of X to the
>>>> number of freepages, undoing the change from step 1. Again, the accounting matches reality.
>>>>
>>>>
>>>> The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
>>>> be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
>>>> handled.
>>>
>>> The 4MB error in accounting looks serious for me. min_free_kbytes is
>>> 4MB in 1GB system. So this 4MB error would makes all things broken in
>>> such systems. Moreover, there are some ARCH having larger
>>> pageblock_order than MAX_ORDER. In this case, the error will be larger
>>> than 4MB.
>>>
>>> In addition, I have a plan to extend CMA to work in parallel. It means
>>> that there could be parallel memory isolation users rather than just
>>> one user at the same time, so, we cannot easily bound the error under
>>> some degree.
>>
>> OK. I had thought that the error would be much smaller in practice,
>> but probably due to how buddy merging works, it would be enough if
>> just the very last freed page in the MAX_ORDER block was misplaced,
>> and that would trigger a cascading merge that will end with single
>> page at MAX_ORDER size becoming misplaced. So let's probably forget
>> this approach.
>>
>>>> As a possible improvement, we can assume during phase 2 that every page freed by migration
>>>> will end up correctly on isolate free list. So we create M free pages by migration, and
>>>> subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
>>>> + M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
>>>> pages on pcplists and the possible races it will be less). I think with this improvement,
>>>> any error would be negligible.
>>>>
>>>> Thoughts?
>>>
>>> Thanks for suggestion. :)
>>> It is really good topic to think deeply.
>>>
>>> For now, I'd like to fix these problems without side-effect as you
>>> suggested. Your suggestion changes the meaning of freepage that
>>> isolated pages are included in nr_freepage and there could be possible
>>> regression in success rate of memory hotplug and CMA. Possibly, it
>>> is the way we have to go, but, IMHO, it isn't the time to go. Before
>>> going that way, we should fix current implementation first so that
>>> fixes can be backported to old kernel if someone needs.
>>
>> Adding the extra pcplist drains and reordering
>> get_pageblock_migratetype() under disabled IRQ's should be small
>> enough to be backportable and not bring any regressions to success
>> rates. There will be some cost for the extra drains, but paid only
>> when the memory isolation actually happens. Extending the scope of
>> disabled irq's would affect fast path somewhat, but I guess less
>> than extra checks?
>
> Your approach would close the race on migratetype buddy list. But,
> I'm not sure how we can count number of freepages correctly. IIUC,
> unlike my approach, this approach allows merging between isolate buddy
> list and normal buddy list and it would results in miscounting number of
> freepages. You probably think that move_freepages_block() could fixup
> all the situation, but, I don't think so.

No, I didn't think that, I simply overlooked this scenario. Good catch.

> After applying your approach,
>
> CPU1                                    CPU2
> set_migratetype_isolate()               - free_one_page()
>    - lock
>    - set_pageblock_migratetype()
>    - unlock
>    - drain...
>                                          - lock
>                                          - __free_one_page() with MIGRATE_ISOLATE
>                                          - merged with normal page and linked with
>                                          isolate buddy list
>                                          - skip to count freepage, because of
>                                          MIGRATE_ISOLATE
>                                          - unlock
>    - lock
>    - move_freepages_block()
>      try to fixup freepages count
>    - unlock
>
> We want to fixup counting in move_freepages_block(), but, we can't
> because we don't have enough information whether this page is already
> accounted on number of freepage or not. Could you help me to find out
> the whole mechanism of your approach?

I can't think of an easy fix to this situation.

A non-trivial fix that comes to mind (and I might have overlooked 
something) is something like:

- distinguish MIGRATETYPE_ISOLATING and MIGRATETYPE_ISOLATED
- CPU1 first sets MIGRATETYPE_ISOLATING before the drain
- when CPU2 sees MIGRATETYPE_ISOLATING, it just puts the page on special 
unbounded pcplist and that's it
- CPU1 does the drain as usual, potentially misplacing some pages that 
move_freepages_block() will then fix. But no wrong merging can occur.
- after move_freepages_block(), CPU1 changes MIGRATETYPE_ISOLATING to 
MIGRATETYPE_ISOLATED
- CPU2 can then start freeing directly on isolate buddy list. There 
might be some pages still on the special pcplist of CPU2/CPUx but that 
means they won't merge yet.
- CPU1 executes on all CPU's a new operation that flushes the special 
pcplist on isolate buddy list and merge as needed.


> And, IIUC, your approach can remove only one get_pageblock_migratetype()
> in free_pcppages_bulk(). free_hot_cold_page() and free_one_page()
> should have is_migrate_isolate() branches to free the page to correct
> list and count number of freepage correctly. Additionally, it extends
> scope of disabled irq. So it doesn't look much better than mine,
> because my approach also removes get_pageblock_migratetype() in
> free_pcppages_bulk().
>
> Please let me know what I am missing. :)
>
> Thanks.
>
>>
>>> Interestingly,
>>> on last Saturday, Lisa Du reported CMA accounting bugs.
>>>
>>> http://www.spinics.net/lists/linux-mm/msg75741.html
>>>
>>> I don't look at it in detail, but, maybe it is related to these
>>> problems and we should fix it without side-effect.
>>>
>>> So, in conclusion, I think that your suggestion is beyond the scope of
>>> this patchset because of following two reasons.
>>>
>>> 1. I'd like to fix these problems without side-effect(possible
>>> regression in success rate of memory hotplug and CMA, and nr_freepage
>>> meanging change) due to backport possibility.
>>> 2. nr_freepage without considering memory isolation is somewhat dangerous
>>> and not suitable for some systems.
>>>
>>> If you have any objection, please let me know. But, I will go on
>>> a vacation for a week so I can't answer your further comments
>>> for a week. I will reply them next week. :)
>>>
>>> Thanks.
>>>
>>>>> 3. pages on cma buddy list are counted as CMA freepage, too.
>>>>> 4. pages for guard are *not* counted as freepage.
>>>>>
>>>>> Now, I describe problems and related patch.
>>>>>
>>>>> 1. Patch 2: If guard page are cleared and merged into isolate buddy list,
>>>>> we should not add freepage count.
>>>>>
>>>>> 2. Patch 3: When the page return back from pcp to buddy, we should
>>>>> account it to freepage counter. In this case, we should check the
>>>>> pageblock migratetype of the page and should insert the page into
>>>>> appropriate buddy list. Although we checked it in current code, we
>>>>> didn't insert the page into appropriate buddy list so that freepage
>>>>> counting can be wrong.
>>>>>
>>>>> 3. Patch 4: There is race condition so that some freepages could be
>>>>> on isolate buddy list. If so, we can't use this page until next isolation
>>>>> attempt on this pageblock.
>>>>>
>>>>> 4. Patch 5: There is race condition that page on isolate pageblock
>>>>> can go into non-isolate buddy list. If so, buddy allocator would
>>>>> merge pages on non-isolate buddy list and isolate buddy list, respectively,
>>>>> and freepage count will be wrong.
>>>>>
>>>>> 5. Patch 9: move_freepages(_block) returns *not* number of moved pages.
>>>>> Instead, it returns number of pages linked in that migratetype buddy list.
>>>>> So accouting with this return value makes freepage count wrong.
>>>>>
>>>>> 6. Patch 10: buddy allocator would merge pages on non-isolate buddy list
>>>>> and isolate buddy list, respectively. This leads to freepage counting
>>>>> problem so fix it by stopping merging in this case.
>>>>>
>>>>> Without patchset [1], above problem doesn't happens on my CMA allocation
>>>>> test, because CMA reserved pages aren't used at all. So there is no
>>>>> chance for above race.
>>>>>
>>>>> With patchset [1], I did simple CMA allocation test and get below result.
>>>>>
>>>>> - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
>>>>> - run kernel build (make -j16) on background
>>>>> - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
>>>>> - Result: more than 5000 freepage count are missed
>>>>>
>>>>> With patchset [1] and this patchset, I found that no freepage count are
>>>>> missed so that I conclude that problems are solved.
>>>>>
>>>>> These problems can be possible on memory hot remove users, although
>>>>> I didn't check it further.
>>>>>
>>>>> Other patches are either for the base to fix these problems or for
>>>>> simple clean-up. Please see individual patches for more information.
>>>>>
>>>>> This patchset is based on linux-next-20140703.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> [1]: Aggressively allocate the pages on cma reserved memory
>>>>>       https://lkml.org/lkml/2014/5/30/291
>>>>>
>>>>>
>>>>> Joonsoo Kim (10):
>>>>>    mm/page_alloc: remove unlikely macro on free_one_page()
>>>>>    mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
>>>>>    mm/page_alloc: handle page on pcp correctly if it's pageblock is
>>>>>      isolated
>>>>>    mm/page_alloc: carefully free the page on isolate pageblock
>>>>>    mm/page_alloc: optimize and unify pageblock migratetype check in free
>>>>>      path
>>>>>    mm/page_alloc: separate freepage migratetype interface
>>>>>    mm/page_alloc: store migratetype of the buddy list into freepage
>>>>>      correctly
>>>>>    mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
>>>>>    mm/page_alloc: fix possible wrongly calculated freepage counter
>>>>>    mm/page_alloc: Stop merging pages on non-isolate and isolate buddy
>>>>>      list
>>>>>
>>>>>   include/linux/mm.h             |   30 +++++++--
>>>>>   include/linux/mmzone.h         |    5 ++
>>>>>   include/linux/page-isolation.h |    8 +++
>>>>>   mm/page_alloc.c                |  138 +++++++++++++++++++++++++++++-----------
>>>>>   mm/page_isolation.c            |   18 ++----
>>>>>   5 files changed, 147 insertions(+), 52 deletions(-)
>>>>>
>>>>
>>>> --
>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>>> see: http://www.linux-mm.org/ .
>>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
  2014-07-14  9:49           ` Vlastimil Babka
@ 2014-07-15  8:28             ` Joonsoo Kim
  -1 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-15  8:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, Lisa Du,
	linux-kernel

On Mon, Jul 14, 2014 at 11:49:25AM +0200, Vlastimil Babka wrote:
> On 07/14/2014 08:22 AM, Joonsoo Kim wrote:
> >On Mon, Jul 07, 2014 at 04:33:09PM +0200, Vlastimil Babka wrote:
> >>On 07/07/2014 06:49 AM, Joonsoo Kim wrote:
> >>>Ccing Lisa, because there was bug report it may be related this
> >>>topic last Saturday.
> >>>
> >>>http://www.spinics.net/lists/linux-mm/msg75741.html
> >>>
> >>>On Fri, Jul 04, 2014 at 05:33:27PM +0200, Vlastimil Babka wrote:
> >>>>On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> >>>>>Hello,
> >>>>
> >>>>Hi Joonsoo,
> >>>>
> >>>>please CC me on further updates, this is relevant to me.
> >>>
> >>>Hello, Vlastimil.
> >>>
> >>>Sorry for missing you. :)
> >>>
> >>>>
> >>>>>This patchset aims at fixing problems due to memory isolation found by
> >>>>>testing my patchset [1].
> >>>>>
> >>>>>These are really subtle problems so I can be wrong. If you find what I am
> >>>>>missing, please let me know.
> >>>>>
> >>>>>Before describing bugs itself, I first explain definition of freepage.
> >>>>>
> >>>>>1. pages on buddy list are counted as freepage.
> >>>>>2. pages on isolate migratetype buddy list are *not* counted as freepage.
> >>>>
> >>>>I think the second point is causing us a lot of trouble. And I wonder if it's really
> >>>>justified! We already have some is_migrate_isolate() checks in the fast path and now you
> >>>>would add more, mostly just to keep this accounting correct.
> >>>
> >>>It's not just for keeping accouting correct. There is another
> >>>purpose for is_migrate_isolate(). It forces freed pages to go into
> >>>isolate buddy list. Without it, freed pages would go into other
> >>>buddy list and will be used soon. So memory isolation can't work well
> >>>without is_migrate_isolate() checks and success rate could decrease.
> >>
> >>Well I think that could be solved also by doing a lru/pcplists drain
> >>right after marking pageblock as MIGRATETYPE_ISOLATE. After that
> >>moment, anything newly put on pcplists should observe the new
> >>migratetype and go to the correct pcplists. As putting stuff on
> >>pcplists is done with disabled interrupts, and draining is done by
> >>IPI, I think it should work correctly if we put the migratetype
> >>determination under the disabled irq part in free_hot_cold_page().
> >
> >Hello, sorry for late.
> >
> >Yes, this can close the race on migratetype buddy list, but, there is
> >a problem. See the below.
> >
> >>
> >>>And, I just added three more is_migrate_isolate() in the fast
> >>>path, but, two checks are in same *unlikely* branch and I can remove
> >>>another one easily. Therefore it's not quite problem I guess. (It even
> >>>does no-op if MEMORY_ISOLATION is disabled.)
> >>>Moreover, I removed one unconditional get_pageblock_migratetype() in
> >>>free_pcppages_bulk() so, in performance point or view, freepath would
> >>>be improved.
> >>
> >>I haven't checked the individual patches yet, but I'll try.
> >>
> >>>>
> >>>>So the question is, does it have to be correct? And (admiteddly not after a completely
> >>>>exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)
> >>>>
> >>>>Well I of course don't mean that the freepage accounts could go random completely, but
> >>>>what if we allowed them to drift a bit, limiting both the max error and the timeframe
> >>>>where errors are possible? After all, watermarks checking is already racy so I don't think
> >>>>it would be hurt that much.
> >>>
> >>>I understand your suggestion. I once thought like as you, but give up
> >>>that idea. Watermark checking is already racy, but, it's *only*
> >>>protection to prevent memory allocation. After passing that check,
> >>>there is no mean to prevent us from allocating memory. So it should
> >>>be accurate as much as possible. If we allow someone to get the
> >>>memory without considering memory isolation, free memory could be in
> >>>really low state and system could be broken occasionally.
> >>
> >>It would definitely require more thorough review, but I think with
> >>the pcplists draining changes outlined above, there shouldn't be any
> >>more inaccuracy happening.
> >>
> >>>>
> >>>>Now if we look at what both CMA and memory hot-remove does is:
> >>>>
> >>>>1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
> >>>>start_isolate_page_range(). As part of this, all free pages in that area are
> >>>>moved on the isolate freelist through move_freepages_block().
> >>>>
> >>>>2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
> >>>>caches.
> >>>>
> >>>>3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
> >>>>undo pageblock isolation if not.
> >>>>
> >>>>So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
> >>>>allocator, which would therefore account free pages on the isolate freelist as normal free
> >>>>pages.
> >>>>The accounting of isolated pages would be instead done only on the top level of CMA/hot
> >>>>remove in the three steps outlined above, which would be modified as follows:
> >>>>
> >>>>1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
> >>>>as usual. Calculate X as the number of pages that move_freepages_block() has moved.
> >>>>Subtract X from freepages (this is the same as it is done now), but also *remember the
> >>>>value of X*
> >>>>
> >>>>2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
> >>>>isolate freelist, or not. We don't care, they will be accounted as freepages either way.
> >>>>This is where some inaccuracy in accounted freepages would build up.
> >>>>
> >>>>3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
> >>>>that we have a isolated range of N pages that nobody can steal now as everything is on
> >>>>isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
> >>>>X, then N-X). So the accounting matches reality.
> >>>>
> >>>>If we have to undo, we undo the isolation and as part of this, we use
> >>>>move_freepages_block() to move pages from isolate freelist to the normal ones. But we
> >>>>don't care how many pages were moved. We simply add the remembered value of X to the
> >>>>number of freepages, undoing the change from step 1. Again, the accounting matches reality.
> >>>>
> >>>>
> >>>>The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
> >>>>be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
> >>>>handled.
> >>>
> >>>The 4MB error in accounting looks serious for me. min_free_kbytes is
> >>>4MB in 1GB system. So this 4MB error would makes all things broken in
> >>>such systems. Moreover, there are some ARCH having larger
> >>>pageblock_order than MAX_ORDER. In this case, the error will be larger
> >>>than 4MB.
> >>>
> >>>In addition, I have a plan to extend CMA to work in parallel. It means
> >>>that there could be parallel memory isolation users rather than just
> >>>one user at the same time, so, we cannot easily bound the error under
> >>>some degree.
> >>
> >>OK. I had thought that the error would be much smaller in practice,
> >>but probably due to how buddy merging works, it would be enough if
> >>just the very last freed page in the MAX_ORDER block was misplaced,
> >>and that would trigger a cascading merge that will end with single
> >>page at MAX_ORDER size becoming misplaced. So let's probably forget
> >>this approach.
> >>
> >>>>As a possible improvement, we can assume during phase 2 that every page freed by migration
> >>>>will end up correctly on isolate free list. So we create M free pages by migration, and
> >>>>subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
> >>>>+ M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
> >>>>pages on pcplists and the possible races it will be less). I think with this improvement,
> >>>>any error would be negligible.
> >>>>
> >>>>Thoughts?
> >>>
> >>>Thanks for suggestion. :)
> >>>It is really good topic to think deeply.
> >>>
> >>>For now, I'd like to fix these problems without side-effect as you
> >>>suggested. Your suggestion changes the meaning of freepage that
> >>>isolated pages are included in nr_freepage and there could be possible
> >>>regression in success rate of memory hotplug and CMA. Possibly, it
> >>>is the way we have to go, but, IMHO, it isn't the time to go. Before
> >>>going that way, we should fix current implementation first so that
> >>>fixes can be backported to old kernel if someone needs.
> >>
> >>Adding the extra pcplist drains and reordering
> >>get_pageblock_migratetype() under disabled IRQ's should be small
> >>enough to be backportable and not bring any regressions to success
> >>rates. There will be some cost for the extra drains, but paid only
> >>when the memory isolation actually happens. Extending the scope of
> >>disabled irq's would affect fast path somewhat, but I guess less
> >>than extra checks?
> >
> >Your approach would close the race on migratetype buddy list. But,
> >I'm not sure how we can count number of freepages correctly. IIUC,
> >unlike my approach, this approach allows merging between isolate buddy
> >list and normal buddy list and it would results in miscounting number of
> >freepages. You probably think that move_freepages_block() could fixup
> >all the situation, but, I don't think so.
> 
> No, I didn't think that, I simply overlooked this scenario. Good catch.
> 
> >After applying your approach,
> >
> >CPU1                                    CPU2
> >set_migratetype_isolate()               - free_one_page()
> >   - lock
> >   - set_pageblock_migratetype()
> >   - unlock
> >   - drain...
> >                                         - lock
> >                                         - __free_one_page() with MIGRATE_ISOLATE
> >                                         - merged with normal page and linked with
> >                                         isolate buddy list
> >                                         - skip to count freepage, because of
> >                                         MIGRATE_ISOLATE
> >                                         - unlock
> >   - lock
> >   - move_freepages_block()
> >     try to fixup freepages count
> >   - unlock
> >
> >We want to fixup counting in move_freepages_block(), but, we can't
> >because we don't have enough information whether this page is already
> >accounted on number of freepage or not. Could you help me to find out
> >the whole mechanism of your approach?
> 
> I can't think of an easy fix to this situation.
> 
> A non-trivial fix that comes to mind (and I might have overlooked
> something) is something like:
> 
> - distinguish MIGRATETYPE_ISOLATING and MIGRATETYPE_ISOLATED
> - CPU1 first sets MIGRATETYPE_ISOLATING before the drain
> - when CPU2 sees MIGRATETYPE_ISOLATING, it just puts the page on
> special unbounded pcplist and that's it
> - CPU1 does the drain as usual, potentially misplacing some pages
> that move_freepages_block() will then fix. But no wrong merging can
> occur.
> - after move_freepages_block(), CPU1 changes MIGRATETYPE_ISOLATING
> to MIGRATETYPE_ISOLATED
> - CPU2 can then start freeing directly on isolate buddy list. There
> might be some pages still on the special pcplist of CPU2/CPUx but
> that means they won't merge yet.
> - CPU1 executes on all CPU's a new operation that flushes the
> special pcplist on isolate buddy list and merge as needed.
> 

Really thanks for sharing idea.

It looks possible but I guess that it needs more branches related to
pageblock isolation. Now I have a quick thought to prevent merging,
but, I'm not sure that it is better than current patchset. After more
thinking, I will post rough idea here.

Thanks.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
@ 2014-07-15  8:28             ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-15  8:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, Lisa Du,
	linux-kernel

On Mon, Jul 14, 2014 at 11:49:25AM +0200, Vlastimil Babka wrote:
> On 07/14/2014 08:22 AM, Joonsoo Kim wrote:
> >On Mon, Jul 07, 2014 at 04:33:09PM +0200, Vlastimil Babka wrote:
> >>On 07/07/2014 06:49 AM, Joonsoo Kim wrote:
> >>>Ccing Lisa, because there was bug report it may be related this
> >>>topic last Saturday.
> >>>
> >>>http://www.spinics.net/lists/linux-mm/msg75741.html
> >>>
> >>>On Fri, Jul 04, 2014 at 05:33:27PM +0200, Vlastimil Babka wrote:
> >>>>On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> >>>>>Hello,
> >>>>
> >>>>Hi Joonsoo,
> >>>>
> >>>>please CC me on further updates, this is relevant to me.
> >>>
> >>>Hello, Vlastimil.
> >>>
> >>>Sorry for missing you. :)
> >>>
> >>>>
> >>>>>This patchset aims at fixing problems due to memory isolation found by
> >>>>>testing my patchset [1].
> >>>>>
> >>>>>These are really subtle problems so I can be wrong. If you find what I am
> >>>>>missing, please let me know.
> >>>>>
> >>>>>Before describing bugs itself, I first explain definition of freepage.
> >>>>>
> >>>>>1. pages on buddy list are counted as freepage.
> >>>>>2. pages on isolate migratetype buddy list are *not* counted as freepage.
> >>>>
> >>>>I think the second point is causing us a lot of trouble. And I wonder if it's really
> >>>>justified! We already have some is_migrate_isolate() checks in the fast path and now you
> >>>>would add more, mostly just to keep this accounting correct.
> >>>
> >>>It's not just for keeping accouting correct. There is another
> >>>purpose for is_migrate_isolate(). It forces freed pages to go into
> >>>isolate buddy list. Without it, freed pages would go into other
> >>>buddy list and will be used soon. So memory isolation can't work well
> >>>without is_migrate_isolate() checks and success rate could decrease.
> >>
> >>Well I think that could be solved also by doing a lru/pcplists drain
> >>right after marking pageblock as MIGRATETYPE_ISOLATE. After that
> >>moment, anything newly put on pcplists should observe the new
> >>migratetype and go to the correct pcplists. As putting stuff on
> >>pcplists is done with disabled interrupts, and draining is done by
> >>IPI, I think it should work correctly if we put the migratetype
> >>determination under the disabled irq part in free_hot_cold_page().
> >
> >Hello, sorry for late.
> >
> >Yes, this can close the race on migratetype buddy list, but, there is
> >a problem. See the below.
> >
> >>
> >>>And, I just added three more is_migrate_isolate() in the fast
> >>>path, but, two checks are in same *unlikely* branch and I can remove
> >>>another one easily. Therefore it's not quite problem I guess. (It even
> >>>does no-op if MEMORY_ISOLATION is disabled.)
> >>>Moreover, I removed one unconditional get_pageblock_migratetype() in
> >>>free_pcppages_bulk() so, in performance point or view, freepath would
> >>>be improved.
> >>
> >>I haven't checked the individual patches yet, but I'll try.
> >>
> >>>>
> >>>>So the question is, does it have to be correct? And (admiteddly not after a completely
> >>>>exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)
> >>>>
> >>>>Well I of course don't mean that the freepage accounts could go random completely, but
> >>>>what if we allowed them to drift a bit, limiting both the max error and the timeframe
> >>>>where errors are possible? After all, watermarks checking is already racy so I don't think
> >>>>it would be hurt that much.
> >>>
> >>>I understand your suggestion. I once thought like as you, but give up
> >>>that idea. Watermark checking is already racy, but, it's *only*
> >>>protection to prevent memory allocation. After passing that check,
> >>>there is no mean to prevent us from allocating memory. So it should
> >>>be accurate as much as possible. If we allow someone to get the
> >>>memory without considering memory isolation, free memory could be in
> >>>really low state and system could be broken occasionally.
> >>
> >>It would definitely require more thorough review, but I think with
> >>the pcplists draining changes outlined above, there shouldn't be any
> >>more inaccuracy happening.
> >>
> >>>>
> >>>>Now if we look at what both CMA and memory hot-remove does is:
> >>>>
> >>>>1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
> >>>>start_isolate_page_range(). As part of this, all free pages in that area are
> >>>>moved on the isolate freelist through move_freepages_block().
> >>>>
> >>>>2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
> >>>>caches.
> >>>>
> >>>>3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
> >>>>undo pageblock isolation if not.
> >>>>
> >>>>So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
> >>>>allocator, which would therefore account free pages on the isolate freelist as normal free
> >>>>pages.
> >>>>The accounting of isolated pages would be instead done only on the top level of CMA/hot
> >>>>remove in the three steps outlined above, which would be modified as follows:
> >>>>
> >>>>1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
> >>>>as usual. Calculate X as the number of pages that move_freepages_block() has moved.
> >>>>Subtract X from freepages (this is the same as it is done now), but also *remember the
> >>>>value of X*
> >>>>
> >>>>2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
> >>>>isolate freelist, or not. We don't care, they will be accounted as freepages either way.
> >>>>This is where some inaccuracy in accounted freepages would build up.
> >>>>
> >>>>3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
> >>>>that we have a isolated range of N pages that nobody can steal now as everything is on
> >>>>isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
> >>>>X, then N-X). So the accounting matches reality.
> >>>>
> >>>>If we have to undo, we undo the isolation and as part of this, we use
> >>>>move_freepages_block() to move pages from isolate freelist to the normal ones. But we
> >>>>don't care how many pages were moved. We simply add the remembered value of X to the
> >>>>number of freepages, undoing the change from step 1. Again, the accounting matches reality.
> >>>>
> >>>>
> >>>>The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
> >>>>be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
> >>>>handled.
> >>>
> >>>The 4MB error in accounting looks serious for me. min_free_kbytes is
> >>>4MB in 1GB system. So this 4MB error would makes all things broken in
> >>>such systems. Moreover, there are some ARCH having larger
> >>>pageblock_order than MAX_ORDER. In this case, the error will be larger
> >>>than 4MB.
> >>>
> >>>In addition, I have a plan to extend CMA to work in parallel. It means
> >>>that there could be parallel memory isolation users rather than just
> >>>one user at the same time, so, we cannot easily bound the error under
> >>>some degree.
> >>
> >>OK. I had thought that the error would be much smaller in practice,
> >>but probably due to how buddy merging works, it would be enough if
> >>just the very last freed page in the MAX_ORDER block was misplaced,
> >>and that would trigger a cascading merge that will end with single
> >>page at MAX_ORDER size becoming misplaced. So let's probably forget
> >>this approach.
> >>
> >>>>As a possible improvement, we can assume during phase 2 that every page freed by migration
> >>>>will end up correctly on isolate free list. So we create M free pages by migration, and
> >>>>subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
> >>>>+ M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
> >>>>pages on pcplists and the possible races it will be less). I think with this improvement,
> >>>>any error would be negligible.
> >>>>
> >>>>Thoughts?
> >>>
> >>>Thanks for suggestion. :)
> >>>It is really good topic to think deeply.
> >>>
> >>>For now, I'd like to fix these problems without side-effect as you
> >>>suggested. Your suggestion changes the meaning of freepage that
> >>>isolated pages are included in nr_freepage and there could be possible
> >>>regression in success rate of memory hotplug and CMA. Possibly, it
> >>>is the way we have to go, but, IMHO, it isn't the time to go. Before
> >>>going that way, we should fix current implementation first so that
> >>>fixes can be backported to old kernel if someone needs.
> >>
> >>Adding the extra pcplist drains and reordering
> >>get_pageblock_migratetype() under disabled IRQ's should be small
> >>enough to be backportable and not bring any regressions to success
> >>rates. There will be some cost for the extra drains, but paid only
> >>when the memory isolation actually happens. Extending the scope of
> >>disabled irq's would affect fast path somewhat, but I guess less
> >>than extra checks?
> >
> >Your approach would close the race on migratetype buddy list. But,
> >I'm not sure how we can count number of freepages correctly. IIUC,
> >unlike my approach, this approach allows merging between isolate buddy
> >list and normal buddy list and it would results in miscounting number of
> >freepages. You probably think that move_freepages_block() could fixup
> >all the situation, but, I don't think so.
> 
> No, I didn't think that, I simply overlooked this scenario. Good catch.
> 
> >After applying your approach,
> >
> >CPU1                                    CPU2
> >set_migratetype_isolate()               - free_one_page()
> >   - lock
> >   - set_pageblock_migratetype()
> >   - unlock
> >   - drain...
> >                                         - lock
> >                                         - __free_one_page() with MIGRATE_ISOLATE
> >                                         - merged with normal page and linked with
> >                                         isolate buddy list
> >                                         - skip to count freepage, because of
> >                                         MIGRATE_ISOLATE
> >                                         - unlock
> >   - lock
> >   - move_freepages_block()
> >     try to fixup freepages count
> >   - unlock
> >
> >We want to fixup counting in move_freepages_block(), but, we can't
> >because we don't have enough information whether this page is already
> >accounted on number of freepage or not. Could you help me to find out
> >the whole mechanism of your approach?
> 
> I can't think of an easy fix to this situation.
> 
> A non-trivial fix that comes to mind (and I might have overlooked
> something) is something like:
> 
> - distinguish MIGRATETYPE_ISOLATING and MIGRATETYPE_ISOLATED
> - CPU1 first sets MIGRATETYPE_ISOLATING before the drain
> - when CPU2 sees MIGRATETYPE_ISOLATING, it just puts the page on
> special unbounded pcplist and that's it
> - CPU1 does the drain as usual, potentially misplacing some pages
> that move_freepages_block() will then fix. But no wrong merging can
> occur.
> - after move_freepages_block(), CPU1 changes MIGRATETYPE_ISOLATING
> to MIGRATETYPE_ISOLATED
> - CPU2 can then start freeing directly on isolate buddy list. There
> might be some pages still on the special pcplist of CPU2/CPUx but
> that means they won't merge yet.
> - CPU1 executes on all CPU's a new operation that flushes the
> special pcplist on isolate buddy list and merge as needed.
> 

Really thanks for sharing idea.

It looks possible but I guess that it needs more branches related to
pageblock isolation. Now I have a quick thought to prevent merging,
but, I'm not sure that it is better than current patchset. After more
thinking, I will post rough idea here.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
  2014-07-15  8:28             ` Joonsoo Kim
@ 2014-07-15  8:36               ` Vlastimil Babka
  -1 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-15  8:36 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, Lisa Du,
	linux-kernel

On 07/15/2014 10:28 AM, Joonsoo Kim wrote:
> On Mon, Jul 14, 2014 at 11:49:25AM +0200, Vlastimil Babka wrote:
>> On 07/14/2014 08:22 AM, Joonsoo Kim wrote:
>>> On Mon, Jul 07, 2014 at 04:33:09PM +0200, Vlastimil Babka wrote:
>>>> On 07/07/2014 06:49 AM, Joonsoo Kim wrote:
>>>>> Ccing Lisa, because there was bug report it may be related this
>>>>> topic last Saturday.
>>>>>
>>>>> http://www.spinics.net/lists/linux-mm/msg75741.html
>>>>>
>>>>> On Fri, Jul 04, 2014 at 05:33:27PM +0200, Vlastimil Babka wrote:
>>>>>> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
>>>>>>> Hello,
>>>>>>
>>>>>> Hi Joonsoo,
>>>>>>
>>>>>> please CC me on further updates, this is relevant to me.
>>>>>
>>>>> Hello, Vlastimil.
>>>>>
>>>>> Sorry for missing you. :)
>>>>>
>>>>>>
>>>>>>> This patchset aims at fixing problems due to memory isolation found by
>>>>>>> testing my patchset [1].
>>>>>>>
>>>>>>> These are really subtle problems so I can be wrong. If you find what I am
>>>>>>> missing, please let me know.
>>>>>>>
>>>>>>> Before describing bugs itself, I first explain definition of freepage.
>>>>>>>
>>>>>>> 1. pages on buddy list are counted as freepage.
>>>>>>> 2. pages on isolate migratetype buddy list are *not* counted as freepage.
>>>>>>
>>>>>> I think the second point is causing us a lot of trouble. And I wonder if it's really
>>>>>> justified! We already have some is_migrate_isolate() checks in the fast path and now you
>>>>>> would add more, mostly just to keep this accounting correct.
>>>>>
>>>>> It's not just for keeping accouting correct. There is another
>>>>> purpose for is_migrate_isolate(). It forces freed pages to go into
>>>>> isolate buddy list. Without it, freed pages would go into other
>>>>> buddy list and will be used soon. So memory isolation can't work well
>>>>> without is_migrate_isolate() checks and success rate could decrease.
>>>>
>>>> Well I think that could be solved also by doing a lru/pcplists drain
>>>> right after marking pageblock as MIGRATETYPE_ISOLATE. After that
>>>> moment, anything newly put on pcplists should observe the new
>>>> migratetype and go to the correct pcplists. As putting stuff on
>>>> pcplists is done with disabled interrupts, and draining is done by
>>>> IPI, I think it should work correctly if we put the migratetype
>>>> determination under the disabled irq part in free_hot_cold_page().
>>>
>>> Hello, sorry for late.
>>>
>>> Yes, this can close the race on migratetype buddy list, but, there is
>>> a problem. See the below.
>>>
>>>>
>>>>> And, I just added three more is_migrate_isolate() in the fast
>>>>> path, but, two checks are in same *unlikely* branch and I can remove
>>>>> another one easily. Therefore it's not quite problem I guess. (It even
>>>>> does no-op if MEMORY_ISOLATION is disabled.)
>>>>> Moreover, I removed one unconditional get_pageblock_migratetype() in
>>>>> free_pcppages_bulk() so, in performance point or view, freepath would
>>>>> be improved.
>>>>
>>>> I haven't checked the individual patches yet, but I'll try.
>>>>
>>>>>>
>>>>>> So the question is, does it have to be correct? And (admiteddly not after a completely
>>>>>> exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)
>>>>>>
>>>>>> Well I of course don't mean that the freepage accounts could go random completely, but
>>>>>> what if we allowed them to drift a bit, limiting both the max error and the timeframe
>>>>>> where errors are possible? After all, watermarks checking is already racy so I don't think
>>>>>> it would be hurt that much.
>>>>>
>>>>> I understand your suggestion. I once thought like as you, but give up
>>>>> that idea. Watermark checking is already racy, but, it's *only*
>>>>> protection to prevent memory allocation. After passing that check,
>>>>> there is no mean to prevent us from allocating memory. So it should
>>>>> be accurate as much as possible. If we allow someone to get the
>>>>> memory without considering memory isolation, free memory could be in
>>>>> really low state and system could be broken occasionally.
>>>>
>>>> It would definitely require more thorough review, but I think with
>>>> the pcplists draining changes outlined above, there shouldn't be any
>>>> more inaccuracy happening.
>>>>
>>>>>>
>>>>>> Now if we look at what both CMA and memory hot-remove does is:
>>>>>>
>>>>>> 1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
>>>>>> start_isolate_page_range(). As part of this, all free pages in that area are
>>>>>> moved on the isolate freelist through move_freepages_block().
>>>>>>
>>>>>> 2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
>>>>>> caches.
>>>>>>
>>>>>> 3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
>>>>>> undo pageblock isolation if not.
>>>>>>
>>>>>> So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
>>>>>> allocator, which would therefore account free pages on the isolate freelist as normal free
>>>>>> pages.
>>>>>> The accounting of isolated pages would be instead done only on the top level of CMA/hot
>>>>>> remove in the three steps outlined above, which would be modified as follows:
>>>>>>
>>>>>> 1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
>>>>>> as usual. Calculate X as the number of pages that move_freepages_block() has moved.
>>>>>> Subtract X from freepages (this is the same as it is done now), but also *remember the
>>>>>> value of X*
>>>>>>
>>>>>> 2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
>>>>>> isolate freelist, or not. We don't care, they will be accounted as freepages either way.
>>>>>> This is where some inaccuracy in accounted freepages would build up.
>>>>>>
>>>>>> 3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
>>>>>> that we have a isolated range of N pages that nobody can steal now as everything is on
>>>>>> isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
>>>>>> X, then N-X). So the accounting matches reality.
>>>>>>
>>>>>> If we have to undo, we undo the isolation and as part of this, we use
>>>>>> move_freepages_block() to move pages from isolate freelist to the normal ones. But we
>>>>>> don't care how many pages were moved. We simply add the remembered value of X to the
>>>>>> number of freepages, undoing the change from step 1. Again, the accounting matches reality.
>>>>>>
>>>>>>
>>>>>> The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
>>>>>> be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
>>>>>> handled.
>>>>>
>>>>> The 4MB error in accounting looks serious for me. min_free_kbytes is
>>>>> 4MB in 1GB system. So this 4MB error would makes all things broken in
>>>>> such systems. Moreover, there are some ARCH having larger
>>>>> pageblock_order than MAX_ORDER. In this case, the error will be larger
>>>>> than 4MB.
>>>>>
>>>>> In addition, I have a plan to extend CMA to work in parallel. It means
>>>>> that there could be parallel memory isolation users rather than just
>>>>> one user at the same time, so, we cannot easily bound the error under
>>>>> some degree.
>>>>
>>>> OK. I had thought that the error would be much smaller in practice,
>>>> but probably due to how buddy merging works, it would be enough if
>>>> just the very last freed page in the MAX_ORDER block was misplaced,
>>>> and that would trigger a cascading merge that will end with single
>>>> page at MAX_ORDER size becoming misplaced. So let's probably forget
>>>> this approach.
>>>>
>>>>>> As a possible improvement, we can assume during phase 2 that every page freed by migration
>>>>>> will end up correctly on isolate free list. So we create M free pages by migration, and
>>>>>> subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
>>>>>> + M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
>>>>>> pages on pcplists and the possible races it will be less). I think with this improvement,
>>>>>> any error would be negligible.
>>>>>>
>>>>>> Thoughts?
>>>>>
>>>>> Thanks for suggestion. :)
>>>>> It is really good topic to think deeply.
>>>>>
>>>>> For now, I'd like to fix these problems without side-effect as you
>>>>> suggested. Your suggestion changes the meaning of freepage that
>>>>> isolated pages are included in nr_freepage and there could be possible
>>>>> regression in success rate of memory hotplug and CMA. Possibly, it
>>>>> is the way we have to go, but, IMHO, it isn't the time to go. Before
>>>>> going that way, we should fix current implementation first so that
>>>>> fixes can be backported to old kernel if someone needs.
>>>>
>>>> Adding the extra pcplist drains and reordering
>>>> get_pageblock_migratetype() under disabled IRQ's should be small
>>>> enough to be backportable and not bring any regressions to success
>>>> rates. There will be some cost for the extra drains, but paid only
>>>> when the memory isolation actually happens. Extending the scope of
>>>> disabled irq's would affect fast path somewhat, but I guess less
>>>> than extra checks?
>>>
>>> Your approach would close the race on migratetype buddy list. But,
>>> I'm not sure how we can count number of freepages correctly. IIUC,
>>> unlike my approach, this approach allows merging between isolate buddy
>>> list and normal buddy list and it would results in miscounting number of
>>> freepages. You probably think that move_freepages_block() could fixup
>>> all the situation, but, I don't think so.
>>
>> No, I didn't think that, I simply overlooked this scenario. Good catch.
>>
>>> After applying your approach,
>>>
>>> CPU1                                    CPU2
>>> set_migratetype_isolate()               - free_one_page()
>>>    - lock
>>>    - set_pageblock_migratetype()
>>>    - unlock
>>>    - drain...
>>>                                          - lock
>>>                                          - __free_one_page() with MIGRATE_ISOLATE
>>>                                          - merged with normal page and linked with
>>>                                          isolate buddy list
>>>                                          - skip to count freepage, because of
>>>                                          MIGRATE_ISOLATE
>>>                                          - unlock
>>>    - lock
>>>    - move_freepages_block()
>>>      try to fixup freepages count
>>>    - unlock
>>>
>>> We want to fixup counting in move_freepages_block(), but, we can't
>>> because we don't have enough information whether this page is already
>>> accounted on number of freepage or not. Could you help me to find out
>>> the whole mechanism of your approach?
>>
>> I can't think of an easy fix to this situation.
>>
>> A non-trivial fix that comes to mind (and I might have overlooked
>> something) is something like:
>>
>> - distinguish MIGRATETYPE_ISOLATING and MIGRATETYPE_ISOLATED
>> - CPU1 first sets MIGRATETYPE_ISOLATING before the drain
>> - when CPU2 sees MIGRATETYPE_ISOLATING, it just puts the page on
>> special unbounded pcplist and that's it
>> - CPU1 does the drain as usual, potentially misplacing some pages
>> that move_freepages_block() will then fix. But no wrong merging can
>> occur.
>> - after move_freepages_block(), CPU1 changes MIGRATETYPE_ISOLATING
>> to MIGRATETYPE_ISOLATED
>> - CPU2 can then start freeing directly on isolate buddy list. There
>> might be some pages still on the special pcplist of CPU2/CPUx but
>> that means they won't merge yet.
>> - CPU1 executes on all CPU's a new operation that flushes the
>> special pcplist on isolate buddy list and merge as needed.
>>
>
> Really thanks for sharing idea.

Ah, you didn't find a hole yet, good sign :D

> It looks possible but I guess that it needs more branches related to
> pageblock isolation. Now I have a quick thought to prevent merging,
> but, I'm not sure that it is better than current patchset. After more
> thinking, I will post rough idea here.

I was thinking about it more and maybe it wouldn't need a new 
migratetype after all. But it would always need to free isolate pages on 
the special pcplist. That means this pcplist would be used not only 
during the call to start_isolate_page_range, but all the way until 
undo_isolate_page_range(). I don't think it's a problem and it 
simplifies things. The only way to move to isolate freelist is through 
the new isolate pcplist flush operation initiated by a single CPU at 
well defined time.

The undo would look like:
- (migratetype is still set to MIGRATETYPE_ISOLATE, CPU2 frees affected 
pages to the special freelist)
- CPU1 does move_freepages_block() to put pages back from isolate 
freelist to e.g. MOVABLE or CMA. At this point, nobody will put new 
pages on isolate freelist.
- CPU1 changes migratetype of the pageblock to e.g. MOVABLE. CPU2 and 
others start freeing normally. Merging can occur only on the MOVABLE 
freelist, as isolate freelist is empty and nobody puts pages there.
- CPU1 flushes the isolate pcplists of all CPU's on the MOVABLE 
freelist. Merging is again correct.

I think your plan of multiple parallel CMA allocations (and thus 
multiple parallel isolations) is also possible. The isolate pcplists can 
be shared by pages coming from multiple parallel isolations. But the 
flush operation needs a pfn start/end parameters to only flush pages 
belonging to the given isolation. That might mean a bit of inefficient 
list traversing, but I don't think it's a problem.

Makes sense?
Vlastimil

> Thanks.
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
@ 2014-07-15  8:36               ` Vlastimil Babka
  0 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-15  8:36 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, Lisa Du,
	linux-kernel

On 07/15/2014 10:28 AM, Joonsoo Kim wrote:
> On Mon, Jul 14, 2014 at 11:49:25AM +0200, Vlastimil Babka wrote:
>> On 07/14/2014 08:22 AM, Joonsoo Kim wrote:
>>> On Mon, Jul 07, 2014 at 04:33:09PM +0200, Vlastimil Babka wrote:
>>>> On 07/07/2014 06:49 AM, Joonsoo Kim wrote:
>>>>> Ccing Lisa, because there was bug report it may be related this
>>>>> topic last Saturday.
>>>>>
>>>>> http://www.spinics.net/lists/linux-mm/msg75741.html
>>>>>
>>>>> On Fri, Jul 04, 2014 at 05:33:27PM +0200, Vlastimil Babka wrote:
>>>>>> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
>>>>>>> Hello,
>>>>>>
>>>>>> Hi Joonsoo,
>>>>>>
>>>>>> please CC me on further updates, this is relevant to me.
>>>>>
>>>>> Hello, Vlastimil.
>>>>>
>>>>> Sorry for missing you. :)
>>>>>
>>>>>>
>>>>>>> This patchset aims at fixing problems due to memory isolation found by
>>>>>>> testing my patchset [1].
>>>>>>>
>>>>>>> These are really subtle problems so I can be wrong. If you find what I am
>>>>>>> missing, please let me know.
>>>>>>>
>>>>>>> Before describing bugs itself, I first explain definition of freepage.
>>>>>>>
>>>>>>> 1. pages on buddy list are counted as freepage.
>>>>>>> 2. pages on isolate migratetype buddy list are *not* counted as freepage.
>>>>>>
>>>>>> I think the second point is causing us a lot of trouble. And I wonder if it's really
>>>>>> justified! We already have some is_migrate_isolate() checks in the fast path and now you
>>>>>> would add more, mostly just to keep this accounting correct.
>>>>>
>>>>> It's not just for keeping accouting correct. There is another
>>>>> purpose for is_migrate_isolate(). It forces freed pages to go into
>>>>> isolate buddy list. Without it, freed pages would go into other
>>>>> buddy list and will be used soon. So memory isolation can't work well
>>>>> without is_migrate_isolate() checks and success rate could decrease.
>>>>
>>>> Well I think that could be solved also by doing a lru/pcplists drain
>>>> right after marking pageblock as MIGRATETYPE_ISOLATE. After that
>>>> moment, anything newly put on pcplists should observe the new
>>>> migratetype and go to the correct pcplists. As putting stuff on
>>>> pcplists is done with disabled interrupts, and draining is done by
>>>> IPI, I think it should work correctly if we put the migratetype
>>>> determination under the disabled irq part in free_hot_cold_page().
>>>
>>> Hello, sorry for late.
>>>
>>> Yes, this can close the race on migratetype buddy list, but, there is
>>> a problem. See the below.
>>>
>>>>
>>>>> And, I just added three more is_migrate_isolate() in the fast
>>>>> path, but, two checks are in same *unlikely* branch and I can remove
>>>>> another one easily. Therefore it's not quite problem I guess. (It even
>>>>> does no-op if MEMORY_ISOLATION is disabled.)
>>>>> Moreover, I removed one unconditional get_pageblock_migratetype() in
>>>>> free_pcppages_bulk() so, in performance point or view, freepath would
>>>>> be improved.
>>>>
>>>> I haven't checked the individual patches yet, but I'll try.
>>>>
>>>>>>
>>>>>> So the question is, does it have to be correct? And (admiteddly not after a completely
>>>>>> exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)
>>>>>>
>>>>>> Well I of course don't mean that the freepage accounts could go random completely, but
>>>>>> what if we allowed them to drift a bit, limiting both the max error and the timeframe
>>>>>> where errors are possible? After all, watermarks checking is already racy so I don't think
>>>>>> it would be hurt that much.
>>>>>
>>>>> I understand your suggestion. I once thought like as you, but give up
>>>>> that idea. Watermark checking is already racy, but, it's *only*
>>>>> protection to prevent memory allocation. After passing that check,
>>>>> there is no mean to prevent us from allocating memory. So it should
>>>>> be accurate as much as possible. If we allow someone to get the
>>>>> memory without considering memory isolation, free memory could be in
>>>>> really low state and system could be broken occasionally.
>>>>
>>>> It would definitely require more thorough review, but I think with
>>>> the pcplists draining changes outlined above, there shouldn't be any
>>>> more inaccuracy happening.
>>>>
>>>>>>
>>>>>> Now if we look at what both CMA and memory hot-remove does is:
>>>>>>
>>>>>> 1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
>>>>>> start_isolate_page_range(). As part of this, all free pages in that area are
>>>>>> moved on the isolate freelist through move_freepages_block().
>>>>>>
>>>>>> 2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
>>>>>> caches.
>>>>>>
>>>>>> 3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
>>>>>> undo pageblock isolation if not.
>>>>>>
>>>>>> So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
>>>>>> allocator, which would therefore account free pages on the isolate freelist as normal free
>>>>>> pages.
>>>>>> The accounting of isolated pages would be instead done only on the top level of CMA/hot
>>>>>> remove in the three steps outlined above, which would be modified as follows:
>>>>>>
>>>>>> 1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
>>>>>> as usual. Calculate X as the number of pages that move_freepages_block() has moved.
>>>>>> Subtract X from freepages (this is the same as it is done now), but also *remember the
>>>>>> value of X*
>>>>>>
>>>>>> 2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
>>>>>> isolate freelist, or not. We don't care, they will be accounted as freepages either way.
>>>>>> This is where some inaccuracy in accounted freepages would build up.
>>>>>>
>>>>>> 3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
>>>>>> that we have a isolated range of N pages that nobody can steal now as everything is on
>>>>>> isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
>>>>>> X, then N-X). So the accounting matches reality.
>>>>>>
>>>>>> If we have to undo, we undo the isolation and as part of this, we use
>>>>>> move_freepages_block() to move pages from isolate freelist to the normal ones. But we
>>>>>> don't care how many pages were moved. We simply add the remembered value of X to the
>>>>>> number of freepages, undoing the change from step 1. Again, the accounting matches reality.
>>>>>>
>>>>>>
>>>>>> The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
>>>>>> be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
>>>>>> handled.
>>>>>
>>>>> The 4MB error in accounting looks serious for me. min_free_kbytes is
>>>>> 4MB in 1GB system. So this 4MB error would makes all things broken in
>>>>> such systems. Moreover, there are some ARCH having larger
>>>>> pageblock_order than MAX_ORDER. In this case, the error will be larger
>>>>> than 4MB.
>>>>>
>>>>> In addition, I have a plan to extend CMA to work in parallel. It means
>>>>> that there could be parallel memory isolation users rather than just
>>>>> one user at the same time, so, we cannot easily bound the error under
>>>>> some degree.
>>>>
>>>> OK. I had thought that the error would be much smaller in practice,
>>>> but probably due to how buddy merging works, it would be enough if
>>>> just the very last freed page in the MAX_ORDER block was misplaced,
>>>> and that would trigger a cascading merge that will end with single
>>>> page at MAX_ORDER size becoming misplaced. So let's probably forget
>>>> this approach.
>>>>
>>>>>> As a possible improvement, we can assume during phase 2 that every page freed by migration
>>>>>> will end up correctly on isolate free list. So we create M free pages by migration, and
>>>>>> subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
>>>>>> + M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
>>>>>> pages on pcplists and the possible races it will be less). I think with this improvement,
>>>>>> any error would be negligible.
>>>>>>
>>>>>> Thoughts?
>>>>>
>>>>> Thanks for suggestion. :)
>>>>> It is really good topic to think deeply.
>>>>>
>>>>> For now, I'd like to fix these problems without side-effect as you
>>>>> suggested. Your suggestion changes the meaning of freepage that
>>>>> isolated pages are included in nr_freepage and there could be possible
>>>>> regression in success rate of memory hotplug and CMA. Possibly, it
>>>>> is the way we have to go, but, IMHO, it isn't the time to go. Before
>>>>> going that way, we should fix current implementation first so that
>>>>> fixes can be backported to old kernel if someone needs.
>>>>
>>>> Adding the extra pcplist drains and reordering
>>>> get_pageblock_migratetype() under disabled IRQ's should be small
>>>> enough to be backportable and not bring any regressions to success
>>>> rates. There will be some cost for the extra drains, but paid only
>>>> when the memory isolation actually happens. Extending the scope of
>>>> disabled irq's would affect fast path somewhat, but I guess less
>>>> than extra checks?
>>>
>>> Your approach would close the race on migratetype buddy list. But,
>>> I'm not sure how we can count number of freepages correctly. IIUC,
>>> unlike my approach, this approach allows merging between isolate buddy
>>> list and normal buddy list and it would results in miscounting number of
>>> freepages. You probably think that move_freepages_block() could fixup
>>> all the situation, but, I don't think so.
>>
>> No, I didn't think that, I simply overlooked this scenario. Good catch.
>>
>>> After applying your approach,
>>>
>>> CPU1                                    CPU2
>>> set_migratetype_isolate()               - free_one_page()
>>>    - lock
>>>    - set_pageblock_migratetype()
>>>    - unlock
>>>    - drain...
>>>                                          - lock
>>>                                          - __free_one_page() with MIGRATE_ISOLATE
>>>                                          - merged with normal page and linked with
>>>                                          isolate buddy list
>>>                                          - skip to count freepage, because of
>>>                                          MIGRATE_ISOLATE
>>>                                          - unlock
>>>    - lock
>>>    - move_freepages_block()
>>>      try to fixup freepages count
>>>    - unlock
>>>
>>> We want to fixup counting in move_freepages_block(), but, we can't
>>> because we don't have enough information whether this page is already
>>> accounted on number of freepage or not. Could you help me to find out
>>> the whole mechanism of your approach?
>>
>> I can't think of an easy fix to this situation.
>>
>> A non-trivial fix that comes to mind (and I might have overlooked
>> something) is something like:
>>
>> - distinguish MIGRATETYPE_ISOLATING and MIGRATETYPE_ISOLATED
>> - CPU1 first sets MIGRATETYPE_ISOLATING before the drain
>> - when CPU2 sees MIGRATETYPE_ISOLATING, it just puts the page on
>> special unbounded pcplist and that's it
>> - CPU1 does the drain as usual, potentially misplacing some pages
>> that move_freepages_block() will then fix. But no wrong merging can
>> occur.
>> - after move_freepages_block(), CPU1 changes MIGRATETYPE_ISOLATING
>> to MIGRATETYPE_ISOLATED
>> - CPU2 can then start freeing directly on isolate buddy list. There
>> might be some pages still on the special pcplist of CPU2/CPUx but
>> that means they won't merge yet.
>> - CPU1 executes on all CPU's a new operation that flushes the
>> special pcplist on isolate buddy list and merge as needed.
>>
>
> Really thanks for sharing idea.

Ah, you didn't find a hole yet, good sign :D

> It looks possible but I guess that it needs more branches related to
> pageblock isolation. Now I have a quick thought to prevent merging,
> but, I'm not sure that it is better than current patchset. After more
> thinking, I will post rough idea here.

I was thinking about it more and maybe it wouldn't need a new 
migratetype after all. But it would always need to free isolate pages on 
the special pcplist. That means this pcplist would be used not only 
during the call to start_isolate_page_range, but all the way until 
undo_isolate_page_range(). I don't think it's a problem and it 
simplifies things. The only way to move to isolate freelist is through 
the new isolate pcplist flush operation initiated by a single CPU at 
well defined time.

The undo would look like:
- (migratetype is still set to MIGRATETYPE_ISOLATE, CPU2 frees affected 
pages to the special freelist)
- CPU1 does move_freepages_block() to put pages back from isolate 
freelist to e.g. MOVABLE or CMA. At this point, nobody will put new 
pages on isolate freelist.
- CPU1 changes migratetype of the pageblock to e.g. MOVABLE. CPU2 and 
others start freeing normally. Merging can occur only on the MOVABLE 
freelist, as isolate freelist is empty and nobody puts pages there.
- CPU1 flushes the isolate pcplists of all CPU's on the MOVABLE 
freelist. Merging is again correct.

I think your plan of multiple parallel CMA allocations (and thus 
multiple parallel isolations) is also possible. The isolate pcplists can 
be shared by pages coming from multiple parallel isolations. But the 
flush operation needs a pfn start/end parameters to only flush pages 
belonging to the given isolation. That might mean a bit of inefficient 
list traversing, but I don't think it's a problem.

Makes sense?
Vlastimil

> Thanks.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
  2014-07-15  8:36               ` Vlastimil Babka
  (?)
@ 2014-07-15  9:39               ` Joonsoo Kim
  2014-07-15 10:00                 ` Peter Zijlstra
  -1 siblings, 1 reply; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-15  9:39 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Joonsoo Kim, Andrew Morton, Kirill A. Shutemov, Rik van Riel,
	Peter Zijlstra, Mel Gorman, Johannes Weiner, Minchan Kim,
	Yasuaki Ishimatsu, Zhang Yanfei, Srivatsa S. Bhat, Tang Chen,
	Naoya Horiguchi, Bartlomiej Zolnierkiewicz, Wen Congyang,
	Marek Szyprowski, Michal Nazarewicz, Laura Abbott, Heesub Shin,
	Aneesh Kumar K.V, Ritesh Harjani, t.stanislaws, Gioh Kim,
	Linux Memory Management List, Lisa Du, LKML

2014-07-15 17:36 GMT+09:00 Vlastimil Babka <vbabka@suse.cz>:
> On 07/15/2014 10:28 AM, Joonsoo Kim wrote:
>>
>> On Mon, Jul 14, 2014 at 11:49:25AM +0200, Vlastimil Babka wrote:
>>>
>>> On 07/14/2014 08:22 AM, Joonsoo Kim wrote:
>>>>
>>>> On Mon, Jul 07, 2014 at 04:33:09PM +0200, Vlastimil Babka wrote:
>>>>>
>>>>> On 07/07/2014 06:49 AM, Joonsoo Kim wrote:
>>>>>>
>>>>>> Ccing Lisa, because there was bug report it may be related this
>>>>>> topic last Saturday.
>>>>>>
>>>>>> http://www.spinics.net/lists/linux-mm/msg75741.html
>>>>>>
>>>>>> On Fri, Jul 04, 2014 at 05:33:27PM +0200, Vlastimil Babka wrote:
>>>>>>>
>>>>>>> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>
>>>>>>>
>>>>>>> Hi Joonsoo,
>>>>>>>
>>>>>>> please CC me on further updates, this is relevant to me.
>>>>>>
>>>>>>
>>>>>> Hello, Vlastimil.
>>>>>>
>>>>>> Sorry for missing you. :)
>>>>>>
>>>>>>>
>>>>>>>> This patchset aims at fixing problems due to memory isolation found
>>>>>>>> by
>>>>>>>> testing my patchset [1].
>>>>>>>>
>>>>>>>> These are really subtle problems so I can be wrong. If you find what
>>>>>>>> I am
>>>>>>>> missing, please let me know.
>>>>>>>>
>>>>>>>> Before describing bugs itself, I first explain definition of
>>>>>>>> freepage.
>>>>>>>>
>>>>>>>> 1. pages on buddy list are counted as freepage.
>>>>>>>> 2. pages on isolate migratetype buddy list are *not* counted as
>>>>>>>> freepage.
>>>>>>>
>>>>>>>
>>>>>>> I think the second point is causing us a lot of trouble. And I wonder
>>>>>>> if it's really
>>>>>>> justified! We already have some is_migrate_isolate() checks in the
>>>>>>> fast path and now you
>>>>>>> would add more, mostly just to keep this accounting correct.
>>>>>>
>>>>>>
>>>>>> It's not just for keeping accouting correct. There is another
>>>>>> purpose for is_migrate_isolate(). It forces freed pages to go into
>>>>>> isolate buddy list. Without it, freed pages would go into other
>>>>>> buddy list and will be used soon. So memory isolation can't work well
>>>>>> without is_migrate_isolate() checks and success rate could decrease.
>>>>>
>>>>>
>>>>> Well I think that could be solved also by doing a lru/pcplists drain
>>>>> right after marking pageblock as MIGRATETYPE_ISOLATE. After that
>>>>> moment, anything newly put on pcplists should observe the new
>>>>> migratetype and go to the correct pcplists. As putting stuff on
>>>>> pcplists is done with disabled interrupts, and draining is done by
>>>>> IPI, I think it should work correctly if we put the migratetype
>>>>> determination under the disabled irq part in free_hot_cold_page().
>>>>
>>>>
>>>> Hello, sorry for late.
>>>>
>>>> Yes, this can close the race on migratetype buddy list, but, there is
>>>> a problem. See the below.
>>>>
>>>>>
>>>>>> And, I just added three more is_migrate_isolate() in the fast
>>>>>> path, but, two checks are in same *unlikely* branch and I can remove
>>>>>> another one easily. Therefore it's not quite problem I guess. (It even
>>>>>> does no-op if MEMORY_ISOLATION is disabled.)
>>>>>> Moreover, I removed one unconditional get_pageblock_migratetype() in
>>>>>> free_pcppages_bulk() so, in performance point or view, freepath would
>>>>>> be improved.
>>>>>
>>>>>
>>>>> I haven't checked the individual patches yet, but I'll try.
>>>>>
>>>>>>>
>>>>>>> So the question is, does it have to be correct? And (admiteddly not
>>>>>>> after a completely
>>>>>>> exhaustive analysis) I think the answer is, surprisingly, that it
>>>>>>> doesn't :)
>>>>>>>
>>>>>>> Well I of course don't mean that the freepage accounts could go
>>>>>>> random completely, but
>>>>>>> what if we allowed them to drift a bit, limiting both the max error
>>>>>>> and the timeframe
>>>>>>> where errors are possible? After all, watermarks checking is already
>>>>>>> racy so I don't think
>>>>>>> it would be hurt that much.
>>>>>>
>>>>>>
>>>>>> I understand your suggestion. I once thought like as you, but give up
>>>>>> that idea. Watermark checking is already racy, but, it's *only*
>>>>>> protection to prevent memory allocation. After passing that check,
>>>>>> there is no mean to prevent us from allocating memory. So it should
>>>>>> be accurate as much as possible. If we allow someone to get the
>>>>>> memory without considering memory isolation, free memory could be in
>>>>>> really low state and system could be broken occasionally.
>>>>>
>>>>>
>>>>> It would definitely require more thorough review, but I think with
>>>>> the pcplists draining changes outlined above, there shouldn't be any
>>>>> more inaccuracy happening.
>>>>>
>>>>>>>
>>>>>>> Now if we look at what both CMA and memory hot-remove does is:
>>>>>>>
>>>>>>> 1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE
>>>>>>> through
>>>>>>> start_isolate_page_range(). As part of this, all free pages in that
>>>>>>> area are
>>>>>>> moved on the isolate freelist through move_freepages_block().
>>>>>>>
>>>>>>> 2. Try to migrate away all non-free pages in the range. Also drain
>>>>>>> pcplists and lru_add
>>>>>>> caches.
>>>>>>>
>>>>>>> 3. Check if everything was successfully isolated by
>>>>>>> test_pages_isolated(). Restart and/or
>>>>>>> undo pageblock isolation if not.
>>>>>>>
>>>>>>> So my idea is to throw away all special-casing of
>>>>>>> is_migrate_isolate() in the buddy
>>>>>>> allocator, which would therefore account free pages on the isolate
>>>>>>> freelist as normal free
>>>>>>> pages.
>>>>>>> The accounting of isolated pages would be instead done only on the
>>>>>>> top level of CMA/hot
>>>>>>> remove in the three steps outlined above, which would be modified as
>>>>>>> follows:
>>>>>>>
>>>>>>> 1. Calculate N as the target number of pages to be isolated. Perform
>>>>>>> the actions of step 1
>>>>>>> as usual. Calculate X as the number of pages that
>>>>>>> move_freepages_block() has moved.
>>>>>>> Subtract X from freepages (this is the same as it is done now), but
>>>>>>> also *remember the
>>>>>>> value of X*
>>>>>>>
>>>>>>> 2. Migrate and drain pcplists as usual. The new free pages will
>>>>>>> either end up correctly on
>>>>>>> isolate freelist, or not. We don't care, they will be accounted as
>>>>>>> freepages either way.
>>>>>>> This is where some inaccuracy in accounted freepages would build up.
>>>>>>>
>>>>>>> 3. If test_pages_isolated() checks pass, subtract (N - X) from
>>>>>>> freepages. The result is
>>>>>>> that we have a isolated range of N pages that nobody can steal now as
>>>>>>> everything is on
>>>>>>> isolate freelist and is MAX_ORDER aligned. And we have in total
>>>>>>> subtracted N pages (first
>>>>>>> X, then N-X). So the accounting matches reality.
>>>>>>>
>>>>>>> If we have to undo, we undo the isolation and as part of this, we use
>>>>>>> move_freepages_block() to move pages from isolate freelist to the
>>>>>>> normal ones. But we
>>>>>>> don't care how many pages were moved. We simply add the remembered
>>>>>>> value of X to the
>>>>>>> number of freepages, undoing the change from step 1. Again, the
>>>>>>> accounting matches reality.
>>>>>>>
>>>>>>>
>>>>>>> The final point is that if we do this per MAX_ORDER blocks, the error
>>>>>>> in accounting cannot
>>>>>>> be ever larger than 4MB and will be visible only during time a single
>>>>>>> MAX_ORDER block is
>>>>>>> handled.
>>>>>>
>>>>>>
>>>>>> The 4MB error in accounting looks serious for me. min_free_kbytes is
>>>>>> 4MB in 1GB system. So this 4MB error would makes all things broken in
>>>>>> such systems. Moreover, there are some ARCH having larger
>>>>>> pageblock_order than MAX_ORDER. In this case, the error will be larger
>>>>>> than 4MB.
>>>>>>
>>>>>> In addition, I have a plan to extend CMA to work in parallel. It means
>>>>>> that there could be parallel memory isolation users rather than just
>>>>>> one user at the same time, so, we cannot easily bound the error under
>>>>>> some degree.
>>>>>
>>>>>
>>>>> OK. I had thought that the error would be much smaller in practice,
>>>>> but probably due to how buddy merging works, it would be enough if
>>>>> just the very last freed page in the MAX_ORDER block was misplaced,
>>>>> and that would trigger a cascading merge that will end with single
>>>>> page at MAX_ORDER size becoming misplaced. So let's probably forget
>>>>> this approach.
>>>>>
>>>>>>> As a possible improvement, we can assume during phase 2 that every
>>>>>>> page freed by migration
>>>>>>> will end up correctly on isolate free list. So we create M free pages
>>>>>>> by migration, and
>>>>>>> subtract M from freepage account. Then in phase 3 we either subtract
>>>>>>> (N - X - M), or add X
>>>>>>> + M in the undo case. (Ideally, if we succeed, X + M should be equal
>>>>>>> to N, but due to
>>>>>>> pages on pcplists and the possible races it will be less). I think
>>>>>>> with this improvement,
>>>>>>> any error would be negligible.
>>>>>>>
>>>>>>> Thoughts?
>>>>>>
>>>>>>
>>>>>> Thanks for suggestion. :)
>>>>>> It is really good topic to think deeply.
>>>>>>
>>>>>> For now, I'd like to fix these problems without side-effect as you
>>>>>> suggested. Your suggestion changes the meaning of freepage that
>>>>>> isolated pages are included in nr_freepage and there could be possible
>>>>>> regression in success rate of memory hotplug and CMA. Possibly, it
>>>>>> is the way we have to go, but, IMHO, it isn't the time to go. Before
>>>>>> going that way, we should fix current implementation first so that
>>>>>> fixes can be backported to old kernel if someone needs.
>>>>>
>>>>>
>>>>> Adding the extra pcplist drains and reordering
>>>>> get_pageblock_migratetype() under disabled IRQ's should be small
>>>>> enough to be backportable and not bring any regressions to success
>>>>> rates. There will be some cost for the extra drains, but paid only
>>>>> when the memory isolation actually happens. Extending the scope of
>>>>> disabled irq's would affect fast path somewhat, but I guess less
>>>>> than extra checks?
>>>>
>>>>
>>>> Your approach would close the race on migratetype buddy list. But,
>>>> I'm not sure how we can count number of freepages correctly. IIUC,
>>>> unlike my approach, this approach allows merging between isolate buddy
>>>> list and normal buddy list and it would results in miscounting number of
>>>> freepages. You probably think that move_freepages_block() could fixup
>>>> all the situation, but, I don't think so.
>>>
>>>
>>> No, I didn't think that, I simply overlooked this scenario. Good catch.
>>>
>>>> After applying your approach,
>>>>
>>>> CPU1                                    CPU2
>>>> set_migratetype_isolate()               - free_one_page()
>>>>    - lock
>>>>    - set_pageblock_migratetype()
>>>>    - unlock
>>>>    - drain...
>>>>                                          - lock
>>>>                                          - __free_one_page() with
>>>> MIGRATE_ISOLATE
>>>>                                          - merged with normal page and
>>>> linked with
>>>>                                          isolate buddy list
>>>>                                          - skip to count freepage,
>>>> because of
>>>>                                          MIGRATE_ISOLATE
>>>>                                          - unlock
>>>>    - lock
>>>>    - move_freepages_block()
>>>>      try to fixup freepages count
>>>>    - unlock
>>>>
>>>> We want to fixup counting in move_freepages_block(), but, we can't
>>>> because we don't have enough information whether this page is already
>>>> accounted on number of freepage or not. Could you help me to find out
>>>> the whole mechanism of your approach?
>>>
>>>
>>> I can't think of an easy fix to this situation.
>>>
>>> A non-trivial fix that comes to mind (and I might have overlooked
>>> something) is something like:
>>>
>>> - distinguish MIGRATETYPE_ISOLATING and MIGRATETYPE_ISOLATED
>>> - CPU1 first sets MIGRATETYPE_ISOLATING before the drain
>>> - when CPU2 sees MIGRATETYPE_ISOLATING, it just puts the page on
>>> special unbounded pcplist and that's it
>>> - CPU1 does the drain as usual, potentially misplacing some pages
>>> that move_freepages_block() will then fix. But no wrong merging can
>>> occur.
>>> - after move_freepages_block(), CPU1 changes MIGRATETYPE_ISOLATING
>>> to MIGRATETYPE_ISOLATED
>>> - CPU2 can then start freeing directly on isolate buddy list. There
>>> might be some pages still on the special pcplist of CPU2/CPUx but
>>> that means they won't merge yet.
>>> - CPU1 executes on all CPU's a new operation that flushes the
>>> special pcplist on isolate buddy list and merge as needed.
>>>
>>
>> Really thanks for sharing idea.
>
>
> Ah, you didn't find a hole yet, good sign :D
>
>
>> It looks possible but I guess that it needs more branches related to
>> pageblock isolation. Now I have a quick thought to prevent merging,
>> but, I'm not sure that it is better than current patchset. After more
>> thinking, I will post rough idea here.
>
>
> I was thinking about it more and maybe it wouldn't need a new migratetype
> after all. But it would always need to free isolate pages on the special
> pcplist.

Yes, I also have thought of this extended idea. :)

>That means this pcplist would be used not only during the call to
> start_isolate_page_range, but all the way until undo_isolate_page_range(). I
> don't think it's a problem and it simplifies things. The only way to move to
> isolate freelist is through the new isolate pcplist flush operation
> initiated by a single CPU at well defined time.
>
> The undo would look like:
> - (migratetype is still set to MIGRATETYPE_ISOLATE, CPU2 frees affected
> pages to the special freelist)
> - CPU1 does move_freepages_block() to put pages back from isolate freelist
> to e.g. MOVABLE or CMA. At this point, nobody will put new pages on isolate
> freelist.
> - CPU1 changes migratetype of the pageblock to e.g. MOVABLE. CPU2 and others
> start freeing normally. Merging can occur only on the MOVABLE freelist, as
> isolate freelist is empty and nobody puts pages there.
> - CPU1 flushes the isolate pcplists of all CPU's on the MOVABLE freelist.
> Merging is again correct.
>
> I think your plan of multiple parallel CMA allocations (and thus multiple
> parallel isolations) is also possible. The isolate pcplists can be shared by
> pages coming from multiple parallel isolations. But the flush operation
> needs a pfn start/end parameters to only flush pages belonging to the given
> isolation. That might mean a bit of inefficient list traversing, but I don't
> think it's a problem.
>
> Makes sense?

Yes.
Anyway, I will post my idea tomorrow since I'm at home now.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
  2014-07-15  9:39               ` Joonsoo Kim
@ 2014-07-15 10:00                 ` Peter Zijlstra
  2014-07-16  8:44                   ` Joonsoo Kim
  0 siblings, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2014-07-15 10:00 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Vlastimil Babka, Joonsoo Kim, Andrew Morton, Kirill A. Shutemov,
	Rik van Riel, Mel Gorman, Johannes Weiner, Minchan Kim,
	Yasuaki Ishimatsu, Zhang Yanfei, Srivatsa S. Bhat, Tang Chen,
	Naoya Horiguchi, Bartlomiej Zolnierkiewicz, Wen Congyang,
	Marek Szyprowski, Michal Nazarewicz, Laura Abbott, Heesub Shin,
	Aneesh Kumar K.V, Ritesh Harjani, t.stanislaws, Gioh Kim,
	Linux Memory Management List, Lisa Du, LKML

[-- Attachment #1: Type: text/plain, Size: 127 bytes --]

On Tue, Jul 15, 2014 at 06:39:20PM +0900, Joonsoo Kim wrote:
> >>>>>>>>

tl;dr, if you want me to read your emails, trim them.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
  2014-07-15  8:36               ` Vlastimil Babka
@ 2014-07-16  8:43                 ` Joonsoo Kim
  -1 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-16  8:43 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, Lisa Du,
	linux-kernel

On Tue, Jul 15, 2014 at 10:36:35AM +0200, Vlastimil Babka wrote:
> >>A non-trivial fix that comes to mind (and I might have overlooked
> >>something) is something like:
> >>
> >>- distinguish MIGRATETYPE_ISOLATING and MIGRATETYPE_ISOLATED
> >>- CPU1 first sets MIGRATETYPE_ISOLATING before the drain
> >>- when CPU2 sees MIGRATETYPE_ISOLATING, it just puts the page on
> >>special unbounded pcplist and that's it
> >>- CPU1 does the drain as usual, potentially misplacing some pages
> >>that move_freepages_block() will then fix. But no wrong merging can
> >>occur.
> >>- after move_freepages_block(), CPU1 changes MIGRATETYPE_ISOLATING
> >>to MIGRATETYPE_ISOLATED
> >>- CPU2 can then start freeing directly on isolate buddy list. There
> >>might be some pages still on the special pcplist of CPU2/CPUx but
> >>that means they won't merge yet.
> >>- CPU1 executes on all CPU's a new operation that flushes the
> >>special pcplist on isolate buddy list and merge as needed.
> >>
> >
> >Really thanks for sharing idea.
> 
> Ah, you didn't find a hole yet, good sign :D
> 
> >It looks possible but I guess that it needs more branches related to
> >pageblock isolation. Now I have a quick thought to prevent merging,
> >but, I'm not sure that it is better than current patchset. After more
> >thinking, I will post rough idea here.
> 
> I was thinking about it more and maybe it wouldn't need a new
> migratetype after all. But it would always need to free isolate
> pages on the special pcplist. That means this pcplist would be used
> not only during the call to start_isolate_page_range, but all the
> way until undo_isolate_page_range(). I don't think it's a problem
> and it simplifies things. The only way to move to isolate freelist
> is through the new isolate pcplist flush operation initiated by a
> single CPU at well defined time.
> 
> The undo would look like:
> - (migratetype is still set to MIGRATETYPE_ISOLATE, CPU2 frees
> affected pages to the special freelist)
> - CPU1 does move_freepages_block() to put pages back from isolate
> freelist to e.g. MOVABLE or CMA. At this point, nobody will put new
> pages on isolate freelist.
> - CPU1 changes migratetype of the pageblock to e.g. MOVABLE. CPU2
> and others start freeing normally. Merging can occur only on the
> MOVABLE freelist, as isolate freelist is empty and nobody puts pages
> there.
> - CPU1 flushes the isolate pcplists of all CPU's on the MOVABLE
> freelist. Merging is again correct.
> 
> I think your plan of multiple parallel CMA allocations (and thus
> multiple parallel isolations) is also possible. The isolate pcplists
> can be shared by pages coming from multiple parallel isolations. But
> the flush operation needs a pfn start/end parameters to only flush
> pages belonging to the given isolation. That might mean a bit of
> inefficient list traversing, but I don't think it's a problem.

I think that special pcplist would cause a problem if we should check
pfn range. If there are too many pages on this pcplist, move pages from
this pcplist to isolate freelist takes too long time in irq context and
system could be broken. This operation cannot be easily stopped because
it is initiated by IPI on other cpu and starter of this IPI expect that
all pages on other cpus' pcplist are moved properly when returning
from on_each_cpu().

And, if there are so many pages, serious lock contention would happen
in this case.

Anyway, my idea's key point is using PageIsolated() to distinguish
isolated page, instead of using PageBuddy(). If page is PageIsolated(),
it isn't handled as freepage although it is in buddy allocator. During free,
page with MIGRATETYPE_ISOLATE will be marked as PageIsolated() and
won't be merged and counted for freepage.

When we move pages from normal buddy list to isolate buddy
list, we check PageBuddy() and subtract number of PageBuddy() pages
from number of freepage. And, change page from PageBuddy() to PageIsolated()
since it is handled as isolated page at this point. In this way, freepage
count will be correct.

Unisolation can be done by similar approach.

I made prototype of this approach and it isn't intrusive to core
allocator compared to my previous patchset.

Make sense?

Thanks.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
@ 2014-07-16  8:43                 ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-16  8:43 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, Lisa Du,
	linux-kernel

On Tue, Jul 15, 2014 at 10:36:35AM +0200, Vlastimil Babka wrote:
> >>A non-trivial fix that comes to mind (and I might have overlooked
> >>something) is something like:
> >>
> >>- distinguish MIGRATETYPE_ISOLATING and MIGRATETYPE_ISOLATED
> >>- CPU1 first sets MIGRATETYPE_ISOLATING before the drain
> >>- when CPU2 sees MIGRATETYPE_ISOLATING, it just puts the page on
> >>special unbounded pcplist and that's it
> >>- CPU1 does the drain as usual, potentially misplacing some pages
> >>that move_freepages_block() will then fix. But no wrong merging can
> >>occur.
> >>- after move_freepages_block(), CPU1 changes MIGRATETYPE_ISOLATING
> >>to MIGRATETYPE_ISOLATED
> >>- CPU2 can then start freeing directly on isolate buddy list. There
> >>might be some pages still on the special pcplist of CPU2/CPUx but
> >>that means they won't merge yet.
> >>- CPU1 executes on all CPU's a new operation that flushes the
> >>special pcplist on isolate buddy list and merge as needed.
> >>
> >
> >Really thanks for sharing idea.
> 
> Ah, you didn't find a hole yet, good sign :D
> 
> >It looks possible but I guess that it needs more branches related to
> >pageblock isolation. Now I have a quick thought to prevent merging,
> >but, I'm not sure that it is better than current patchset. After more
> >thinking, I will post rough idea here.
> 
> I was thinking about it more and maybe it wouldn't need a new
> migratetype after all. But it would always need to free isolate
> pages on the special pcplist. That means this pcplist would be used
> not only during the call to start_isolate_page_range, but all the
> way until undo_isolate_page_range(). I don't think it's a problem
> and it simplifies things. The only way to move to isolate freelist
> is through the new isolate pcplist flush operation initiated by a
> single CPU at well defined time.
> 
> The undo would look like:
> - (migratetype is still set to MIGRATETYPE_ISOLATE, CPU2 frees
> affected pages to the special freelist)
> - CPU1 does move_freepages_block() to put pages back from isolate
> freelist to e.g. MOVABLE or CMA. At this point, nobody will put new
> pages on isolate freelist.
> - CPU1 changes migratetype of the pageblock to e.g. MOVABLE. CPU2
> and others start freeing normally. Merging can occur only on the
> MOVABLE freelist, as isolate freelist is empty and nobody puts pages
> there.
> - CPU1 flushes the isolate pcplists of all CPU's on the MOVABLE
> freelist. Merging is again correct.
> 
> I think your plan of multiple parallel CMA allocations (and thus
> multiple parallel isolations) is also possible. The isolate pcplists
> can be shared by pages coming from multiple parallel isolations. But
> the flush operation needs a pfn start/end parameters to only flush
> pages belonging to the given isolation. That might mean a bit of
> inefficient list traversing, but I don't think it's a problem.

I think that special pcplist would cause a problem if we should check
pfn range. If there are too many pages on this pcplist, move pages from
this pcplist to isolate freelist takes too long time in irq context and
system could be broken. This operation cannot be easily stopped because
it is initiated by IPI on other cpu and starter of this IPI expect that
all pages on other cpus' pcplist are moved properly when returning
from on_each_cpu().

And, if there are so many pages, serious lock contention would happen
in this case.

Anyway, my idea's key point is using PageIsolated() to distinguish
isolated page, instead of using PageBuddy(). If page is PageIsolated(),
it isn't handled as freepage although it is in buddy allocator. During free,
page with MIGRATETYPE_ISOLATE will be marked as PageIsolated() and
won't be merged and counted for freepage.

When we move pages from normal buddy list to isolate buddy
list, we check PageBuddy() and subtract number of PageBuddy() pages
from number of freepage. And, change page from PageBuddy() to PageIsolated()
since it is handled as isolated page at this point. In this way, freepage
count will be correct.

Unisolation can be done by similar approach.

I made prototype of this approach and it isn't intrusive to core
allocator compared to my previous patchset.

Make sense?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
  2014-07-15 10:00                 ` Peter Zijlstra
@ 2014-07-16  8:44                   ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-16  8:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vlastimil Babka, Andrew Morton, Kirill A. Shutemov, Rik van Riel,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim,
	Linux Memory Management List, Lisa Du, LKML

On Tue, Jul 15, 2014 at 12:00:13PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 15, 2014 at 06:39:20PM +0900, Joonsoo Kim wrote:
> > >>>>>>>>
> 
> tl;dr, if you want me to read your emails, trim them.

Okay. I will do it. Sorry for bothering you.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
  2014-07-16  8:43                 ` Joonsoo Kim
@ 2014-07-16 11:14                   ` Vlastimil Babka
  -1 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-16 11:14 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, Lisa Du,
	linux-kernel

On 07/16/2014 10:43 AM, Joonsoo Kim wrote:
>> I think your plan of multiple parallel CMA allocations (and thus
>> multiple parallel isolations) is also possible. The isolate pcplists
>> can be shared by pages coming from multiple parallel isolations. But
>> the flush operation needs a pfn start/end parameters to only flush
>> pages belonging to the given isolation. That might mean a bit of
>> inefficient list traversing, but I don't think it's a problem.
> 
> I think that special pcplist would cause a problem if we should check
> pfn range. If there are too many pages on this pcplist, move pages from
> this pcplist to isolate freelist takes too long time in irq context and
> system could be broken. This operation cannot be easily stopped because
> it is initiated by IPI on other cpu and starter of this IPI expect that
> all pages on other cpus' pcplist are moved properly when returning
> from on_each_cpu().
> 
> And, if there are so many pages, serious lock contention would happen
> in this case.

Hm I see. So what if it wasn't a special pcplist, but a special "free list"
where the pages would be just linked together as on pcplist, regardless of
order, and would not merge until the CPU that drives the memory isolation
process decides it is safe to flush them away. That would remove the need for
IPI's and provide the same guarantees I think.

> Anyway, my idea's key point is using PageIsolated() to distinguish
> isolated page, instead of using PageBuddy(). If page is PageIsolated(),

Is PageIsolated a completely new page flag? Those are a limited resource so I
would expect some resistance to such approach. Or a new special page->_mapcount
value? That could maybe work.

> it isn't handled as freepage although it is in buddy allocator. During free,
> page with MIGRATETYPE_ISOLATE will be marked as PageIsolated() and
> won't be merged and counted for freepage.

OK. Preventing wrong merging is the key point and this should work.

> When we move pages from normal buddy list to isolate buddy
> list, we check PageBuddy() and subtract number of PageBuddy() pages

Do we really need to check PageBuddy()? Could a page get marked as PageIsolate()
but still go to normal list instead of isolate list?

> from number of freepage. And, change page from PageBuddy() to PageIsolated()
> since it is handled as isolated page at this point. In this way, freepage
> count will be correct.
> 
> Unisolation can be done by similar approach.
> 
> I made prototype of this approach and it isn't intrusive to core
> allocator compared to my previous patchset.
> 
> Make sense?

I think so :)

> Thanks.
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
@ 2014-07-16 11:14                   ` Vlastimil Babka
  0 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-16 11:14 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, Lisa Du,
	linux-kernel

On 07/16/2014 10:43 AM, Joonsoo Kim wrote:
>> I think your plan of multiple parallel CMA allocations (and thus
>> multiple parallel isolations) is also possible. The isolate pcplists
>> can be shared by pages coming from multiple parallel isolations. But
>> the flush operation needs a pfn start/end parameters to only flush
>> pages belonging to the given isolation. That might mean a bit of
>> inefficient list traversing, but I don't think it's a problem.
> 
> I think that special pcplist would cause a problem if we should check
> pfn range. If there are too many pages on this pcplist, move pages from
> this pcplist to isolate freelist takes too long time in irq context and
> system could be broken. This operation cannot be easily stopped because
> it is initiated by IPI on other cpu and starter of this IPI expect that
> all pages on other cpus' pcplist are moved properly when returning
> from on_each_cpu().
> 
> And, if there are so many pages, serious lock contention would happen
> in this case.

Hm I see. So what if it wasn't a special pcplist, but a special "free list"
where the pages would be just linked together as on pcplist, regardless of
order, and would not merge until the CPU that drives the memory isolation
process decides it is safe to flush them away. That would remove the need for
IPI's and provide the same guarantees I think.

> Anyway, my idea's key point is using PageIsolated() to distinguish
> isolated page, instead of using PageBuddy(). If page is PageIsolated(),

Is PageIsolated a completely new page flag? Those are a limited resource so I
would expect some resistance to such approach. Or a new special page->_mapcount
value? That could maybe work.

> it isn't handled as freepage although it is in buddy allocator. During free,
> page with MIGRATETYPE_ISOLATE will be marked as PageIsolated() and
> won't be merged and counted for freepage.

OK. Preventing wrong merging is the key point and this should work.

> When we move pages from normal buddy list to isolate buddy
> list, we check PageBuddy() and subtract number of PageBuddy() pages

Do we really need to check PageBuddy()? Could a page get marked as PageIsolate()
but still go to normal list instead of isolate list?

> from number of freepage. And, change page from PageBuddy() to PageIsolated()
> since it is handled as isolated page at this point. In this way, freepage
> count will be correct.
> 
> Unisolation can be done by similar approach.
> 
> I made prototype of this approach and it isn't intrusive to core
> allocator compared to my previous patchset.
> 
> Make sense?

I think so :)

> Thanks.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
  2014-07-16 11:14                   ` Vlastimil Babka
@ 2014-07-17  6:12                     ` Joonsoo Kim
  -1 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-17  6:12 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, Lisa Du,
	linux-kernel

On Wed, Jul 16, 2014 at 01:14:26PM +0200, Vlastimil Babka wrote:
> On 07/16/2014 10:43 AM, Joonsoo Kim wrote:
> >> I think your plan of multiple parallel CMA allocations (and thus
> >> multiple parallel isolations) is also possible. The isolate pcplists
> >> can be shared by pages coming from multiple parallel isolations. But
> >> the flush operation needs a pfn start/end parameters to only flush
> >> pages belonging to the given isolation. That might mean a bit of
> >> inefficient list traversing, but I don't think it's a problem.
> > 
> > I think that special pcplist would cause a problem if we should check
> > pfn range. If there are too many pages on this pcplist, move pages from
> > this pcplist to isolate freelist takes too long time in irq context and
> > system could be broken. This operation cannot be easily stopped because
> > it is initiated by IPI on other cpu and starter of this IPI expect that
> > all pages on other cpus' pcplist are moved properly when returning
> > from on_each_cpu().
> > 
> > And, if there are so many pages, serious lock contention would happen
> > in this case.
> 
> Hm I see. So what if it wasn't a special pcplist, but a special "free list"
> where the pages would be just linked together as on pcplist, regardless of
> order, and would not merge until the CPU that drives the memory isolation
> process decides it is safe to flush them away. That would remove the need for
> IPI's and provide the same guarantees I think.

Looks good. It would work. I think that your solution is better than mine.
I will implement it and test. 

> 
> > Anyway, my idea's key point is using PageIsolated() to distinguish
> > isolated page, instead of using PageBuddy(). If page is PageIsolated(),
> 
> Is PageIsolated a completely new page flag? Those are a limited resource so I
> would expect some resistance to such approach. Or a new special page->_mapcount
> value? That could maybe work.

Yes, it is new special page->_mapcount.

> > it isn't handled as freepage although it is in buddy allocator. During free,
> > page with MIGRATETYPE_ISOLATE will be marked as PageIsolated() and
> > won't be merged and counted for freepage.
> 
> OK. Preventing wrong merging is the key point and this should work.
> 
> > When we move pages from normal buddy list to isolate buddy
> > list, we check PageBuddy() and subtract number of PageBuddy() pages
> 
> Do we really need to check PageBuddy()? Could a page get marked as PageIsolate()
> but still go to normal list instead of isolate list?

Checking PageBuddy() is used for identifying page linked in normal
list.

Thanks.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
@ 2014-07-17  6:12                     ` Joonsoo Kim
  0 siblings, 0 replies; 79+ messages in thread
From: Joonsoo Kim @ 2014-07-17  6:12 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, Lisa Du,
	linux-kernel

On Wed, Jul 16, 2014 at 01:14:26PM +0200, Vlastimil Babka wrote:
> On 07/16/2014 10:43 AM, Joonsoo Kim wrote:
> >> I think your plan of multiple parallel CMA allocations (and thus
> >> multiple parallel isolations) is also possible. The isolate pcplists
> >> can be shared by pages coming from multiple parallel isolations. But
> >> the flush operation needs a pfn start/end parameters to only flush
> >> pages belonging to the given isolation. That might mean a bit of
> >> inefficient list traversing, but I don't think it's a problem.
> > 
> > I think that special pcplist would cause a problem if we should check
> > pfn range. If there are too many pages on this pcplist, move pages from
> > this pcplist to isolate freelist takes too long time in irq context and
> > system could be broken. This operation cannot be easily stopped because
> > it is initiated by IPI on other cpu and starter of this IPI expect that
> > all pages on other cpus' pcplist are moved properly when returning
> > from on_each_cpu().
> > 
> > And, if there are so many pages, serious lock contention would happen
> > in this case.
> 
> Hm I see. So what if it wasn't a special pcplist, but a special "free list"
> where the pages would be just linked together as on pcplist, regardless of
> order, and would not merge until the CPU that drives the memory isolation
> process decides it is safe to flush them away. That would remove the need for
> IPI's and provide the same guarantees I think.

Looks good. It would work. I think that your solution is better than mine.
I will implement it and test. 

> 
> > Anyway, my idea's key point is using PageIsolated() to distinguish
> > isolated page, instead of using PageBuddy(). If page is PageIsolated(),
> 
> Is PageIsolated a completely new page flag? Those are a limited resource so I
> would expect some resistance to such approach. Or a new special page->_mapcount
> value? That could maybe work.

Yes, it is new special page->_mapcount.

> > it isn't handled as freepage although it is in buddy allocator. During free,
> > page with MIGRATETYPE_ISOLATE will be marked as PageIsolated() and
> > won't be merged and counted for freepage.
> 
> OK. Preventing wrong merging is the key point and this should work.
> 
> > When we move pages from normal buddy list to isolate buddy
> > list, we check PageBuddy() and subtract number of PageBuddy() pages
> 
> Do we really need to check PageBuddy()? Could a page get marked as PageIsolate()
> but still go to normal list instead of isolate list?

Checking PageBuddy() is used for identifying page linked in normal
list.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
  2014-07-17  6:12                     ` Joonsoo Kim
@ 2014-07-17  9:14                       ` Vlastimil Babka
  -1 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-17  9:14 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, Lisa Du,
	linux-kernel

On 07/17/2014 08:12 AM, Joonsoo Kim wrote:
>>
>> Hm I see. So what if it wasn't a special pcplist, but a special "free list"
>> where the pages would be just linked together as on pcplist, regardless of
>> order, and would not merge until the CPU that drives the memory isolation
>> process decides it is safe to flush them away. That would remove the need for
>> IPI's and provide the same guarantees I think.
>
> Looks good. It would work. I think that your solution is better than mine.
> I will implement it and test.

Thanks. But maybe there's still a good use for marking pages specially 
as isolated, and not PageBuddy with freepage_migratetype set to 
MIGRATE_ISOLATE. Why do we buddy-merge them anyway? I don't think CMA or 
memory offlining benefits from that, and it only makes the code more 
complex?
So maybe we could split isolated buddy pages to order-0 marked as 
PageIsolated and have a single per-zone list instead of order-based 
freelists?

>> Do we really need to check PageBuddy()? Could a page get marked as PageIsolate()
>> but still go to normal list instead of isolate list?
>
> Checking PageBuddy() is used for identifying page linked in normal
> list.

Ah right, forgot it walks by pfn scanning, not by traversing the free list.

> Thanks.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/10] fix freepage count problems due to memory isolation
@ 2014-07-17  9:14                       ` Vlastimil Babka
  0 siblings, 0 replies; 79+ messages in thread
From: Vlastimil Babka @ 2014-07-17  9:14 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Kirill A. Shutemov, Rik van Riel, Peter Zijlstra,
	Mel Gorman, Johannes Weiner, Minchan Kim, Yasuaki Ishimatsu,
	Zhang Yanfei, Srivatsa S. Bhat, Tang Chen, Naoya Horiguchi,
	Bartlomiej Zolnierkiewicz, Wen Congyang, Marek Szyprowski,
	Michal Nazarewicz, Laura Abbott, Heesub Shin, Aneesh Kumar K.V,
	Ritesh Harjani, t.stanislaws, Gioh Kim, linux-mm, Lisa Du,
	linux-kernel

On 07/17/2014 08:12 AM, Joonsoo Kim wrote:
>>
>> Hm I see. So what if it wasn't a special pcplist, but a special "free list"
>> where the pages would be just linked together as on pcplist, regardless of
>> order, and would not merge until the CPU that drives the memory isolation
>> process decides it is safe to flush them away. That would remove the need for
>> IPI's and provide the same guarantees I think.
>
> Looks good. It would work. I think that your solution is better than mine.
> I will implement it and test.

Thanks. But maybe there's still a good use for marking pages specially 
as isolated, and not PageBuddy with freepage_migratetype set to 
MIGRATE_ISOLATE. Why do we buddy-merge them anyway? I don't think CMA or 
memory offlining benefits from that, and it only makes the code more 
complex?
So maybe we could split isolated buddy pages to order-0 marked as 
PageIsolated and have a single per-zone list instead of order-based 
freelists?

>> Do we really need to check PageBuddy()? Could a page get marked as PageIsolate()
>> but still go to normal list instead of isolate list?
>
> Checking PageBuddy() is used for identifying page linked in normal
> list.

Ah right, forgot it walks by pfn scanning, not by traversing the free list.

> Thanks.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2014-07-17  9:28 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-04  7:57 [PATCH 00/10] fix freepage count problems due to memory isolation Joonsoo Kim
2014-07-04  7:57 ` Joonsoo Kim
2014-07-04  7:57 ` [PATCH 01/10] mm/page_alloc: remove unlikely macro on free_one_page() Joonsoo Kim
2014-07-04  7:57   ` Joonsoo Kim
2014-07-04 12:03   ` Vlastimil Babka
2014-07-04 12:03     ` Vlastimil Babka
2014-07-07  4:58     ` Joonsoo Kim
2014-07-07  4:58       ` Joonsoo Kim
2014-07-04 12:52   ` Michal Nazarewicz
2014-07-04 12:52     ` Michal Nazarewicz
2014-07-04 12:53   ` Michal Nazarewicz
2014-07-04 12:53     ` Michal Nazarewicz
2014-07-04  7:57 ` [PATCH 02/10] mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC Joonsoo Kim
2014-07-04  7:57   ` Joonsoo Kim
2014-07-07 14:50   ` Vlastimil Babka
2014-07-07 14:50     ` Vlastimil Babka
2014-07-04  7:57 ` [PATCH 03/10] mm/page_alloc: handle page on pcp correctly if it's pageblock is isolated Joonsoo Kim
2014-07-04  7:57   ` Joonsoo Kim
2014-07-07  8:25   ` Gioh Kim
2014-07-07  8:25     ` Gioh Kim
2014-07-08  7:18     ` Vlastimil Babka
2014-07-08  7:18       ` Vlastimil Babka
2014-07-07 15:19   ` Vlastimil Babka
2014-07-07 15:19     ` Vlastimil Babka
2014-07-14  6:24     ` Joonsoo Kim
2014-07-14  6:24       ` Joonsoo Kim
2014-07-04  7:57 ` [PATCH 04/10] mm/page_alloc: carefully free the page on isolate pageblock Joonsoo Kim
2014-07-04  7:57   ` Joonsoo Kim
2014-07-07 15:43   ` Vlastimil Babka
2014-07-07 15:43     ` Vlastimil Babka
2014-07-04  7:57 ` [PATCH 05/10] mm/page_alloc: optimize and unify pageblock migratetype check in free path Joonsoo Kim
2014-07-04  7:57   ` Joonsoo Kim
2014-07-07 15:50   ` Vlastimil Babka
2014-07-07 15:50     ` Vlastimil Babka
2014-07-14  6:28     ` Joonsoo Kim
2014-07-14  6:28       ` Joonsoo Kim
2014-07-04  7:57 ` [PATCH 06/10] mm/page_alloc: separate freepage migratetype interface Joonsoo Kim
2014-07-04  7:57   ` Joonsoo Kim
2014-07-04  7:57 ` [PATCH 07/10] mm/page_alloc: store migratetype of the buddy list into freepage correctly Joonsoo Kim
2014-07-04  7:57   ` Joonsoo Kim
2014-07-04  7:57 ` [PATCH 08/10] mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type Joonsoo Kim
2014-07-04  7:57   ` Joonsoo Kim
2014-07-07 15:57   ` Vlastimil Babka
2014-07-07 15:57     ` Vlastimil Babka
2014-07-08  1:01     ` Gioh Kim
2014-07-08  1:01       ` Gioh Kim
2014-07-08  7:23       ` Vlastimil Babka
2014-07-08  7:23         ` Vlastimil Babka
2014-07-14  6:34     ` Joonsoo Kim
2014-07-14  6:34       ` Joonsoo Kim
2014-07-04  7:57 ` [PATCH 09/10] mm/page_alloc: fix possible wrongly calculated freepage counter Joonsoo Kim
2014-07-04  7:57   ` Joonsoo Kim
2014-07-04  7:57 ` [PATCH 10/10] mm/page_alloc: Stop merging pages on non-isolate and isolate buddy list Joonsoo Kim
2014-07-04  7:57   ` Joonsoo Kim
2014-07-04 15:33 ` [PATCH 00/10] fix freepage count problems due to memory isolation Vlastimil Babka
2014-07-04 15:33   ` Vlastimil Babka
2014-07-07  4:49   ` Joonsoo Kim
2014-07-07  4:49     ` Joonsoo Kim
2014-07-07 14:33     ` Vlastimil Babka
2014-07-07 14:33       ` Vlastimil Babka
2014-07-14  6:22       ` Joonsoo Kim
2014-07-14  6:22         ` Joonsoo Kim
2014-07-14  9:49         ` Vlastimil Babka
2014-07-14  9:49           ` Vlastimil Babka
2014-07-15  8:28           ` Joonsoo Kim
2014-07-15  8:28             ` Joonsoo Kim
2014-07-15  8:36             ` Vlastimil Babka
2014-07-15  8:36               ` Vlastimil Babka
2014-07-15  9:39               ` Joonsoo Kim
2014-07-15 10:00                 ` Peter Zijlstra
2014-07-16  8:44                   ` Joonsoo Kim
2014-07-16  8:43               ` Joonsoo Kim
2014-07-16  8:43                 ` Joonsoo Kim
2014-07-16 11:14                 ` Vlastimil Babka
2014-07-16 11:14                   ` Vlastimil Babka
2014-07-17  6:12                   ` Joonsoo Kim
2014-07-17  6:12                     ` Joonsoo Kim
2014-07-17  9:14                     ` Vlastimil Babka
2014-07-17  9:14                       ` Vlastimil Babka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.