All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] pro-active compaction
@ 2017-01-13  7:14 js1304
  2017-01-13  7:14 ` [RFC PATCH 1/5] mm/vmstat: retrieve suitable free pageblock information just once js1304
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: js1304 @ 2017-01-13  7:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, linux-mm, Vlastimil Babka, David Rientjes,
	Mel Gorman, Johannes Weiner, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Hello,

This is a patchset for pro-active compaction to reduce fragmentation.
It is a just RFC patchset so implementation detail isn't good.
I submit this for people who want to check the effect of pro-active
compaction.

Patch 1 ~ 4 introduces new metric for checking fragmentation. I think
that this new metric is useful to check fragmentation state
regardless of usefulness of pro-active compaction. Please let me know
if someone see that this new metric is useful. I'd like to submit it,
separately.
	
Any feedback is more than welcome.

Thanks.

Joonsoo Kim (5):
  mm/vmstat: retrieve suitable free pageblock information just once
  mm/vmstat: rename variables/functions about buddyinfo
  mm: introduce exponential moving average to unusable free index
  mm/vmstat: introduce /proc/fraginfo to get fragmentation stat stably
  mm/compaction: run the compaction whenever fragmentation ratio exceeds
    the threshold

 include/linux/mmzone.h |   3 +
 mm/compaction.c        | 280 +++++++++++++++++++++++++++++++++++++++++++++++--
 mm/internal.h          |  21 ++++
 mm/page_alloc.c        |  33 ++++++
 mm/vmstat.c            | 101 ++++++++++++------
 5 files changed, 397 insertions(+), 41 deletions(-)

-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 1/5] mm/vmstat: retrieve suitable free pageblock information just once
  2017-01-13  7:14 [RFC PATCH 0/5] pro-active compaction js1304
@ 2017-01-13  7:14 ` js1304
  2017-01-19 10:47   ` Vlastimil Babka
  2017-01-19 11:51   ` Michal Hocko
  2017-01-13  7:14 ` [RFC PATCH 2/5] mm/vmstat: rename variables/functions about buddyinfo js1304
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 15+ messages in thread
From: js1304 @ 2017-01-13  7:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, linux-mm, Vlastimil Babka, David Rientjes,
	Mel Gorman, Johannes Weiner, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

It's inefficient to retrieve buddy information for fragmentation index
calculation on every order. By using some stack memory, we could retrieve
it once and reuse it to compute all the required values. MAX_ORDER is
usually small enough so there is no big risk about stack overflow.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/vmstat.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7c28df3..e1ca5eb 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -821,7 +821,7 @@ unsigned long node_page_state(struct pglist_data *pgdat,
 struct contig_page_info {
 	unsigned long free_pages;
 	unsigned long free_blocks_total;
-	unsigned long free_blocks_suitable;
+	unsigned long free_blocks_order[MAX_ORDER];
 };
 
 /*
@@ -833,16 +833,14 @@ struct contig_page_info {
  * figured out from userspace
  */
 static void fill_contig_page_info(struct zone *zone,
-				unsigned int suitable_order,
 				struct contig_page_info *info)
 {
 	unsigned int order;
 
 	info->free_pages = 0;
 	info->free_blocks_total = 0;
-	info->free_blocks_suitable = 0;
 
-	for (order = 0; order < MAX_ORDER; order++) {
+	for (order = MAX_ORDER - 1; order >= 0 && order < MAX_ORDER; order--) {
 		unsigned long blocks;
 
 		/* Count number of free blocks */
@@ -851,11 +849,12 @@ static void fill_contig_page_info(struct zone *zone,
 
 		/* Count free base pages */
 		info->free_pages += blocks << order;
+		info->free_blocks_order[order] = blocks;
+		if (order == MAX_ORDER - 1)
+			continue;
 
-		/* Count the suitable free blocks */
-		if (order >= suitable_order)
-			info->free_blocks_suitable += blocks <<
-						(order - suitable_order);
+		info->free_blocks_order[order] +=
+			(info->free_blocks_order[order + 1] << 1);
 	}
 }
 
@@ -874,7 +873,7 @@ static int __fragmentation_index(unsigned int order, struct contig_page_info *in
 		return 0;
 
 	/* Fragmentation index only makes sense when a request would fail */
-	if (info->free_blocks_suitable)
+	if (info->free_blocks_order[order])
 		return -1000;
 
 	/*
@@ -891,7 +890,7 @@ int fragmentation_index(struct zone *zone, unsigned int order)
 {
 	struct contig_page_info info;
 
-	fill_contig_page_info(zone, order, &info);
+	fill_contig_page_info(zone, &info);
 	return __fragmentation_index(order, &info);
 }
 #endif
@@ -1811,7 +1810,7 @@ static int unusable_free_index(unsigned int order,
 	 * 0 => no fragmentation
 	 * 1 => high fragmentation
 	 */
-	return div_u64((info->free_pages - (info->free_blocks_suitable << order)) * 1000ULL, info->free_pages);
+	return div_u64((info->free_pages - (info->free_blocks_order[order] << order)) * 1000ULL, info->free_pages);
 
 }
 
@@ -1825,8 +1824,8 @@ static void unusable_show_print(struct seq_file *m,
 	seq_printf(m, "Node %d, zone %8s ",
 				pgdat->node_id,
 				zone->name);
+	fill_contig_page_info(zone, &info);
 	for (order = 0; order < MAX_ORDER; ++order) {
-		fill_contig_page_info(zone, order, &info);
 		index = unusable_free_index(order, &info);
 		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
 	}
@@ -1887,8 +1886,8 @@ static void extfrag_show_print(struct seq_file *m,
 	seq_printf(m, "Node %d, zone %8s ",
 				pgdat->node_id,
 				zone->name);
+	fill_contig_page_info(zone, &info);
 	for (order = 0; order < MAX_ORDER; ++order) {
-		fill_contig_page_info(zone, order, &info);
 		index = __fragmentation_index(order, &info);
 		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
 	}
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 2/5] mm/vmstat: rename variables/functions about buddyinfo
  2017-01-13  7:14 [RFC PATCH 0/5] pro-active compaction js1304
  2017-01-13  7:14 ` [RFC PATCH 1/5] mm/vmstat: retrieve suitable free pageblock information just once js1304
@ 2017-01-13  7:14 ` js1304
  2017-01-13  7:14 ` [RFC PATCH 3/5] mm: introduce exponential moving average to unusable free index js1304
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: js1304 @ 2017-01-13  7:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, linux-mm, Vlastimil Babka, David Rientjes,
	Mel Gorman, Johannes Weiner, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Following patch will introduce interface about fragmentation information
and "frag" prefix would be more suitable for it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/vmstat.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index e1ca5eb..cd0c331 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1138,7 +1138,7 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
 #endif
 
 #ifdef CONFIG_PROC_FS
-static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
+static void buddyinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 						struct zone *zone)
 {
 	int order;
@@ -1152,10 +1152,10 @@ static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
 /*
  * This walks the free areas for each zone.
  */
-static int frag_show(struct seq_file *m, void *arg)
+static int buddyinfo_show(struct seq_file *m, void *arg)
 {
 	pg_data_t *pgdat = (pg_data_t *)arg;
-	walk_zones_in_node(m, pgdat, frag_show_print);
+	walk_zones_in_node(m, pgdat, buddyinfo_show_print);
 	return 0;
 }
 
@@ -1300,20 +1300,20 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
 	return 0;
 }
 
-static const struct seq_operations fragmentation_op = {
+static const struct seq_operations buddyinfo_op = {
 	.start	= frag_start,
 	.next	= frag_next,
 	.stop	= frag_stop,
-	.show	= frag_show,
+	.show	= buddyinfo_show,
 };
 
-static int fragmentation_open(struct inode *inode, struct file *file)
+static int buddyinfo_open(struct inode *inode, struct file *file)
 {
-	return seq_open(file, &fragmentation_op);
+	return seq_open(file, &buddyinfo_op);
 }
 
-static const struct file_operations fragmentation_file_operations = {
-	.open		= fragmentation_open,
+static const struct file_operations buddyinfo_file_operations = {
+	.open		= buddyinfo_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
 	.release	= seq_release,
@@ -1781,7 +1781,7 @@ static int __init setup_vmstat(void)
 	start_shepherd_timer();
 #endif
 #ifdef CONFIG_PROC_FS
-	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
+	proc_create("buddyinfo", S_IRUGO, NULL, &buddyinfo_file_operations);
 	proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
 	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
 	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 3/5] mm: introduce exponential moving average to unusable free index
  2017-01-13  7:14 [RFC PATCH 0/5] pro-active compaction js1304
  2017-01-13  7:14 ` [RFC PATCH 1/5] mm/vmstat: retrieve suitable free pageblock information just once js1304
  2017-01-13  7:14 ` [RFC PATCH 2/5] mm/vmstat: rename variables/functions about buddyinfo js1304
@ 2017-01-13  7:14 ` js1304
  2017-01-19 12:52   ` Vlastimil Babka
  2017-01-13  7:14 ` [RFC PATCH 4/5] mm/vmstat: introduce /proc/fraginfo to get fragmentation stat stably js1304
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: js1304 @ 2017-01-13  7:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, linux-mm, Vlastimil Babka, David Rientjes,
	Mel Gorman, Johannes Weiner, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

We have a statistic about memory fragmentation but it would be fluctuated
a lot within very short term so it's hard to accurately measure
system's fragmentation state while workload is actively running. Without
stable statistic, it's not possible to determine if the system is
fragmented or not.

Meanwhile, recently, there were a lot of reports about fragmentation
problem and we tried some changes. However, since there is no way
to measure fragmentation ratio stably, we cannot make sure how these
changes help the fragmentation.

There are some methods to measure fragmentation but I think that they
have some problems.

1. buddyinfo: it fluctuated a lot within very short term
2. tracepoint: it shows how steal happens between buddylists of different
migratetype. It means fragmentation indirectly but would not be accurate.
3. pageowner: it shows the number of mixed pageblocks but it is not
suitable for production system since it requires some additional memory.

Therefore, this patch try to calculate exponential moving average to
unusable free index. Since it is a moving average, it is quite stable
even if fragmentation state of memory fluctuate a lot.

I made this patch 3 month ago and implementation detail looks not
good to me now. Maybe, it's better to rule out update code in allocation
path and make it timer based. Anyway, this patch is just for RFC.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/mmzone.h |  2 ++
 mm/internal.h          | 21 +++++++++++++++++++++
 mm/page_alloc.c        | 32 ++++++++++++++++++++++++++++++++
 mm/vmstat.c            | 16 ++++------------
 4 files changed, 59 insertions(+), 12 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 36d9896..94bb4fd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -90,6 +90,7 @@ enum {
 struct free_area {
 	struct list_head	free_list[MIGRATE_TYPES];
 	unsigned long		nr_free;
+	unsigned long		unusable_free_avg;
 };
 
 struct pglist_data;
@@ -447,6 +448,7 @@ struct zone {
 
 	/* free areas of different sizes */
 	struct free_area	free_area[MAX_ORDER];
+	unsigned long		unusable_free_index_updated;
 
 	/* zone flags, see below */
 	unsigned long		flags;
diff --git a/mm/internal.h b/mm/internal.h
index bfad3b5..912df14 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -113,6 +113,12 @@ struct alloc_context {
 	bool spread_dirty_pages;
 };
 
+struct contig_page_info {
+	unsigned long free_pages;
+	unsigned long free_blocks_total;
+	unsigned long free_blocks_order[MAX_ORDER];
+};
+
 #define ac_classzone_idx(ac) zonelist_zone_idx(ac->preferred_zoneref)
 
 /*
@@ -158,6 +164,21 @@ extern void post_alloc_hook(struct page *page, unsigned int order,
 					gfp_t gfp_flags);
 extern int user_min_free_kbytes;
 
+#define ewma_add(ewma, val, weight, factor)				\
+({									\
+	(ewma) *= (weight) - 1;						\
+	(ewma) += (val) << factor;					\
+	(ewma) /= (weight);						\
+	(ewma) >> factor;						\
+})
+
+#define UNUSABLE_INDEX_FACTOR (10)
+
+extern void fill_contig_page_info(struct zone *zone,
+				struct contig_page_info *info);
+extern int unusable_free_index(unsigned int order,
+				struct contig_page_info *info);
+
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 46ad035..5a22708 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -754,6 +754,32 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 	return 0;
 }
 
+static void update_unusable_free_index(struct zone *zone)
+{
+	struct contig_page_info info;
+	unsigned long val;
+	unsigned int order;
+	struct free_area *free_area;
+
+	do {
+		if (unlikely(time_before(jiffies,
+			zone->unusable_free_index_updated + HZ / 10)))
+			return;
+
+		fill_contig_page_info(zone, &info);
+		for (order = 0; order < MAX_ORDER; order++) {
+			free_area = &zone->free_area[order];
+
+			val = unusable_free_index(order, &info);
+			/* decay value contribution by 99% in 1 min */
+			ewma_add(free_area->unusable_free_avg, val,
+					128, UNUSABLE_INDEX_FACTOR);
+		}
+
+		zone->unusable_free_index_updated = jiffies + HZ / 10;
+	} while (1);
+}
+
 /*
  * Freeing function for a buddy system allocator.
  *
@@ -878,6 +904,8 @@ static inline void __free_one_page(struct page *page,
 	list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
 out:
 	zone->free_area[order].nr_free++;
+
+	update_unusable_free_index(zone);
 }
 
 /*
@@ -1802,6 +1830,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 		area->nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);
 		set_pcppage_migratetype(page, migratetype);
+		update_unusable_free_index(zone);
 		return page;
 	}
 
@@ -2174,6 +2203,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 		 * fallback only via special __rmqueue_cma_fallback() function
 		 */
 		set_pcppage_migratetype(page, start_migratetype);
+		update_unusable_free_index(zone);
 
 		trace_mm_page_alloc_extfrag(page, order, current_order,
 			start_migratetype, fallback_mt);
@@ -5127,7 +5157,9 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 	for_each_migratetype_order(order, t) {
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
 		zone->free_area[order].nr_free = 0;
+		zone->free_area[order].unusable_free_avg = 0;
 	}
+	zone->unusable_free_index_updated = jiffies;
 }
 
 #ifndef __HAVE_ARCH_MEMMAP_INIT
diff --git a/mm/vmstat.c b/mm/vmstat.c
index cd0c331..0b218d9 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -816,14 +816,6 @@ unsigned long node_page_state(struct pglist_data *pgdat,
 }
 #endif
 
-#ifdef CONFIG_COMPACTION
-
-struct contig_page_info {
-	unsigned long free_pages;
-	unsigned long free_blocks_total;
-	unsigned long free_blocks_order[MAX_ORDER];
-};
-
 /*
  * Calculate the number of free pages in a zone, how many contiguous
  * pages are free and how many are large enough to satisfy an allocation of
@@ -832,7 +824,7 @@ struct contig_page_info {
  * migrated. Calculating that is possible, but expensive and can be
  * figured out from userspace
  */
-static void fill_contig_page_info(struct zone *zone,
+void fill_contig_page_info(struct zone *zone,
 				struct contig_page_info *info)
 {
 	unsigned int order;
@@ -858,6 +850,7 @@ static void fill_contig_page_info(struct zone *zone,
 	}
 }
 
+#ifdef CONFIG_COMPACTION
 /*
  * A fragmentation index only makes sense if an allocation of a requested
  * size would fail. If that is true, the fragmentation index indicates
@@ -1790,13 +1783,11 @@ static int __init setup_vmstat(void)
 }
 module_init(setup_vmstat)
 
-#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_COMPACTION)
-
 /*
  * Return an index indicating how much of the available free memory is
  * unusable for an allocation of the requested size.
  */
-static int unusable_free_index(unsigned int order,
+int unusable_free_index(unsigned int order,
 				struct contig_page_info *info)
 {
 	/* No free memory is interpreted as all free memory is unusable */
@@ -1814,6 +1805,7 @@ static int unusable_free_index(unsigned int order,
 
 }
 
+#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_COMPACTION)
 static void unusable_show_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 4/5] mm/vmstat: introduce /proc/fraginfo to get fragmentation stat stably
  2017-01-13  7:14 [RFC PATCH 0/5] pro-active compaction js1304
                   ` (2 preceding siblings ...)
  2017-01-13  7:14 ` [RFC PATCH 3/5] mm: introduce exponential moving average to unusable free index js1304
@ 2017-01-13  7:14 ` js1304
  2017-01-13  7:14 ` [RFC PATCH 5/5] mm/compaction: run the compaction whenever fragmentation ratio exceeds the threshold js1304
  2017-01-13  9:24 ` [RFC PATCH 0/5] pro-active compaction Michal Hocko
  5 siblings, 0 replies; 15+ messages in thread
From: js1304 @ 2017-01-13  7:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, linux-mm, Vlastimil Babka, David Rientjes,
	Mel Gorman, Johannes Weiner, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/vmstat.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 0b218d9..9e5a862 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1448,6 +1448,47 @@ static int zoneinfo_open(struct inode *inode, struct file *file)
 	.release	= seq_release,
 };
 
+static void fraginfo_show_print(struct seq_file *m, pg_data_t *pgdat,
+						struct zone *zone)
+{
+	int order;
+	int index;
+
+	seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
+	for (order = 0; order < MAX_ORDER; ++order) {
+		index = zone->free_area[order].unusable_free_avg /
+			(1 << UNUSABLE_INDEX_FACTOR);
+		seq_printf(m, "0.%03d ", index);
+	}
+	seq_putc(m, '\n');
+}
+
+static int fraginfo_show(struct seq_file *m, void *arg)
+{
+	pg_data_t *pgdat = (pg_data_t *)arg;
+	walk_zones_in_node(m, pgdat, fraginfo_show_print);
+	return 0;
+}
+
+static const struct seq_operations fraginfo_op = {
+	.start	= frag_start,
+	.next	= frag_next,
+	.stop	= frag_stop,
+	.show	= fraginfo_show,
+};
+
+static int fraginfo_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &fraginfo_op);
+}
+
+static const struct file_operations fraginfo_file_operations = {
+	.open		= fraginfo_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 enum writeback_stat_item {
 	NR_DIRTY_THRESHOLD,
 	NR_DIRTY_BG_THRESHOLD,
@@ -1778,6 +1819,7 @@ static int __init setup_vmstat(void)
 	proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
 	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
 	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
+	proc_create("fraginfo", S_IRUGO, NULL, &fraginfo_file_operations);
 #endif
 	return 0;
 }
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 5/5] mm/compaction: run the compaction whenever fragmentation ratio exceeds the threshold
  2017-01-13  7:14 [RFC PATCH 0/5] pro-active compaction js1304
                   ` (3 preceding siblings ...)
  2017-01-13  7:14 ` [RFC PATCH 4/5] mm/vmstat: introduce /proc/fraginfo to get fragmentation stat stably js1304
@ 2017-01-13  7:14 ` js1304
  2017-01-19 13:39   ` Vlastimil Babka
  2017-01-13  9:24 ` [RFC PATCH 0/5] pro-active compaction Michal Hocko
  5 siblings, 1 reply; 15+ messages in thread
From: js1304 @ 2017-01-13  7:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, linux-mm, Vlastimil Babka, David Rientjes,
	Mel Gorman, Johannes Weiner, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Until now, we invoke the compaction whenever allocation request is stall
due to non-existence of the high order freepage. It is effective since we
don't need a high order freepage in usual and cost of maintaining
high order freepages is quite high. However, it increases latency of high
order allocation request and decreases success rate if allocation request
cannot use the reclaim/compaction. Since there are some workloads that
require high order freepage to boost the performance, it is a matter of
trade-off that we prepares high order freepage in advance. Now, there is
no way to prepare high order freepages, we cannot consider this trade-off.
Therefore, this patch introduces a way to invoke the compaction when
necessary to manage trade-off.

Implementation is so simple. There is a theshold to invoke the full
compaction. If fragmentation ratio reaches this threshold in given order,
we ask the full compaction to kcompactd with a hope that it restores
fragmentation ratio.

If fragmentation ratio is unchanged or worse after full compaction,
further compaction attempt would not be useful. So, this patch
stops the full compaction in this case until the situation changes
to avoid useless compaction effort.

Now, there is no scientific code to detect the situation change.
kcompactd's full compaction would be re-enabled when lower order
triggers kcompactd wake-up or time limit (a second) is passed.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/mmzone.h |   1 +
 mm/compaction.c        | 280 +++++++++++++++++++++++++++++++++++++++++++++++--
 mm/page_alloc.c        |   1 +
 3 files changed, 275 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 94bb4fd..6029335 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -639,6 +639,7 @@ struct zonelist {
 	enum zone_type kcompactd_classzone_idx;
 	wait_queue_head_t kcompactd_wait;
 	struct task_struct *kcompactd;
+	void *kcompactd_state;
 #endif
 #ifdef CONFIG_NUMA_BALANCING
 	/* Lock serializing the migrate rate limiting window */
diff --git a/mm/compaction.c b/mm/compaction.c
index 949198d..58536c1 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1848,6 +1848,87 @@ void compaction_unregister_node(struct node *node)
 }
 #endif /* CONFIG_SYSFS && CONFIG_NUMA */
 
+#define KCOMPACTD_INDEX_GAP (200)
+
+struct kcompactd_zone_state {
+	int target_order;
+	int target_ratio;
+	int failed;
+	unsigned long failed_time;
+	struct contig_page_info info;
+};
+
+struct kcompactd_state {
+	struct kcompactd_zone_state zone_state[MAX_NR_ZONES];
+};
+
+static int kcompactd_order;
+static unsigned int kcompactd_ratio;
+
+static ssize_t order_show(struct kobject *kobj,
+			struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%d\n", kcompactd_order);
+}
+
+static ssize_t order_store(struct kobject *kobj,
+			struct kobj_attribute *attr,
+			const char *buf, size_t count)
+{
+	int order;
+	int ret;
+
+	ret = kstrtoint(buf, 10, &order);
+	if (ret)
+		return -EINVAL;
+
+	/* kcompactd's compaction will be disabled when order is -1 */
+	if (order >= MAX_ORDER || order < -1)
+		return -EINVAL;
+
+	kcompactd_order = order;
+	return count;
+}
+
+static ssize_t ratio_show(struct kobject *kobj,
+			struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%u\n", kcompactd_ratio);
+}
+
+static ssize_t ratio_store(struct kobject *kobj,
+			struct kobj_attribute *attr,
+			const char *buf, size_t count)
+{
+	unsigned int ratio;
+	int ret;
+
+	ret = kstrtouint(buf, 10, &ratio);
+	if (ret)
+		return -EINVAL;
+
+	if (ratio > 1000)
+		return -EINVAL;
+
+	kcompactd_ratio = ratio;
+	return count;
+}
+
+static struct kobj_attribute order_attr = __ATTR_RW(order);
+static struct kobj_attribute ratio_attr = __ATTR_RW(ratio);
+
+static struct attribute *kcompactd_attrs[] = {
+	&order_attr.attr,
+	&ratio_attr.attr,
+	NULL,
+};
+
+static struct attribute_group kcompactd_attr_group = {
+	.attrs = kcompactd_attrs,
+	.name = "kcompactd",
+};
+
+
 static inline bool kcompactd_work_requested(pg_data_t *pgdat)
 {
 	return pgdat->kcompactd_max_order > 0 || kthread_should_stop();
@@ -1858,6 +1939,11 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
 	int zoneid;
 	struct zone *zone;
 	enum zone_type classzone_idx = pgdat->kcompactd_classzone_idx;
+	int order;
+
+	order = pgdat->kcompactd_max_order;
+	if (order == INT_MAX)
+		order = -1;
 
 	for (zoneid = 0; zoneid <= classzone_idx; zoneid++) {
 		zone = &pgdat->node_zones[zoneid];
@@ -1865,14 +1951,116 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
 		if (!populated_zone(zone))
 			continue;
 
-		if (compaction_suitable(zone, pgdat->kcompactd_max_order, 0,
-					classzone_idx) == COMPACT_CONTINUE)
+		if (compaction_suitable(zone, order, 0, classzone_idx)
+			== COMPACT_CONTINUE)
 			return true;
 	}
 
 	return false;
 }
 
+static int kcompactd_check_ratio(pg_data_t *pgdat, int zoneid)
+{
+	int i;
+	int unusable_free_avg;
+	struct zone *zone;
+	struct kcompactd_state *state;
+	struct kcompactd_zone_state *zone_state;
+	struct contig_page_info info;
+	int index;
+
+	state = pgdat->kcompactd_state;
+	zone_state = &state->zone_state[zoneid];
+	zone = &pgdat->node_zones[zoneid];
+
+	fill_contig_page_info(zone, &info);
+	for (i = PAGE_ALLOC_COSTLY_ORDER + 1; i <= kcompactd_order; i++) {
+		unusable_free_avg = zone->free_area[i].unusable_free_avg >>
+					UNUSABLE_INDEX_FACTOR;
+
+		if (unusable_free_avg >= kcompactd_ratio)
+			return i;
+
+		index = unusable_free_index(i, &info);
+		if (index >= kcompactd_ratio &&
+			(kcompactd_ratio > unusable_free_avg + KCOMPACTD_INDEX_GAP))
+			return i;
+	}
+
+	return -1;
+}
+
+static void kcompactd_check_result(pg_data_t *pgdat, int classzone_idx)
+{
+	int zoneid;
+	struct zone *zone;
+	struct kcompactd_state *state;
+	struct kcompactd_zone_state *zone_state;
+	int unusable_free_avg;
+	unsigned long flags;
+	int prev_index, curr_index;
+
+	for (zoneid = 0; zoneid <= classzone_idx; zoneid++) {
+		zone = &pgdat->node_zones[zoneid];
+		if (!populated_zone(zone))
+			continue;
+
+		state = pgdat->kcompactd_state;
+		zone_state = &state->zone_state[zoneid];
+		unusable_free_avg =
+			zone->free_area[zone_state->target_order].unusable_free_avg >>
+				UNUSABLE_INDEX_FACTOR;
+		if (unusable_free_avg < zone_state->target_ratio) {
+			zone_state->failed = 0;
+			continue;
+		}
+
+		prev_index = unusable_free_index(zone_state->target_order,
+						&zone_state->info);
+		spin_lock_irqsave(&zone->lock, flags);
+		fill_contig_page_info(zone, &zone_state->info);
+		spin_unlock_irqrestore(&zone->lock, flags);
+
+		curr_index = unusable_free_index(zone_state->target_order,
+						&zone_state->info);
+		if (curr_index < zone_state->target_ratio ||
+			curr_index < prev_index) {
+			zone_state->failed = 0;
+			continue;
+		}
+
+		zone_state->failed++;
+		zone_state->failed_time = jiffies;
+	}
+}
+
+static bool kcompactd_should_skip(pg_data_t *pgdat, int classzone_idx)
+{
+	struct kcompactd_state *state;
+	struct kcompactd_zone_state *zone_state;
+	int target_order;
+	unsigned long recharge_time;
+
+	target_order = kcompactd_check_ratio(pgdat, classzone_idx);
+	if (target_order < 0)
+		return true;
+
+	state = pgdat->kcompactd_state;
+	zone_state = &state->zone_state[classzone_idx];
+	if (!zone_state->failed)
+		return false;
+
+	if (target_order < zone_state->target_order)
+		return false;
+
+	recharge_time = zone_state->failed_time;
+	recharge_time += HZ * (1 << zone_state->failed);
+	if (time_after(jiffies, recharge_time))
+		return false;
+
+	return true;
+}
+
 static void kcompactd_do_work(pg_data_t *pgdat)
 {
 	/*
@@ -1880,6 +2068,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 	 * order is allocatable.
 	 */
 	int zoneid;
+	int cpu;
 	struct zone *zone;
 	struct compact_control cc = {
 		.order = pgdat->kcompactd_max_order,
@@ -1889,10 +2078,19 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 		.gfp_mask = GFP_KERNEL,
 
 	};
+	struct kcompactd_state *state;
+	struct kcompactd_zone_state *zone_state;
+	unsigned long flags;
 	trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order,
 							cc.classzone_idx);
 	count_vm_event(KCOMPACTD_WAKE);
 
+	/* Force to run full compaction */
+	if (cc.order == INT_MAX) {
+		cc.order = -1;
+		cc.whole_zone = true;
+	}
+
 	for (zoneid = 0; zoneid <= cc.classzone_idx; zoneid++) {
 		int status;
 
@@ -1915,8 +2113,29 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 
 		if (kthread_should_stop())
 			return;
+
+		if (is_via_compact_memory(cc.order)) {
+			state = pgdat->kcompactd_state;
+			zone_state = &state->zone_state[zoneid];
+			zone_state->target_order =
+				kcompactd_check_ratio(pgdat, zoneid);
+			zone_state->target_ratio = kcompactd_ratio;
+			if (zone_state->target_order < 0)
+				continue;
+
+			spin_lock_irqsave(&zone->lock, flags);
+			fill_contig_page_info(zone, &zone_state->info);
+			spin_unlock_irqrestore(&zone->lock, flags);
+		}
+
 		status = compact_zone(zone, &cc);
 
+		VM_BUG_ON(!list_empty(&cc.freepages));
+		VM_BUG_ON(!list_empty(&cc.migratepages));
+
+		if (is_via_compact_memory(cc.order))
+			continue;
+
 		if (status == COMPACT_SUCCESS) {
 			compaction_defer_reset(zone, cc.order, false);
 		} else if (status == COMPACT_PARTIAL_SKIPPED || status == COMPACT_COMPLETE) {
@@ -1926,9 +2145,6 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 			 */
 			defer_compaction(zone, cc.order);
 		}
-
-		VM_BUG_ON(!list_empty(&cc.freepages));
-		VM_BUG_ON(!list_empty(&cc.migratepages));
 	}
 
 	/*
@@ -1940,6 +2156,16 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 		pgdat->kcompactd_max_order = 0;
 	if (pgdat->kcompactd_classzone_idx >= cc.classzone_idx)
 		pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
+
+	/* Do not invoke compaction immediately if we did full compaction */
+	if (is_via_compact_memory(cc.order)) {
+		pgdat->kcompactd_max_order = 0;
+		cpu = get_cpu();
+		lru_add_drain_cpu(cpu);
+		drain_local_pages(NULL);
+		put_cpu();
+		kcompactd_check_result(pgdat, cc.classzone_idx);
+	}
 }
 
 void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx)
@@ -1947,6 +2173,11 @@ void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx)
 	if (!order)
 		return;
 
+	if (order == INT_MAX) {
+		if (kcompactd_should_skip(pgdat, classzone_idx))
+			return;
+	}
+
 	if (pgdat->kcompactd_max_order < order)
 		pgdat->kcompactd_max_order = order;
 
@@ -2000,18 +2231,42 @@ static int kcompactd(void *p)
  */
 int kcompactd_run(int nid)
 {
+	int i;
+	struct kcompactd_state *state;
+	struct kcompactd_zone_state *zone_state;
 	pg_data_t *pgdat = NODE_DATA(nid);
-	int ret = 0;
+	struct zone *zone;
+	int ret = -ENOMEM;
 
 	if (pgdat->kcompactd)
 		return 0;
 
+	state = kzalloc(sizeof(struct kcompactd_state), GFP_KERNEL);
+	if (!state)
+		goto err;
+
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		zone = &pgdat->node_zones[i];
+		zone_state = &state->zone_state[i];
+		if (!populated_zone(zone))
+			continue;
+
+		zone_state->failed = 0;
+	}
+
 	pgdat->kcompactd = kthread_run(kcompactd, pgdat, "kcompactd%d", nid);
 	if (IS_ERR(pgdat->kcompactd)) {
-		pr_err("Failed to start kcompactd on node %d\n", nid);
 		ret = PTR_ERR(pgdat->kcompactd);
 		pgdat->kcompactd = NULL;
+		kfree(state);
+		goto err;
 	}
+	pgdat->kcompactd_state = (void *)state;
+
+	return 0;
+
+err:
+	pr_err("Failed to start kcompactd on node %d\n", nid);
 	return ret;
 }
 
@@ -2065,6 +2320,17 @@ static int __init kcompactd_init(void)
 		return ret;
 	}
 
+	kcompactd_order = -1;
+	kcompactd_ratio = 800;
+
+#ifdef CONFIG_SYSFS
+	ret = sysfs_create_group(mm_kobj, &kcompactd_attr_group);
+	if (ret) {
+		pr_err("kcompactd: failed to register sysfs callbacks.\n");
+		return ret;
+	}
+#endif
+
 	for_each_node_state(nid, N_MEMORY)
 		kcompactd_run(nid);
 	return 0;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5a22708..f3c2099 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -776,6 +776,7 @@ static void update_unusable_free_index(struct zone *zone)
 					128, UNUSABLE_INDEX_FACTOR);
 		}
 
+		wakeup_kcompactd(zone->zone_pgdat, INT_MAX, zone_idx(zone));
 		zone->unusable_free_index_updated = jiffies + HZ / 10;
 	} while (1);
 }
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/5] pro-active compaction
  2017-01-13  7:14 [RFC PATCH 0/5] pro-active compaction js1304
                   ` (4 preceding siblings ...)
  2017-01-13  7:14 ` [RFC PATCH 5/5] mm/compaction: run the compaction whenever fragmentation ratio exceeds the threshold js1304
@ 2017-01-13  9:24 ` Michal Hocko
  2017-01-17  0:48   ` Joonsoo Kim
  5 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2017-01-13  9:24 UTC (permalink / raw)
  To: js1304
  Cc: Andrew Morton, linux-mm, Vlastimil Babka, David Rientjes,
	Mel Gorman, Johannes Weiner, Joonsoo Kim

On Fri 13-01-17 16:14:28, Joonsoo Kim wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> Hello,
> 
> This is a patchset for pro-active compaction to reduce fragmentation.
> It is a just RFC patchset so implementation detail isn't good.
> I submit this for people who want to check the effect of pro-active
> compaction.
> 
> Patch 1 ~ 4 introduces new metric for checking fragmentation. I think
> that this new metric is useful to check fragmentation state
> regardless of usefulness of pro-active compaction. Please let me know
> if someone see that this new metric is useful. I'd like to submit it,
> separately.

Could you describe this metric from a high level POV please?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/5] pro-active compaction
  2017-01-13  9:24 ` [RFC PATCH 0/5] pro-active compaction Michal Hocko
@ 2017-01-17  0:48   ` Joonsoo Kim
  0 siblings, 0 replies; 15+ messages in thread
From: Joonsoo Kim @ 2017-01-17  0:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, Vlastimil Babka, David Rientjes,
	Mel Gorman, Johannes Weiner

On Fri, Jan 13, 2017 at 10:24:21AM +0100, Michal Hocko wrote:
> On Fri 13-01-17 16:14:28, Joonsoo Kim wrote:
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > 
> > Hello,
> > 
> > This is a patchset for pro-active compaction to reduce fragmentation.
> > It is a just RFC patchset so implementation detail isn't good.
> > I submit this for people who want to check the effect of pro-active
> > compaction.
> > 
> > Patch 1 ~ 4 introduces new metric for checking fragmentation. I think
> > that this new metric is useful to check fragmentation state
> > regardless of usefulness of pro-active compaction. Please let me know
> > if someone see that this new metric is useful. I'd like to submit it,
> > separately.
> 
> Could you describe this metric from a high level POV please?

There is some information at description on patch #3.

Anyway, in summary, it is an exponential moving average of unusable free
index which already exists. Unusable free index means the freepage
ratio at the moment that cannot be usable for specific order.
It is easy to understand if you see below equation.

unusable_free_index(order N) = 1 -
  (Number of freepages higher or equal than order N / Total freepages)

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/5] mm/vmstat: retrieve suitable free pageblock information just once
  2017-01-13  7:14 ` [RFC PATCH 1/5] mm/vmstat: retrieve suitable free pageblock information just once js1304
@ 2017-01-19 10:47   ` Vlastimil Babka
  2017-01-23  3:17     ` Joonsoo Kim
  2017-01-19 11:51   ` Michal Hocko
  1 sibling, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2017-01-19 10:47 UTC (permalink / raw)
  To: js1304, Andrew Morton
  Cc: Michal Hocko, linux-mm, David Rientjes, Mel Gorman,
	Johannes Weiner, Joonsoo Kim

On 01/13/2017 08:14 AM, js1304@gmail.com wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> It's inefficient to retrieve buddy information for fragmentation index
> calculation on every order. By using some stack memory, we could retrieve
> it once and reuse it to compute all the required values. MAX_ORDER is
> usually small enough so there is no big risk about stack overflow.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Sounds useful regardless of the rest of the series.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

A nit below.

> ---
>  mm/vmstat.c | 25 ++++++++++++-------------
>  1 file changed, 12 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 7c28df3..e1ca5eb 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -821,7 +821,7 @@ unsigned long node_page_state(struct pglist_data *pgdat,
>  struct contig_page_info {
>  	unsigned long free_pages;
>  	unsigned long free_blocks_total;
> -	unsigned long free_blocks_suitable;
> +	unsigned long free_blocks_order[MAX_ORDER];

No need to rename _suitable to _order IMHO. The meaning is still the
same, it's just an array now. For me a name "free_blocks_order" would
suggest it's just simple zone->free_area[order].nr_free.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/5] mm/vmstat: retrieve suitable free pageblock information just once
  2017-01-13  7:14 ` [RFC PATCH 1/5] mm/vmstat: retrieve suitable free pageblock information just once js1304
  2017-01-19 10:47   ` Vlastimil Babka
@ 2017-01-19 11:51   ` Michal Hocko
  2017-01-19 12:06     ` Vlastimil Babka
  1 sibling, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2017-01-19 11:51 UTC (permalink / raw)
  To: js1304
  Cc: Andrew Morton, linux-mm, Vlastimil Babka, David Rientjes,
	Mel Gorman, Johannes Weiner, Joonsoo Kim

On Fri 13-01-17 16:14:29, Joonsoo Kim wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> It's inefficient to retrieve buddy information for fragmentation index
> calculation on every order. By using some stack memory, we could retrieve
> it once and reuse it to compute all the required values. MAX_ORDER is
> usually small enough so there is no big risk about stack overflow.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>  mm/vmstat.c | 25 ++++++++++++-------------
>  1 file changed, 12 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 7c28df3..e1ca5eb 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -821,7 +821,7 @@ unsigned long node_page_state(struct pglist_data *pgdat,
>  struct contig_page_info {
>  	unsigned long free_pages;
>  	unsigned long free_blocks_total;
> -	unsigned long free_blocks_suitable;
> +	unsigned long free_blocks_order[MAX_ORDER];
>  };

I haven't looked at the rest of the patch becaust this has already
raised a red flag.  This will increase the size of the structure quite a
bit and from a quick look at least compaction_suitable->fragmentation_index
will call with this allocated on the stack and this can be pretty deep
on the call chain already.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/5] mm/vmstat: retrieve suitable free pageblock information just once
  2017-01-19 11:51   ` Michal Hocko
@ 2017-01-19 12:06     ` Vlastimil Babka
  0 siblings, 0 replies; 15+ messages in thread
From: Vlastimil Babka @ 2017-01-19 12:06 UTC (permalink / raw)
  To: Michal Hocko, js1304
  Cc: Andrew Morton, linux-mm, David Rientjes, Mel Gorman,
	Johannes Weiner, Joonsoo Kim

On 01/19/2017 12:51 PM, Michal Hocko wrote:
> On Fri 13-01-17 16:14:29, Joonsoo Kim wrote:
>> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>
>> It's inefficient to retrieve buddy information for fragmentation index
>> calculation on every order. By using some stack memory, we could retrieve
>> it once and reuse it to compute all the required values. MAX_ORDER is
>> usually small enough so there is no big risk about stack overflow.
>>
>> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> ---
>>  mm/vmstat.c | 25 ++++++++++++-------------
>>  1 file changed, 12 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 7c28df3..e1ca5eb 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -821,7 +821,7 @@ unsigned long node_page_state(struct pglist_data *pgdat,
>>  struct contig_page_info {
>>  	unsigned long free_pages;
>>  	unsigned long free_blocks_total;
>> -	unsigned long free_blocks_suitable;
>> +	unsigned long free_blocks_order[MAX_ORDER];
>>  };
> 
> I haven't looked at the rest of the patch becaust this has already
> raised a red flag.  This will increase the size of the structure quite a
> bit and from a quick look at least compaction_suitable->fragmentation_index
> will call with this allocated on the stack and this can be pretty deep
> on the call chain already.

Yeah, but compaction_suitable() is usually called at a point where
you're deciding whether to do more reclaim or compaction in the same
context, and both of those most likely have much larger stacks than this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 3/5] mm: introduce exponential moving average to unusable free index
  2017-01-13  7:14 ` [RFC PATCH 3/5] mm: introduce exponential moving average to unusable free index js1304
@ 2017-01-19 12:52   ` Vlastimil Babka
  2017-01-23  5:27     ` Joonsoo Kim
  0 siblings, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2017-01-19 12:52 UTC (permalink / raw)
  To: js1304, Andrew Morton
  Cc: Michal Hocko, linux-mm, David Rientjes, Mel Gorman,
	Johannes Weiner, Joonsoo Kim

On 01/13/2017 08:14 AM, js1304@gmail.com wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> We have a statistic about memory fragmentation but it would be fluctuated
> a lot within very short term so it's hard to accurately measure
> system's fragmentation state while workload is actively running. Without
> stable statistic, it's not possible to determine if the system is
> fragmented or not.
> 
> Meanwhile, recently, there were a lot of reports about fragmentation
> problem and we tried some changes. However, since there is no way
> to measure fragmentation ratio stably, we cannot make sure how these
> changes help the fragmentation.
> 
> There are some methods to measure fragmentation but I think that they
> have some problems.
> 
> 1. buddyinfo: it fluctuated a lot within very short term
> 2. tracepoint: it shows how steal happens between buddylists of different
> migratetype. It means fragmentation indirectly but would not be accurate.
> 3. pageowner: it shows the number of mixed pageblocks but it is not
> suitable for production system since it requires some additional memory.
> 
> Therefore, this patch try to calculate exponential moving average to
> unusable free index. Since it is a moving average, it is quite stable
> even if fragmentation state of memory fluctuate a lot.

I suspect that the fluctuation of the underlying unusable free index
isn't so much because the number of high-order free blocks would
fluctuate, but because of allocation vs reclaim changing the total
number of free blocks, which is used in the equation. Reclaim uses LRU
which I expect to have low correlation with pfn, so the freed pages tend
towards order-0. And the allocation side tries not to split large pages
so it also consumes mostly order-0.

So I would expect just plain free_blocks_order from contig_page_info to
be a good metric without need for averaging, at least for costly orders
and when we have enough free memory - if we are below e.g. the high
(order-0) watermark, then we should let kswapd do its job first anyway
before considering proactive compaction.

> I made this patch 3 month ago and implementation detail looks not
> good to me now. Maybe, it's better to rule out update code in allocation
> path and make it timer based. Anyway, this patch is just for RFC.

Yeah, any hooks in allocation/free hotpaths are going to meet strong
resistance :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 5/5] mm/compaction: run the compaction whenever fragmentation ratio exceeds the threshold
  2017-01-13  7:14 ` [RFC PATCH 5/5] mm/compaction: run the compaction whenever fragmentation ratio exceeds the threshold js1304
@ 2017-01-19 13:39   ` Vlastimil Babka
  0 siblings, 0 replies; 15+ messages in thread
From: Vlastimil Babka @ 2017-01-19 13:39 UTC (permalink / raw)
  To: js1304, Andrew Morton
  Cc: Michal Hocko, linux-mm, David Rientjes, Mel Gorman,
	Johannes Weiner, Joonsoo Kim

On 01/13/2017 08:14 AM, js1304@gmail.com wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> Until now, we invoke the compaction whenever allocation request is stall
> due to non-existence of the high order freepage. It is effective since we
> don't need a high order freepage in usual and cost of maintaining
> high order freepages is quite high. However, it increases latency of high
> order allocation request and decreases success rate if allocation request
> cannot use the reclaim/compaction. Since there are some workloads that
> require high order freepage to boost the performance, it is a matter of
> trade-off that we prepares high order freepage in advance. Now, there is
> no way to prepare high order freepages, we cannot consider this trade-off.
> Therefore, this patch introduces a way to invoke the compaction when
> necessary to manage trade-off.
> 
> Implementation is so simple. There is a theshold to invoke the full
> compaction. If fragmentation ratio reaches this threshold in given order,
> we ask the full compaction to kcompactd with a hope that it restores
> fragmentation ratio.
> 
> If fragmentation ratio is unchanged or worse after full compaction,
> further compaction attempt would not be useful. So, this patch
> stops the full compaction in this case until the situation changes
> to avoid useless compaction effort.
> 
> Now, there is no scientific code to detect the situation change.
> kcompactd's full compaction would be re-enabled when lower order
> triggers kcompactd wake-up or time limit (a second) is passed.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

So, as you expected, I'm not thrilled about the tunables :) And also the
wakeups from allocator hotpaths. Otherwise I'll wait with discussing
details until we get some consensus on usecases and metrics.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/5] mm/vmstat: retrieve suitable free pageblock information just once
  2017-01-19 10:47   ` Vlastimil Babka
@ 2017-01-23  3:17     ` Joonsoo Kim
  0 siblings, 0 replies; 15+ messages in thread
From: Joonsoo Kim @ 2017-01-23  3:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Michal Hocko, linux-mm, David Rientjes,
	Mel Gorman, Johannes Weiner

On Thu, Jan 19, 2017 at 11:47:09AM +0100, Vlastimil Babka wrote:
> On 01/13/2017 08:14 AM, js1304@gmail.com wrote:
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > 
> > It's inefficient to retrieve buddy information for fragmentation index
> > calculation on every order. By using some stack memory, we could retrieve
> > it once and reuse it to compute all the required values. MAX_ORDER is
> > usually small enough so there is no big risk about stack overflow.
> > 
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> Sounds useful regardless of the rest of the series.
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks! I will submit this patch separately.

> 
> A nit below.
> 
> > ---
> >  mm/vmstat.c | 25 ++++++++++++-------------
> >  1 file changed, 12 insertions(+), 13 deletions(-)
> > 
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 7c28df3..e1ca5eb 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -821,7 +821,7 @@ unsigned long node_page_state(struct pglist_data *pgdat,
> >  struct contig_page_info {
> >  	unsigned long free_pages;
> >  	unsigned long free_blocks_total;
> > -	unsigned long free_blocks_suitable;
> > +	unsigned long free_blocks_order[MAX_ORDER];
> 
> No need to rename _suitable to _order IMHO. The meaning is still the
> same, it's just an array now. For me a name "free_blocks_order" would
> suggest it's just simple zone->free_area[order].nr_free.

Okay. Will fix.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 3/5] mm: introduce exponential moving average to unusable free index
  2017-01-19 12:52   ` Vlastimil Babka
@ 2017-01-23  5:27     ` Joonsoo Kim
  0 siblings, 0 replies; 15+ messages in thread
From: Joonsoo Kim @ 2017-01-23  5:27 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Michal Hocko, linux-mm, David Rientjes,
	Mel Gorman, Johannes Weiner

On Thu, Jan 19, 2017 at 01:52:38PM +0100, Vlastimil Babka wrote:
> On 01/13/2017 08:14 AM, js1304@gmail.com wrote:
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > 
> > We have a statistic about memory fragmentation but it would be fluctuated
> > a lot within very short term so it's hard to accurately measure
> > system's fragmentation state while workload is actively running. Without
> > stable statistic, it's not possible to determine if the system is
> > fragmented or not.
> > 
> > Meanwhile, recently, there were a lot of reports about fragmentation
> > problem and we tried some changes. However, since there is no way
> > to measure fragmentation ratio stably, we cannot make sure how these
> > changes help the fragmentation.
> > 
> > There are some methods to measure fragmentation but I think that they
> > have some problems.
> > 
> > 1. buddyinfo: it fluctuated a lot within very short term
> > 2. tracepoint: it shows how steal happens between buddylists of different
> > migratetype. It means fragmentation indirectly but would not be accurate.
> > 3. pageowner: it shows the number of mixed pageblocks but it is not
> > suitable for production system since it requires some additional memory.
> > 
> > Therefore, this patch try to calculate exponential moving average to
> > unusable free index. Since it is a moving average, it is quite stable
> > even if fragmentation state of memory fluctuate a lot.
> 
> I suspect that the fluctuation of the underlying unusable free index
> isn't so much because the number of high-order free blocks would
> fluctuate, but because of allocation vs reclaim changing the total
> number of free blocks, which is used in the equation. Reclaim uses LRU
> which I expect to have low correlation with pfn, so the freed pages tend
> towards order-0. And the allocation side tries not to split large pages
> so it also consumes mostly order-0.

I introduced this metric because I observed fluctuation of unusable
free index. :)

> 
> So I would expect just plain free_blocks_order from contig_page_info to
> be a good metric without need for averaging, at least for costly orders
> and when we have enough free memory - if we are below e.g. the high
> (order-0) watermark, then we should let kswapd do its job first anyway
> before considering proactive compaction.

Maybe, plain free_blocks_order would be stable for the order 7 or more
but it's better to have the metric that works well for all orders.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-01-23  5:21 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-13  7:14 [RFC PATCH 0/5] pro-active compaction js1304
2017-01-13  7:14 ` [RFC PATCH 1/5] mm/vmstat: retrieve suitable free pageblock information just once js1304
2017-01-19 10:47   ` Vlastimil Babka
2017-01-23  3:17     ` Joonsoo Kim
2017-01-19 11:51   ` Michal Hocko
2017-01-19 12:06     ` Vlastimil Babka
2017-01-13  7:14 ` [RFC PATCH 2/5] mm/vmstat: rename variables/functions about buddyinfo js1304
2017-01-13  7:14 ` [RFC PATCH 3/5] mm: introduce exponential moving average to unusable free index js1304
2017-01-19 12:52   ` Vlastimil Babka
2017-01-23  5:27     ` Joonsoo Kim
2017-01-13  7:14 ` [RFC PATCH 4/5] mm/vmstat: introduce /proc/fraginfo to get fragmentation stat stably js1304
2017-01-13  7:14 ` [RFC PATCH 5/5] mm/compaction: run the compaction whenever fragmentation ratio exceeds the threshold js1304
2017-01-19 13:39   ` Vlastimil Babka
2017-01-13  9:24 ` [RFC PATCH 0/5] pro-active compaction Michal Hocko
2017-01-17  0:48   ` Joonsoo Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.