linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RESEND PATCH 0/3] mm: Add cache coloring mechanism
@ 2017-08-23 10:02 Łukasz Daniluk
  2017-08-23 10:02 ` [RESEND PATCH 1/3] mm: move free_list selection to dedicated functions Łukasz Daniluk
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Łukasz Daniluk @ 2017-08-23 10:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: dave.hansen, lukasz.anaczkowski, Łukasz Daniluk

Patches resend with Linux Kernel Mailing List added correctly this time.

This patch series adds cache coloring mechanism that works along buddy
allocator. The solution is opt-in, disabled by default minimally
interferes with default allocation paths due to added if statements.

Why would such patches be needed? Big caches with low associativity
(direct mapped caches, 2-way associative) will benefit from the solution
the most - it allows for near constant performance through the lifetime
of a system, despite the allocations and deallocations happening and
reordering buddy lists.

On KNL system, the STREAM benchmark with problem size resulting in its
internal arrays being of 16GB size will yield bandwidth performance of
336GB/s after fresh boot. With cache coloring patches applied and
enabled, this performance stays near constant (most 1.5% drop observed),
despite running benchmark multiple times with varying sizes over course
of days.  Without these patches however, the bandwidth when using such
allocations drops to 117GB/s - over 65% of irrecoverable performance
penalty. Workloads that exceed set cache size suffer from decreased
randomization of allocations with cache coloring enabled, but effect of
cache usage disappears roughly at the same allocation size.

Solution is divided into three patches. First patch is a preparatory one
that provides interface for retrieving (information about) free lists
contained by particular free_area structure.  Second one (parallel
structure keeping separate list_heads for each cache color in a given
context) shows general solution overview and is working as it is.
However, it has serious performance implications with bigger caches due
to linear search for next color to be used during allocations. Third
patch (sorting list_heads using RB trees) aims to improve solution's
performance by replacing linear search for next color with searching in
RB tree. While improving computational performance, it imposes increased
memory cost of the solution.


A?ukasz Daniluk (3):
  mm: move free_list selection to dedicated functions
  mm: Add page colored allocation path
  mm: Add helper rbtree to search for next cache color

 Documentation/admin-guide/kernel-parameters.txt |   8 +
 include/linux/mmzone.h                          |  12 +-
 mm/compaction.c                                 |   4 +-
 mm/page_alloc.c                                 | 381 ++++++++++++++++++++++--
 mm/vmstat.c                                     |  10 +-
 5 files changed, 383 insertions(+), 32 deletions(-)

-- 
2.13.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RESEND PATCH 1/3] mm: move free_list selection to dedicated functions
  2017-08-23 10:02 [RESEND PATCH 0/3] mm: Add cache coloring mechanism Łukasz Daniluk
@ 2017-08-23 10:02 ` Łukasz Daniluk
  2017-08-23 10:02 ` [RESEND PATCH 2/3] mm: Add page colored allocation path Łukasz Daniluk
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Łukasz Daniluk @ 2017-08-23 10:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: dave.hansen, lukasz.anaczkowski, Łukasz Daniluk

Currently free_list selection from particular free_area was done on
principle that there is only one free_list per order per migratetype.
This patch is preparation for page coloring solution utilising multiple
free_lists.

Signed-off-by: A?ukasz Daniluk <lukasz.daniluk@intel.com>
---
 include/linux/mmzone.h |  4 ++++
 mm/compaction.c        |  4 ++--
 mm/page_alloc.c        | 61 +++++++++++++++++++++++++++++++++++++-------------
 mm/vmstat.c            | 10 +++------
 4 files changed, 55 insertions(+), 24 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fc14b8b3f6ce..04128890a684 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -910,6 +910,10 @@ extern struct pglist_data contig_page_data;
 
 #endif /* !CONFIG_NEED_MULTIPLE_NODES */
 
+extern bool area_empty(struct free_area *area, int order, int migratetype);
+extern unsigned long area_free_count(struct free_area *area, int order,
+					int migratetype);
+
 extern struct pglist_data *first_online_pgdat(void);
 extern struct pglist_data *next_online_pgdat(struct pglist_data *pgdat);
 extern struct zone *next_zone(struct zone *zone);
diff --git a/mm/compaction.c b/mm/compaction.c
index fb548e4c7bd4..8f6274c0a04b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1334,13 +1334,13 @@ static enum compact_result __compact_finished(struct zone *zone,
 		bool can_steal;
 
 		/* Job done if page is free of the right migratetype */
-		if (!list_empty(&area->free_list[migratetype]))
+		if (!area_empty(area, order, migratetype))
 			return COMPACT_SUCCESS;
 
 #ifdef CONFIG_CMA
 		/* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
 		if (migratetype == MIGRATE_MOVABLE &&
-			!list_empty(&area->free_list[MIGRATE_CMA]))
+			!area_empty(area, order, MIGRATE_CMA))
 			return COMPACT_SUCCESS;
 #endif
 		/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 471b0526b876..3f7b074fbfdb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -288,6 +288,29 @@ EXPORT_SYMBOL(nr_online_nodes);
 
 int page_group_by_mobility_disabled __read_mostly;
 
+struct list_head *area_get_free_list(struct free_area *area, int order,
+				int migratetype)
+{
+	return &area->free_list[migratetype];
+}
+
+bool area_empty(struct free_area *area, int order, int migratetype)
+{
+	return list_empty(&area->free_list[migratetype]);
+}
+
+unsigned long area_free_count(struct free_area *area, int order,
+				int migratetype)
+{
+	unsigned long count = 0;
+	struct list_head *lh;
+
+	list_for_each(lh, &area->free_list[migratetype])
+		++count;
+
+	return count;
+}
+
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 static inline void reset_deferred_meminit(pg_data_t *pgdat)
 {
@@ -887,12 +910,14 @@ static inline void __free_one_page(struct page *page,
 		if (pfn_valid_within(buddy_pfn) &&
 		    page_is_buddy(higher_page, higher_buddy, order + 1)) {
 			list_add_tail(&page->lru,
-				&zone->free_area[order].free_list[migratetype]);
+				area_get_free_list(&zone->free_area[order],
+						order, migratetype));
 			goto out;
 		}
 	}
 
-	list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
+	list_add(&page->lru, area_get_free_list(&zone->free_area[order], order,
+						migratetype));
 out:
 	zone->free_area[order].nr_free++;
 }
@@ -1660,7 +1685,8 @@ static inline void expand(struct zone *zone, struct page *page,
 		if (set_page_guard(zone, &page[size], high, migratetype))
 			continue;
 
-		list_add(&page[size].lru, &area->free_list[migratetype]);
+		list_add(&page[size].lru, area_get_free_list(area, high,
+								migratetype));
 		area->nr_free++;
 		set_page_order(&page[size], high);
 	}
@@ -1802,8 +1828,9 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	/* Find a page of the appropriate size in the preferred list */
 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
 		area = &(zone->free_area[current_order]);
-		page = list_first_entry_or_null(&area->free_list[migratetype],
-							struct page, lru);
+		page = list_first_entry_or_null(
+			area_get_free_list(area, current_order, migratetype),
+			struct page, lru);
 		if (!page)
 			continue;
 		list_del(&page->lru);
@@ -1897,7 +1924,8 @@ static int move_freepages(struct zone *zone,
 
 		order = page_order(page);
 		list_move(&page->lru,
-			  &zone->free_area[order].free_list[migratetype]);
+			area_get_free_list(&zone->free_area[order], order,
+					migratetype));
 		page += 1 << order;
 		pages_moved += 1 << order;
 	}
@@ -2046,7 +2074,8 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 
 single_page:
 	area = &zone->free_area[current_order];
-	list_move(&page->lru, &area->free_list[start_type]);
+	list_move(&page->lru, area_get_free_list(area, current_order,
+						start_type));
 }
 
 /*
@@ -2070,7 +2099,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 		if (fallback_mt == MIGRATE_TYPES)
 			break;
 
-		if (list_empty(&area->free_list[fallback_mt]))
+		if (area_empty(area, order, fallback_mt))
 			continue;
 
 		if (can_steal_fallback(order, migratetype))
@@ -2158,7 +2187,8 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			struct free_area *area = &(zone->free_area[order]);
 
 			page = list_first_entry_or_null(
-					&area->free_list[MIGRATE_HIGHATOMIC],
+					area_get_free_list(area, order,
+							MIGRATE_HIGHATOMIC),
 					struct page, lru);
 			if (!page)
 				continue;
@@ -2272,8 +2302,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 	VM_BUG_ON(current_order == MAX_ORDER);
 
 do_steal:
-	page = list_first_entry(&area->free_list[fallback_mt],
-							struct page, lru);
+	page = list_first_entry(area_get_free_list(area, current_order,
+					fallback_mt), struct page, lru);
 
 	steal_suitable_fallback(zone, page, start_migratetype, can_steal);
 
@@ -2562,7 +2592,8 @@ void mark_free_pages(struct zone *zone)
 
 	for_each_migratetype_order(order, t) {
 		list_for_each_entry(page,
-				&zone->free_area[order].free_list[t], lru) {
+				area_get_free_list(&zone->free_area[order],
+						order, t), lru) {
 			unsigned long i;
 
 			pfn = page_to_pfn(page);
@@ -2983,13 +3014,13 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 			return true;
 
 		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
-			if (!list_empty(&area->free_list[mt]))
+			if (!area_empty(area, o, mt))
 				return true;
 		}
 
 #ifdef CONFIG_CMA
 		if ((alloc_flags & ALLOC_CMA) &&
-		    !list_empty(&area->free_list[MIGRATE_CMA])) {
+		    !area_empty(area, o, MIGRATE_CMA)) {
 			return true;
 		}
 #endif
@@ -4788,7 +4819,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 
 			types[order] = 0;
 			for (type = 0; type < MIGRATE_TYPES; type++) {
-				if (!list_empty(&area->free_list[type]))
+				if (!area_empty(area, order, type))
 					types[order] |= 1 << type;
 			}
 		}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9a4441bbeef2..1ff86656cb2e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1183,14 +1183,10 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 					zone->name,
 					migratetype_names[mtype]);
 		for (order = 0; order < MAX_ORDER; ++order) {
-			unsigned long freecount = 0;
-			struct free_area *area;
-			struct list_head *curr;
+			struct free_area *area = &(zone->free_area[order]);
+			unsigned long freecount =
+				area_free_count(area, order, mtype);
 
-			area = &(zone->free_area[order]);
-
-			list_for_each(curr, &area->free_list[mtype])
-				freecount++;
 			seq_printf(m, "%6lu ", freecount);
 		}
 		seq_putc(m, '\n');
-- 
2.13.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RESEND PATCH 2/3] mm: Add page colored allocation path
  2017-08-23 10:02 [RESEND PATCH 0/3] mm: Add cache coloring mechanism Łukasz Daniluk
  2017-08-23 10:02 ` [RESEND PATCH 1/3] mm: move free_list selection to dedicated functions Łukasz Daniluk
@ 2017-08-23 10:02 ` Łukasz Daniluk
  2017-08-23 13:51   ` Dave Hansen
  2017-08-23 10:02 ` [RESEND PATCH 3/3] mm: Add helper rbtree to search for next cache color Łukasz Daniluk
  2017-08-24 12:47 ` [RESEND PATCH 0/3] mm: Add cache coloring mechanism Vlastimil Babka
  3 siblings, 1 reply; 9+ messages in thread
From: Łukasz Daniluk @ 2017-08-23 10:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: dave.hansen, lukasz.anaczkowski, Łukasz Daniluk

Create alternative path when requesting free_list from buddy system
that takes recent allocations or physical page offset into account.

This solution carries a cost of additional, early memory allocations
done by kernel that is dependent on cache size and minimum order to be
colored.

Signed-off-by: A?ukasz Daniluk <lukasz.daniluk@intel.com>
---
 Documentation/admin-guide/kernel-parameters.txt |   8 +
 include/linux/mmzone.h                          |   5 +-
 mm/page_alloc.c                                 | 253 +++++++++++++++++++++---
 3 files changed, 243 insertions(+), 23 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 372cc66bba23..4648f1cc6665 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -455,6 +455,14 @@
 			possible to determine what the correct size should be.
 			This option provides an override for these situations.
 
+	cache_color_size=
+			[KNL] Set cache size for purposes of cache coloring
+			mechanism in buddy allocator.
+
+	cache_color_min_order=
+			[KNL] Set minimal order for which page coloring
+			mechanism will be enabled in buddy allocator.
+
 	ca_keys=	[KEYS] This parameter identifies a specific key(s) on
 			the system trusted keyring to be used for certificate
 			trust validation.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 04128890a684..d10a5421b18b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -25,7 +25,8 @@
 #else
 #define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER
 #endif
-#define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))
+#define ORDER_NR_PAGES(order) (1 << (order))
+#define MAX_ORDER_NR_PAGES ORDER_NR_PAGES(MAX_ORDER - 1)
 
 /*
  * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
@@ -94,6 +95,8 @@ extern int page_group_by_mobility_disabled;
 
 struct free_area {
 	struct list_head	free_list[MIGRATE_TYPES];
+	struct list_head	*colored_free_list[MIGRATE_TYPES];
+	unsigned long		next_color[MIGRATE_TYPES];
 	unsigned long		nr_free;
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3f7b074fbfdb..3718b49032c2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -288,14 +288,129 @@ EXPORT_SYMBOL(nr_online_nodes);
 
 int page_group_by_mobility_disabled __read_mostly;
 
-struct list_head *area_get_free_list(struct free_area *area, int order,
-				int migratetype)
+int cache_color_min_order __read_mostly = MAX_ORDER - 1;
+
+
+/*
+ * cache_size - size of cache (in bytes)
+ */
+unsigned long long cache_size __read_mostly;
+
+/*
+ * Returns true if cache color allocations are enabled
+ */
+static inline bool cache_color_used(int order)
+{
+	return cache_size && order >= cache_color_min_order;
+}
+
+/*
+ * Returns number of cache colors in specified order
+ */
+static inline unsigned long cache_colors(int order)
+{
+	unsigned long colors = cache_size / (PAGE_SIZE * ORDER_NR_PAGES(order));
+
+	return colors ? colors : 1;
+}
+
+/*
+ * Returns cache color for page of specified order
+ */
+static inline unsigned long page_cache_color(struct page *page, int order)
+{
+	return (page_to_phys(page) % cache_size)
+		/ (PAGE_SIZE * ORDER_NR_PAGES(order));
+}
+
+/*
+ * Returns color of first non-empty free_list for purpose of cache color
+ * allocations in specified range (start until end-1).
+ */
+static inline unsigned long cache_color_find_nonempty(struct free_area *area,
+			int migratetype, unsigned long start, unsigned long end)
 {
+	unsigned long color;
+
+	for (color = start; color < end; ++color)
+		if (!list_empty(&area->colored_free_list[migratetype][color]))
+			break;
+
+	return color;
+}
+
+/*
+ * Returns free_list for cache color allocation purposes.
+ * In case page is passed, it is assumed that it will be added to free_list, so
+ * return value is based on color of the page.
+ * Otherwise it searches for next color with free pages available.
+ */
+static inline struct list_head *cache_color_area_get_free_list(
+		struct free_area *area, struct page *page, int order,
+		int migratetype)
+{
+	unsigned long color;
+	unsigned long current_color = area->next_color[migratetype];
+
+	if (page) {
+		color = page_cache_color(page, order);
+		return &area->colored_free_list[migratetype][color];
+	}
+
+	color = cache_color_find_nonempty(area, migratetype, current_color,
+					cache_colors(order));
+
+
+	if (color == cache_colors(order))
+		color = cache_color_find_nonempty(area, migratetype, 0,
+						current_color);
+
+	current_color = color + 1;
+	if (current_color >= cache_colors(order))
+		current_color = 0;
+
+	area->next_color[migratetype] = current_color;
+
+	return &area->colored_free_list[migratetype][color];
+}
+
+static inline bool cache_color_area_empty(struct free_area *area, int order,
+					int migratetype)
+{
+	return cache_color_find_nonempty(area, migratetype, 0,
+				cache_colors(order)) == cache_colors(order);
+}
+
+
+static inline unsigned long cache_color_area_free_count(struct free_area *area,
+						int order, int migratetype)
+{
+	unsigned long count = 0;
+	unsigned long color;
+	struct list_head *lh;
+
+	for (color = 0; color < cache_colors(order); ++color)
+		list_for_each(lh, &area->colored_free_list[migratetype][color])
+			++count;
+
+	return count;
+}
+
+struct list_head *area_get_free_list(struct free_area *area, struct page *page,
+					int order, int migratetype)
+{
+	if (cache_color_used(order))
+		return cache_color_area_get_free_list(area, page, order,
+							migratetype);
+
 	return &area->free_list[migratetype];
 }
 
 bool area_empty(struct free_area *area, int order, int migratetype)
 {
+	if (cache_color_used(order))
+		return cache_color_area_empty(area, order, migratetype);
+
 	return list_empty(&area->free_list[migratetype]);
 }
 
@@ -305,12 +420,67 @@ unsigned long area_free_count(struct free_area *area, int order,
 	unsigned long count = 0;
 	struct list_head *lh;
 
+	if (cache_color_used(order))
+		return cache_color_area_free_count(area, order, migratetype);
+
 	list_for_each(lh, &area->free_list[migratetype])
 		++count;
 
 	return count;
 }
 
+/*
+ * Returns pointer to allocated space that will host the array of free_lists
+ * used for cache color allocations.
+ */
+static __ref void *cache_color_alloc_lists(struct zone *zone, int order)
+{
+	const size_t size = sizeof(struct list_head) * cache_colors(order);
+
+	return alloc_bootmem_pages_node(zone->zone_pgdat, size);
+}
+
+static __init int setup_cache_color_size(char *_cache_size)
+{
+	char *retptr;
+
+	cache_size = memparse(_cache_size, &retptr);
+
+	if (retptr == _cache_size) {
+		pr_warn("Invalid cache size requested");
+		cache_size = 0;
+	}
+
+	if (cache_size == 0)
+		pr_info("Cache color pages functionality disabled");
+	else
+		pr_info("Cache size set to %llu bytes", cache_size);
+
+	return 0;
+}
+early_param("cache_color_size", setup_cache_color_size);
+
+static __init int setup_cache_color_min_order(char *_cache_color_min_order)
+{
+	int order;
+
+	if (kstrtoint(_cache_color_min_order, 10, &order) < 0) {
+		pr_warn("Invalid cache color min order requested");
+		order = -1;
+	}
+
+	if (order < 0 || order >= MAX_ORDER) {
+		pr_info("Cache color pages functionality disabled");
+		cache_color_min_order = MAX_ORDER;
+	} else {
+		pr_info("Cache min order set to %d", order);
+		cache_color_min_order = order;
+	}
+
+	return 0;
+}
+early_param("cache_color_min_order", setup_cache_color_min_order);
+
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 static inline void reset_deferred_meminit(pg_data_t *pgdat)
 {
@@ -911,13 +1081,13 @@ static inline void __free_one_page(struct page *page,
 		    page_is_buddy(higher_page, higher_buddy, order + 1)) {
 			list_add_tail(&page->lru,
 				area_get_free_list(&zone->free_area[order],
-						order, migratetype));
+						page, order, migratetype));
 			goto out;
 		}
 	}
 
-	list_add(&page->lru, area_get_free_list(&zone->free_area[order], order,
-						migratetype));
+	list_add(&page->lru, area_get_free_list(&zone->free_area[order], page,
+						order, migratetype));
 out:
 	zone->free_area[order].nr_free++;
 }
@@ -1685,8 +1855,8 @@ static inline void expand(struct zone *zone, struct page *page,
 		if (set_page_guard(zone, &page[size], high, migratetype))
 			continue;
 
-		list_add(&page[size].lru, area_get_free_list(area, high,
-								migratetype));
+		list_add(&page[size].lru, area_get_free_list(area, &page[size],
+							high, migratetype));
 		area->nr_free++;
 		set_page_order(&page[size], high);
 	}
@@ -1829,7 +1999,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
 		area = &(zone->free_area[current_order]);
 		page = list_first_entry_or_null(
-			area_get_free_list(area, current_order, migratetype),
+			area_get_free_list(area, NULL, current_order,
+					migratetype),
 			struct page, lru);
 		if (!page)
 			continue;
@@ -1924,8 +2095,8 @@ static int move_freepages(struct zone *zone,
 
 		order = page_order(page);
 		list_move(&page->lru,
-			area_get_free_list(&zone->free_area[order], order,
-					migratetype));
+			  area_get_free_list(&zone->free_area[order], page,
+						order, migratetype));
 		page += 1 << order;
 		pages_moved += 1 << order;
 	}
@@ -2074,7 +2245,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 
 single_page:
 	area = &zone->free_area[current_order];
-	list_move(&page->lru, area_get_free_list(area, current_order,
+	list_move(&page->lru, area_get_free_list(area, page, current_order,
 						start_type));
 }
 
@@ -2187,7 +2358,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			struct free_area *area = &(zone->free_area[order]);
 
 			page = list_first_entry_or_null(
-					area_get_free_list(area, order,
+					area_get_free_list(area, NULL, order,
 							MIGRATE_HIGHATOMIC),
 					struct page, lru);
 			if (!page)
@@ -2302,7 +2473,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 	VM_BUG_ON(current_order == MAX_ORDER);
 
 do_steal:
-	page = list_first_entry(area_get_free_list(area, current_order,
+	page = list_first_entry(area_get_free_list(area, NULL, current_order,
 					fallback_mt), struct page, lru);
 
 	steal_suitable_fallback(zone, page, start_migratetype, can_steal);
@@ -2566,6 +2737,16 @@ void drain_all_pages(struct zone *zone)
 
 #ifdef CONFIG_HIBERNATION
 
+static inline void mark_free_page(struct page *page, int order)
+{
+	unsigned long i, pfn;
+
+	pfn = page_to_pfn(page);
+
+	for (i = 0; i < (1UL << order); ++i)
+		swsusp_set_page_free(pfn_to_page(pfn + i));
+}
+
 void mark_free_pages(struct zone *zone)
 {
 	unsigned long pfn, max_zone_pfn;
@@ -2591,14 +2772,20 @@ void mark_free_pages(struct zone *zone)
 		}
 
 	for_each_migratetype_order(order, t) {
-		list_for_each_entry(page,
-				area_get_free_list(&zone->free_area[order],
-						order, t), lru) {
-			unsigned long i;
+		struct free_area *area = &zone->free_area[order];
+
+		if (cache_color_used(order)) {
+			unsigned long color;
+			struct list_head *colored_lists =
+					area->colored_free_list[t];
 
-			pfn = page_to_pfn(page);
-			for (i = 0; i < (1UL << order); i++)
-				swsusp_set_page_free(pfn_to_page(pfn + i));
+			for (color = 0; color < cache_colors(order); ++color)
+				list_for_each_entry(page,
+						&colored_lists[color], lru)
+					mark_free_page(page, order);
+		} else {
+			list_for_each_entry(page, &area->free_list[t], lru)
+				mark_free_page(page, order);
 		}
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
@@ -5490,12 +5677,34 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 	}
 }
 
+static void __meminit zone_init_cache_color(struct zone *zone, int order,
+						int migratetype)
+{
+	unsigned long c;
+	struct free_area *area = &zone->free_area[order];
+
+	if (!cache_color_used(order))
+		return;
+
+	area->next_color[migratetype] = 0;
+	area->colored_free_list[migratetype] =
+		cache_color_alloc_lists(zone, order);
+
+	for (c = 0; c < cache_colors(order); ++c)
+		INIT_LIST_HEAD(&area->colored_free_list[migratetype][c]);
+}
+
 static void __meminit zone_init_free_lists(struct zone *zone)
 {
 	unsigned int order, t;
 	for_each_migratetype_order(order, t) {
-		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
-		zone->free_area[order].nr_free = 0;
+		struct free_area *area = &zone->free_area[order];
+
+		INIT_LIST_HEAD(&area->free_list[t]);
+		area->nr_free = 0;
+
+		if (cache_color_used(order))
+			zone_init_cache_color(zone, order, t);
 	}
 }
 
-- 
2.13.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RESEND PATCH 3/3] mm: Add helper rbtree to search for next cache color
  2017-08-23 10:02 [RESEND PATCH 0/3] mm: Add cache coloring mechanism Łukasz Daniluk
  2017-08-23 10:02 ` [RESEND PATCH 1/3] mm: move free_list selection to dedicated functions Łukasz Daniluk
  2017-08-23 10:02 ` [RESEND PATCH 2/3] mm: Add page colored allocation path Łukasz Daniluk
@ 2017-08-23 10:02 ` Łukasz Daniluk
  2017-08-24 12:47 ` [RESEND PATCH 0/3] mm: Add cache coloring mechanism Vlastimil Babka
  3 siblings, 0 replies; 9+ messages in thread
From: Łukasz Daniluk @ 2017-08-23 10:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: dave.hansen, lukasz.anaczkowski, Łukasz Daniluk

Before this patch search for next available cache color was done
linearly. This kind of search is problematic if the contents are sparse
or there are no contents at all.

This patch aims to fix this problem by arranging the free_lists that
take part in cache aware allocations in the RB tree structure.

In order for the solution to work properly, space for free_lists and RB
tree nodes has to be allocated. Required space is 5 pointers per color
(struct rb_node, struct list_head).

Cost of the solution with RB tree helpers can be estimated as:
5 * MIGRATE_TYPES * cache_colors(order), per zone per node, for each
order that we wish to be affected. For example 16GB cache with only
order 10 enabled requires 160KB per zone type, migratetype, node; whereas
enabling this feature for order 9 will bump this number up to 480KB per
zone type, migratetype.

Signed-off-by: A?ukasz Daniluk <lukasz.daniluk@intel.com>
---
 include/linux/mmzone.h |   5 +-
 mm/page_alloc.c        | 155 ++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 130 insertions(+), 30 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d10a5421b18b..cda726854078 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -6,6 +6,7 @@
 
 #include <linux/spinlock.h>
 #include <linux/list.h>
+#include <linux/rbtree.h>
 #include <linux/wait.h>
 #include <linux/bitops.h>
 #include <linux/cache.h>
@@ -93,9 +94,11 @@ extern int page_group_by_mobility_disabled;
 	get_pfnblock_flags_mask(page, page_to_pfn(page),		\
 			PB_migrate_end, MIGRATETYPE_MASK)
 
+struct cache_color;
 struct free_area {
 	struct list_head	free_list[MIGRATE_TYPES];
-	struct list_head	*colored_free_list[MIGRATE_TYPES];
+	struct rb_root		cache_colored_free_lists[MIGRATE_TYPES];
+	struct cache_color	*cache_colors[MIGRATE_TYPES];
 	unsigned long		next_color[MIGRATE_TYPES];
 	unsigned long		nr_free;
 };
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3718b49032c2..da1431c4703c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -290,6 +290,10 @@ int page_group_by_mobility_disabled __read_mostly;
 
 int cache_color_min_order __read_mostly = MAX_ORDER - 1;
 
+struct cache_color {
+	struct list_head	free_list;
+	struct rb_node		rb_tree_node;
+};
 
 /*
  * cache_size - size of cache (in bytes)
@@ -323,18 +327,102 @@ static inline unsigned long page_cache_color(struct page *page, int order)
 		/ (PAGE_SIZE * ORDER_NR_PAGES(order));
 }
 
+static inline unsigned long get_cache_color(struct cache_color *cc,
+				struct free_area *area, int migratetype)
+{
+	return cc - area->cache_colors[migratetype];
+}
+
+/*
+ * Returns pointer to cache_color structure that has first available color
+ * after color passed as argument, or NULL, if no such structure was found.
+ */
+static inline struct cache_color *cache_color_area_find_next(
+					struct free_area *area,
+					int migratetype, unsigned long color)
+{
+	struct rb_root *root = &area->cache_colored_free_lists[migratetype];
+	struct rb_node *node = root->rb_node;
+	struct cache_color *ret = NULL;
+
+	while (node) {
+		struct cache_color *cc =
+			rb_entry(node, struct cache_color, rb_tree_node);
+		unsigned long cc_color =
+			get_cache_color(cc, area, migratetype);
+
+		if (cc_color < color) {
+			node = node->rb_right;
+		} else if (cc_color > color) {
+			ret = cc;
+			node = node->rb_left;
+		} else {
+			return cc;
+		}
+	}
+
+	return ret;
+}
+
+/*
+ * Inserts cache_color structure into RB tree associated with migratetype in
+ * area, in case the color was absent from the tree.
+ */
+static inline void cache_color_area_insert(struct free_area *area,
+					int migratetype, struct cache_color *cc)
+{
+	struct rb_root *root = &area->cache_colored_free_lists[migratetype];
+	struct rb_node **new = &(root->rb_node), *parent = NULL;
+	unsigned long cc_color = get_cache_color(cc, area, migratetype);
+
+	while (*new) {
+		struct cache_color *this =
+			rb_entry(*new, struct cache_color, rb_tree_node);
+		unsigned long this_color =
+			get_cache_color(this, area, migratetype);
+
+		parent = *new;
+		if (this_color < cc_color)
+			new = &((*new)->rb_right);
+		else if (this_color > cc_color)
+			new = &((*new)->rb_left);
+		else
+			return;
+	}
+
+	rb_link_node(&cc->rb_tree_node, parent, new);
+	rb_insert_color(&cc->rb_tree_node, root);
+}
+
 /*
  * Returns color of first non-empty free_list for purpose of cache color
  * allocations in specified range (start until end-1).
  */
-static inline unsigned long cache_color_find_nonempty(struct free_area *area,
-			int migratetype, unsigned long start, unsigned long end)
-{
-	unsigned long color;
-
-	for (color = start; color < end; ++color)
-		if (!list_empty(&area->colored_free_list[migratetype][color]))
-			break;
+static inline unsigned long cache_color_find_nonempty(
+		struct free_area *area, int migratetype,
+		unsigned long start, unsigned long end, bool delete_empty)
+{
+	unsigned long color = start;
+	struct cache_color *cc;
+
+	while (color < end) {
+		struct rb_root *root =
+			&area->cache_colored_free_lists[migratetype];
+		unsigned long cc_color;
+
+		cc = cache_color_area_find_next(area, migratetype, color);
+		if (!cc)
+			return end;
+
+		cc_color = get_cache_color(cc, area, migratetype);
+		if (list_empty(&cc->free_list)) {
+			if (delete_empty)
+				rb_erase(&cc->rb_tree_node, root);
+			color = cc_color + 1;
+		} else {
+			return cc_color;
+		}
+	}
 
 	return color;
 }
@@ -353,17 +441,23 @@ static inline struct list_head *cache_color_area_get_free_list(
 	unsigned long current_color = area->next_color[migratetype];
 
 	if (page) {
+		struct cache_color *cc;
+
 		color = page_cache_color(page, order);
-		return &area->colored_free_list[migratetype][color];
+		cc = &area->cache_colors[migratetype][color];
+
+		if (list_empty(&cc->free_list))
+			cache_color_area_insert(area, migratetype, cc);
+
+		return &cc->free_list;
 	}
 
 	color = cache_color_find_nonempty(area, migratetype, current_color,
-					cache_colors(order));
-
+					cache_colors(order), true);
 
 	if (color == cache_colors(order))
 		color = cache_color_find_nonempty(area, migratetype, 0,
-						current_color);
+						current_color, true);
 
 	current_color = color + 1;
 	if (current_color >= cache_colors(order))
@@ -371,14 +465,14 @@ static inline struct list_head *cache_color_area_get_free_list(
 
 	area->next_color[migratetype] = current_color;
 
-	return &area->colored_free_list[migratetype][color];
+	return &area->cache_colors[migratetype][color].free_list;
 }
 
 static inline bool cache_color_area_empty(struct free_area *area, int order,
 					int migratetype)
 {
 	return cache_color_find_nonempty(area, migratetype, 0,
-				cache_colors(order)) == cache_colors(order);
+			cache_colors(order), false) == cache_colors(order);
 }
 
 
@@ -386,12 +480,17 @@ static inline unsigned long cache_color_area_free_count(struct free_area *area,
 						int order, int migratetype)
 {
 	unsigned long count = 0;
-	unsigned long color;
 	struct list_head *lh;
+	struct rb_node *node;
 
-	for (color = 0; color < cache_colors(order); ++color)
-		list_for_each(lh, &area->colored_free_list[migratetype][color])
+	for (node = rb_first(&area->cache_colored_free_lists[migratetype]);
+			node; node = rb_next(node)) {
+		struct cache_color *cc =
+			rb_entry(node, struct cache_color, rb_tree_node);
+
+		list_for_each(lh, &cc->free_list)
 			++count;
+	}
 
 	return count;
 }
@@ -435,7 +534,7 @@ unsigned long area_free_count(struct free_area *area, int order,
  */
 static __ref void *cache_color_alloc_lists(struct zone *zone, int order)
 {
-	const size_t size = sizeof(struct list_head) * cache_colors(order);
+	const size_t size = sizeof(struct cache_color) * cache_colors(order);
 
 	return alloc_bootmem_pages_node(zone->zone_pgdat, size);
 }
@@ -2776,12 +2875,11 @@ void mark_free_pages(struct zone *zone)
 
 		if (cache_color_used(order)) {
 			unsigned long color;
-			struct list_head *colored_lists =
-					area->colored_free_list[t];
+			struct cache_color *ccs = area->cache_colors[t];
 
 			for (color = 0; color < cache_colors(order); ++color)
 				list_for_each_entry(page,
-						&colored_lists[color], lru)
+						&ccs[color].free_list, lru)
 					mark_free_page(page, order);
 		} else {
 			list_for_each_entry(page, &area->free_list[t], lru)
@@ -5680,18 +5778,17 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 static void __meminit zone_init_cache_color(struct zone *zone, int order,
 						int migratetype)
 {
-	unsigned long c;
+	unsigned long color;
 	struct free_area *area = &zone->free_area[order];
-
-	if (!cache_color_used(order))
-		return;
+	struct cache_color **ccs = &area->cache_colors[migratetype];
 
 	area->next_color[migratetype] = 0;
-	area->colored_free_list[migratetype] =
-		cache_color_alloc_lists(zone, order);
+	area->cache_colored_free_lists[migratetype] = RB_ROOT;
 
-	for (c = 0; c < cache_colors(order); ++c)
-		INIT_LIST_HEAD(&area->colored_free_list[migratetype][c]);
+	*ccs = cache_color_alloc_lists(zone, order);
+
+	for (color = 0; color < cache_colors(order); ++color)
+		INIT_LIST_HEAD(&(*ccs)[color].free_list);
 }
 
 static void __meminit zone_init_free_lists(struct zone *zone)
-- 
2.13.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RESEND PATCH 2/3] mm: Add page colored allocation path
  2017-08-23 10:02 ` [RESEND PATCH 2/3] mm: Add page colored allocation path Łukasz Daniluk
@ 2017-08-23 13:51   ` Dave Hansen
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Hansen @ 2017-08-23 13:51 UTC (permalink / raw)
  To: Łukasz Daniluk, linux-mm, linux-kernel; +Cc: lukasz.anaczkowski

[-- Attachment #1: Type: text/plain, Size: 1015 bytes --]

On 08/23/2017 03:02 AM, A?ukasz Daniluk wrote:
> +	cache_color_size=
> +			[KNL] Set cache size for purposes of cache coloring
> +			mechanism in buddy allocator.
> +
> +	cache_color_min_order=
> +			[KNL] Set minimal order for which page coloring
> +			mechanism will be enabled in buddy allocator.

I guess I should send along the code I've been playing with.  I have
this broken out into a bunch of helper patches, but I'll just attach the
combined patch.

This also uses an rbtree, but it puts 'struct page' itself in the
rbtree.  Further, it reuses the zone->free_area list_heads' storage for
the rbtree head.  This means no additional space overhead and you can
also enable it at runtime without boot options.  You can also have it
enabled for any order(s) you want.

The rbtree(s) you've grafted on will not need to be walked or rebalanced
as much as the ones in my version, so that's a plus for your version.

The trick with either of these is trying to make sure the cost of all
the new branches very low.

[-- Attachment #2: rbtree-buddy-20170706.patch --]
[-- Type: text/x-patch, Size: 24211 bytes --]

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 45cdb27..b7dd66d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -104,6 +104,16 @@ struct page {
 		};
 	};
 
+	union {
+		/*
+		 * This is fugly, but it works for now.
+		 *
+		 * An rb_node needs three pointers.  Take
+		 * the space from page->lru (2) and
+		 * page->private (1).
+		 */
+		struct rb_node rb;
+	struct {
 	/*
 	 * Third double word block
 	 *
@@ -185,7 +195,9 @@ struct page {
 #endif
 		struct kmem_cache *slab_cache;	/* SL[AU]B: Pointer to slab */
 	};
+	};
 
+	};
 #ifdef CONFIG_MEMCG
 	struct mem_cgroup *mem_cgroup;
 #endif
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ef6a13b..c70cb02 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -17,6 +17,7 @@
 #include <linux/pageblock-flags.h>
 #include <linux/page-flags-layout.h>
 #include <linux/atomic.h>
+#include <linux/rbtree.h>
 #include <asm/page.h>
 
 /* Free memory management - zoned buddy allocator.  */
@@ -92,8 +93,18 @@ static inline bool is_migrate_movable(int mt)
 	get_pfnblock_flags_mask(page, page_to_pfn(page),		\
 			PB_migrate_end, MIGRATETYPE_MASK)
 
+union free_area_list {
+	/* Used for orders where !tree_free_order(): */
+	struct list_head list;
+	/* Used for orders where tree_free_order()==true: */
+	struct {
+		struct rb_root rb_root;
+		unsigned long  rb_color;
+	};
+};
+
 struct free_area {
-	struct list_head	free_list[MIGRATE_TYPES];
+	union free_area_list	free_pages[MIGRATE_TYPES];
 	unsigned long		nr_free;
 };
 
@@ -828,6 +839,9 @@ static inline bool populated_zone(struct zone *zone)
 
 extern int movable_zone;
 
+extern unsigned long free_page_count(struct zone *zone, int order, int mt);
+extern bool free_area_empty(int order, struct free_area *area, int mt);
+
 #ifdef CONFIG_HIGHMEM
 static inline int zone_movable_is_highmem(void)
 {
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index fcbd568..fd1bd46 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -408,14 +408,14 @@ static int __init crash_save_vmcoreinfo_init(void)
 	VMCOREINFO_OFFSET(zone, free_area);
 	VMCOREINFO_OFFSET(zone, vm_stat);
 	VMCOREINFO_OFFSET(zone, spanned_pages);
-	VMCOREINFO_OFFSET(free_area, free_list);
+	VMCOREINFO_OFFSET(free_area, free_pages);
 	VMCOREINFO_OFFSET(list_head, next);
 	VMCOREINFO_OFFSET(list_head, prev);
 	VMCOREINFO_OFFSET(vmap_area, va_start);
 	VMCOREINFO_OFFSET(vmap_area, list);
 	VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER);
 	log_buf_vmcoreinfo_setup();
-	VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES);
+	VMCOREINFO_LENGTH(free_area.free_pages, MIGRATE_TYPES);
 	VMCOREINFO_NUMBER(NR_FREE_PAGES);
 	VMCOREINFO_NUMBER(PG_lru);
 	VMCOREINFO_NUMBER(PG_private);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 4dfba1a..c48229a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -113,6 +113,10 @@
 #ifndef CONFIG_MMU
 extern int sysctl_nr_trim_pages;
 #endif
+extern int sysctl_tree_free_order;
+extern int tree_free_order_sysctl_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *length, loff_t *ppos);
+
 
 /* Constants used for minimum and  maximum */
 #ifdef CONFIG_LOCKUP_DETECTOR
@@ -134,6 +138,7 @@
 #ifdef CONFIG_PERF_EVENTS
 static int six_hundred_forty_kb = 640 * 1024;
 #endif
+static int max_order_minus_one = MAX_ORDER-1;
 
 /* this is needed for the proc_doulongvec_minmax of vm_dirty_bytes */
 static unsigned long dirty_bytes_min = 2 * PAGE_SIZE;
@@ -1387,6 +1392,15 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
 		.extra1		= &one,
 		.extra2		= &four,
 	},
+	{
+		.procname	= "tree_free_order",
+		.data		= &sysctl_tree_free_order,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= tree_free_order_sysctl_handler,
+		.extra1		= &neg_one,
+		.extra2		= &max_order_minus_one,
+	},
 #ifdef CONFIG_COMPACTION
 	{
 		.procname	= "compact_memory",
diff --git a/mm/compaction.c b/mm/compaction.c
index 613c59e..73c0e71 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1335,13 +1335,13 @@ static enum compact_result __compact_finished(struct zone *zone,
 		bool can_steal;
 
 		/* Job done if page is free of the right migratetype */
-		if (!list_empty(&area->free_list[migratetype]))
+		if (!free_area_empty(order, area, migratetype))
 			return COMPACT_SUCCESS;
 
 #ifdef CONFIG_CMA
 		/* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
 		if (migratetype == MIGRATE_MOVABLE &&
-			!list_empty(&area->free_list[MIGRATE_CMA]))
+			!free_area_empty(order, area, MIGRATE_CMA))
 			return COMPACT_SUCCESS;
 #endif
 		/*
diff --git a/mm/internal.h b/mm/internal.h
index 0e4f558..716dd90 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -227,7 +227,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 static inline unsigned int page_order(struct page *page)
 {
 	/* PageBuddy() must be checked by the caller */
-	return page_private(page);
+	return page->index;
 }
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2302f25..184197c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -66,6 +66,7 @@
 #include <linux/kthread.h>
 #include <linux/memcontrol.h>
 #include <linux/ftrace.h>
+#include <linux/debugfs.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -718,14 +719,14 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
 
 static inline void set_page_order(struct page *page, unsigned int order)
 {
-	set_page_private(page, order);
+	page->index = order;
 	__SetPageBuddy(page);
 }
 
 static inline void rmv_page_order(struct page *page)
 {
 	__ClearPageBuddy(page);
-	set_page_private(page, 0);
+	page->index = 0;
 }
 
 /*
@@ -771,6 +772,443 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 	return 0;
 }
 
+#define DEFAULT_TREE_FREE_ORDER 99
+int __tree_free_order = DEFAULT_TREE_FREE_ORDER;
+static bool tree_free_order(int order)
+{
+	if (order >= __tree_free_order)
+		return true;
+	return false;
+}
+
+int rb_count(struct rb_root *root)
+{
+	int count = 0;
+	struct page *page;
+	struct page *next;
+
+	rbtree_postorder_for_each_entry_safe(page, next, root, rb)
+		count++;
+
+	return count;
+}
+
+struct page *struct_page_from_rb(struct rb_node *rb)
+{
+	return rb_entry(rb, struct page, rb);
+}
+
+u64 pfn_cache_colors = 0;
+static unsigned long pfn_cache_color(unsigned long pfn)
+{
+	if (!pfn_cache_colors)
+		return 0;
+	return pfn % pfn_cache_colors;
+}
+
+static int __init pfn_cache_colors_debugfs(void)
+{
+	umode_t mode = S_IFREG | S_IRUSR | S_IWUSR;
+
+	if (!debugfs_create_u64("pfn-cache-colors", mode, NULL,
+				&pfn_cache_colors))
+		return -ENOMEM;
+
+	return 0;
+}
+late_initcall(pfn_cache_colors_debugfs);
+
+static int rb_cmp_page(struct page *p1, struct page *p2)
+{
+	unsigned long pfn1 = page_to_pfn(p1);
+	unsigned long pfn2 = page_to_pfn(p2);
+	unsigned long color1 = pfn_cache_color(pfn1);
+	unsigned long color2 = pfn_cache_color(pfn2);
+
+	/*
+	 * Sort first by color:
+	 */
+	if (color1 != color2)
+		return color1 < color2;
+
+	/*
+	 * Then sort by pfn.
+	 *
+	 * We really only need to sort on the non-color bits, but
+	 * this comparison yields the same results and is cheaper
+	 * than masking the color bits off.
+	 */
+	return pfn1 < pfn2;
+}
+
+static void rb_free_area_insert(struct page *page, struct rb_root *root)
+{
+	struct rb_node **rbp = &root->rb_node;
+	struct rb_node *parent = NULL;
+
+	while (*rbp) {
+		parent = *rbp;
+		if (rb_cmp_page(page, struct_page_from_rb(parent)))
+			rbp = &(*rbp)->rb_left;
+		else
+			rbp = &(*rbp)->rb_right;
+	}
+
+	rb_link_node(&page->rb, parent, rbp);
+	rb_insert_color(&page->rb, root);
+}
+
+/* Used for pages not on another list */
+static void add_to_free_area(struct page *page, int order,
+			     struct free_area *area,
+			     int migratetype)
+{
+	if (tree_free_order(order)) {
+		rb_free_area_insert(page, &area->free_pages[migratetype].rb_root);
+	} else {
+		list_add(&page->lru, &area->free_pages[migratetype].list);
+	}
+}
+
+/* Used for pages not on another list */
+static void add_to_free_area_tail(struct page *page, int order,
+				  struct free_area *area,
+				  int migratetype)
+{
+	/*
+	 * We are doing ordering based on paddr, not hot/cold,
+	 * so this is just a normal insert:
+	 */
+	if (tree_free_order(order)) {
+		add_to_free_area(page, order, area, migratetype);
+	} else {
+		list_add_tail(&page->lru, &area->free_pages[migratetype].list);
+	}
+}
+
+struct rb_root *page_rb_root(struct page *page, struct free_area *area)
+{
+	int i;
+	struct rb_node *tmp;
+	struct rb_node *head_node;
+	struct rb_root *root = NULL;
+
+	/* Walk up tree to find head node: */
+	tmp = &page->rb;
+	while (rb_parent(tmp))
+		tmp = rb_parent(tmp);
+	head_node = tmp;
+
+	/* Now go find the root among the migratetype rbtrees: */
+	for (i = 0; i < MIGRATE_TYPES; i++) {
+		if (area->free_pages[i].rb_root.rb_node == head_node) {
+			root = &area->free_pages[i].rb_root;
+			break;
+		}
+	}
+	VM_BUG_ON(!root);
+	return root;
+}
+
+/* Used for pages which are on another list */
+static void move_to_free_area(struct page *page, int order,
+			      struct free_area *area,
+			     int migratetype)
+{
+	if (tree_free_order(order)) {
+		/*
+		 * We found this page through physical scanning,
+		 * so we have no idea what migratetype it is.
+		 * We need to scan all the migratetype rbtree
+		 * roots to find the root:
+		 */
+		struct rb_root *old_root = page_rb_root(page, area);
+
+		/* Erase page from root: */
+		rb_erase(&page->rb, old_root);
+
+		rb_free_area_insert(page, &area->free_pages[migratetype].rb_root);
+	} else {
+		list_move(&page->lru, &area->free_pages[migratetype].list);
+	}
+}
+
+/*
+ * Find a page in the rbtree with the given cache color.
+ *
+ * This is confusing because we have an rbtree that has
+ * colored (red/black) nodes, and we also have the cache
+ * color of the pages, which we are searching for.
+ *
+ * This entire function refers only to the cache color
+ * of the memory, *NOT* its color in the rbtree.
+ */
+struct rb_node *rb_find_page(struct rb_root *root, unsigned long *rb_color, int order)
+{
+	struct rb_node *n;
+
+	n = root->rb_node;
+	if (!n)
+		return NULL;
+
+	/*
+	 * We do not initialize this, so rb_color can be
+	 * basically random values.  Sanitize it before
+	 * using it.
+	 */
+	if (*rb_color >= pfn_cache_colors)
+		*rb_color = 0;
+
+	/* Walk down the tree: */
+	while (n) {
+		struct page *tmp = struct_page_from_rb(n);
+		unsigned long pfn = page_to_pfn(tmp);
+
+		if (pfn_cache_color(pfn) > *rb_color) {
+			//trace_printk("going left  at color: %lx\n", pfn_cache_color(pfn));
+			/* Dead end, color not found.  Return something: */
+			if (!n->rb_left)
+				break;
+			/* Walk down the tree looking for the color: */
+			n = n->rb_left;
+		} else if (pfn_cache_color(pfn) < *rb_color) {
+			//trace_printk("going right at color: %lx\n", pfn_cache_color(pfn));
+			/*
+			 * Dead end, color not found.  Return something: */
+			if (!n->rb_right)
+				break;
+			/* Walk down the tree looking for the color: */
+			n = n->rb_right;
+		} else {
+			/* Found the color we want, return it: */
+			break;
+		}
+	}
+
+	if (pfn_cache_colors) {
+		trace_printk("rb_color search: %ld result color: %ld colors: %ld\n",
+				*rb_color,
+				pfn_cache_color(page_to_pfn(struct_page_from_rb(n))),
+				(unsigned long)pfn_cache_colors);
+		/*
+		 * Order-1 pages contain two subpages, one of each color
+		 * Order-2 pages always have 4 colors
+		 * etc...
+		 *
+		 * We increment this for the colors of all the subpages.
+		 * We need to do this because we only search by the head
+		 * page color.
+		 */
+		(*rb_color) += (1 << order);
+	}
+	return n;
+}
+
+static struct page *get_page_from_free_area(int order, struct free_area *area,
+					    int migratetype)
+{
+	struct page *page;
+	struct rb_root *root = &area->free_pages[migratetype].rb_root;
+	unsigned long *rb_color = &area->free_pages[migratetype].rb_color;
+	if (tree_free_order(order)) {
+		struct rb_node *rb = rb_find_page(root, rb_color, order);
+
+		if (!rb)
+			return NULL;
+
+		page = rb_entry(rb, struct page, rb);
+		return page;
+	} else {
+		page = list_first_entry_or_null(&area->free_pages[migratetype].list,
+					struct page, lru);
+
+		return page;
+	}
+}
+
+static void del_page_from_free_area(struct page *page, int order,
+				    struct free_area *area,
+				    int migratetype)
+{
+	if (tree_free_order(order)) {
+		struct rb_root *root = &area->free_pages[migratetype].rb_root;
+		/*
+		 * rb->__parent_color has low bits set, while list_heads do
+		 * not.  We must clear this to make PageTail() not trigger.
+		 */
+		rb_erase(&page->rb, root);
+		page->compound_head = 0;
+	} else {
+		list_del(&page->lru);
+	}
+}
+
+bool free_area_empty(int order, struct free_area *area, int migratetype)
+{
+	if (tree_free_order(order)) {
+		struct rb_root *root = &area->free_pages[migratetype].rb_root;
+		return RB_EMPTY_ROOT(root);
+	} else {
+		return list_empty(&area->free_pages[migratetype].list);
+	}
+}
+
+static void __tree_to_list(struct zone *zone, int order, struct free_area *area, int migratetype)
+{
+	struct rb_root *root = &area->free_pages[migratetype].rb_root;
+	struct list_head *new_head = &area->free_pages[migratetype].list;
+	struct list_head tmp_head;
+	struct page *page;
+	struct page *next;
+
+	INIT_LIST_HEAD(&tmp_head);
+
+	/*
+	 * rbtree_postorder_for_each_entry_safe() is not safe if
+	 * 'page' has rb_erase() called on it.  So just do this
+	 * an entry at a time until empty.
+	 */
+	while (!free_area_empty(order, area, migratetype)) {
+		rbtree_postorder_for_each_entry_safe(page, next, root, rb) {
+			rb_erase(&page->rb, root);
+			list_add(&page->lru, &tmp_head);
+			break;
+		}
+	}
+	INIT_LIST_HEAD(new_head);
+	list_splice(&tmp_head, new_head);
+}
+
+static void tree_to_list(struct zone *zone, int order)
+{
+	struct free_area *area = &(zone->free_area[order]);
+	int i;
+
+	for (i = 0; i < MIGRATE_TYPES; i++)
+		__tree_to_list(zone, order, area, i);
+}
+
+static void __list_to_tree(struct zone *zone, int order, struct free_area *area, int migratetype)
+{
+	struct list_head *head = &area->free_pages[migratetype].list;
+	struct rb_root *new_root = &area->free_pages[migratetype].rb_root;
+	struct rb_root tmp_root = RB_ROOT;
+	struct page *page;
+	struct page *next;
+
+	if (list_empty(head))
+		goto out;
+
+	list_for_each_entry_safe(page, next, head, lru) {
+		list_del_init(&page->lru);
+		rb_free_area_insert(page, &tmp_root);
+	}
+out:
+	new_root->rb_node = tmp_root.rb_node;
+}
+
+static void list_to_tree(struct zone *zone, int order)
+{
+	struct free_area *area = &(zone->free_area[order]);
+	int i;
+
+	for (i = 0; i < MIGRATE_TYPES; i++) {
+		__list_to_tree(zone, order, area, i);
+	}
+}
+
+static void set_zone_tree_free_order(struct zone *zone, int new_order)
+{
+	int i;
+
+	if (!zone_is_initialized(zone))
+		return;
+	if (!populated_zone(zone))
+		return;
+
+	for (i = 0; i < MAX_ORDER; i++) {
+		if (i < new_order && i < __tree_free_order) {
+			/* Not a tree order now, not going to be */
+		} else if (i >= new_order && i <  __tree_free_order) {
+			/* needs to be a tree and is a list now*/
+			list_to_tree(zone, i);
+		} else if (i < new_order  && i >= __tree_free_order) {
+			/* needs to be a list, but is a tree */
+			tree_to_list(zone, i);
+		} else if (i >= new_order && i >= __tree_free_order) {
+			/* Tree order now, and staying that way */
+		}
+	}
+}
+
+static void set_tree_free_order(int new_order)
+{
+	struct zone *zone;
+	unsigned long flags;
+
+	/*
+	 * Just totally disable irqs so we do not have to store
+	 * a per-zone flags for each spin_lock_irq().
+	 */
+	local_irq_save(flags);
+
+	/*
+	 * There is only one __tree_free_order for all zones,
+	 * so we need to lock them all before we make a change.
+	 */
+
+	for_each_populated_zone(zone) {
+		spin_lock(&zone->lock);
+	}
+
+	for_each_populated_zone(zone)
+		set_zone_tree_free_order(zone, new_order);
+	__tree_free_order = new_order;
+
+	for_each_populated_zone(zone) {
+		spin_unlock(&zone->lock);
+	}
+
+	local_irq_restore(flags);
+}
+
+int sysctl_tree_free_order = DEFAULT_TREE_FREE_ORDER;
+int tree_free_order_sysctl_handler(struct ctl_table *table, int write,
+				   void __user *buffer, size_t *length, loff_t *ppos)
+{
+	int new_tree_free_order;
+	int ret;
+
+	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+	if (ret)
+		return ret;
+
+	new_tree_free_order = sysctl_tree_free_order;
+	if (new_tree_free_order == -1)
+		new_tree_free_order = MAX_ORDER;
+
+	set_tree_free_order(new_tree_free_order);
+
+	return 0;
+}
+
+unsigned long free_page_count(struct zone *zone, int order, int mt)
+{
+	unsigned long ret = 0;
+	struct list_head *curr;
+	struct free_area *area = &(zone->free_area[order]);
+	union free_area_list *pages = &area->free_pages[mt];
+
+	if (tree_free_order(order)) {
+		ret = rb_count(&pages->rb_root);
+	} else {
+		list_for_each(curr, &pages->list)
+			ret++;
+	}
+
+	return ret;
+}
+
 /*
  * Freeing function for a buddy system allocator.
  *
@@ -834,7 +1272,7 @@ static inline void __free_one_page(struct page *page,
 		if (page_is_guard(buddy)) {
 			clear_page_guard(zone, buddy, order, migratetype);
 		} else {
-			list_del(&buddy->lru);
+			del_page_from_free_area(buddy, order, &zone->free_area[order], migratetype);
 			zone->free_area[order].nr_free--;
 			rmv_page_order(buddy);
 		}
@@ -887,13 +1325,13 @@ static inline void __free_one_page(struct page *page,
 		higher_buddy = higher_page + (buddy_pfn - combined_pfn);
 		if (pfn_valid_within(buddy_pfn) &&
 		    page_is_buddy(higher_page, higher_buddy, order + 1)) {
-			list_add_tail(&page->lru,
-				&zone->free_area[order].free_list[migratetype]);
+			add_to_free_area_tail(page, order, &zone->free_area[order],
+					      migratetype);
 			goto out;
 		}
 	}
 
-	list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
+	add_to_free_area(page, order, &zone->free_area[order], migratetype);
 out:
 	zone->free_area[order].nr_free++;
 }
@@ -1653,7 +2091,7 @@ static inline void expand(struct zone *zone, struct page *page,
 		if (set_page_guard(zone, &page[size], high, migratetype))
 			continue;
 
-		list_add(&page[size].lru, &area->free_list[migratetype]);
+		add_to_free_area(&page[size], high, area, migratetype);
 		area->nr_free++;
 		set_page_order(&page[size], high);
 	}
@@ -1795,11 +2233,10 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	/* Find a page of the appropriate size in the preferred list */
 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
 		area = &(zone->free_area[current_order]);
-		page = list_first_entry_or_null(&area->free_list[migratetype],
-							struct page, lru);
+		page = get_page_from_free_area(current_order, area, migratetype);
 		if (!page)
 			continue;
-		list_del(&page->lru);
+		del_page_from_free_area(page, current_order, area, migratetype);
 		rmv_page_order(page);
 		area->nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);
@@ -1889,8 +2326,7 @@ static int move_freepages(struct zone *zone,
 		}
 
 		order = page_order(page);
-		list_move(&page->lru,
-			  &zone->free_area[order].free_list[migratetype]);
+		move_to_free_area(page, order, &zone->free_area[order], migratetype);
 		page += 1 << order;
 		pages_moved += 1 << order;
 	}
@@ -2039,7 +2475,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 
 single_page:
 	area = &zone->free_area[current_order];
-	list_move(&page->lru, &area->free_list[start_type]);
+	move_to_free_area(page, current_order, area, start_type);
 }
 
 /*
@@ -2063,7 +2499,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 		if (fallback_mt == MIGRATE_TYPES)
 			break;
 
-		if (list_empty(&area->free_list[fallback_mt]))
+		if (free_area_empty(order, area, fallback_mt))
 			continue;
 
 		if (can_steal_fallback(order, migratetype))
@@ -2150,9 +2586,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 		for (order = 0; order < MAX_ORDER; order++) {
 			struct free_area *area = &(zone->free_area[order]);
 
-			page = list_first_entry_or_null(
-					&area->free_list[MIGRATE_HIGHATOMIC],
-					struct page, lru);
+			page = get_page_from_free_area(order, area, MIGRATE_HIGHATOMIC);
 			if (!page)
 				continue;
 
@@ -2224,8 +2658,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 		if (fallback_mt == -1)
 			continue;
 
-		page = list_first_entry(&area->free_list[fallback_mt],
-						struct page, lru);
+		page = get_page_from_free_area(current_order, area, fallback_mt);
 
 		steal_suitable_fallback(zone, page, start_migratetype,
 								can_steal);
@@ -2491,6 +2924,15 @@ void drain_all_pages(struct zone *zone)
 
 #ifdef CONFIG_HIBERNATION
 
+void mark_free_page(struct page *page, order)
+{
+	unsigned long i;
+	unsigned long pfn = page_to_pfn(page);
+
+	for (i = 0; i < (1UL << order); i++)
+		swsusp_set_page_free(pfn_to_page(pfn + i));
+}
+
 void mark_free_pages(struct zone *zone)
 {
 	unsigned long pfn, max_zone_pfn;
@@ -2516,13 +2958,17 @@ void mark_free_pages(struct zone *zone)
 		}
 
 	for_each_migratetype_order(order, t) {
-		list_for_each_entry(page,
-				&zone->free_area[order].free_list[t], lru) {
-			unsigned long i;
+		if (!tree_free_order(order)) {
+			list_for_each_entry(page,
+				&zone->free_area[order].free_pages[t].list, lru)
+				mark_free_page(page, order);
+		} else {
+			struct page *page, next;
+			struct rb_root *root = &zone->free_area[order].free_pages[t].rb_root;
 
-			pfn = page_to_pfn(page);
-			for (i = 0; i < (1UL << order); i++)
-				swsusp_set_page_free(pfn_to_page(pfn + i));
+			rbtree_postorder_for_each_entry_safe(page, next, root, rb) {
+				mark_free_page(page, order);
+			}
 		}
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
@@ -2649,7 +3095,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
 	}
 
 	/* Remove page from free list */
-	list_del(&page->lru);
+	del_page_from_free_area(page, order, &zone->free_area[order], mt);
 	zone->free_area[order].nr_free--;
 	rmv_page_order(page);
 
@@ -2938,13 +3384,13 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 			return true;
 
 		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
-			if (!list_empty(&area->free_list[mt]))
+			if (!free_area_empty(o, area, mt))
 				return true;
 		}
 
 #ifdef CONFIG_CMA
 		if ((alloc_flags & ALLOC_CMA) &&
-		    !list_empty(&area->free_list[MIGRATE_CMA])) {
+		    !free_area_empty(o, area, MIGRATE_CMA)) {
 			return true;
 		}
 #endif
@@ -4671,7 +5117,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 
 			types[order] = 0;
 			for (type = 0; type < MIGRATE_TYPES; type++) {
-				if (!list_empty(&area->free_list[type]))
+				if (!free_area_empty(order, area, type))
 					types[order] |= 1 << type;
 			}
 		}
@@ -5343,7 +5789,10 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 {
 	unsigned int order, t;
 	for_each_migratetype_order(order, t) {
-		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
+		if (tree_free_order(order))
+			zone->free_area[order].free_pages[t].rb_root = RB_ROOT;
+		else
+			INIT_LIST_HEAD(&zone->free_area[order].free_pages[t].list);
 		zone->free_area[order].nr_free = 0;
 	}
 }
@@ -7686,6 +8135,7 @@ void zone_pcp_reset(struct zone *zone)
 		pr_info("remove from free list %lx %d %lx\n",
 			pfn, 1 << order, end_pfn);
 #endif
+		BUG_ON(1); // FIXME
 		list_del(&page->lru);
 		rmv_page_order(page);
 		zone->free_area[order].nr_free--;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 76f7367..18f23c1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1180,15 +1180,8 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 					zone->name,
 					migratetype_names[mtype]);
 		for (order = 0; order < MAX_ORDER; ++order) {
-			unsigned long freecount = 0;
-			struct free_area *area;
-			struct list_head *curr;
-
-			area = &(zone->free_area[order]);
-
-			list_for_each(curr, &area->free_list[mtype])
-				freecount++;
-			seq_printf(m, "%6lu ", freecount);
+			seq_printf(m, "%6lu ",
+					free_page_count(zone, order, mtype));
 		}
 		seq_putc(m, '\n');
 	}

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RESEND PATCH 0/3] mm: Add cache coloring mechanism
  2017-08-23 10:02 [RESEND PATCH 0/3] mm: Add cache coloring mechanism Łukasz Daniluk
                   ` (2 preceding siblings ...)
  2017-08-23 10:02 ` [RESEND PATCH 3/3] mm: Add helper rbtree to search for next cache color Łukasz Daniluk
@ 2017-08-24 12:47 ` Vlastimil Babka
  2017-08-24 16:08   ` Dave Hansen
  3 siblings, 1 reply; 9+ messages in thread
From: Vlastimil Babka @ 2017-08-24 12:47 UTC (permalink / raw)
  To: Łukasz Daniluk, linux-mm, linux-kernel
  Cc: dave.hansen, lukasz.anaczkowski

On 08/23/2017 12:02 PM, A?ukasz Daniluk wrote:
> Patches resend with Linux Kernel Mailing List added correctly this time.
> 
> This patch series adds cache coloring mechanism that works along buddy
> allocator. The solution is opt-in, disabled by default minimally
> interferes with default allocation paths due to added if statements.
> 
> Why would such patches be needed? Big caches with low associativity
> (direct mapped caches, 2-way associative) will benefit from the solution
> the most - it allows for near constant performance through the lifetime
> of a system, despite the allocations and deallocations happening and
> reordering buddy lists.

So the obvious question, what about THPs? Their size should be enough to
contain all the colors with current caches, no? Even on KNL I didn't
find more than "32x 1 MB 16-way L2 caches". This is in addition to the
improved TLB performance, which you want to get as well for such workloads?

> On KNL system, the STREAM benchmark with problem size resulting in its
> internal arrays being of 16GB size will yield bandwidth performance of
> 336GB/s after fresh boot. With cache coloring patches applied and
> enabled, this performance stays near constant (most 1.5% drop observed),
> despite running benchmark multiple times with varying sizes over course
> of days.  Without these patches however, the bandwidth when using such
> allocations drops to 117GB/s - over 65% of irrecoverable performance
> penalty. Workloads that exceed set cache size suffer from decreased
> randomization of allocations with cache coloring enabled, but effect of
> cache usage disappears roughly at the same allocation size.

So was the test with THP's enabled or disabled? And what was the cache
configuration and the values of cache_color_size and
cache_color_min_order parameters?

I'm also confused about the "cache_color_min_order=" parameter. If this
wants to benefit non-THP userspace, then you would need to set it to 0,
right? Or does this mean that indeed you expect THP to not contain all
the colors, so you'd set it to the THP order (9)?

> Solution is divided into three patches. First patch is a preparatory one
> that provides interface for retrieving (information about) free lists
> contained by particular free_area structure.  Second one (parallel
> structure keeping separate list_heads for each cache color in a given
> context) shows general solution overview and is working as it is.
> However, it has serious performance implications with bigger caches due
> to linear search for next color to be used during allocations. Third
> patch (sorting list_heads using RB trees) aims to improve solution's
> performance by replacing linear search for next color with searching in
> RB tree. While improving computational performance, it imposes increased
> memory cost of the solution.
> 
> 
> A?ukasz Daniluk (3):
>   mm: move free_list selection to dedicated functions
>   mm: Add page colored allocation path
>   mm: Add helper rbtree to search for next cache color
> 
>  Documentation/admin-guide/kernel-parameters.txt |   8 +
>  include/linux/mmzone.h                          |  12 +-
>  mm/compaction.c                                 |   4 +-
>  mm/page_alloc.c                                 | 381 ++++++++++++++++++++++--
>  mm/vmstat.c                                     |  10 +-
>  5 files changed, 383 insertions(+), 32 deletions(-)
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RESEND PATCH 0/3] mm: Add cache coloring mechanism
  2017-08-24 12:47 ` [RESEND PATCH 0/3] mm: Add cache coloring mechanism Vlastimil Babka
@ 2017-08-24 16:08   ` Dave Hansen
  2017-08-25  9:04     ` Vlastimil Babka
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Hansen @ 2017-08-24 16:08 UTC (permalink / raw)
  To: Vlastimil Babka, Łukasz Daniluk, linux-mm, linux-kernel
  Cc: lukasz.anaczkowski

On 08/24/2017 05:47 AM, Vlastimil Babka wrote:
> So the obvious question, what about THPs? Their size should be enough to
> contain all the colors with current caches, no? Even on KNL I didn't
> find more than "32x 1 MB 16-way L2 caches". This is in addition to the
> improved TLB performance, which you want to get as well for such workloads?

The cache in this case is "MCDRAM" which is 16GB in size.  It can be
used as normal RAM or a cache.  This patch deals with when "MCDRAM" is
in its cache mode.

It's described in the "Memory Modes" slide here:

> http://www.nersc.gov/users/computational-systems/cori/configuration/knl-processor-modes/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RESEND PATCH 0/3] mm: Add cache coloring mechanism
  2017-08-24 16:08   ` Dave Hansen
@ 2017-08-25  9:04     ` Vlastimil Babka
  2017-08-25 13:10       ` Dave Hansen
  0 siblings, 1 reply; 9+ messages in thread
From: Vlastimil Babka @ 2017-08-25  9:04 UTC (permalink / raw)
  To: Dave Hansen, Łukasz Daniluk, linux-mm, linux-kernel
  Cc: lukasz.anaczkowski

On 08/24/2017 06:08 PM, Dave Hansen wrote:
> On 08/24/2017 05:47 AM, Vlastimil Babka wrote:
>> So the obvious question, what about THPs? Their size should be enough to
>> contain all the colors with current caches, no? Even on KNL I didn't
>> find more than "32x 1 MB 16-way L2 caches". This is in addition to the
>> improved TLB performance, which you want to get as well for such workloads?
> 
> The cache in this case is "MCDRAM" which is 16GB in size.  It can be
> used as normal RAM or a cache.  This patch deals with when "MCDRAM" is
> in its cache mode.

Hm, 16GB direct mapped, that means 8k colors for 2MB THP's. Is that
really practical? Wouldn't such workload use 1GB hugetlbfs pages? Then
it's still 16 colors to manage, but could be done purely in userspace
since they should not move in physical memory and userspace can control
where to map each phase in the virtual layout.

> It's described in the "Memory Modes" slide here:
> 
>> http://www.nersc.gov/users/computational-systems/cori/configuration/knl-processor-modes/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RESEND PATCH 0/3] mm: Add cache coloring mechanism
  2017-08-25  9:04     ` Vlastimil Babka
@ 2017-08-25 13:10       ` Dave Hansen
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Hansen @ 2017-08-25 13:10 UTC (permalink / raw)
  To: Vlastimil Babka, Łukasz Daniluk, linux-mm, linux-kernel
  Cc: lukasz.anaczkowski

On 08/25/2017 02:04 AM, Vlastimil Babka wrote:
> On 08/24/2017 06:08 PM, Dave Hansen wrote:
>> On 08/24/2017 05:47 AM, Vlastimil Babka wrote:
>>> So the obvious question, what about THPs? Their size should be enough to
>>> contain all the colors with current caches, no? Even on KNL I didn't
>>> find more than "32x 1 MB 16-way L2 caches". This is in addition to the
>>> improved TLB performance, which you want to get as well for such workloads?
>> The cache in this case is "MCDRAM" which is 16GB in size.  It can be
>> used as normal RAM or a cache.  This patch deals with when "MCDRAM" is
>> in its cache mode.
> Hm, 16GB direct mapped, that means 8k colors for 2MB THP's. Is that
> really practical? Wouldn't such workload use 1GB hugetlbfs pages? Then
> it's still 16 colors to manage, but could be done purely in userspace
> since they should not move in physical memory and userspace can control
> where to map each phase in the virtual layout.

There are lots of options for applications that are written with
specific knowledge of MCDRAM.  The easiest option from the kernel's
perspective is to just turn the caching mode off and treat MCDRAM as
normal RAM (it shows up in a separate NUMA node in that case).

But, one of the reasons for the cache mode in the first place was to
support applications that don't have specific knowledge of MCDRAM.  Or,
even old binaries that were compiled long ago.

In other words, I don't think this is something we can easily punt to
userspace.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-08-25 13:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-23 10:02 [RESEND PATCH 0/3] mm: Add cache coloring mechanism Łukasz Daniluk
2017-08-23 10:02 ` [RESEND PATCH 1/3] mm: move free_list selection to dedicated functions Łukasz Daniluk
2017-08-23 10:02 ` [RESEND PATCH 2/3] mm: Add page colored allocation path Łukasz Daniluk
2017-08-23 13:51   ` Dave Hansen
2017-08-23 10:02 ` [RESEND PATCH 3/3] mm: Add helper rbtree to search for next cache color Łukasz Daniluk
2017-08-24 12:47 ` [RESEND PATCH 0/3] mm: Add cache coloring mechanism Vlastimil Babka
2017-08-24 16:08   ` Dave Hansen
2017-08-25  9:04     ` Vlastimil Babka
2017-08-25 13:10       ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).