linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] Randomize free memory
@ 2018-10-04  2:15 Dan Williams
  2018-10-04  2:15 ` [PATCH v2 1/3] mm: Shuffle initial " Dan Williams
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Dan Williams @ 2018-10-04  2:15 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Dave Hansen, Kees Cook, linux-mm, linux-kernel, keescook

Changes since v1:
* Add support for shuffling hot-added memory (Andrew)
* Update cover letter and commit message to clarify the performance impact
  and relevance to future platforms

[1]: https://lkml.org/lkml/2018/9/15/366

---

Some data exfiltration and return-oriented-programming attacks rely on
the ability to infer the location of sensitive data objects. The kernel
page allocator, especially early in system boot, has predictable
first-in-first out behavior for physical pages. Pages are freed in
physical address order when first onlined.

Quoting Kees:
    "While we already have a base-address randomization
     (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
     memory layouts would certainly be using the predictability of
     allocation ordering (i.e. for attacks where the base address isn't
     important: only the relative positions between allocated memory).
     This is common in lots of heap-style attacks. They try to gain
     control over ordering by spraying allocations, etc.

     I'd really like to see this because it gives us something similar
     to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

Another motivation for this change is performance in the presence of a
memory-side cache. In the future, memory-side-cache technology will be
available on generally available server platforms. The proposed
randomization approach has been measured to improve the cache conflict
rate by a factor of 2.5X on a well-known Java benchmark. It avoids
performance peaks and valleys to provide more predictable performance.

More details in the patch1 commit message.

---

Dan Williams (3):
      mm: Shuffle initial free memory
      mm: Move buddy list manipulations into helpers
      mm: Maintain randomization of page free lists


 include/linux/list.h     |   17 +++
 include/linux/mm.h       |    8 +
 include/linux/mm_types.h |    3 +
 include/linux/mmzone.h   |   57 ++++++++++
 mm/bootmem.c             |    9 +-
 mm/compaction.c          |    4 -
 mm/memory_hotplug.c      |    2 
 mm/nobootmem.c           |    7 +
 mm/page_alloc.c          |  267 +++++++++++++++++++++++++++++++++++++++-------
 9 files changed, 321 insertions(+), 53 deletions(-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/3] mm: Shuffle initial free memory
  2018-10-04  2:15 [PATCH v2 0/3] Randomize free memory Dan Williams
@ 2018-10-04  2:15 ` Dan Williams
  2018-10-04  7:48   ` Michal Hocko
  2018-10-04  2:15 ` [PATCH v2 2/3] mm: Move buddy list manipulations into helpers Dan Williams
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2018-10-04  2:15 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Kees Cook, Dave Hansen, linux-mm, linux-kernel, keescook

Some data exfiltration and return-oriented-programming attacks rely on
the ability to infer the location of sensitive data objects. The kernel
page allocator, especially early in system boot, has predictable
first-in-first out behavior for physical pages. Pages are freed in
physical address order when first onlined.

Introduce shuffle_free_memory(), and its helper shuffle_zone(), to
perform a Fisher-Yates shuffle of the page allocator 'free_area' lists
when they are initially populated with free memory at boot and at
hotplug time.

Quoting Kees:
    "While we already have a base-address randomization
     (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
     memory layouts would certainly be using the predictability of
     allocation ordering (i.e. for attacks where the base address isn't
     important: only the relative positions between allocated memory).
     This is common in lots of heap-style attacks. They try to gain
     control over ordering by spraying allocations, etc.

     I'd really like to see this because it gives us something similar
     to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

Another motivation for this change is performance in the presence of a
memory-side cache. In the future, memory-side-cache technology will be
available on generally available server platforms. The proposed
randomization approach has been measured to improve the cache conflict
rate by a factor of 2.5X on a well-known Java benchmark. It avoids
performance peaks and valleys to provide more predictable performance.

While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
caches it leaves vast bulk of memory to be predictably in order
allocated. That ordering can be detected by a memory side-cache.

The shuffling is done in terms of 'shuffle_page_order' sized free pages
where the default shuffle_page_order is MAX_ORDER-1 i.e. 10, 4MB this
trades off randomization granularity for time spent shuffling.
MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
while still showing memory-side cache behavior improvements.

The performance impact of the shuffling appears to be in the noise
compared to other memory initialization work. Also the bulk of the work
is done in the background as a part of deferred_init_memmap().

Cc: Michal Hocko <mhocko@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/list.h   |   17 +++++
 include/linux/mm.h     |    5 +
 include/linux/mmzone.h |    4 +
 mm/bootmem.c           |    9 ++-
 mm/memory_hotplug.c    |    2 +
 mm/nobootmem.c         |    7 ++
 mm/page_alloc.c        |  172 ++++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 211 insertions(+), 5 deletions(-)

diff --git a/include/linux/list.h b/include/linux/list.h
index de04cc5ed536..43f963328d7c 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -150,6 +150,23 @@ static inline void list_replace_init(struct list_head *old,
 	INIT_LIST_HEAD(old);
 }
 
+/**
+ * list_swap - replace entry1 with entry2 and re-add entry1 at entry2's position
+ * @entry1: the location to place entry2
+ * @entry2: the location to place entry1
+ */
+static inline void list_swap(struct list_head *entry1,
+			     struct list_head *entry2)
+{
+	struct list_head *pos = entry2->prev;
+
+	list_del(entry2);
+	list_replace(entry1, entry2);
+	if (pos == entry1)
+		pos = entry2;
+	list_add(entry1, pos);
+}
+
 /**
  * list_del_init - deletes entry from list and reinitialize it.
  * @entry: the element to delete from the list.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a61ebe8ad4ca..ca1581814fe2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2040,7 +2040,10 @@ extern void adjust_managed_page_count(struct page *page, long count);
 extern void mem_init_print_info(const char *str);
 
 extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
-
+extern void shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
+		unsigned long end_pfn);
+extern void shuffle_zone(struct zone *z, unsigned long start_pfn,
+		unsigned long end_pfn);
 /* Free the reserved page into the buddy system, so it gets managed. */
 static inline void __free_reserved_page(struct page *page)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1e22d96734e0..8f8fc7dab5cb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1277,6 +1277,10 @@ void sparse_init(void);
 #else
 #define sparse_init()	do {} while (0)
 #define sparse_index_init(_sec, _nid)  do {} while (0)
+static inline int pfn_present(unsigned long pfn)
+{
+	return 1;
+}
 #endif /* CONFIG_SPARSEMEM */
 
 /*
diff --git a/mm/bootmem.c b/mm/bootmem.c
index 97db0e8e362b..7f5ff899c622 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -210,6 +210,7 @@ void __init free_bootmem_late(unsigned long physaddr, unsigned long size)
 static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
 {
 	struct page *page;
+	int nid = bdata - bootmem_node_data;
 	unsigned long *map, start, end, pages, cur, count = 0;
 
 	if (!bdata->node_bootmem_map)
@@ -219,8 +220,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
 	start = bdata->node_min_pfn;
 	end = bdata->node_low_pfn;
 
-	bdebug("nid=%td start=%lx end=%lx\n",
-		bdata - bootmem_node_data, start, end);
+	bdebug("nid=%d start=%lx end=%lx\n", nid, start, end);
 
 	while (start < end) {
 		unsigned long idx, vec;
@@ -276,7 +276,10 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
 		__free_pages_bootmem(page++, cur++, 0);
 	bdata->node_bootmem_map = NULL;
 
-	bdebug("nid=%td released=%lx\n", bdata - bootmem_node_data, count);
+	shuffle_free_memory(NODE_DATA(nid), bdata->node_min_pfn,
+			bdata->node_low_pfn);
+
+	bdebug("nid=%d released=%lx\n", nid, count);
 
 	return count;
 }
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 38d94b703e9d..c75e597eecd2 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -936,6 +936,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	zone->zone_pgdat->node_present_pages += onlined_pages;
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
 
+	shuffle_zone(zone, pfn, zone_end_pfn(zone));
+
 	if (onlined_pages) {
 		node_states_set_node(nid, &arg);
 		if (need_zonelists_rebuild)
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 439af3b765a7..40b42434e805 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -131,6 +131,7 @@ static unsigned long __init free_low_memory_core_early(void)
 {
 	unsigned long count = 0;
 	phys_addr_t start, end;
+	pg_data_t *pgdat;
 	u64 i;
 
 	memblock_clear_hotplug(0, -1);
@@ -144,8 +145,12 @@ static unsigned long __init free_low_memory_core_early(void)
 	 *  low ram will be on Node1
 	 */
 	for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end,
-				NULL)
+				NULL) {
 		count += __free_memory_core(start, end);
+		for_each_online_pgdat(pgdat)
+			shuffle_free_memory(pgdat, PHYS_PFN(start),
+					PHYS_PFN(end));
+	}
 
 	return count;
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 89d2a2ab3fe6..9a1d97655c19 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -55,6 +55,7 @@
 #include <trace/events/kmem.h>
 #include <trace/events/oom.h>
 #include <linux/prefetch.h>
+#include <linux/random.h>
 #include <linux/mm_inline.h>
 #include <linux/migrate.h>
 #include <linux/hugetlb.h>
@@ -72,6 +73,13 @@
 #include <asm/div64.h>
 #include "internal.h"
 
+/*
+ * page_alloc.shuffle_page_order gates which page orders are shuffled by
+ * shuffle_zone() during memory initialization.
+ */
+static int __read_mostly shuffle_page_order = MAX_ORDER-1;
+module_param(shuffle_page_order, int, 0444);
+
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_FRACTION	(8)
@@ -1035,6 +1043,168 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	return true;
 }
 
+/*
+ * For two pages to be swapped in the shuffle, they must be free (on a
+ * 'free_area' lru), have the same order, and have the same migratetype.
+ */
+static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order)
+{
+	struct page *page;
+
+	/*
+	 * Given we're dealing with randomly selected pfns in a zone we
+	 * need to ask questions like...
+	 */
+
+	/* ...is the pfn even in the memmap? */
+	if (!pfn_valid_within(pfn))
+		return NULL;
+
+	/* ...is the pfn in a present section or a hole? */
+	if (!pfn_present(pfn))
+		return NULL;
+
+	/* ...is the page free and currently on a free_area list? */
+	page = pfn_to_page(pfn);
+	if (!PageBuddy(page))
+		return NULL;
+
+	/*
+	 * ...is the page on the same list as the page we will
+	 * shuffle it with?
+	 */
+	if (page_order(page) != order)
+		return NULL;
+
+	return page;
+}
+
+/*
+ * Fisher-Yates shuffle the freelist which prescribes iterating through
+ * an array, pfns in this case, and randomly swapping each entry with
+ * another in the span, end_pfn - start_pfn.
+ *
+ * To keep the implementation simple it does not attempt to correct for
+ * sources of bias in the distribution, like modulo bias or
+ * pseudo-random number generator bias. I.e. the expectation is that
+ * this shuffling raises the bar for attacks that exploit the
+ * predictability of page allocations, but need not be a perfect
+ * shuffle.
+ *
+ * Note that we don't use @z->zone_start_pfn and zone_end_pfn(@z)
+ * directly since the caller may be aware of holes in the zone and can
+ * improve the accuracy of the random pfn selection.
+ */
+#define SHUFFLE_RETRY 10
+static void __meminit shuffle_zone_order(struct zone *z, unsigned long start_pfn,
+		unsigned long end_pfn, const int order)
+{
+	unsigned long i, flags;
+	const int order_pages = 1 << order;
+
+	if (start_pfn < z->zone_start_pfn)
+		start_pfn = z->zone_start_pfn;
+	if (end_pfn > zone_end_pfn(z))
+		end_pfn = zone_end_pfn(z);
+
+	/* probably means that start/end were outside the zone */
+	if (end_pfn <= start_pfn)
+		return;
+	spin_lock_irqsave(&z->lock, flags);
+	start_pfn = ALIGN(start_pfn, order_pages);
+	for (i = start_pfn; i < end_pfn; i += order_pages) {
+		unsigned long j;
+		int migratetype, retry;
+		struct page *page_i, *page_j;
+
+		/*
+		 * We expect page_i, in the sub-range of a zone being
+		 * added (@start_pfn to @end_pfn), to more likely be
+		 * valid compared to page_j randomly selected in the
+		 * span @zone_start_pfn to @spanned_pages.
+		 */
+		page_i = shuffle_valid_page(i, order);
+		if (!page_i)
+			continue;
+
+		for (retry = 0; retry < SHUFFLE_RETRY; retry++) {
+			/*
+			 * Pick a random order aligned page from the
+			 * start of the zone. Use the *whole* zone here
+			 * so that if it is freed in tiny pieces that we
+			 * randomize in the whole zone, not just within
+			 * those fragments.
+			 *
+			 * Since page_j comes from a potentially sparse
+			 * address range we want to try a bit harder to
+			 * find a shuffle point for page_i.
+			 */
+			j = z->zone_start_pfn +
+				ALIGN_DOWN(get_random_long() % z->spanned_pages,
+						order_pages);
+			page_j = shuffle_valid_page(j, order);
+			if (page_j && page_j != page_i)
+				break;
+		}
+		if (retry >= SHUFFLE_RETRY) {
+			pr_debug("%s: failed to swap %#lx\n", __func__, i);
+			continue;
+		}
+
+		/*
+		 * Each migratetype corresponds to its own list, make
+		 * sure the types match otherwise we're moving pages to
+		 * lists where they do not belong.
+		 */
+		migratetype = get_pageblock_migratetype(page_i);
+		if (get_pageblock_migratetype(page_j) != migratetype) {
+			pr_debug("%s: migratetype mismatch %#lx\n", __func__, i);
+			continue;
+		}
+
+		list_swap(&page_i->lru, &page_j->lru);
+
+		pr_debug("%s: swap: %#lx -> %#lx\n", __func__, i, j);
+
+		/* take it easy on the zone lock */
+		if ((i % (100 * order_pages)) == 0) {
+			spin_unlock_irqrestore(&z->lock, flags);
+			cond_resched();
+			spin_lock_irqsave(&z->lock, flags);
+		}
+	}
+	spin_unlock_irqrestore(&z->lock, flags);
+}
+
+void __meminit shuffle_zone(struct zone *z, unsigned long start_pfn,
+               unsigned long end_pfn)
+{
+       int i;
+
+       /* shuffle all the orders at the specified order and higher */
+       for (i = shuffle_page_order; i < MAX_ORDER; i++)
+               shuffle_zone_order(z, start_pfn, end_pfn, i);
+}
+
+/**
+ * shuffle_free_memory - reduce the predictability of the page allocator
+ * @pgdat: node page data
+ * @start_pfn: Limit the shuffle to the greater of this value or zone start
+ * @end_pfn: Limit the shuffle to the less of this value or zone end
+ *
+ * While shuffle_zone() attempts to avoid holes with pfn_valid() and
+ * pfn_present() they can not report sub-section sized holes. @start_pfn
+ * and @end_pfn limit the shuffle to the exact memory pages being freed.
+ */
+void __meminit shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
+		unsigned long end_pfn)
+{
+	struct zone *z;
+
+	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
+		shuffle_zone(z, start_pfn, end_pfn);
+}
+
 #ifdef CONFIG_DEBUG_VM
 static inline bool free_pcp_prepare(struct page *page)
 {
@@ -1583,6 +1753,8 @@ static int __init deferred_init_memmap(void *data)
 	}
 	pgdat_resize_unlock(pgdat, &flags);
 
+	shuffle_zone(zone, first_init_pfn, zone_end_pfn(zone));
+
 	/* Sanity check that the next zone really is unpopulated */
 	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
 


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 2/3] mm: Move buddy list manipulations into helpers
  2018-10-04  2:15 [PATCH v2 0/3] Randomize free memory Dan Williams
  2018-10-04  2:15 ` [PATCH v2 1/3] mm: Shuffle initial " Dan Williams
@ 2018-10-04  2:15 ` Dan Williams
  2018-10-04  2:15 ` [PATCH v2 3/3] mm: Maintain randomization of page free lists Dan Williams
  2018-10-04  7:44 ` [PATCH v2 0/3] Randomize free memory Michal Hocko
  3 siblings, 0 replies; 18+ messages in thread
From: Dan Williams @ 2018-10-04  2:15 UTC (permalink / raw)
  To: akpm; +Cc: Michal Hocko, Dave Hansen, linux-mm, linux-kernel, keescook

In preparation for runtime randomization of the zone lists, take all
(well, most of) the list_*() functions in the buddy allocator and put
them in helper functions. Provide a common control point for injecting
additional behavior when freeing pages.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/mm.h       |    3 --
 include/linux/mm_types.h |    3 ++
 include/linux/mmzone.h   |   51 ++++++++++++++++++++++++++++++++++
 mm/compaction.c          |    4 +--
 mm/page_alloc.c          |   70 ++++++++++++++++++----------------------------
 5 files changed, 84 insertions(+), 47 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ca1581814fe2..1d19ec6a2b81 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -473,9 +473,6 @@ static inline void vma_set_anonymous(struct vm_area_struct *vma)
 struct mmu_gather;
 struct inode;
 
-#define page_private(page)		((page)->private)
-#define set_page_private(page, v)	((page)->private = (v))
-
 #if !defined(__HAVE_ARCH_PTE_DEVMAP) || !defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static inline int pmd_devmap(pmd_t pmd)
 {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index cd2bc939efd0..191610be62bd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -209,6 +209,9 @@ struct page {
 #define PAGE_FRAG_CACHE_MAX_SIZE	__ALIGN_MASK(32768, ~PAGE_MASK)
 #define PAGE_FRAG_CACHE_MAX_ORDER	get_order(PAGE_FRAG_CACHE_MAX_SIZE)
 
+#define page_private(page)		((page)->private)
+#define set_page_private(page, v)	((page)->private = (v))
+
 struct page_frag_cache {
 	void * va;
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8f8fc7dab5cb..adf9b3a7440d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -18,6 +18,8 @@
 #include <linux/pageblock-flags.h>
 #include <linux/page-flags-layout.h>
 #include <linux/atomic.h>
+#include <linux/mm_types.h>
+#include <linux/page-flags.h>
 #include <asm/page.h>
 
 /* Free memory management - zoned buddy allocator.  */
@@ -98,6 +100,55 @@ struct free_area {
 	unsigned long		nr_free;
 };
 
+/* Used for pages not on another list */
+static inline void add_to_free_area(struct page *page, struct free_area *area,
+			     int migratetype)
+{
+	list_add(&page->lru, &area->free_list[migratetype]);
+	area->nr_free++;
+}
+
+/* Used for pages not on another list */
+static inline void add_to_free_area_tail(struct page *page, struct free_area *area,
+				  int migratetype)
+{
+	list_add_tail(&page->lru, &area->free_list[migratetype]);
+	area->nr_free++;
+}
+
+/* Used for pages which are on another list */
+static inline void move_to_free_area(struct page *page, struct free_area *area,
+			     int migratetype)
+{
+	list_move(&page->lru, &area->free_list[migratetype]);
+}
+
+static inline struct page *get_page_from_free_area(struct free_area *area,
+					    int migratetype)
+{
+	return list_first_entry_or_null(&area->free_list[migratetype],
+					struct page, lru);
+}
+
+static inline void rmv_page_order(struct page *page)
+{
+	__ClearPageBuddy(page);
+	set_page_private(page, 0);
+}
+
+static inline void del_page_from_free_area(struct page *page,
+		struct free_area *area, int migratetype)
+{
+	list_del(&page->lru);
+	rmv_page_order(page);
+	area->nr_free--;
+}
+
+static inline bool free_area_empty(struct free_area *area, int migratetype)
+{
+	return list_empty(&area->free_list[migratetype]);
+}
+
 struct pglist_data;
 
 /*
diff --git a/mm/compaction.c b/mm/compaction.c
index faca45ebe62d..48736044f682 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1358,13 +1358,13 @@ static enum compact_result __compact_finished(struct zone *zone,
 		bool can_steal;
 
 		/* Job done if page is free of the right migratetype */
-		if (!list_empty(&area->free_list[migratetype]))
+		if (!free_area_empty(area, migratetype))
 			return COMPACT_SUCCESS;
 
 #ifdef CONFIG_CMA
 		/* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
 		if (migratetype == MIGRATE_MOVABLE &&
-			!list_empty(&area->free_list[MIGRATE_CMA]))
+			!free_area_empty(area, MIGRATE_CMA))
 			return COMPACT_SUCCESS;
 #endif
 		/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9a1d97655c19..b4a1598fcab5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -705,12 +705,6 @@ static inline void set_page_order(struct page *page, unsigned int order)
 	__SetPageBuddy(page);
 }
 
-static inline void rmv_page_order(struct page *page)
-{
-	__ClearPageBuddy(page);
-	set_page_private(page, 0);
-}
-
 /*
  * This function checks whether a page is free && is the buddy
  * we can coalesce a page and its buddy if
@@ -811,13 +805,11 @@ static inline void __free_one_page(struct page *page,
 		 * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
 		 * merge with it and move up one order.
 		 */
-		if (page_is_guard(buddy)) {
+		if (page_is_guard(buddy))
 			clear_page_guard(zone, buddy, order, migratetype);
-		} else {
-			list_del(&buddy->lru);
-			zone->free_area[order].nr_free--;
-			rmv_page_order(buddy);
-		}
+		else
+			del_page_from_free_area(buddy, &zone->free_area[order],
+					migratetype);
 		combined_pfn = buddy_pfn & pfn;
 		page = page + (combined_pfn - pfn);
 		pfn = combined_pfn;
@@ -867,15 +859,13 @@ static inline void __free_one_page(struct page *page,
 		higher_buddy = higher_page + (buddy_pfn - combined_pfn);
 		if (pfn_valid_within(buddy_pfn) &&
 		    page_is_buddy(higher_page, higher_buddy, order + 1)) {
-			list_add_tail(&page->lru,
-				&zone->free_area[order].free_list[migratetype]);
-			goto out;
+			add_to_free_area_tail(page, &zone->free_area[order],
+					      migratetype);
+			return;
 		}
 	}
 
-	list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
-out:
-	zone->free_area[order].nr_free++;
+	add_to_free_area(page, &zone->free_area[order], migratetype);
 }
 
 /*
@@ -1977,7 +1967,7 @@ static inline void expand(struct zone *zone, struct page *page,
 		if (set_page_guard(zone, &page[size], high, migratetype))
 			continue;
 
-		list_add(&page[size].lru, &area->free_list[migratetype]);
+		add_to_free_area(&page[size], area, migratetype);
 		area->nr_free++;
 		set_page_order(&page[size], high);
 	}
@@ -2119,13 +2109,10 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	/* Find a page of the appropriate size in the preferred list */
 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
 		area = &(zone->free_area[current_order]);
-		page = list_first_entry_or_null(&area->free_list[migratetype],
-							struct page, lru);
+		page = get_page_from_free_area(area, migratetype);
 		if (!page)
 			continue;
-		list_del(&page->lru);
-		rmv_page_order(page);
-		area->nr_free--;
+		del_page_from_free_area(page, area, migratetype);
 		expand(zone, page, order, current_order, area, migratetype);
 		set_pcppage_migratetype(page, migratetype);
 		return page;
@@ -2215,8 +2202,7 @@ static int move_freepages(struct zone *zone,
 		}
 
 		order = page_order(page);
-		list_move(&page->lru,
-			  &zone->free_area[order].free_list[migratetype]);
+		move_to_free_area(page, &zone->free_area[order], migratetype);
 		page += 1 << order;
 		pages_moved += 1 << order;
 	}
@@ -2365,7 +2351,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 
 single_page:
 	area = &zone->free_area[current_order];
-	list_move(&page->lru, &area->free_list[start_type]);
+	move_to_free_area(page, area, start_type);
 }
 
 /*
@@ -2389,7 +2375,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 		if (fallback_mt == MIGRATE_TYPES)
 			break;
 
-		if (list_empty(&area->free_list[fallback_mt]))
+		if (free_area_empty(area, fallback_mt))
 			continue;
 
 		if (can_steal_fallback(order, migratetype))
@@ -2476,9 +2462,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 		for (order = 0; order < MAX_ORDER; order++) {
 			struct free_area *area = &(zone->free_area[order]);
 
-			page = list_first_entry_or_null(
-					&area->free_list[MIGRATE_HIGHATOMIC],
-					struct page, lru);
+			page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC);
 			if (!page)
 				continue;
 
@@ -2591,8 +2575,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 	VM_BUG_ON(current_order == MAX_ORDER);
 
 do_steal:
-	page = list_first_entry(&area->free_list[fallback_mt],
-							struct page, lru);
+	page = get_page_from_free_area(area, fallback_mt);
 
 	steal_suitable_fallback(zone, page, start_migratetype, can_steal);
 
@@ -3019,6 +3002,7 @@ EXPORT_SYMBOL_GPL(split_page);
 
 int __isolate_free_page(struct page *page, unsigned int order)
 {
+	struct free_area *area = &page_zone(page)->free_area[order];
 	unsigned long watermark;
 	struct zone *zone;
 	int mt;
@@ -3043,9 +3027,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
 	}
 
 	/* Remove page from free list */
-	list_del(&page->lru);
-	zone->free_area[order].nr_free--;
-	rmv_page_order(page);
+
+	del_page_from_free_area(page, area, mt);
 
 	/*
 	 * Set the pageblock if the isolated page is at least half of a
@@ -3339,13 +3322,13 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 			continue;
 
 		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
-			if (!list_empty(&area->free_list[mt]))
+			if (!free_area_empty(area, mt))
 				return true;
 		}
 
 #ifdef CONFIG_CMA
 		if ((alloc_flags & ALLOC_CMA) &&
-		    !list_empty(&area->free_list[MIGRATE_CMA])) {
+		    !free_area_empty(area, MIGRATE_CMA)) {
 			return true;
 		}
 #endif
@@ -5191,7 +5174,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 
 			types[order] = 0;
 			for (type = 0; type < MIGRATE_TYPES; type++) {
-				if (!list_empty(&area->free_list[type]))
+				if (!free_area_empty(area, type))
 					types[order] |= 1 << type;
 			}
 		}
@@ -8220,6 +8203,9 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 	spin_lock_irqsave(&zone->lock, flags);
 	pfn = start_pfn;
 	while (pfn < end_pfn) {
+		struct free_area *area;
+		int mt;
+
 		if (!pfn_valid(pfn)) {
 			pfn++;
 			continue;
@@ -8238,13 +8224,13 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 		BUG_ON(page_count(page));
 		BUG_ON(!PageBuddy(page));
 		order = page_order(page);
+		area = &zone->free_area[order];
 #ifdef CONFIG_DEBUG_VM
 		pr_info("remove from free list %lx %d %lx\n",
 			pfn, 1 << order, end_pfn);
 #endif
-		list_del(&page->lru);
-		rmv_page_order(page);
-		zone->free_area[order].nr_free--;
+		mt = get_pageblock_migratetype(page);
+		del_page_from_free_area(page, area, mt);
 		for (i = 0; i < (1 << order); i++)
 			SetPageReserved((page+i));
 		pfn += (1 << order);


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 3/3] mm: Maintain randomization of page free lists
  2018-10-04  2:15 [PATCH v2 0/3] Randomize free memory Dan Williams
  2018-10-04  2:15 ` [PATCH v2 1/3] mm: Shuffle initial " Dan Williams
  2018-10-04  2:15 ` [PATCH v2 2/3] mm: Move buddy list manipulations into helpers Dan Williams
@ 2018-10-04  2:15 ` Dan Williams
  2018-10-04  7:44 ` [PATCH v2 0/3] Randomize free memory Michal Hocko
  3 siblings, 0 replies; 18+ messages in thread
From: Dan Williams @ 2018-10-04  2:15 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Kees Cook, Dave Hansen, linux-mm, linux-kernel, keescook

When freeing a page with an order >= shuffle_page_order randomly select
the front or back of the list for insertion.

While the mm tries to defragment physical pages into huge pages this can
tend to make the page allocator more predictable over time. Inject the
front-back randomness to preserve the initial randomness established by
shuffle_free_memory() when the kernel was booted.

The overhead of this manipulation is constrained by only being applied
for MAX_ORDER sized pages by default.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/mmzone.h |    2 ++
 mm/page_alloc.c        |   27 +++++++++++++++++++++++++--
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index adf9b3a7440d..4a095432843d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -98,6 +98,8 @@ extern int page_group_by_mobility_disabled;
 struct free_area {
 	struct list_head	free_list[MIGRATE_TYPES];
 	unsigned long		nr_free;
+	u64			rand;
+	u8			rand_bits;
 };
 
 /* Used for pages not on another list */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b4a1598fcab5..e659119351ad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -43,6 +43,7 @@
 #include <linux/mempolicy.h>
 #include <linux/memremap.h>
 #include <linux/stop_machine.h>
+#include <linux/random.h>
 #include <linux/sort.h>
 #include <linux/pfn.h>
 #include <linux/backing-dev.h>
@@ -746,6 +747,22 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 	return 0;
 }
 
+static void add_to_free_area_random(struct page *page, struct free_area *area,
+		int migratetype)
+{
+	if (area->rand_bits == 0) {
+		area->rand_bits = 64;
+		area->rand = get_random_u64();
+	}
+
+	if (area->rand & 1)
+		add_to_free_area(page, area, migratetype);
+	else
+		add_to_free_area_tail(page, area, migratetype);
+	area->rand_bits--;
+	area->rand >>= 1;
+}
+
 /*
  * Freeing function for a buddy system allocator.
  *
@@ -851,7 +868,8 @@ static inline void __free_one_page(struct page *page,
 	 * so it's less likely to be used soon and more likely to be merged
 	 * as a higher order page
 	 */
-	if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)) {
+	if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)
+			&& order < shuffle_page_order) {
 		struct page *higher_page, *higher_buddy;
 		combined_pfn = buddy_pfn & pfn;
 		higher_page = page + (combined_pfn - pfn);
@@ -865,7 +883,12 @@ static inline void __free_one_page(struct page *page,
 		}
 	}
 
-	add_to_free_area(page, &zone->free_area[order], migratetype);
+	if (order < shuffle_page_order)
+		add_to_free_area(page, &zone->free_area[order], migratetype);
+	else
+		add_to_free_area_random(page, &zone->free_area[order],
+				migratetype);
+
 }
 
 /*


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/3] Randomize free memory
  2018-10-04  2:15 [PATCH v2 0/3] Randomize free memory Dan Williams
                   ` (2 preceding siblings ...)
  2018-10-04  2:15 ` [PATCH v2 3/3] mm: Maintain randomization of page free lists Dan Williams
@ 2018-10-04  7:44 ` Michal Hocko
  2018-10-04 16:44   ` Dan Williams
  3 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2018-10-04  7:44 UTC (permalink / raw)
  To: Dan Williams; +Cc: akpm, Dave Hansen, Kees Cook, linux-mm, linux-kernel

On Wed 03-10-18 19:15:18, Dan Williams wrote:
> Changes since v1:
> * Add support for shuffling hot-added memory (Andrew)
> * Update cover letter and commit message to clarify the performance impact
>   and relevance to future platforms

I believe this hasn't addressed my questions in
http://lkml.kernel.org/r/20181002143015.GX18290@dhcp22.suse.cz. Namely
"
It is the more general idea that I am not really sure about. First of
all. Does it make _any_ sense to randomize 4MB blocks by default? Why
cannot we simply have it disabled? Then and more concerning question is,
does it even make sense to have this randomization applied to higher
orders than 0? Attacker might fragment the memory and keep recycling the
lowest order and get the predictable behavior that we have right now.
"

> [1]: https://lkml.org/lkml/2018/9/15/366
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/3] mm: Shuffle initial free memory
  2018-10-04  2:15 ` [PATCH v2 1/3] mm: Shuffle initial " Dan Williams
@ 2018-10-04  7:48   ` Michal Hocko
  2018-10-04 16:51     ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2018-10-04  7:48 UTC (permalink / raw)
  To: Dan Williams; +Cc: akpm, Kees Cook, Dave Hansen, linux-mm, linux-kernel

On Wed 03-10-18 19:15:24, Dan Williams wrote:
> Some data exfiltration and return-oriented-programming attacks rely on
> the ability to infer the location of sensitive data objects. The kernel
> page allocator, especially early in system boot, has predictable
> first-in-first out behavior for physical pages. Pages are freed in
> physical address order when first onlined.
> 
> Introduce shuffle_free_memory(), and its helper shuffle_zone(), to
> perform a Fisher-Yates shuffle of the page allocator 'free_area' lists
> when they are initially populated with free memory at boot and at
> hotplug time.
> 
> Quoting Kees:
>     "While we already have a base-address randomization
>      (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
>      memory layouts would certainly be using the predictability of
>      allocation ordering (i.e. for attacks where the base address isn't
>      important: only the relative positions between allocated memory).
>      This is common in lots of heap-style attacks. They try to gain
>      control over ordering by spraying allocations, etc.
> 
>      I'd really like to see this because it gives us something similar
>      to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."
> 
> Another motivation for this change is performance in the presence of a
> memory-side cache. In the future, memory-side-cache technology will be
> available on generally available server platforms. The proposed
> randomization approach has been measured to improve the cache conflict
> rate by a factor of 2.5X on a well-known Java benchmark. It avoids
> performance peaks and valleys to provide more predictable performance.
> 
> While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
> caches it leaves vast bulk of memory to be predictably in order
> allocated. That ordering can be detected by a memory side-cache.
> 
> The shuffling is done in terms of 'shuffle_page_order' sized free pages
> where the default shuffle_page_order is MAX_ORDER-1 i.e. 10, 4MB this
> trades off randomization granularity for time spent shuffling.
> MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
> while still showing memory-side cache behavior improvements.
> 
> The performance impact of the shuffling appears to be in the noise
> compared to other memory initialization work. Also the bulk of the work
> is done in the background as a part of deferred_init_memmap().

This is the biggest portion of the series and I am wondering why do we
need it at all. Why it isn't sufficient to rely on the patch 3 here?
Pages freed from the bootmem allocator go via the same path so they
might be shuffled at that time. Or is there any problem with that?
Not enough entropy at the time when this is called or the final result
is not randomized enough (some numbers would be helpful).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/3] Randomize free memory
  2018-10-04  7:44 ` [PATCH v2 0/3] Randomize free memory Michal Hocko
@ 2018-10-04 16:44   ` Dan Williams
  2018-10-06 17:01     ` Dan Williams
  2018-10-09 11:22     ` Michal Hocko
  0 siblings, 2 replies; 18+ messages in thread
From: Dan Williams @ 2018-10-04 16:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Dave Hansen, Kees Cook, Linux MM,
	Linux Kernel Mailing List

Hi Michal,

On Thu, Oct 4, 2018 at 12:53 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 03-10-18 19:15:18, Dan Williams wrote:
> > Changes since v1:
> > * Add support for shuffling hot-added memory (Andrew)
> > * Update cover letter and commit message to clarify the performance impact
> >   and relevance to future platforms
>
> I believe this hasn't addressed my questions in
> http://lkml.kernel.org/r/20181002143015.GX18290@dhcp22.suse.cz. Namely
> "
> It is the more general idea that I am not really sure about. First of
> all. Does it make _any_ sense to randomize 4MB blocks by default? Why
> cannot we simply have it disabled?

I'm not aware of any CVE that this would directly preclude, but that
said the entropy injected at 4MB boundaries raises the bar on heap
attacks. Environments that want more can adjust that with the boot
parameter. Given the potential benefits I think it would only make
sense to default disable it if there was a significant runtime impact,
from what I have seen there isn't.

> Then and more concerning question is,
> does it even make sense to have this randomization applied to higher
> orders than 0? Attacker might fragment the memory and keep recycling the
> lowest order and get the predictable behavior that we have right now.

Certainly I expect there are attacks that can operate within a 4MB
window, as I expect there are attacks that could operate within a 4K
window that would need sub-page randomization to deter. In fact I
believe that is the motivation for CONFIG_SLAB_FREELIST_RANDOM.
Combining that with page allocator randomization makes the kernel less
predictable.

Is that enough justification for this patch on its own? It's
debatable. Combine that though with the wider availability of
platforms with memory-side-cache and I think it's a reasonable default
behavior for the kernel to deploy.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/3] mm: Shuffle initial free memory
  2018-10-04  7:48   ` Michal Hocko
@ 2018-10-04 16:51     ` Dan Williams
  2018-10-09 11:12       ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2018-10-04 16:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kees Cook, Dave Hansen, Linux MM,
	Linux Kernel Mailing List

On Thu, Oct 4, 2018 at 12:48 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 03-10-18 19:15:24, Dan Williams wrote:
> > Some data exfiltration and return-oriented-programming attacks rely on
> > the ability to infer the location of sensitive data objects. The kernel
> > page allocator, especially early in system boot, has predictable
> > first-in-first out behavior for physical pages. Pages are freed in
> > physical address order when first onlined.
> >
> > Introduce shuffle_free_memory(), and its helper shuffle_zone(), to
> > perform a Fisher-Yates shuffle of the page allocator 'free_area' lists
> > when they are initially populated with free memory at boot and at
> > hotplug time.
> >
> > Quoting Kees:
> >     "While we already have a base-address randomization
> >      (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
> >      memory layouts would certainly be using the predictability of
> >      allocation ordering (i.e. for attacks where the base address isn't
> >      important: only the relative positions between allocated memory).
> >      This is common in lots of heap-style attacks. They try to gain
> >      control over ordering by spraying allocations, etc.
> >
> >      I'd really like to see this because it gives us something similar
> >      to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."
> >
> > Another motivation for this change is performance in the presence of a
> > memory-side cache. In the future, memory-side-cache technology will be
> > available on generally available server platforms. The proposed
> > randomization approach has been measured to improve the cache conflict
> > rate by a factor of 2.5X on a well-known Java benchmark. It avoids
> > performance peaks and valleys to provide more predictable performance.
> >
> > While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
> > caches it leaves vast bulk of memory to be predictably in order
> > allocated. That ordering can be detected by a memory side-cache.
> >
> > The shuffling is done in terms of 'shuffle_page_order' sized free pages
> > where the default shuffle_page_order is MAX_ORDER-1 i.e. 10, 4MB this
> > trades off randomization granularity for time spent shuffling.
> > MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
> > while still showing memory-side cache behavior improvements.
> >
> > The performance impact of the shuffling appears to be in the noise
> > compared to other memory initialization work. Also the bulk of the work
> > is done in the background as a part of deferred_init_memmap().
>
> This is the biggest portion of the series and I am wondering why do we
> need it at all. Why it isn't sufficient to rely on the patch 3 here?

In fact we started with only patch3 and it had no measurable impact on
the cache conflict rate.

> Pages freed from the bootmem allocator go via the same path so they
> might be shuffled at that time. Or is there any problem with that?
> Not enough entropy at the time when this is called or the final result
> is not randomized enough (some numbers would be helpful).

So the reason front-back randomization is not enough is due to the
in-order initial freeing of pages. At the start of that process
putting page1 in front or behind page0 still keeps them close
together, page2 is still near page1 and has a high chance of being
adjacent. As more pages are added ordering diversity improves, but
there is still high page locality for the low address pages and this
leads to no significant impact to the cache conflict rate. Patch3 is
enough to keep the entropy sustained over time, but it's not enough
initially.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/3] Randomize free memory
  2018-10-04 16:44   ` Dan Williams
@ 2018-10-06 17:01     ` Dan Williams
  2018-10-09 11:22     ` Michal Hocko
  1 sibling, 0 replies; 18+ messages in thread
From: Dan Williams @ 2018-10-06 17:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Dave Hansen, Kees Cook, Linux MM,
	Linux Kernel Mailing List

On Thu, Oct 4, 2018 at 9:44 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> Hi Michal,
>
> On Thu, Oct 4, 2018 at 12:53 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Wed 03-10-18 19:15:18, Dan Williams wrote:
> > > Changes since v1:
> > > * Add support for shuffling hot-added memory (Andrew)
> > > * Update cover letter and commit message to clarify the performance impact
> > >   and relevance to future platforms
> >
> > I believe this hasn't addressed my questions in
> > http://lkml.kernel.org/r/20181002143015.GX18290@dhcp22.suse.cz. Namely
> > "
> > It is the more general idea that I am not really sure about. First of
> > all. Does it make _any_ sense to randomize 4MB blocks by default? Why
> > cannot we simply have it disabled?
>
> I'm not aware of any CVE that this would directly preclude, but that
> said the entropy injected at 4MB boundaries raises the bar on heap
> attacks. Environments that want more can adjust that with the boot
> parameter. Given the potential benefits I think it would only make
> sense to default disable it if there was a significant runtime impact,
> from what I have seen there isn't.
>
> > Then and more concerning question is,
> > does it even make sense to have this randomization applied to higher
> > orders than 0? Attacker might fragment the memory and keep recycling the
> > lowest order and get the predictable behavior that we have right now.
>
> Certainly I expect there are attacks that can operate within a 4MB
> window, as I expect there are attacks that could operate within a 4K
> window that would need sub-page randomization to deter. In fact I
> believe that is the motivation for CONFIG_SLAB_FREELIST_RANDOM.
> Combining that with page allocator randomization makes the kernel less
> predictable.
>
> Is that enough justification for this patch on its own? It's
> debatable. Combine that though with the wider availability of
> platforms with memory-side-cache and I think it's a reasonable default
> behavior for the kernel to deploy.

Hi Michal,

Does the above address your concerns? v4.20 is perhaps the last
upstream kernel release in advance of wider hardware availability.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/3] mm: Shuffle initial free memory
  2018-10-04 16:51     ` Dan Williams
@ 2018-10-09 11:12       ` Michal Hocko
  2018-10-09 17:36         ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2018-10-09 11:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Kees Cook, Dave Hansen, Linux MM,
	Linux Kernel Mailing List

On Thu 04-10-18 09:51:37, Dan Williams wrote:
> On Thu, Oct 4, 2018 at 12:48 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Wed 03-10-18 19:15:24, Dan Williams wrote:
> > > Some data exfiltration and return-oriented-programming attacks rely on
> > > the ability to infer the location of sensitive data objects. The kernel
> > > page allocator, especially early in system boot, has predictable
> > > first-in-first out behavior for physical pages. Pages are freed in
> > > physical address order when first onlined.
> > >
> > > Introduce shuffle_free_memory(), and its helper shuffle_zone(), to
> > > perform a Fisher-Yates shuffle of the page allocator 'free_area' lists
> > > when they are initially populated with free memory at boot and at
> > > hotplug time.
> > >
> > > Quoting Kees:
> > >     "While we already have a base-address randomization
> > >      (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
> > >      memory layouts would certainly be using the predictability of
> > >      allocation ordering (i.e. for attacks where the base address isn't
> > >      important: only the relative positions between allocated memory).
> > >      This is common in lots of heap-style attacks. They try to gain
> > >      control over ordering by spraying allocations, etc.
> > >
> > >      I'd really like to see this because it gives us something similar
> > >      to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."
> > >
> > > Another motivation for this change is performance in the presence of a
> > > memory-side cache. In the future, memory-side-cache technology will be
> > > available on generally available server platforms. The proposed
> > > randomization approach has been measured to improve the cache conflict
> > > rate by a factor of 2.5X on a well-known Java benchmark. It avoids
> > > performance peaks and valleys to provide more predictable performance.
> > >
> > > While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
> > > caches it leaves vast bulk of memory to be predictably in order
> > > allocated. That ordering can be detected by a memory side-cache.
> > >
> > > The shuffling is done in terms of 'shuffle_page_order' sized free pages
> > > where the default shuffle_page_order is MAX_ORDER-1 i.e. 10, 4MB this
> > > trades off randomization granularity for time spent shuffling.
> > > MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
> > > while still showing memory-side cache behavior improvements.
> > >
> > > The performance impact of the shuffling appears to be in the noise
> > > compared to other memory initialization work. Also the bulk of the work
> > > is done in the background as a part of deferred_init_memmap().
> >
> > This is the biggest portion of the series and I am wondering why do we
> > need it at all. Why it isn't sufficient to rely on the patch 3 here?
> 
> In fact we started with only patch3 and it had no measurable impact on
> the cache conflict rate.
> 
> > Pages freed from the bootmem allocator go via the same path so they
> > might be shuffled at that time. Or is there any problem with that?
> > Not enough entropy at the time when this is called or the final result
> > is not randomized enough (some numbers would be helpful).
> 
> So the reason front-back randomization is not enough is due to the
> in-order initial freeing of pages. At the start of that process
> putting page1 in front or behind page0 still keeps them close
> together, page2 is still near page1 and has a high chance of being
> adjacent. As more pages are added ordering diversity improves, but
> there is still high page locality for the low address pages and this
> leads to no significant impact to the cache conflict rate. Patch3 is
> enough to keep the entropy sustained over time, but it's not enough
> initially.

That should be in the changelog IMHO.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/3] Randomize free memory
  2018-10-04 16:44   ` Dan Williams
  2018-10-06 17:01     ` Dan Williams
@ 2018-10-09 11:22     ` Michal Hocko
  2018-10-09 17:34       ` Dan Williams
  1 sibling, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2018-10-09 11:22 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Dave Hansen, Kees Cook, Linux MM,
	Linux Kernel Mailing List

On Thu 04-10-18 09:44:35, Dan Williams wrote:
> Hi Michal,
> 
> On Thu, Oct 4, 2018 at 12:53 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Wed 03-10-18 19:15:18, Dan Williams wrote:
> > > Changes since v1:
> > > * Add support for shuffling hot-added memory (Andrew)
> > > * Update cover letter and commit message to clarify the performance impact
> > >   and relevance to future platforms
> >
> > I believe this hasn't addressed my questions in
> > http://lkml.kernel.org/r/20181002143015.GX18290@dhcp22.suse.cz. Namely
> > "
> > It is the more general idea that I am not really sure about. First of
> > all. Does it make _any_ sense to randomize 4MB blocks by default? Why
> > cannot we simply have it disabled?
> 
> I'm not aware of any CVE that this would directly preclude, but that
> said the entropy injected at 4MB boundaries raises the bar on heap
> attacks. Environments that want more can adjust that with the boot
> parameter. Given the potential benefits I think it would only make
> sense to default disable it if there was a significant runtime impact,
> from what I have seen there isn't.
> 
> > Then and more concerning question is,
> > does it even make sense to have this randomization applied to higher
> > orders than 0? Attacker might fragment the memory and keep recycling the
> > lowest order and get the predictable behavior that we have right now.
> 
> Certainly I expect there are attacks that can operate within a 4MB
> window, as I expect there are attacks that could operate within a 4K
> window that would need sub-page randomization to deter. In fact I
> believe that is the motivation for CONFIG_SLAB_FREELIST_RANDOM.
> Combining that with page allocator randomization makes the kernel less
> predictable.

I am sorry but this hasn't explained anything (at least to me). I can
still see a way to bypass this randomization by fragmenting the memory.
With that possibility in place this doesn't really provide the promissed
additional security. So either I am missing something or the per-order
threshold is simply a wrong interface to a broken security misfeature.

> Is that enough justification for this patch on its own?

I do not think so from what I have heard so far.

> It's
> debatable. Combine that though with the wider availability of
> platforms with memory-side-cache and I think it's a reasonable default
> behavior for the kernel to deploy.

OK, this sounds a bit more interesting. I am going to speculate because
memory-side-cache is way too generic of a term for me to imagine
anything specific. Many years back while at a university I was playing
with page coloring as a method to reach a more stable performance
results due to reduced cache conflicts. It was not always a performance
gain but it definitely allowed for more stable run-to-run comparable
results. I can imagine that a randomization might lead to a similar effect
although I am not sure how much and it would be more interesting to hear
about that effect. If this is really the case then I would assume on/off
knob to control the randomization without something as specific as
order.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/3] Randomize free memory
  2018-10-09 11:22     ` Michal Hocko
@ 2018-10-09 17:34       ` Dan Williams
  2018-10-10  8:47         ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2018-10-09 17:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Dave Hansen, Kees Cook, Linux MM,
	Linux Kernel Mailing List

On Tue, Oct 9, 2018 at 4:28 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 04-10-18 09:44:35, Dan Williams wrote:
> > Hi Michal,
> >
> > On Thu, Oct 4, 2018 at 12:53 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Wed 03-10-18 19:15:18, Dan Williams wrote:
> > > > Changes since v1:
> > > > * Add support for shuffling hot-added memory (Andrew)
> > > > * Update cover letter and commit message to clarify the performance impact
> > > >   and relevance to future platforms
> > >
> > > I believe this hasn't addressed my questions in
> > > http://lkml.kernel.org/r/20181002143015.GX18290@dhcp22.suse.cz. Namely
> > > "
> > > It is the more general idea that I am not really sure about. First of
> > > all. Does it make _any_ sense to randomize 4MB blocks by default? Why
> > > cannot we simply have it disabled?
> >
> > I'm not aware of any CVE that this would directly preclude, but that
> > said the entropy injected at 4MB boundaries raises the bar on heap
> > attacks. Environments that want more can adjust that with the boot
> > parameter. Given the potential benefits I think it would only make
> > sense to default disable it if there was a significant runtime impact,
> > from what I have seen there isn't.
> >
> > > Then and more concerning question is,
> > > does it even make sense to have this randomization applied to higher
> > > orders than 0? Attacker might fragment the memory and keep recycling the
> > > lowest order and get the predictable behavior that we have right now.
> >
> > Certainly I expect there are attacks that can operate within a 4MB
> > window, as I expect there are attacks that could operate within a 4K
> > window that would need sub-page randomization to deter. In fact I
> > believe that is the motivation for CONFIG_SLAB_FREELIST_RANDOM.
> > Combining that with page allocator randomization makes the kernel less
> > predictable.
>
> I am sorry but this hasn't explained anything (at least to me). I can
> still see a way to bypass this randomization by fragmenting the memory.
> With that possibility in place this doesn't really provide the promissed
> additional security. So either I am missing something or the per-order
> threshold is simply a wrong interface to a broken security misfeature.

I think a similar argument can be made against
CONFIG_SLAB_FREELIST_RANDOM the randomization benefits can be defeated
with more effort, and more effort is the entire point.

> > Is that enough justification for this patch on its own?
>
> I do not think so from what I have heard so far.

I'm missing what bar you are judging the criteria for these patches,
my bar is increased protection against allocation ordering attacks as
seconded by Kees, and the memory side caching effects. That said I
don't have a known CVE in my mind that would be mitigated by 4MB page
shuffling.

> > It's
> > debatable. Combine that though with the wider availability of
> > platforms with memory-side-cache and I think it's a reasonable default
> > behavior for the kernel to deploy.
>
> OK, this sounds a bit more interesting. I am going to speculate because
> memory-side-cache is way too generic of a term for me to imagine
> anything specific.

No need to imagine, a memory side cache shipped on a previous product
as Robert linked in his comments.

> Many years back while at a university I was playing
> with page coloring as a method to reach a more stable performance
> results due to reduced cache conflicts. It was not always a performance
> gain but it definitely allowed for more stable run-to-run comparable
> results. I can imagine that a randomization might lead to a similar effect
> although I am not sure how much and it would be more interesting to hear
> about that effect.

Cache coloring is effective up until your workload no longer fits in
that color. Randomization helps to attenuate the cache conflict rate
when that happens. For workloads that may fit in the cache, and/or
environments that need more explicit cache control we have the recent
changes to numa_emulation [1] to arrange for cache sized numa nodes.

> If this is really the case then I would assume on/off
> knob to control the randomization without something as specific as
> order.

Are we only debating the enabling knob at this point? I'm not opposed
to changing that, but I do think we want to keep the rest of the
infrastructure to allow for shuffling on a variable page size boundary
in case there is enhanced security benefits at smaller buddy-page
sizes.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cc9aec03e58f

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/3] mm: Shuffle initial free memory
  2018-10-09 11:12       ` Michal Hocko
@ 2018-10-09 17:36         ` Dan Williams
  0 siblings, 0 replies; 18+ messages in thread
From: Dan Williams @ 2018-10-09 17:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kees Cook, Dave Hansen, Linux MM,
	Linux Kernel Mailing List

On Tue, Oct 9, 2018 at 4:16 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 04-10-18 09:51:37, Dan Williams wrote:
> > On Thu, Oct 4, 2018 at 12:48 AM Michal Hocko <mhocko@kernel.org> wrote:
[..]
> > So the reason front-back randomization is not enough is due to the
> > in-order initial freeing of pages. At the start of that process
> > putting page1 in front or behind page0 still keeps them close
> > together, page2 is still near page1 and has a high chance of being
> > adjacent. As more pages are added ordering diversity improves, but
> > there is still high page locality for the low address pages and this
> > leads to no significant impact to the cache conflict rate. Patch3 is
> > enough to keep the entropy sustained over time, but it's not enough
> > initially.
>
> That should be in the changelog IMHO.

Fair enough, I'll fold that in when I rebase on top of -next.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/3] Randomize free memory
  2018-10-09 17:34       ` Dan Williams
@ 2018-10-10  8:47         ` Michal Hocko
  2018-10-11  0:13           ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2018-10-10  8:47 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Dave Hansen, Kees Cook, Linux MM,
	Linux Kernel Mailing List

On Tue 09-10-18 10:34:55, Dan Williams wrote:
> On Tue, Oct 9, 2018 at 4:28 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 04-10-18 09:44:35, Dan Williams wrote:
> > > Hi Michal,
> > >
> > > On Thu, Oct 4, 2018 at 12:53 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Wed 03-10-18 19:15:18, Dan Williams wrote:
> > > > > Changes since v1:
> > > > > * Add support for shuffling hot-added memory (Andrew)
> > > > > * Update cover letter and commit message to clarify the performance impact
> > > > >   and relevance to future platforms
> > > >
> > > > I believe this hasn't addressed my questions in
> > > > http://lkml.kernel.org/r/20181002143015.GX18290@dhcp22.suse.cz. Namely
> > > > "
> > > > It is the more general idea that I am not really sure about. First of
> > > > all. Does it make _any_ sense to randomize 4MB blocks by default? Why
> > > > cannot we simply have it disabled?
> > >
> > > I'm not aware of any CVE that this would directly preclude, but that
> > > said the entropy injected at 4MB boundaries raises the bar on heap
> > > attacks. Environments that want more can adjust that with the boot
> > > parameter. Given the potential benefits I think it would only make
> > > sense to default disable it if there was a significant runtime impact,
> > > from what I have seen there isn't.
> > >
> > > > Then and more concerning question is,
> > > > does it even make sense to have this randomization applied to higher
> > > > orders than 0? Attacker might fragment the memory and keep recycling the
> > > > lowest order and get the predictable behavior that we have right now.
> > >
> > > Certainly I expect there are attacks that can operate within a 4MB
> > > window, as I expect there are attacks that could operate within a 4K
> > > window that would need sub-page randomization to deter. In fact I
> > > believe that is the motivation for CONFIG_SLAB_FREELIST_RANDOM.
> > > Combining that with page allocator randomization makes the kernel less
> > > predictable.
> >
> > I am sorry but this hasn't explained anything (at least to me). I can
> > still see a way to bypass this randomization by fragmenting the memory.
> > With that possibility in place this doesn't really provide the promissed
> > additional security. So either I am missing something or the per-order
> > threshold is simply a wrong interface to a broken security misfeature.
> 
> I think a similar argument can be made against
> CONFIG_SLAB_FREELIST_RANDOM the randomization benefits can be defeated
> with more effort, and more effort is the entire point.

If there is relatively simple way to achieve that (which I dunno about
the slab free list randomization because I am not familiar with the
implementation) then the feature is indeed questionable. I would
understand an argument about feasibility if bypassing was extremely hard
but fragmenting the memory is relatively a simple task.

> > > Is that enough justification for this patch on its own?
> >
> > I do not think so from what I have heard so far.
> 
> I'm missing what bar you are judging the criteria for these patches,
> my bar is increased protection against allocation ordering attacks as
> seconded by Kees, and the memory side caching effects.

As said above, if it is quite easy to bypass the randomization then
calling and advertizing this as a security feature is a dubious. Not
enough to ouright nak it of course but also not something I would put my
stamp on. And arguments would be much more solid if they were backed by
some numbers (not only for the security aspect but also the side caching
effects).

> That said I
> don't have a known CVE in my mind that would be mitigated by 4MB page
> shuffling.
> 
> > > It's
> > > debatable. Combine that though with the wider availability of
> > > platforms with memory-side-cache and I think it's a reasonable default
> > > behavior for the kernel to deploy.
> >
> > OK, this sounds a bit more interesting. I am going to speculate because
> > memory-side-cache is way too generic of a term for me to imagine
> > anything specific.
> 
> No need to imagine, a memory side cache shipped on a previous product
> as Robert linked in his comments.

Could you make this a part of the changelog? I would really appreciate
to see justification based on actual numbers rather than quite hand wavy
"it helps".

> > Many years back while at a university I was playing
> > with page coloring as a method to reach a more stable performance
> > results due to reduced cache conflicts. It was not always a performance
> > gain but it definitely allowed for more stable run-to-run comparable
> > results. I can imagine that a randomization might lead to a similar effect
> > although I am not sure how much and it would be more interesting to hear
> > about that effect.
> 
> Cache coloring is effective up until your workload no longer fits in
> that color.

Yes, that was my observation back then more or less. But even when you
do not fit into the cache a color aware strategy (I was playing with bin
hoping as well) produced a more deterministic/stable results. But that
is just a side note as it doesn't directly relate to your change.

> Randomization helps to attenuate the cache conflict rate
> when that happens.

I can imagine that. Do we have any numbers to actually back that claim
though?

> For workloads that may fit in the cache, and/or
> environments that need more explicit cache control we have the recent
> changes to numa_emulation [1] to arrange for cache sized numa nodes.

Could you point me to some more documentation. My google-fu is failing
me and "5.2.27.5 Memory Side Cache Information Structure" doesn't point
to anything official (except for your patch referencing it).

> > If this is really the case then I would assume on/off
> > knob to control the randomization without something as specific as
> > order.
> 
> Are we only debating the enabling knob at this point? I'm not opposed
> to changing that, but I do think we want to keep the rest of the
> infrastructure to allow for shuffling on a variable page size boundary
> in case there is enhanced security benefits at smaller buddy-page
> sizes.

I am still trying to understand the benefit of this change. If the
caching effects are actually the most important part and there is a
reasonable cut in allocation order to keep the randomization effective
during the runtime then I would like to understand the thinking behind
that. In other words does the randomization at smaller orders than
biggest order still visible in actual benchmarks? If not then on/off
knob should be sufficient with potential auto tuning based on actual HW
rather than to expect poor admin to google for $RANDOM_ORDER to use on a
specific HW and all the potential cargo cult that will grow around it.

As I've said before, I am not convinced about the security argument but
even if I am wrong here then I am still quite sure that you do not want
to expose the security aspect as "chose an order to randomize from"
because admins will have no real way to know what is the $RANDOM_ORDER
to set. So even then it should be on/off thing. You are going to pay
some of the performance because you would lose some page allocator
optimizations (e.g. pcp lists) but that is unavoidable AFAICS.

With all that being said, I think the overal idea makes sense but you
should try much harder to explain _why_ we need it and back your
justification by actual _data_ before I would consider my ack.

> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cc9aec03e58f

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/3] Randomize free memory
  2018-10-10  8:47         ` Michal Hocko
@ 2018-10-11  0:13           ` Dan Williams
  2018-10-11 11:52             ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2018-10-11  0:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Dave Hansen, Kees Cook, Linux MM,
	Linux Kernel Mailing List

On Wed, Oct 10, 2018 at 1:48 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 09-10-18 10:34:55, Dan Williams wrote:
> > On Tue, Oct 9, 2018 at 4:28 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Thu 04-10-18 09:44:35, Dan Williams wrote:
> > > > Hi Michal,
> > > >
> > > > On Thu, Oct 4, 2018 at 12:53 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Wed 03-10-18 19:15:18, Dan Williams wrote:
> > > > > > Changes since v1:
> > > > > > * Add support for shuffling hot-added memory (Andrew)
> > > > > > * Update cover letter and commit message to clarify the performance impact
> > > > > >   and relevance to future platforms
> > > > >
> > > > > I believe this hasn't addressed my questions in
> > > > > http://lkml.kernel.org/r/20181002143015.GX18290@dhcp22.suse.cz. Namely
> > > > > "
> > > > > It is the more general idea that I am not really sure about. First of
> > > > > all. Does it make _any_ sense to randomize 4MB blocks by default? Why
> > > > > cannot we simply have it disabled?
> > > >
> > > > I'm not aware of any CVE that this would directly preclude, but that
> > > > said the entropy injected at 4MB boundaries raises the bar on heap
> > > > attacks. Environments that want more can adjust that with the boot
> > > > parameter. Given the potential benefits I think it would only make
> > > > sense to default disable it if there was a significant runtime impact,
> > > > from what I have seen there isn't.
> > > >
> > > > > Then and more concerning question is,
> > > > > does it even make sense to have this randomization applied to higher
> > > > > orders than 0? Attacker might fragment the memory and keep recycling the
> > > > > lowest order and get the predictable behavior that we have right now.
> > > >
> > > > Certainly I expect there are attacks that can operate within a 4MB
> > > > window, as I expect there are attacks that could operate within a 4K
> > > > window that would need sub-page randomization to deter. In fact I
> > > > believe that is the motivation for CONFIG_SLAB_FREELIST_RANDOM.
> > > > Combining that with page allocator randomization makes the kernel less
> > > > predictable.
> > >
> > > I am sorry but this hasn't explained anything (at least to me). I can
> > > still see a way to bypass this randomization by fragmenting the memory.
> > > With that possibility in place this doesn't really provide the promissed
> > > additional security. So either I am missing something or the per-order
> > > threshold is simply a wrong interface to a broken security misfeature.
> >
> > I think a similar argument can be made against
> > CONFIG_SLAB_FREELIST_RANDOM the randomization benefits can be defeated
> > with more effort, and more effort is the entire point.
>
> If there is relatively simple way to achieve that (which I dunno about
> the slab free list randomization because I am not familiar with the
> implementation) then the feature is indeed questionable. I would
> understand an argument about feasibility if bypassing was extremely hard
> but fragmenting the memory is relatively a simple task.
>
> > > > Is that enough justification for this patch on its own?
> > >
> > > I do not think so from what I have heard so far.
> >
> > I'm missing what bar you are judging the criteria for these patches,
> > my bar is increased protection against allocation ordering attacks as
> > seconded by Kees, and the memory side caching effects.
>
> As said above, if it is quite easy to bypass the randomization then
> calling and advertizing this as a security feature is a dubious. Not
> enough to ouright nak it of course but also not something I would put my
> stamp on. And arguments would be much more solid if they were backed by
> some numbers (not only for the security aspect but also the side caching
> effects).

In fact you don't even need to fragment since you'll have 4MB
contiguous targets by default, but that's not the point. We'll now
have more entropy in the allocation order to compliment the entropy
introduced at the per-SLAB level with CONFIG_SLAB_FREELIST_RANDOM.

...and now that I've made that argument I think I've come around to
your point about the shuffle_page_order parameter. The only entity
that might have a better clue about "safer" shuffle orders than
MAX_ORDER is the distribution provider. I'll cut a v4 to move all of
this under a configuration symbol and make the shuffle order a compile
time setting.

> > That said I
> > don't have a known CVE in my mind that would be mitigated by 4MB page
> > shuffling.
> >
> > > > It's
> > > > debatable. Combine that though with the wider availability of
> > > > platforms with memory-side-cache and I think it's a reasonable default
> > > > behavior for the kernel to deploy.
> > >
> > > OK, this sounds a bit more interesting. I am going to speculate because
> > > memory-side-cache is way too generic of a term for me to imagine
> > > anything specific.
> >
> > No need to imagine, a memory side cache shipped on a previous product
> > as Robert linked in his comments.
>
> Could you make this a part of the changelog? I would really appreciate
> to see justification based on actual numbers rather than quite hand wavy
> "it helps".

I put in the changelog that these patches reduced the cache conflict
rate by 2.5X on a Java benchmark. I specifically did not put KNL data
directly into the changelog because that is not a general purpose
server platform.

Note, you can also think about this just on pure architecture terms.
I.e. that for a direct mapped cache anywhere in a system you can have
a near zero cache conflict rate on a first run of a workload and high
conflict rate on a second run based on how lucky you are with memory
allocation placement relative to the first run. Randomization keeps
you out of such performance troughs and provides more reliable average
performance.  With the numa emulation patch I referenced an
administrator could constrain a workload to run in a cache-sized
subset of the available memory if they really know what they are doing
and need firmer guarantees.

The risk if Linux does not have this capability is unstable hacks like
zonesort and rebooting, as referenced in that KNL article, which are
not suitable for a general purpose kernel / platform.

> > > Many years back while at a university I was playing
> > > with page coloring as a method to reach a more stable performance
> > > results due to reduced cache conflicts. It was not always a performance
> > > gain but it definitely allowed for more stable run-to-run comparable
> > > results. I can imagine that a randomization might lead to a similar effect
> > > although I am not sure how much and it would be more interesting to hear
> > > about that effect.
> >
> > Cache coloring is effective up until your workload no longer fits in
> > that color.
>
> Yes, that was my observation back then more or less. But even when you
> do not fit into the cache a color aware strategy (I was playing with bin
> hoping as well) produced a more deterministic/stable results. But that
> is just a side note as it doesn't directly relate to your change.
>
> > Randomization helps to attenuate the cache conflict rate
> > when that happens.
>
> I can imagine that. Do we have any numbers to actually back that claim
> though?
>

Yes, 2.5X cache conflict rate reduction, in the change log.

> > For workloads that may fit in the cache, and/or
> > environments that need more explicit cache control we have the recent
> > changes to numa_emulation [1] to arrange for cache sized numa nodes.
>
> Could you point me to some more documentation. My google-fu is failing
> me and "5.2.27.5 Memory Side Cache Information Structure" doesn't point
> to anything official (except for your patch referencing it).

http://www.uefi.org/sites/default/files/resources/ACPI%206_2_A_Sept29.pdf

>
> > > If this is really the case then I would assume on/off
> > > knob to control the randomization without something as specific as
> > > order.
> >
> > Are we only debating the enabling knob at this point? I'm not opposed
> > to changing that, but I do think we want to keep the rest of the
> > infrastructure to allow for shuffling on a variable page size boundary
> > in case there is enhanced security benefits at smaller buddy-page
> > sizes.
>
> I am still trying to understand the benefit of this change. If the
> caching effects are actually the most important part and there is a
> reasonable cut in allocation order to keep the randomization effective
> during the runtime then I would like to understand the thinking behind
> that. In other words does the randomization at smaller orders than
> biggest order still visible in actual benchmarks? If not then on/off
> knob should be sufficient with potential auto tuning based on actual HW
> rather than to expect poor admin to google for $RANDOM_ORDER to use on a
> specific HW and all the potential cargo cult that will grow around it.

So, I've come around to your viewpoint on this. Especially when we
have CONFIG_SLAB_FREELIST_RANDOM the security benefit of smaller than
MAX_ORDER shuffling is hard to justify and likely does not need kernel
parameter based control.

> As I've said before, I am not convinced about the security argument but
> even if I am wrong here then I am still quite sure that you do not want
> to expose the security aspect as "chose an order to randomize from"
> because admins will have no real way to know what is the $RANDOM_ORDER
> to set. So even then it should be on/off thing. You are going to pay
> some of the performance because you would lose some page allocator
> optimizations (e.g. pcp lists) but that is unavoidable AFAICS.
>
> With all that being said, I think the overal idea makes sense but you
> should try much harder to explain _why_ we need it and back your
> justification by actual _data_ before I would consider my ack.

I don't have a known CVE, I only have the ack of people more
knowledgeable about security than myself like Kees to say in effect,
"yes, this complicates attacks". If you won't take Kees' word for it,
I'm not sure what other justification I can present on the security
aspect.

2.5X cache conflict reduction on a Java benchmark workload that the
exceeds the cache size by multiple factors is the data I can provide
today. Post launch it becomes easier to share more precise data, but
that's post 4.20. The hope of course is to have this capability
available in an upstream released kernel in advance of wider hardware
availability.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/3] Randomize free memory
  2018-10-11  0:13           ` Dan Williams
@ 2018-10-11 11:52             ` Michal Hocko
  2018-10-11 18:03               ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2018-10-11 11:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Dave Hansen, Kees Cook, Linux MM,
	Linux Kernel Mailing List

On Wed 10-10-18 17:13:14, Dan Williams wrote:
[...]
> On Wed, Oct 10, 2018 at 1:48 AM Michal Hocko <mhocko@kernel.org> wrote:
> ...and now that I've made that argument I think I've come around to
> your point about the shuffle_page_order parameter. The only entity
> that might have a better clue about "safer" shuffle orders than
> MAX_ORDER is the distribution provider.

And how is somebody providing a kernel for large variety of workloads
supposed to know?

[...]

> Note, you can also think about this just on pure architecture terms.
> I.e. that for a direct mapped cache anywhere in a system you can have
> a near zero cache conflict rate on a first run of a workload and high
> conflict rate on a second run based on how lucky you are with memory
> allocation placement relative to the first run. Randomization keeps
> you out of such performance troughs and provides more reliable average
> performance.

I am not disagreeing here. That reliable average might be worse than
what you get with the non-randomized case. And that might be a fair
deal for some workloads. You are, however, providing a functionality
which is enabled by default without any actual numbers (well except for
_a_java_ workload that seems to benefit) so you should really do your
homework stop handwaving and give us some numbers and/or convincing
arguments please.

> With the numa emulation patch I referenced an
> administrator could constrain a workload to run in a cache-sized
> subset of the available memory if they really know what they are doing
> and need firmer guarantees.

Then mention how and what you can achieve by that in the changelog.

> The risk if Linux does not have this capability is unstable hacks like
> zonesort and rebooting, as referenced in that KNL article, which are
> not suitable for a general purpose kernel / platform.

We could have lived without those for quite some time so this doesn't
seem to be anything super urgent to push through without a proper
justification.

> > > > Many years back while at a university I was playing
> > > > with page coloring as a method to reach a more stable performance
> > > > results due to reduced cache conflicts. It was not always a performance
> > > > gain but it definitely allowed for more stable run-to-run comparable
> > > > results. I can imagine that a randomization might lead to a similar effect
> > > > although I am not sure how much and it would be more interesting to hear
> > > > about that effect.
> > >
> > > Cache coloring is effective up until your workload no longer fits in
> > > that color.
> >
> > Yes, that was my observation back then more or less. But even when you
> > do not fit into the cache a color aware strategy (I was playing with bin
> > hoping as well) produced a more deterministic/stable results. But that
> > is just a side note as it doesn't directly relate to your change.
> >
> > > Randomization helps to attenuate the cache conflict rate
> > > when that happens.
> >
> > I can imagine that. Do we have any numbers to actually back that claim
> > though?
> >
> 
> Yes, 2.5X cache conflict rate reduction, in the change log.

Which is a single benchmark result which is not even described in detail
to be able to reproduce that measurement. I am sorry for nagging
here but I would expect something less obscure. How does this behave for
usual workloads that we test cache sensitive workloads. I myself am not
a benchmark person but I am pretty sure there are people who can help
you to find proper ones to run and evaluate.

> > > For workloads that may fit in the cache, and/or
> > > environments that need more explicit cache control we have the recent
> > > changes to numa_emulation [1] to arrange for cache sized numa nodes.
> >
> > Could you point me to some more documentation. My google-fu is failing
> > me and "5.2.27.5 Memory Side Cache Information Structure" doesn't point
> > to anything official (except for your patch referencing it).
> 
> http://www.uefi.org/sites/default/files/resources/ACPI%206_2_A_Sept29.pdf

Thanks!

[...]

> > With all that being said, I think the overal idea makes sense but you
> > should try much harder to explain _why_ we need it and back your
> > justification by actual _data_ before I would consider my ack.
> 
> I don't have a known CVE, I only have the ack of people more
> knowledgeable about security than myself like Kees to say in effect,
> "yes, this complicates attacks". If you won't take Kees' word for it,
> I'm not sure what other justification I can present on the security
> aspect.

In general (nothing against Kees here of course), I prefer a stronger
justification than "somebody said it will make attacks harder". At least
my concern about fragmented memory which is not really hard to achieve
at all should be reasonably clarified. I am fully aware there is no
absolute measure here but making something harder under ideal conditions
doesn't really help for common attack strategies which can prepare the
system into an actual state to exploit allocation predictability. I am
no expert here but if an attacker can deduce the allocation pattern then
fragmenting the memory is one easy step to overcome what people would
consider a security measure.

So color me unconvinced for now.

> 2.5X cache conflict reduction on a Java benchmark workload that the
> exceeds the cache size by multiple factors is the data I can provide
> today. Post launch it becomes easier to share more precise data, but
> that's post 4.20. The hope of course is to have this capability
> available in an upstream released kernel in advance of wider hardware
> availability.

I will not comment on timing but in general, any performance related
changes should come with numbers for a wider variety of workloads.

In any case, I believe the change itself is not controversial as long it
is opt-in (potentially autotuned based on specific HW) with a reasonable
API. And no I do not consider $RANDOM_ORDER a good interface.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/3] Randomize free memory
  2018-10-11 11:52             ` Michal Hocko
@ 2018-10-11 18:03               ` Dan Williams
  2018-10-18 13:44                 ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2018-10-11 18:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Dave Hansen, Kees Cook, Linux MM,
	Linux Kernel Mailing List

On Thu, Oct 11, 2018 at 4:56 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 10-10-18 17:13:14, Dan Williams wrote:
> [...]
> > On Wed, Oct 10, 2018 at 1:48 AM Michal Hocko <mhocko@kernel.org> wrote:
> > ...and now that I've made that argument I think I've come around to
> > your point about the shuffle_page_order parameter. The only entity
> > that might have a better clue about "safer" shuffle orders than
> > MAX_ORDER is the distribution provider.
>
> And how is somebody providing a kernel for large variety of workloads
> supposed to know?

True, this would be a much easier discussion with a wider / deeper data set.

>
> [...]
>
> > Note, you can also think about this just on pure architecture terms.
> > I.e. that for a direct mapped cache anywhere in a system you can have
> > a near zero cache conflict rate on a first run of a workload and high
> > conflict rate on a second run based on how lucky you are with memory
> > allocation placement relative to the first run. Randomization keeps
> > you out of such performance troughs and provides more reliable average
> > performance.
>
> I am not disagreeing here. That reliable average might be worse than
> what you get with the non-randomized case. And that might be a fair
> deal for some workloads. You are, however, providing a functionality
> which is enabled by default without any actual numbers (well except for
> _a_java_ workload that seems to benefit) so you should really do your
> homework stop handwaving and give us some numbers and/or convincing
> arguments please.

The latest version of the patches no longer enable it by default. I'm
giving you the data I can give with respect to pre-production
hardware.

> > With the numa emulation patch I referenced an
> > administrator could constrain a workload to run in a cache-sized
> > subset of the available memory if they really know what they are doing
> > and need firmer guarantees.
>
> Then mention how and what you can achieve by that in the changelog.

The numa_emulation aspect is orthogonal to the randomization
implementation. It does not belong in the randomization changelog.

> > The risk if Linux does not have this capability is unstable hacks like
> > zonesort and rebooting, as referenced in that KNL article, which are
> > not suitable for a general purpose kernel / platform.
>
> We could have lived without those for quite some time so this doesn't
> seem to be anything super urgent to push through without a proper
> justification.

We lived without them previously because memory-side-caches were
limited to niche hardware, now this is moving into general purpose
server platforms and the urgency / impact goes up accordingly.

> > > > > Many years back while at a university I was playing
> > > > > with page coloring as a method to reach a more stable performance
> > > > > results due to reduced cache conflicts. It was not always a performance
> > > > > gain but it definitely allowed for more stable run-to-run comparable
> > > > > results. I can imagine that a randomization might lead to a similar effect
> > > > > although I am not sure how much and it would be more interesting to hear
> > > > > about that effect.
> > > >
> > > > Cache coloring is effective up until your workload no longer fits in
> > > > that color.
> > >
> > > Yes, that was my observation back then more or less. But even when you
> > > do not fit into the cache a color aware strategy (I was playing with bin
> > > hoping as well) produced a more deterministic/stable results. But that
> > > is just a side note as it doesn't directly relate to your change.
> > >
> > > > Randomization helps to attenuate the cache conflict rate
> > > > when that happens.
> > >
> > > I can imagine that. Do we have any numbers to actually back that claim
> > > though?
> > >
> >
> > Yes, 2.5X cache conflict rate reduction, in the change log.
>
> Which is a single benchmark result which is not even described in detail
> to be able to reproduce that measurement. I am sorry for nagging
> here but I would expect something less obscure.

No need to apologize.

> How does this behave for
> usual workloads that we test cache sensitive workloads. I myself am not
> a benchmark person but I am pretty sure there are people who can help
> you to find proper ones to run and evaluate.

I wouldn't pick benchmarks that are cpu-cache sensitive since those
are small number of MBs in size, a memory-side cache is on the order
of 10s of GBs.

>
> > > > For workloads that may fit in the cache, and/or
> > > > environments that need more explicit cache control we have the recent
> > > > changes to numa_emulation [1] to arrange for cache sized numa nodes.
> > >
> > > Could you point me to some more documentation. My google-fu is failing
> > > me and "5.2.27.5 Memory Side Cache Information Structure" doesn't point
> > > to anything official (except for your patch referencing it).
> >
> > http://www.uefi.org/sites/default/files/resources/ACPI%206_2_A_Sept29.pdf
>
> Thanks!
>
> [...]
>
> > > With all that being said, I think the overal idea makes sense but you
> > > should try much harder to explain _why_ we need it and back your
> > > justification by actual _data_ before I would consider my ack.
> >
> > I don't have a known CVE, I only have the ack of people more
> > knowledgeable about security than myself like Kees to say in effect,
> > "yes, this complicates attacks". If you won't take Kees' word for it,
> > I'm not sure what other justification I can present on the security
> > aspect.
>
> In general (nothing against Kees here of course), I prefer a stronger
> justification than "somebody said it will make attacks harder". At least
> my concern about fragmented memory which is not really hard to achieve
> at all should be reasonably clarified. I am fully aware there is no
> absolute measure here but making something harder under ideal conditions
> doesn't really help for common attack strategies which can prepare the
> system into an actual state to exploit allocation predictability. I am
> no expert here but if an attacker can deduce the allocation pattern then
> fragmenting the memory is one easy step to overcome what people would
> consider a security measure.
>
> So color me unconvinced for now.

Another way to attack heap randomization without fragmentation is to
just perform heap spraying and hope that lands the data the attacker
needs in the right place. I still think that allocation entropy > 0 is
positive benefit, but I don't know how to determine the curve of
security benefit relative to shuffle order.

> > 2.5X cache conflict reduction on a Java benchmark workload that the
> > exceeds the cache size by multiple factors is the data I can provide
> > today. Post launch it becomes easier to share more precise data, but
> > that's post 4.20. The hope of course is to have this capability
> > available in an upstream released kernel in advance of wider hardware
> > availability.
>
> I will not comment on timing but in general, any performance related
> changes should come with numbers for a wider variety of workloads.

That's fair.

> In any case, I believe the change itself is not controversial as long it
> is opt-in (potentially autotuned based on specific HW)

Do you mean disable shuffling on systems that don't have a
memory-side-cache unless / until we can devise a security benefit
curve relative to shuffle-order? The former I can do, the latter, I'm
at a loss.

> with a reasonable
> API. And no I do not consider $RANDOM_ORDER a good interface.

I think the current v4 proposal of compile-time setting is reasonable
once we have consensus / guidance on the default shuffle-order.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/3] Randomize free memory
  2018-10-11 18:03               ` Dan Williams
@ 2018-10-18 13:44                 ` Michal Hocko
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2018-10-18 13:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Dave Hansen, Kees Cook, Linux MM,
	Linux Kernel Mailing List

On Thu 11-10-18 11:03:07, Dan Williams wrote:
> On Thu, Oct 11, 2018 at 4:56 AM Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > In any case, I believe the change itself is not controversial as long it
> > is opt-in (potentially autotuned based on specific HW)
> 
> Do you mean disable shuffling on systems that don't have a
> memory-side-cache unless / until we can devise a security benefit
> curve relative to shuffle-order? The former I can do, the latter, I'm
> at a loss.

Yes, enable when the HW requires that for whatever reason and make add a
global knob to enable it for those that might find it useful for
security reasons with a clear cost/benefit description. Not "this is tha
security thingy enable and feel safe(r)"
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-10-18 13:44 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-04  2:15 [PATCH v2 0/3] Randomize free memory Dan Williams
2018-10-04  2:15 ` [PATCH v2 1/3] mm: Shuffle initial " Dan Williams
2018-10-04  7:48   ` Michal Hocko
2018-10-04 16:51     ` Dan Williams
2018-10-09 11:12       ` Michal Hocko
2018-10-09 17:36         ` Dan Williams
2018-10-04  2:15 ` [PATCH v2 2/3] mm: Move buddy list manipulations into helpers Dan Williams
2018-10-04  2:15 ` [PATCH v2 3/3] mm: Maintain randomization of page free lists Dan Williams
2018-10-04  7:44 ` [PATCH v2 0/3] Randomize free memory Michal Hocko
2018-10-04 16:44   ` Dan Williams
2018-10-06 17:01     ` Dan Williams
2018-10-09 11:22     ` Michal Hocko
2018-10-09 17:34       ` Dan Williams
2018-10-10  8:47         ` Michal Hocko
2018-10-11  0:13           ` Dan Williams
2018-10-11 11:52             ` Michal Hocko
2018-10-11 18:03               ` Dan Williams
2018-10-18 13:44                 ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).