linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/3] mm: Randomize free memory
@ 2019-01-07 23:21 Dan Williams
  2019-01-07 23:21 ` [PATCH v7 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization Dan Williams
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Dan Williams @ 2019-01-07 23:21 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Dave Hansen, Mike Rapoport, Kees Cook, mhocko,
	keith.busch, linux-mm, linux-kernel, mgorman

Changes since v6 [1]:
* Simplify the review, drop the autodetect patches from the series. That
  work simply results in a single call to page_alloc_shuffle(SHUFFLE_ENABLE)
  injected at the right location during ACPI NUMA initialization / parsing
  of the HMAT (Heterogeneous Memory Attributes Table). That is purely a
  follow-on consideration once the base shuffle implementation and
  definition of page_alloc_shuffle() is accepted. The end result for this
  series is that the command line parameter "page_alloc.shuffle" is
  required to enable the randomization.

* Fix declaration of page_alloc_shuffle() in the
  CONFIG_SHUFFLE_PAGE_ALLOCATOR=n case. (0day)

* Rebased on v5.0-rc1

[1]: https://lkml.org/lkml/2018/12/17/1116

---

Hi Andrew, please consider this series for -mm only after Michal and Mel
have had a chance to review and have their concerns addressed.

---

Quote Patch 1:

Randomization of the page allocator improves the average utilization of
a direct-mapped memory-side-cache. Memory side caching is a platform
capability that Linux has been previously exposed to in HPC
(high-performance computing) environments on specialty platforms. In
that instance it was a smaller pool of high-bandwidth-memory relative to
higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
be found on general purpose server platforms where DRAM is a cache in
front of higher latency persistent memory [2].

Robert offered an explanation of the state of the art of Linux
interactions with memory-side-caches [3], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.org/lkml/2017/8/23/195

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel).  That's better than forcing
    users to deploy remedies like:
        "To eliminate this gradual degradation, we have added a Stream
         measurement to the Node Health Check that follows each job;
         nodes are rebooted whenever their measured memory bandwidth
         falls below 300 GB/s."

A replacement for zonesort was merged upstream in commit cc9aec03e58f
"x86/numa_emulation: Introduce uniform split capability". With this
numa_emulation capability, memory can be split into cache sized
("near-memory" sized) numa nodes. A bind operation to such a node, and
disabling workloads on other nodes, enables full cache performance.
However, once the workload exceeds the cache size then cache conflicts
are unavoidable. While HPC environments might be able to tolerate
time-scheduling of cache sized workloads, for general purpose server
platforms, the oversubscribed cache case will be the common case.

The worst case scenario is that a server system owner benchmarks a
workload at boot with an un-contended cache only to see that performance
degrade over time, even below the average cache performance due to
excessive conflicts. Randomization clips the peaks and fills in the
valleys of cache utilization to yield steady average performance.

See patch 1 for more details.

[2]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
[3]: https://lkml.org/lkml/2018/9/22/54

---

Dan Williams (3):
      mm: Shuffle initial free memory to improve memory-side-cache utilization
      mm: Move buddy list manipulations into helpers
      mm: Maintain randomization of page free lists


 include/linux/list.h     |   17 +++
 include/linux/mm.h       |    3 -
 include/linux/mm_types.h |    3 +
 include/linux/mmzone.h   |   65 +++++++++++++
 include/linux/shuffle.h  |   60 ++++++++++++
 init/Kconfig             |   36 +++++++
 mm/Makefile              |    7 +
 mm/compaction.c          |    4 -
 mm/memblock.c            |   10 ++
 mm/memory_hotplug.c      |    3 +
 mm/page_alloc.c          |   82 ++++++++--------
 mm/shuffle.c             |  231 ++++++++++++++++++++++++++++++++++++++++++++++
 12 files changed, 471 insertions(+), 50 deletions(-)
 create mode 100644 include/linux/shuffle.h
 create mode 100644 mm/shuffle.c

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v7 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization
  2019-01-07 23:21 [PATCH v7 0/3] mm: Randomize free memory Dan Williams
@ 2019-01-07 23:21 ` Dan Williams
  2019-01-08  0:18   ` Kees Cook
                     ` (2 more replies)
  2019-01-07 23:21 ` [PATCH v7 2/3] mm: Move buddy list manipulations into helpers Dan Williams
       [not found] ` <154690328135.676627.5979130839159447106.stgit@dwillia2-desk3.amr.corp.intel.com>
  2 siblings, 3 replies; 16+ messages in thread
From: Dan Williams @ 2019-01-07 23:21 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Kees Cook, Dave Hansen, Mike Rapoport, mhocko,
	keith.busch, linux-mm, linux-kernel, mgorman

Randomization of the page allocator improves the average utilization of
a direct-mapped memory-side-cache. Memory side caching is a platform
capability that Linux has been previously exposed to in HPC
(high-performance computing) environments on specialty platforms. In
that instance it was a smaller pool of high-bandwidth-memory relative to
higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
be found on general purpose server platforms where DRAM is a cache in
front of higher latency persistent memory [1].

Robert offered an explanation of the state of the art of Linux
interactions with memory-side-caches [2], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.org/lkml/2017/8/23/195

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel).  That's better than forcing
    users to deploy remedies like:
        "To eliminate this gradual degradation, we have added a Stream
         measurement to the Node Health Check that follows each job;
         nodes are rebooted whenever their measured memory bandwidth
         falls below 300 GB/s."

A replacement for zonesort was merged upstream in commit cc9aec03e58f
"x86/numa_emulation: Introduce uniform split capability". With this
numa_emulation capability, memory can be split into cache sized
("near-memory" sized) numa nodes. A bind operation to such a node, and
disabling workloads on other nodes, enables full cache performance.
However, once the workload exceeds the cache size then cache conflicts
are unavoidable. While HPC environments might be able to tolerate
time-scheduling of cache sized workloads, for general purpose server
platforms, the oversubscribed cache case will be the common case.

The worst case scenario is that a server system owner benchmarks a
workload at boot with an un-contended cache only to see that performance
degrade over time, even below the average cache performance due to
excessive conflicts. Randomization clips the peaks and fills in the
valleys of cache utilization to yield steady average performance.

Here are some performance impact details of the patches:

1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
3X speedup in a contrived case that tries to force cache conflicts. The
contrived cased used the numa_emulation capability to force an instance
of the benchmark to be run in two of the near-memory sized numa nodes.
If both instances were placed on the same emulated they would fit and
cause zero conflicts.  While on separate emulated nodes without
randomization they underutilized the cache and conflicted unnecessarily
due to the in-order allocation per node.

2/ A well known Java server application benchmark was run with a heap
size that exceeded cache size by 3X. The cache conflict rate was 8% for
the first run and degraded to 21% after page allocator aging. With
randomization enabled the rate levelled out at 11%.

3/ A MongoDB workload did not observe measurable difference in
cache-conflict rates, but the overall throughput dropped by 7% with
randomization in one case.

4/ Mel Gorman ran his suite of performance workloads with randomization
enabled on platforms without a memory-side-cache and saw a mix of some
improvements and some losses [3].

While there is potentially significant improvement for applications that
depend on low latency access across a wide working-set, the performance
may be negligible to negative for other workloads. For this reason the
shuffle capability defaults to off unless a direct-mapped
memory-side-cache is detected. Even then, the page_alloc.shuffle=0
parameter can be specified to disable the randomization on those
systems.

Outside of memory-side-cache utilization concerns there is potentially
security benefit from randomization. Some data exfiltration and
return-oriented-programming attacks rely on the ability to infer the
location of sensitive data objects. The kernel page allocator,
especially early in system boot, has predictable first-in-first out
behavior for physical pages. Pages are freed in physical address order
when first onlined.

Quoting Kees:
    "While we already have a base-address randomization
     (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
     memory layouts would certainly be using the predictability of
     allocation ordering (i.e. for attacks where the base address isn't
     important: only the relative positions between allocated memory).
     This is common in lots of heap-style attacks. They try to gain
     control over ordering by spraying allocations, etc.

     I'd really like to see this because it gives us something similar
     to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
caches it leaves vast bulk of memory to be predictably in order
allocated.  However, it should be noted, the concrete security benefits
are hard to quantify, and no known CVE is mitigated by this
randomization.

Introduce shuffle_free_memory(), and its helper shuffle_zone(), to
perform a Fisher-Yates shuffle of the page allocator 'free_area' lists
when they are initially populated with free memory at boot and at
hotplug time. Do this based on either the presence of a
page_alloc.shuffle=Y command line parameter, or autodetection of a
memory-side-cache (to be added in a follow-on patch).

The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.
10, 4MB this trades off randomization granularity for time spent
shuffling.  MAX_ORDER-1 was chosen to be minimally invasive to the page
allocator while still showing memory-side cache behavior improvements,
and the expectation that the security implications of finer granularity
randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.

The performance impact of the shuffling appears to be in the noise
compared to other memory initialization work. Also the bulk of the work
is done in the background as a part of deferred_init_memmap().

This initial randomization can be undone over time so a follow-on patch
is introduced to inject entropy on page free decisions. It is reasonable
to ask if the page free entropy is sufficient, but it is not enough due
to the in-order initial freeing of pages. At the start of that process
putting page1 in front or behind page0 still keeps them close together,
page2 is still near page1 and has a high chance of being adjacent. As
more pages are added ordering diversity improves, but there is still
high page locality for the low address pages and this leads to no
significant impact to the cache conflict rate.

[1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
[2]: https://lkml.org/lkml/2018/9/22/54
[3]: https://lkml.org/lkml/2018/10/12/309

Cc: Michal Hocko <mhocko@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/list.h    |   17 ++++
 include/linux/mmzone.h  |    4 +
 include/linux/shuffle.h |   48 ++++++++++
 init/Kconfig            |   36 ++++++++
 mm/Makefile             |    7 +-
 mm/memblock.c           |   10 ++
 mm/memory_hotplug.c     |    3 +
 mm/page_alloc.c         |    3 +
 mm/shuffle.c            |  215 +++++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 341 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/shuffle.h
 create mode 100644 mm/shuffle.c

diff --git a/include/linux/list.h b/include/linux/list.h
index edb7628e46ed..3dfb8953f241 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -150,6 +150,23 @@ static inline void list_replace_init(struct list_head *old,
 	INIT_LIST_HEAD(old);
 }
 
+/**
+ * list_swap - replace entry1 with entry2 and re-add entry1 at entry2's position
+ * @entry1: the location to place entry2
+ * @entry2: the location to place entry1
+ */
+static inline void list_swap(struct list_head *entry1,
+			     struct list_head *entry2)
+{
+	struct list_head *pos = entry2->prev;
+
+	list_del(entry2);
+	list_replace(entry1, entry2);
+	if (pos == entry1)
+		pos = entry2;
+	list_add(entry1, pos);
+}
+
 /**
  * list_del_init - deletes entry from list and reinitialize it.
  * @entry: the element to delete from the list.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cc4a507d7ca4..8c37a023a790 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1272,6 +1272,10 @@ void sparse_init(void);
 #else
 #define sparse_init()	do {} while (0)
 #define sparse_index_init(_sec, _nid)  do {} while (0)
+static inline int pfn_present(unsigned long pfn)
+{
+	return 1;
+}
 #endif /* CONFIG_SPARSEMEM */
 
 /*
diff --git a/include/linux/shuffle.h b/include/linux/shuffle.h
new file mode 100644
index 000000000000..d109161f4a62
--- /dev/null
+++ b/include/linux/shuffle.h
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2018 Intel Corporation. All rights reserved.
+#ifndef _MM_SHUFFLE_H
+#define _MM_SHUFFLE_H
+#include <linux/jump_label.h>
+
+enum mm_shuffle_ctl {
+	SHUFFLE_ENABLE,
+	SHUFFLE_FORCE_DISABLE,
+};
+#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR
+DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
+extern void page_alloc_shuffle(enum mm_shuffle_ctl ctl);
+extern void __shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
+		unsigned long end_pfn);
+static inline void shuffle_free_memory(pg_data_t *pgdat,
+		unsigned long start_pfn, unsigned long end_pfn)
+{
+	if (!static_branch_unlikely(&page_alloc_shuffle_key))
+		return;
+	__shuffle_free_memory(pgdat, start_pfn, end_pfn);
+}
+
+extern void __shuffle_zone(struct zone *z, unsigned long start_pfn,
+		unsigned long end_pfn);
+static inline void shuffle_zone(struct zone *z, unsigned long start_pfn,
+		unsigned long end_pfn)
+{
+	if (!static_branch_unlikely(&page_alloc_shuffle_key))
+		return;
+	__shuffle_zone(z, start_pfn, end_pfn);
+}
+#else
+static inline void shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
+		unsigned long end_pfn)
+{
+}
+
+static inline void shuffle_zone(struct zone *z, unsigned long start_pfn,
+		unsigned long end_pfn)
+{
+}
+
+static inline void page_alloc_shuffle(enum mm_shuffle_ctl ctl)
+{
+}
+#endif
+#endif /* _MM_SHUFFLE_H */
diff --git a/init/Kconfig b/init/Kconfig
index d47cb77a220e..db7758476e7a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1714,6 +1714,42 @@ config SLAB_FREELIST_HARDENED
 	  sacrifies to harden the kernel slab allocator against common
 	  freelist exploit methods.
 
+config SHUFFLE_PAGE_ALLOCATOR
+	bool "Page allocator randomization"
+	depends on ACPI_NUMA
+	default SLAB_FREELIST_RANDOM
+	help
+	  Randomization of the page allocator improves the average
+	  utilization of a direct-mapped memory-side-cache. See section
+	  5.2.27 Heterogeneous Memory Attribute Table (HMAT) in the ACPI
+	  6.2a specification for an example of how a platform advertises
+	  the presence of a memory-side-cache. There are also incidental
+	  security benefits as it reduces the predictability of page
+	  allocations to compliment SLAB_FREELIST_RANDOM, but the
+	  default granularity of shuffling on 4MB (MAX_ORDER) pages is
+	  selected based on cache utilization benefits.
+
+	  While the randomization improves cache utilization it may
+	  negatively impact workloads on platforms without a cache. For
+	  this reason, by default, the randomization is enabled only
+	  after runtime detection of a direct-mapped memory-side-cache.
+	  Otherwise, the randomization may be force enabled with the
+	  'page_alloc.shuffle' kernel command line parameter.
+
+	  Say Y if unsure.
+
+config SHUFFLE_PAGE_ORDER
+	depends on SHUFFLE_PAGE_ALLOCATOR
+	int "Page allocator shuffle order"
+	range 0 10
+	default 10
+	help
+	  Specify the granularity at which shuffling (randomization) is
+	  performed. By default this is set to MAX_ORDER-1 to minimize
+	  runtime impact of randomization and with the expectation that
+	  SLAB_FREELIST_RANDOM mitigates heap attacks on smaller
+	  object granularities.
+
 config SLUB_CPU_PARTIAL
 	default y
 	depends on SLUB && SMP
diff --git a/mm/Makefile b/mm/Makefile
index d210cc9d6f80..ac5e5ba78874 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,7 +33,7 @@ mmu-$(CONFIG_MMU)	+= process_vm_access.o
 endif
 
 obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
-			   maccess.o page_alloc.o page-writeback.o \
+			   maccess.o page-writeback.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
@@ -41,6 +41,11 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   interval_tree.o list_lru.o workingset.o \
 			   debug.o $(mmu-y)
 
+# Give 'page_alloc' its own module-parameter namespace
+page-alloc-y := page_alloc.o
+page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
+
+obj-y += page-alloc.o
 obj-y += init-mm.o
 obj-y += memblock.o
 
diff --git a/mm/memblock.c b/mm/memblock.c
index 022d4cbb3618..3602f7a2eab4 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -17,6 +17,7 @@
 #include <linux/poison.h>
 #include <linux/pfn.h>
 #include <linux/debugfs.h>
+#include <linux/shuffle.h>
 #include <linux/kmemleak.h>
 #include <linux/seq_file.h>
 #include <linux/memblock.h>
@@ -1929,9 +1930,16 @@ static unsigned long __init free_low_memory_core_early(void)
 	 *  low ram will be on Node1
 	 */
 	for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end,
-				NULL)
+				NULL) {
+		pg_data_t *pgdat;
+
 		count += __free_memory_core(start, end);
 
+		for_each_online_pgdat(pgdat)
+			shuffle_free_memory(pgdat, PHYS_PFN(start),
+					PHYS_PFN(end));
+	}
+
 	return count;
 }
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b9a667d36c55..7caffb9a91ab 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -23,6 +23,7 @@
 #include <linux/highmem.h>
 #include <linux/vmalloc.h>
 #include <linux/ioport.h>
+#include <linux/shuffle.h>
 #include <linux/delay.h>
 #include <linux/migrate.h>
 #include <linux/page-isolation.h>
@@ -895,6 +896,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	zone->zone_pgdat->node_present_pages += onlined_pages;
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
 
+	shuffle_zone(zone, pfn, zone_end_pfn(zone));
+
 	if (onlined_pages) {
 		node_states_set_node(nid, &arg);
 		if (need_zonelists_rebuild)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cde5dac6229a..2adcd6da8a07 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -61,6 +61,7 @@
 #include <linux/sched/rt.h>
 #include <linux/sched/mm.h>
 #include <linux/page_owner.h>
+#include <linux/shuffle.h>
 #include <linux/kthread.h>
 #include <linux/memcontrol.h>
 #include <linux/ftrace.h>
@@ -1634,6 +1635,8 @@ static int __init deferred_init_memmap(void *data)
 	}
 	pgdat_resize_unlock(pgdat, &flags);
 
+	shuffle_zone(zone, first_init_pfn, zone_end_pfn(zone));
+
 	/* Sanity check that the next zone really is unpopulated */
 	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
 
diff --git a/mm/shuffle.c b/mm/shuffle.c
new file mode 100644
index 000000000000..07961ff41a03
--- /dev/null
+++ b/mm/shuffle.c
@@ -0,0 +1,215 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2018 Intel Corporation. All rights reserved.
+
+#include <linux/mm.h>
+#include <linux/init.h>
+#include <linux/mmzone.h>
+#include <linux/random.h>
+#include <linux/shuffle.h>
+#include <linux/moduleparam.h>
+#include "internal.h"
+
+DEFINE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
+static unsigned long shuffle_state;
+
+/*
+ * Depending on the architecture, module parameter parsing may run
+ * before, or after the cache detection. SHUFFLE_FORCE_DISABLE prevents,
+ * or reverts the enabling of the shuffle implementation. SHUFFLE_ENABLE
+ * attempts to turn on the implementation, but aborts if it finds
+ * SHUFFLE_FORCE_DISABLE already set.
+ */
+void page_alloc_shuffle(enum mm_shuffle_ctl ctl)
+{
+	if (ctl == SHUFFLE_FORCE_DISABLE)
+		set_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state);
+
+	if (test_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state)) {
+		if (test_and_clear_bit(SHUFFLE_ENABLE, &shuffle_state))
+			static_branch_disable(&page_alloc_shuffle_key);
+	} else if (ctl == SHUFFLE_ENABLE
+			&& !test_and_set_bit(SHUFFLE_ENABLE, &shuffle_state))
+		static_branch_enable(&page_alloc_shuffle_key);
+}
+
+static bool shuffle_param;
+extern int shuffle_show(char *buffer, const struct kernel_param *kp)
+{
+	return sprintf(buffer, "%c\n", test_bit(SHUFFLE_ENABLE, &shuffle_state)
+			? 'Y' : 'N');
+}
+static int shuffle_store(const char *val, const struct kernel_param *kp)
+{
+	int rc = param_set_bool(val, kp);
+
+	if (rc < 0)
+		return rc;
+	if (shuffle_param)
+		page_alloc_shuffle(SHUFFLE_ENABLE);
+	else
+		page_alloc_shuffle(SHUFFLE_FORCE_DISABLE);
+	return 0;
+}
+module_param_call(shuffle, shuffle_store, shuffle_show, &shuffle_param, 0400);
+
+/*
+ * For two pages to be swapped in the shuffle, they must be free (on a
+ * 'free_area' lru), have the same order, and have the same migratetype.
+ */
+static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order)
+{
+	struct page *page;
+
+	/*
+	 * Given we're dealing with randomly selected pfns in a zone we
+	 * need to ask questions like...
+	 */
+
+	/* ...is the pfn even in the memmap? */
+	if (!pfn_valid_within(pfn))
+		return NULL;
+
+	/* ...is the pfn in a present section or a hole? */
+	if (!pfn_present(pfn))
+		return NULL;
+
+	/* ...is the page free and currently on a free_area list? */
+	page = pfn_to_page(pfn);
+	if (!PageBuddy(page))
+		return NULL;
+
+	/*
+	 * ...is the page on the same list as the page we will
+	 * shuffle it with?
+	 */
+	if (page_order(page) != order)
+		return NULL;
+
+	return page;
+}
+
+/*
+ * Fisher-Yates shuffle the freelist which prescribes iterating through
+ * an array, pfns in this case, and randomly swapping each entry with
+ * another in the span, end_pfn - start_pfn.
+ *
+ * To keep the implementation simple it does not attempt to correct for
+ * sources of bias in the distribution, like modulo bias or
+ * pseudo-random number generator bias. I.e. the expectation is that
+ * this shuffling raises the bar for attacks that exploit the
+ * predictability of page allocations, but need not be a perfect
+ * shuffle.
+ *
+ * Note that we don't use @z->zone_start_pfn and zone_end_pfn(@z)
+ * directly since the caller may be aware of holes in the zone and can
+ * improve the accuracy of the random pfn selection.
+ */
+#define SHUFFLE_RETRY 10
+static void __meminit shuffle_zone_order(struct zone *z, unsigned long start_pfn,
+		unsigned long end_pfn, const int order)
+{
+	unsigned long i, flags;
+	const int order_pages = 1 << order;
+
+	if (start_pfn < z->zone_start_pfn)
+		start_pfn = z->zone_start_pfn;
+	if (end_pfn > zone_end_pfn(z))
+		end_pfn = zone_end_pfn(z);
+
+	/* probably means that start/end were outside the zone */
+	if (end_pfn <= start_pfn)
+		return;
+	spin_lock_irqsave(&z->lock, flags);
+	start_pfn = ALIGN(start_pfn, order_pages);
+	for (i = start_pfn; i < end_pfn; i += order_pages) {
+		unsigned long j;
+		int migratetype, retry;
+		struct page *page_i, *page_j;
+
+		/*
+		 * We expect page_i, in the sub-range of a zone being
+		 * added (@start_pfn to @end_pfn), to more likely be
+		 * valid compared to page_j randomly selected in the
+		 * span @zone_start_pfn to @spanned_pages.
+		 */
+		page_i = shuffle_valid_page(i, order);
+		if (!page_i)
+			continue;
+
+		for (retry = 0; retry < SHUFFLE_RETRY; retry++) {
+			/*
+			 * Pick a random order aligned page from the
+			 * start of the zone. Use the *whole* zone here
+			 * so that if it is freed in tiny pieces that we
+			 * randomize in the whole zone, not just within
+			 * those fragments.
+			 *
+			 * Since page_j comes from a potentially sparse
+			 * address range we want to try a bit harder to
+			 * find a shuffle point for page_i.
+			 */
+			j = z->zone_start_pfn +
+				ALIGN_DOWN(get_random_long() % z->spanned_pages,
+						order_pages);
+			page_j = shuffle_valid_page(j, order);
+			if (page_j && page_j != page_i)
+				break;
+		}
+		if (retry >= SHUFFLE_RETRY) {
+			pr_debug("%s: failed to swap %#lx\n", __func__, i);
+			continue;
+		}
+
+		/*
+		 * Each migratetype corresponds to its own list, make
+		 * sure the types match otherwise we're moving pages to
+		 * lists where they do not belong.
+		 */
+		migratetype = get_pageblock_migratetype(page_i);
+		if (get_pageblock_migratetype(page_j) != migratetype) {
+			pr_debug("%s: migratetype mismatch %#lx\n", __func__, i);
+			continue;
+		}
+
+		list_swap(&page_i->lru, &page_j->lru);
+
+		pr_debug("%s: swap: %#lx -> %#lx\n", __func__, i, j);
+
+		/* take it easy on the zone lock */
+		if ((i % (100 * order_pages)) == 0) {
+			spin_unlock_irqrestore(&z->lock, flags);
+			cond_resched();
+			spin_lock_irqsave(&z->lock, flags);
+		}
+	}
+	spin_unlock_irqrestore(&z->lock, flags);
+}
+
+void __meminit __shuffle_zone(struct zone *z, unsigned long start_pfn,
+               unsigned long end_pfn)
+{
+       int i;
+
+       /* shuffle all the orders at the specified order and higher */
+       for (i = CONFIG_SHUFFLE_PAGE_ORDER; i < MAX_ORDER; i++)
+               shuffle_zone_order(z, start_pfn, end_pfn, i);
+}
+
+/**
+ * shuffle_free_memory - reduce the predictability of the page allocator
+ * @pgdat: node page data
+ * @start_pfn: Limit the shuffle to the greater of this value or zone start
+ * @end_pfn: Limit the shuffle to the less of this value or zone end
+ *
+ * While shuffle_zone() attempts to avoid holes with pfn_valid() and
+ * pfn_present() they can not report sub-section sized holes. @start_pfn
+ * and @end_pfn limit the shuffle to the exact memory pages being freed.
+ */
+void __meminit __shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
+		unsigned long end_pfn)
+{
+	struct zone *z;
+
+	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
+		shuffle_zone(z, start_pfn, end_pfn);
+}


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v7 2/3] mm: Move buddy list manipulations into helpers
  2019-01-07 23:21 [PATCH v7 0/3] mm: Randomize free memory Dan Williams
  2019-01-07 23:21 ` [PATCH v7 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization Dan Williams
@ 2019-01-07 23:21 ` Dan Williams
  2019-01-25 14:30   ` Michal Hocko
       [not found] ` <154690328135.676627.5979130839159447106.stgit@dwillia2-desk3.amr.corp.intel.com>
  2 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2019-01-07 23:21 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Dave Hansen, mhocko, keith.busch, linux-mm,
	linux-kernel, mgorman

In preparation for runtime randomization of the zone lists, take all
(well, most of) the list_*() functions in the buddy allocator and put
them in helper functions. Provide a common control point for injecting
additional behavior when freeing pages.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/mm.h       |    3 --
 include/linux/mm_types.h |    3 ++
 include/linux/mmzone.h   |   51 ++++++++++++++++++++++++++++++++++
 mm/compaction.c          |    4 +--
 mm/page_alloc.c          |   70 ++++++++++++++++++----------------------------
 5 files changed, 84 insertions(+), 47 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80bb6408fe73..1621acd10f83 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -500,9 +500,6 @@ static inline void vma_set_anonymous(struct vm_area_struct *vma)
 struct mmu_gather;
 struct inode;
 
-#define page_private(page)		((page)->private)
-#define set_page_private(page, v)	((page)->private = (v))
-
 #if !defined(__HAVE_ARCH_PTE_DEVMAP) || !defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static inline int pmd_devmap(pmd_t pmd)
 {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2c471a2c43fa..1c7dc7ffa288 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -214,6 +214,9 @@ struct page {
 #define PAGE_FRAG_CACHE_MAX_SIZE	__ALIGN_MASK(32768, ~PAGE_MASK)
 #define PAGE_FRAG_CACHE_MAX_ORDER	get_order(PAGE_FRAG_CACHE_MAX_SIZE)
 
+#define page_private(page)		((page)->private)
+#define set_page_private(page, v)	((page)->private = (v))
+
 struct page_frag_cache {
 	void * va;
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8c37a023a790..b78a45e0b11c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -18,6 +18,8 @@
 #include <linux/pageblock-flags.h>
 #include <linux/page-flags-layout.h>
 #include <linux/atomic.h>
+#include <linux/mm_types.h>
+#include <linux/page-flags.h>
 #include <asm/page.h>
 
 /* Free memory management - zoned buddy allocator.  */
@@ -98,6 +100,55 @@ struct free_area {
 	unsigned long		nr_free;
 };
 
+/* Used for pages not on another list */
+static inline void add_to_free_area(struct page *page, struct free_area *area,
+			     int migratetype)
+{
+	list_add(&page->lru, &area->free_list[migratetype]);
+	area->nr_free++;
+}
+
+/* Used for pages not on another list */
+static inline void add_to_free_area_tail(struct page *page, struct free_area *area,
+				  int migratetype)
+{
+	list_add_tail(&page->lru, &area->free_list[migratetype]);
+	area->nr_free++;
+}
+
+/* Used for pages which are on another list */
+static inline void move_to_free_area(struct page *page, struct free_area *area,
+			     int migratetype)
+{
+	list_move(&page->lru, &area->free_list[migratetype]);
+}
+
+static inline struct page *get_page_from_free_area(struct free_area *area,
+					    int migratetype)
+{
+	return list_first_entry_or_null(&area->free_list[migratetype],
+					struct page, lru);
+}
+
+static inline void rmv_page_order(struct page *page)
+{
+	__ClearPageBuddy(page);
+	set_page_private(page, 0);
+}
+
+static inline void del_page_from_free_area(struct page *page,
+		struct free_area *area, int migratetype)
+{
+	list_del(&page->lru);
+	rmv_page_order(page);
+	area->nr_free--;
+}
+
+static inline bool free_area_empty(struct free_area *area, int migratetype)
+{
+	return list_empty(&area->free_list[migratetype]);
+}
+
 struct pglist_data;
 
 /*
diff --git a/mm/compaction.c b/mm/compaction.c
index ef29490b0f46..a22ac7ab65c5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1359,13 +1359,13 @@ static enum compact_result __compact_finished(struct zone *zone,
 		bool can_steal;
 
 		/* Job done if page is free of the right migratetype */
-		if (!list_empty(&area->free_list[migratetype]))
+		if (!free_area_empty(area, migratetype))
 			return COMPACT_SUCCESS;
 
 #ifdef CONFIG_CMA
 		/* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
 		if (migratetype == MIGRATE_MOVABLE &&
-			!list_empty(&area->free_list[MIGRATE_CMA]))
+			!free_area_empty(area, MIGRATE_CMA))
 			return COMPACT_SUCCESS;
 #endif
 		/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2adcd6da8a07..0b4791a2dd43 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -743,12 +743,6 @@ static inline void set_page_order(struct page *page, unsigned int order)
 	__SetPageBuddy(page);
 }
 
-static inline void rmv_page_order(struct page *page)
-{
-	__ClearPageBuddy(page);
-	set_page_private(page, 0);
-}
-
 /*
  * This function checks whether a page is free && is the buddy
  * we can coalesce a page and its buddy if
@@ -849,13 +843,11 @@ static inline void __free_one_page(struct page *page,
 		 * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
 		 * merge with it and move up one order.
 		 */
-		if (page_is_guard(buddy)) {
+		if (page_is_guard(buddy))
 			clear_page_guard(zone, buddy, order, migratetype);
-		} else {
-			list_del(&buddy->lru);
-			zone->free_area[order].nr_free--;
-			rmv_page_order(buddy);
-		}
+		else
+			del_page_from_free_area(buddy, &zone->free_area[order],
+					migratetype);
 		combined_pfn = buddy_pfn & pfn;
 		page = page + (combined_pfn - pfn);
 		pfn = combined_pfn;
@@ -905,15 +897,13 @@ static inline void __free_one_page(struct page *page,
 		higher_buddy = higher_page + (buddy_pfn - combined_pfn);
 		if (pfn_valid_within(buddy_pfn) &&
 		    page_is_buddy(higher_page, higher_buddy, order + 1)) {
-			list_add_tail(&page->lru,
-				&zone->free_area[order].free_list[migratetype]);
-			goto out;
+			add_to_free_area_tail(page, &zone->free_area[order],
+					      migratetype);
+			return;
 		}
 	}
 
-	list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
-out:
-	zone->free_area[order].nr_free++;
+	add_to_free_area(page, &zone->free_area[order], migratetype);
 }
 
 /*
@@ -1852,7 +1842,7 @@ static inline void expand(struct zone *zone, struct page *page,
 		if (set_page_guard(zone, &page[size], high, migratetype))
 			continue;
 
-		list_add(&page[size].lru, &area->free_list[migratetype]);
+		add_to_free_area(&page[size], area, migratetype);
 		area->nr_free++;
 		set_page_order(&page[size], high);
 	}
@@ -1994,13 +1984,10 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	/* Find a page of the appropriate size in the preferred list */
 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
 		area = &(zone->free_area[current_order]);
-		page = list_first_entry_or_null(&area->free_list[migratetype],
-							struct page, lru);
+		page = get_page_from_free_area(area, migratetype);
 		if (!page)
 			continue;
-		list_del(&page->lru);
-		rmv_page_order(page);
-		area->nr_free--;
+		del_page_from_free_area(page, area, migratetype);
 		expand(zone, page, order, current_order, area, migratetype);
 		set_pcppage_migratetype(page, migratetype);
 		return page;
@@ -2086,8 +2073,7 @@ static int move_freepages(struct zone *zone,
 		}
 
 		order = page_order(page);
-		list_move(&page->lru,
-			  &zone->free_area[order].free_list[migratetype]);
+		move_to_free_area(page, &zone->free_area[order], migratetype);
 		page += 1 << order;
 		pages_moved += 1 << order;
 	}
@@ -2263,7 +2249,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 
 single_page:
 	area = &zone->free_area[current_order];
-	list_move(&page->lru, &area->free_list[start_type]);
+	move_to_free_area(page, area, start_type);
 }
 
 /*
@@ -2287,7 +2273,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 		if (fallback_mt == MIGRATE_TYPES)
 			break;
 
-		if (list_empty(&area->free_list[fallback_mt]))
+		if (free_area_empty(area, fallback_mt))
 			continue;
 
 		if (can_steal_fallback(order, migratetype))
@@ -2374,9 +2360,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 		for (order = 0; order < MAX_ORDER; order++) {
 			struct free_area *area = &(zone->free_area[order]);
 
-			page = list_first_entry_or_null(
-					&area->free_list[MIGRATE_HIGHATOMIC],
-					struct page, lru);
+			page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC);
 			if (!page)
 				continue;
 
@@ -2499,8 +2483,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
 	VM_BUG_ON(current_order == MAX_ORDER);
 
 do_steal:
-	page = list_first_entry(&area->free_list[fallback_mt],
-							struct page, lru);
+	page = get_page_from_free_area(area, fallback_mt);
 
 	steal_suitable_fallback(zone, page, alloc_flags, start_migratetype,
 								can_steal);
@@ -2937,6 +2920,7 @@ EXPORT_SYMBOL_GPL(split_page);
 
 int __isolate_free_page(struct page *page, unsigned int order)
 {
+	struct free_area *area = &page_zone(page)->free_area[order];
 	unsigned long watermark;
 	struct zone *zone;
 	int mt;
@@ -2961,9 +2945,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
 	}
 
 	/* Remove page from free list */
-	list_del(&page->lru);
-	zone->free_area[order].nr_free--;
-	rmv_page_order(page);
+
+	del_page_from_free_area(page, area, mt);
 
 	/*
 	 * Set the pageblock if the isolated page is at least half of a
@@ -3265,13 +3248,13 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 			continue;
 
 		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
-			if (!list_empty(&area->free_list[mt]))
+			if (!free_area_empty(area, mt))
 				return true;
 		}
 
 #ifdef CONFIG_CMA
 		if ((alloc_flags & ALLOC_CMA) &&
-		    !list_empty(&area->free_list[MIGRATE_CMA])) {
+		    !free_area_empty(area, MIGRATE_CMA)) {
 			return true;
 		}
 #endif
@@ -5173,7 +5156,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 
 			types[order] = 0;
 			for (type = 0; type < MIGRATE_TYPES; type++) {
-				if (!list_empty(&area->free_list[type]))
+				if (!free_area_empty(area, type))
 					types[order] |= 1 << type;
 			}
 		}
@@ -8318,6 +8301,9 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 	spin_lock_irqsave(&zone->lock, flags);
 	pfn = start_pfn;
 	while (pfn < end_pfn) {
+		struct free_area *area;
+		int mt;
+
 		if (!pfn_valid(pfn)) {
 			pfn++;
 			continue;
@@ -8336,13 +8322,13 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 		BUG_ON(page_count(page));
 		BUG_ON(!PageBuddy(page));
 		order = page_order(page);
+		area = &zone->free_area[order];
 #ifdef CONFIG_DEBUG_VM
 		pr_info("remove from free list %lx %d %lx\n",
 			pfn, 1 << order, end_pfn);
 #endif
-		list_del(&page->lru);
-		rmv_page_order(page);
-		zone->free_area[order].nr_free--;
+		mt = get_pageblock_migratetype(page);
+		del_page_from_free_area(page, area, mt);
 		for (i = 0; i < (1 << order); i++)
 			SetPageReserved((page+i));
 		pfn += (1 << order);


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization
  2019-01-07 23:21 ` [PATCH v7 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization Dan Williams
@ 2019-01-08  0:18   ` Kees Cook
  2019-01-08  1:48     ` Dan Williams
  2019-01-10 10:56   ` Mel Gorman
  2019-01-25 14:20   ` Michal Hocko
  2 siblings, 1 reply; 16+ messages in thread
From: Kees Cook @ 2019-01-08  0:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Michal Hocko, Dave Hansen, Mike Rapoport,
	Keith Busch, Linux-MM, LKML, Mel Gorman

On Mon, Jan 7, 2019 at 3:33 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> Randomization of the page allocator improves the average utilization of
> a direct-mapped memory-side-cache. Memory side caching is a platform
> capability that Linux has been previously exposed to in HPC
> (high-performance computing) environments on specialty platforms. In
> that instance it was a smaller pool of high-bandwidth-memory relative to
> higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
> be found on general purpose server platforms where DRAM is a cache in
> front of higher latency persistent memory [1].
>
> Robert offered an explanation of the state of the art of Linux
> interactions with memory-side-caches [2], and I copy it here:
>
>     It's been a problem in the HPC space:
>     http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/
>
>     A kernel module called zonesort is available to try to help:
>     https://software.intel.com/en-us/articles/xeon-phi-software
>
>     and this abandoned patch series proposed that for the kernel:
>     https://lkml.org/lkml/2017/8/23/195
>
>     Dan's patch series doesn't attempt to ensure buffers won't conflict, but
>     also reduces the chance that the buffers will. This will make performance
>     more consistent, albeit slower than "optimal" (which is near impossible
>     to attain in a general-purpose kernel).  That's better than forcing
>     users to deploy remedies like:
>         "To eliminate this gradual degradation, we have added a Stream
>          measurement to the Node Health Check that follows each job;
>          nodes are rebooted whenever their measured memory bandwidth
>          falls below 300 GB/s."
>
> A replacement for zonesort was merged upstream in commit cc9aec03e58f
> "x86/numa_emulation: Introduce uniform split capability". With this
> numa_emulation capability, memory can be split into cache sized
> ("near-memory" sized) numa nodes. A bind operation to such a node, and
> disabling workloads on other nodes, enables full cache performance.
> However, once the workload exceeds the cache size then cache conflicts
> are unavoidable. While HPC environments might be able to tolerate
> time-scheduling of cache sized workloads, for general purpose server
> platforms, the oversubscribed cache case will be the common case.
>
> The worst case scenario is that a server system owner benchmarks a
> workload at boot with an un-contended cache only to see that performance
> degrade over time, even below the average cache performance due to
> excessive conflicts. Randomization clips the peaks and fills in the
> valleys of cache utilization to yield steady average performance.
>
> Here are some performance impact details of the patches:
>
> 1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
> 3X speedup in a contrived case that tries to force cache conflicts. The
> contrived cased used the numa_emulation capability to force an instance
> of the benchmark to be run in two of the near-memory sized numa nodes.
> If both instances were placed on the same emulated they would fit and
> cause zero conflicts.  While on separate emulated nodes without
> randomization they underutilized the cache and conflicted unnecessarily
> due to the in-order allocation per node.
>
> 2/ A well known Java server application benchmark was run with a heap
> size that exceeded cache size by 3X. The cache conflict rate was 8% for
> the first run and degraded to 21% after page allocator aging. With
> randomization enabled the rate levelled out at 11%.
>
> 3/ A MongoDB workload did not observe measurable difference in
> cache-conflict rates, but the overall throughput dropped by 7% with
> randomization in one case.
>
> 4/ Mel Gorman ran his suite of performance workloads with randomization
> enabled on platforms without a memory-side-cache and saw a mix of some
> improvements and some losses [3].
>
> While there is potentially significant improvement for applications that
> depend on low latency access across a wide working-set, the performance
> may be negligible to negative for other workloads. For this reason the
> shuffle capability defaults to off unless a direct-mapped
> memory-side-cache is detected. Even then, the page_alloc.shuffle=0
> parameter can be specified to disable the randomization on those
> systems.
>
> Outside of memory-side-cache utilization concerns there is potentially
> security benefit from randomization. Some data exfiltration and
> return-oriented-programming attacks rely on the ability to infer the
> location of sensitive data objects. The kernel page allocator,
> especially early in system boot, has predictable first-in-first out
> behavior for physical pages. Pages are freed in physical address order
> when first onlined.
>
> Quoting Kees:
>     "While we already have a base-address randomization
>      (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
>      memory layouts would certainly be using the predictability of
>      allocation ordering (i.e. for attacks where the base address isn't
>      important: only the relative positions between allocated memory).
>      This is common in lots of heap-style attacks. They try to gain
>      control over ordering by spraying allocations, etc.
>
>      I'd really like to see this because it gives us something similar
>      to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."
>
> While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
> caches it leaves vast bulk of memory to be predictably in order
> allocated.  However, it should be noted, the concrete security benefits
> are hard to quantify, and no known CVE is mitigated by this
> randomization.
>
> Introduce shuffle_free_memory(), and its helper shuffle_zone(), to
> perform a Fisher-Yates shuffle of the page allocator 'free_area' lists
> when they are initially populated with free memory at boot and at
> hotplug time. Do this based on either the presence of a
> page_alloc.shuffle=Y command line parameter, or autodetection of a
> memory-side-cache (to be added in a follow-on patch).
>
> The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
> pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.
> 10, 4MB this trades off randomization granularity for time spent
> shuffling.  MAX_ORDER-1 was chosen to be minimally invasive to the page
> allocator while still showing memory-side cache behavior improvements,
> and the expectation that the security implications of finer granularity
> randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.
>
> The performance impact of the shuffling appears to be in the noise
> compared to other memory initialization work. Also the bulk of the work
> is done in the background as a part of deferred_init_memmap().
>
> This initial randomization can be undone over time so a follow-on patch
> is introduced to inject entropy on page free decisions. It is reasonable
> to ask if the page free entropy is sufficient, but it is not enough due
> to the in-order initial freeing of pages. At the start of that process
> putting page1 in front or behind page0 still keeps them close together,
> page2 is still near page1 and has a high chance of being adjacent. As
> more pages are added ordering diversity improves, but there is still
> high page locality for the low address pages and this leads to no
> significant impact to the cache conflict rate.
>
> [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
> [2]: https://lkml.org/lkml/2018/9/22/54
> [3]: https://lkml.org/lkml/2018/10/12/309
>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Kees Cook <keescook@chromium.org>

Reviewed-by: Kees Cook <keescook@chromium.org>

With some comments below...

> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  include/linux/list.h    |   17 ++++
>  include/linux/mmzone.h  |    4 +
>  include/linux/shuffle.h |   48 ++++++++++
>  init/Kconfig            |   36 ++++++++
>  mm/Makefile             |    7 +-
>  mm/memblock.c           |   10 ++
>  mm/memory_hotplug.c     |    3 +
>  mm/page_alloc.c         |    3 +
>  mm/shuffle.c            |  215 +++++++++++++++++++++++++++++++++++++++++++++++
>  9 files changed, 341 insertions(+), 2 deletions(-)
>  create mode 100644 include/linux/shuffle.h
>  create mode 100644 mm/shuffle.c
>
> diff --git a/include/linux/list.h b/include/linux/list.h
> index edb7628e46ed..3dfb8953f241 100644
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -150,6 +150,23 @@ static inline void list_replace_init(struct list_head *old,
>         INIT_LIST_HEAD(old);
>  }
>
> +/**
> + * list_swap - replace entry1 with entry2 and re-add entry1 at entry2's position
> + * @entry1: the location to place entry2
> + * @entry2: the location to place entry1
> + */
> +static inline void list_swap(struct list_head *entry1,
> +                            struct list_head *entry2)
> +{
> +       struct list_head *pos = entry2->prev;
> +
> +       list_del(entry2);
> +       list_replace(entry1, entry2);
> +       if (pos == entry1)
> +               pos = entry2;
> +       list_add(entry1, pos);
> +}
> +
>  /**
>   * list_del_init - deletes entry from list and reinitialize it.
>   * @entry: the element to delete from the list.
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index cc4a507d7ca4..8c37a023a790 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1272,6 +1272,10 @@ void sparse_init(void);
>  #else
>  #define sparse_init()  do {} while (0)
>  #define sparse_index_init(_sec, _nid)  do {} while (0)
> +static inline int pfn_present(unsigned long pfn)
> +{
> +       return 1;
> +}
>  #endif /* CONFIG_SPARSEMEM */
>
>  /*
> diff --git a/include/linux/shuffle.h b/include/linux/shuffle.h
> new file mode 100644
> index 000000000000..d109161f4a62
> --- /dev/null
> +++ b/include/linux/shuffle.h
> @@ -0,0 +1,48 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2018 Intel Corporation. All rights reserved.
> +#ifndef _MM_SHUFFLE_H
> +#define _MM_SHUFFLE_H
> +#include <linux/jump_label.h>
> +
> +enum mm_shuffle_ctl {
> +       SHUFFLE_ENABLE,
> +       SHUFFLE_FORCE_DISABLE,
> +};
> +#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR
> +DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
> +extern void page_alloc_shuffle(enum mm_shuffle_ctl ctl);
> +extern void __shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
> +               unsigned long end_pfn);
> +static inline void shuffle_free_memory(pg_data_t *pgdat,
> +               unsigned long start_pfn, unsigned long end_pfn)
> +{
> +       if (!static_branch_unlikely(&page_alloc_shuffle_key))
> +               return;
> +       __shuffle_free_memory(pgdat, start_pfn, end_pfn);
> +}
> +
> +extern void __shuffle_zone(struct zone *z, unsigned long start_pfn,
> +               unsigned long end_pfn);
> +static inline void shuffle_zone(struct zone *z, unsigned long start_pfn,
> +               unsigned long end_pfn)
> +{
> +       if (!static_branch_unlikely(&page_alloc_shuffle_key))
> +               return;
> +       __shuffle_zone(z, start_pfn, end_pfn);
> +}
> +#else
> +static inline void shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
> +               unsigned long end_pfn)
> +{
> +}
> +
> +static inline void shuffle_zone(struct zone *z, unsigned long start_pfn,
> +               unsigned long end_pfn)
> +{
> +}
> +
> +static inline void page_alloc_shuffle(enum mm_shuffle_ctl ctl)
> +{
> +}
> +#endif
> +#endif /* _MM_SHUFFLE_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index d47cb77a220e..db7758476e7a 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1714,6 +1714,42 @@ config SLAB_FREELIST_HARDENED
>           sacrifies to harden the kernel slab allocator against common
>           freelist exploit methods.
>
> +config SHUFFLE_PAGE_ALLOCATOR
> +       bool "Page allocator randomization"
> +       depends on ACPI_NUMA

Why does this need ACPI_NUMA? (e.g. why can't I use this on a non-ACPI
arm64 system?)

> +       default SLAB_FREELIST_RANDOM
> +       help
> +         Randomization of the page allocator improves the average
> +         utilization of a direct-mapped memory-side-cache. See section
> +         5.2.27 Heterogeneous Memory Attribute Table (HMAT) in the ACPI
> +         6.2a specification for an example of how a platform advertises
> +         the presence of a memory-side-cache. There are also incidental
> +         security benefits as it reduces the predictability of page
> +         allocations to compliment SLAB_FREELIST_RANDOM, but the
> +         default granularity of shuffling on 4MB (MAX_ORDER) pages is
> +         selected based on cache utilization benefits.
> +
> +         While the randomization improves cache utilization it may
> +         negatively impact workloads on platforms without a cache. For
> +         this reason, by default, the randomization is enabled only
> +         after runtime detection of a direct-mapped memory-side-cache.
> +         Otherwise, the randomization may be force enabled with the
> +         'page_alloc.shuffle' kernel command line parameter.
> +
> +         Say Y if unsure.
> +
> +config SHUFFLE_PAGE_ORDER
> +       depends on SHUFFLE_PAGE_ALLOCATOR
> +       int "Page allocator shuffle order"
> +       range 0 10
> +       default 10
> +       help
> +         Specify the granularity at which shuffling (randomization) is
> +         performed. By default this is set to MAX_ORDER-1 to minimize
> +         runtime impact of randomization and with the expectation that
> +         SLAB_FREELIST_RANDOM mitigates heap attacks on smaller
> +         object granularities.
> +
>  config SLUB_CPU_PARTIAL
>         default y
>         depends on SLUB && SMP
> diff --git a/mm/Makefile b/mm/Makefile
> index d210cc9d6f80..ac5e5ba78874 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -33,7 +33,7 @@ mmu-$(CONFIG_MMU)     += process_vm_access.o
>  endif
>
>  obj-y                  := filemap.o mempool.o oom_kill.o fadvise.o \
> -                          maccess.o page_alloc.o page-writeback.o \
> +                          maccess.o page-writeback.o \
>                            readahead.o swap.o truncate.o vmscan.o shmem.o \
>                            util.o mmzone.o vmstat.o backing-dev.o \
>                            mm_init.o mmu_context.o percpu.o slab_common.o \
> @@ -41,6 +41,11 @@ obj-y                        := filemap.o mempool.o oom_kill.o fadvise.o \
>                            interval_tree.o list_lru.o workingset.o \
>                            debug.o $(mmu-y)
>
> +# Give 'page_alloc' its own module-parameter namespace
> +page-alloc-y := page_alloc.o
> +page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
> +
> +obj-y += page-alloc.o

I'll get over it, but having both page-alloc.o and page_alloc.o hurts me. :)

>  obj-y += init-mm.o
>  obj-y += memblock.o
>
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 022d4cbb3618..3602f7a2eab4 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -17,6 +17,7 @@
>  #include <linux/poison.h>
>  #include <linux/pfn.h>
>  #include <linux/debugfs.h>
> +#include <linux/shuffle.h>
>  #include <linux/kmemleak.h>
>  #include <linux/seq_file.h>
>  #include <linux/memblock.h>
> @@ -1929,9 +1930,16 @@ static unsigned long __init free_low_memory_core_early(void)
>          *  low ram will be on Node1
>          */
>         for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end,
> -                               NULL)
> +                               NULL) {
> +               pg_data_t *pgdat;
> +
>                 count += __free_memory_core(start, end);
>
> +               for_each_online_pgdat(pgdat)
> +                       shuffle_free_memory(pgdat, PHYS_PFN(start),
> +                                       PHYS_PFN(end));
> +       }
> +
>         return count;
>  }
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index b9a667d36c55..7caffb9a91ab 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -23,6 +23,7 @@
>  #include <linux/highmem.h>
>  #include <linux/vmalloc.h>
>  #include <linux/ioport.h>
> +#include <linux/shuffle.h>
>  #include <linux/delay.h>
>  #include <linux/migrate.h>
>  #include <linux/page-isolation.h>
> @@ -895,6 +896,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
>         zone->zone_pgdat->node_present_pages += onlined_pages;
>         pgdat_resize_unlock(zone->zone_pgdat, &flags);
>
> +       shuffle_zone(zone, pfn, zone_end_pfn(zone));
> +
>         if (onlined_pages) {
>                 node_states_set_node(nid, &arg);
>                 if (need_zonelists_rebuild)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index cde5dac6229a..2adcd6da8a07 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -61,6 +61,7 @@
>  #include <linux/sched/rt.h>
>  #include <linux/sched/mm.h>
>  #include <linux/page_owner.h>
> +#include <linux/shuffle.h>
>  #include <linux/kthread.h>
>  #include <linux/memcontrol.h>
>  #include <linux/ftrace.h>
> @@ -1634,6 +1635,8 @@ static int __init deferred_init_memmap(void *data)
>         }
>         pgdat_resize_unlock(pgdat, &flags);
>
> +       shuffle_zone(zone, first_init_pfn, zone_end_pfn(zone));
> +
>         /* Sanity check that the next zone really is unpopulated */
>         WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
>
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> new file mode 100644
> index 000000000000..07961ff41a03
> --- /dev/null
> +++ b/mm/shuffle.c
> @@ -0,0 +1,215 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2018 Intel Corporation. All rights reserved.
> +
> +#include <linux/mm.h>
> +#include <linux/init.h>
> +#include <linux/mmzone.h>
> +#include <linux/random.h>
> +#include <linux/shuffle.h>
> +#include <linux/moduleparam.h>
> +#include "internal.h"
> +
> +DEFINE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
> +static unsigned long shuffle_state;
> +
> +/*
> + * Depending on the architecture, module parameter parsing may run
> + * before, or after the cache detection. SHUFFLE_FORCE_DISABLE prevents,
> + * or reverts the enabling of the shuffle implementation. SHUFFLE_ENABLE
> + * attempts to turn on the implementation, but aborts if it finds
> + * SHUFFLE_FORCE_DISABLE already set.
> + */
> +void page_alloc_shuffle(enum mm_shuffle_ctl ctl)
> +{
> +       if (ctl == SHUFFLE_FORCE_DISABLE)
> +               set_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state);
> +
> +       if (test_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state)) {
> +               if (test_and_clear_bit(SHUFFLE_ENABLE, &shuffle_state))
> +                       static_branch_disable(&page_alloc_shuffle_key);
> +       } else if (ctl == SHUFFLE_ENABLE
> +                       && !test_and_set_bit(SHUFFLE_ENABLE, &shuffle_state))
> +               static_branch_enable(&page_alloc_shuffle_key);
> +}
> +
> +static bool shuffle_param;
> +extern int shuffle_show(char *buffer, const struct kernel_param *kp)
> +{
> +       return sprintf(buffer, "%c\n", test_bit(SHUFFLE_ENABLE, &shuffle_state)
> +                       ? 'Y' : 'N');
> +}
> +static int shuffle_store(const char *val, const struct kernel_param *kp)
> +{
> +       int rc = param_set_bool(val, kp);
> +
> +       if (rc < 0)
> +               return rc;
> +       if (shuffle_param)
> +               page_alloc_shuffle(SHUFFLE_ENABLE);
> +       else
> +               page_alloc_shuffle(SHUFFLE_FORCE_DISABLE);
> +       return 0;
> +}
> +module_param_call(shuffle, shuffle_store, shuffle_show, &shuffle_param, 0400);

If this is 0400, you don't intend it to be changed after boot. If it's
supposed to be immutable, why not make these __init calls?

> +
> +/*
> + * For two pages to be swapped in the shuffle, they must be free (on a
> + * 'free_area' lru), have the same order, and have the same migratetype.
> + */
> +static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order)
> +{
> +       struct page *page;
> +
> +       /*
> +        * Given we're dealing with randomly selected pfns in a zone we
> +        * need to ask questions like...
> +        */
> +
> +       /* ...is the pfn even in the memmap? */
> +       if (!pfn_valid_within(pfn))
> +               return NULL;
> +
> +       /* ...is the pfn in a present section or a hole? */
> +       if (!pfn_present(pfn))
> +               return NULL;
> +
> +       /* ...is the page free and currently on a free_area list? */
> +       page = pfn_to_page(pfn);
> +       if (!PageBuddy(page))
> +               return NULL;
> +
> +       /*
> +        * ...is the page on the same list as the page we will
> +        * shuffle it with?
> +        */
> +       if (page_order(page) != order)
> +               return NULL;
> +
> +       return page;
> +}
> +
> +/*
> + * Fisher-Yates shuffle the freelist which prescribes iterating through
> + * an array, pfns in this case, and randomly swapping each entry with
> + * another in the span, end_pfn - start_pfn.
> + *
> + * To keep the implementation simple it does not attempt to correct for
> + * sources of bias in the distribution, like modulo bias or
> + * pseudo-random number generator bias. I.e. the expectation is that
> + * this shuffling raises the bar for attacks that exploit the
> + * predictability of page allocations, but need not be a perfect
> + * shuffle.
> + *
> + * Note that we don't use @z->zone_start_pfn and zone_end_pfn(@z)
> + * directly since the caller may be aware of holes in the zone and can
> + * improve the accuracy of the random pfn selection.
> + */
> +#define SHUFFLE_RETRY 10
> +static void __meminit shuffle_zone_order(struct zone *z, unsigned long start_pfn,
> +               unsigned long end_pfn, const int order)
> +{
> +       unsigned long i, flags;
> +       const int order_pages = 1 << order;
> +
> +       if (start_pfn < z->zone_start_pfn)
> +               start_pfn = z->zone_start_pfn;
> +       if (end_pfn > zone_end_pfn(z))
> +               end_pfn = zone_end_pfn(z);
> +
> +       /* probably means that start/end were outside the zone */
> +       if (end_pfn <= start_pfn)
> +               return;
> +       spin_lock_irqsave(&z->lock, flags);
> +       start_pfn = ALIGN(start_pfn, order_pages);
> +       for (i = start_pfn; i < end_pfn; i += order_pages) {
> +               unsigned long j;
> +               int migratetype, retry;
> +               struct page *page_i, *page_j;
> +
> +               /*
> +                * We expect page_i, in the sub-range of a zone being
> +                * added (@start_pfn to @end_pfn), to more likely be
> +                * valid compared to page_j randomly selected in the
> +                * span @zone_start_pfn to @spanned_pages.
> +                */
> +               page_i = shuffle_valid_page(i, order);
> +               if (!page_i)
> +                       continue;
> +
> +               for (retry = 0; retry < SHUFFLE_RETRY; retry++) {
> +                       /*
> +                        * Pick a random order aligned page from the
> +                        * start of the zone. Use the *whole* zone here
> +                        * so that if it is freed in tiny pieces that we
> +                        * randomize in the whole zone, not just within
> +                        * those fragments.
> +                        *
> +                        * Since page_j comes from a potentially sparse
> +                        * address range we want to try a bit harder to
> +                        * find a shuffle point for page_i.
> +                        */
> +                       j = z->zone_start_pfn +
> +                               ALIGN_DOWN(get_random_long() % z->spanned_pages,
> +                                               order_pages);

How late in the boot process does this happen, btw? Do we get warnings
from the RNG about early usage?

> +                       page_j = shuffle_valid_page(j, order);
> +                       if (page_j && page_j != page_i)
> +                               break;
> +               }
> +               if (retry >= SHUFFLE_RETRY) {
> +                       pr_debug("%s: failed to swap %#lx\n", __func__, i);
> +                       continue;
> +               }
> +
> +               /*
> +                * Each migratetype corresponds to its own list, make
> +                * sure the types match otherwise we're moving pages to
> +                * lists where they do not belong.
> +                */
> +               migratetype = get_pageblock_migratetype(page_i);
> +               if (get_pageblock_migratetype(page_j) != migratetype) {
> +                       pr_debug("%s: migratetype mismatch %#lx\n", __func__, i);
> +                       continue;
> +               }
> +
> +               list_swap(&page_i->lru, &page_j->lru);
> +
> +               pr_debug("%s: swap: %#lx -> %#lx\n", __func__, i, j);
> +
> +               /* take it easy on the zone lock */
> +               if ((i % (100 * order_pages)) == 0) {
> +                       spin_unlock_irqrestore(&z->lock, flags);
> +                       cond_resched();
> +                       spin_lock_irqsave(&z->lock, flags);
> +               }
> +       }
> +       spin_unlock_irqrestore(&z->lock, flags);
> +}
> +
> +void __meminit __shuffle_zone(struct zone *z, unsigned long start_pfn,
> +               unsigned long end_pfn)
> +{
> +       int i;
> +
> +       /* shuffle all the orders at the specified order and higher */
> +       for (i = CONFIG_SHUFFLE_PAGE_ORDER; i < MAX_ORDER; i++)
> +               shuffle_zone_order(z, start_pfn, end_pfn, i);
> +}
> +
> +/**
> + * shuffle_free_memory - reduce the predictability of the page allocator
> + * @pgdat: node page data
> + * @start_pfn: Limit the shuffle to the greater of this value or zone start
> + * @end_pfn: Limit the shuffle to the less of this value or zone end
> + *
> + * While shuffle_zone() attempts to avoid holes with pfn_valid() and
> + * pfn_present() they can not report sub-section sized holes. @start_pfn
> + * and @end_pfn limit the shuffle to the exact memory pages being freed.
> + */
> +void __meminit __shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
> +               unsigned long end_pfn)
> +{
> +       struct zone *z;
> +
> +       for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
> +               shuffle_zone(z, start_pfn, end_pfn);
> +}
>


-- 
Kees Cook

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 3/3] mm: Maintain randomization of page free lists
       [not found] ` <154690328135.676627.5979130839159447106.stgit@dwillia2-desk3.amr.corp.intel.com>
@ 2019-01-08  0:19   ` Kees Cook
  2019-01-25 14:32   ` Michal Hocko
  1 sibling, 0 replies; 16+ messages in thread
From: Kees Cook @ 2019-01-08  0:19 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Michal Hocko, Dave Hansen, Keith Busch, Linux-MM,
	LKML, Mel Gorman

On Mon, Jan 7, 2019 at 3:34 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> When freeing a page with an order >= shuffle_page_order randomly select
> the front or back of the list for insertion.
>
> While the mm tries to defragment physical pages into huge pages this can
> tend to make the page allocator more predictable over time. Inject the
> front-back randomness to preserve the initial randomness established by
> shuffle_free_memory() when the kernel was booted.
>
> The overhead of this manipulation is constrained by only being applied
> for MAX_ORDER sized pages by default.
>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  include/linux/mmzone.h  |   10 ++++++++++
>  include/linux/shuffle.h |   12 ++++++++++++
>  mm/page_alloc.c         |   11 +++++++++--
>  mm/shuffle.c            |   16 ++++++++++++++++
>  4 files changed, 47 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index b78a45e0b11c..c15f7f703be0 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -98,6 +98,8 @@ extern int page_group_by_mobility_disabled;
>  struct free_area {
>         struct list_head        free_list[MIGRATE_TYPES];
>         unsigned long           nr_free;
> +       u64                     rand;
> +       u8                      rand_bits;
>  };
>
>  /* Used for pages not on another list */
> @@ -116,6 +118,14 @@ static inline void add_to_free_area_tail(struct page *page, struct free_area *ar
>         area->nr_free++;
>  }
>
> +#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR
> +/* Used to preserve page allocation order entropy */
> +void add_to_free_area_random(struct page *page, struct free_area *area,
> +               int migratetype);
> +#else
> +#define add_to_free_area_random add_to_free_area
> +#endif
> +
>  /* Used for pages which are on another list */
>  static inline void move_to_free_area(struct page *page, struct free_area *area,
>                              int migratetype)
> diff --git a/include/linux/shuffle.h b/include/linux/shuffle.h
> index d109161f4a62..85b7f5f32867 100644
> --- a/include/linux/shuffle.h
> +++ b/include/linux/shuffle.h
> @@ -30,6 +30,13 @@ static inline void shuffle_zone(struct zone *z, unsigned long start_pfn,
>                 return;
>         __shuffle_zone(z, start_pfn, end_pfn);
>  }
> +
> +static inline bool is_shuffle_order(int order)
> +{
> +       if (!static_branch_unlikely(&page_alloc_shuffle_key))
> +                return false;
> +       return order >= CONFIG_SHUFFLE_PAGE_ORDER;
> +}
>  #else
>  static inline void shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
>                 unsigned long end_pfn)
> @@ -44,5 +51,10 @@ static inline void shuffle_zone(struct zone *z, unsigned long start_pfn,
>  static inline void page_alloc_shuffle(enum mm_shuffle_ctl ctl)
>  {
>  }
> +
> +static inline bool is_shuffle_order(int order)
> +{
> +       return false;
> +}
>  #endif
>  #endif /* _MM_SHUFFLE_H */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0b4791a2dd43..f3a859b66d70 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -43,6 +43,7 @@
>  #include <linux/mempolicy.h>
>  #include <linux/memremap.h>
>  #include <linux/stop_machine.h>
> +#include <linux/random.h>
>  #include <linux/sort.h>
>  #include <linux/pfn.h>
>  #include <linux/backing-dev.h>
> @@ -889,7 +890,8 @@ static inline void __free_one_page(struct page *page,
>          * so it's less likely to be used soon and more likely to be merged
>          * as a higher order page
>          */
> -       if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)) {
> +       if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)
> +                       && !is_shuffle_order(order)) {
>                 struct page *higher_page, *higher_buddy;
>                 combined_pfn = buddy_pfn & pfn;
>                 higher_page = page + (combined_pfn - pfn);
> @@ -903,7 +905,12 @@ static inline void __free_one_page(struct page *page,
>                 }
>         }
>
> -       add_to_free_area(page, &zone->free_area[order], migratetype);
> +       if (is_shuffle_order(order))
> +               add_to_free_area_random(page, &zone->free_area[order],
> +                               migratetype);
> +       else
> +               add_to_free_area(page, &zone->free_area[order], migratetype);
> +
>  }
>
>  /*
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> index 07961ff41a03..4cadf51c9b40 100644
> --- a/mm/shuffle.c
> +++ b/mm/shuffle.c
> @@ -213,3 +213,19 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
>         for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
>                 shuffle_zone(z, start_pfn, end_pfn);
>  }
> +
> +void add_to_free_area_random(struct page *page, struct free_area *area,
> +               int migratetype)
> +{
> +       if (area->rand_bits == 0) {
> +               area->rand_bits = 64;
> +               area->rand = get_random_u64();
> +       }
> +
> +       if (area->rand & 1)
> +               add_to_free_area(page, area, migratetype);
> +       else
> +               add_to_free_area_tail(page, area, migratetype);
> +       area->rand_bits--;
> +       area->rand >>= 1;
> +}
>


-- 
Kees Cook

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization
  2019-01-08  0:18   ` Kees Cook
@ 2019-01-08  1:48     ` Dan Williams
  2019-01-08 23:24       ` Kees Cook
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2019-01-08  1:48 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Michal Hocko, Dave Hansen, Mike Rapoport,
	Keith Busch, Linux-MM, LKML, Mel Gorman

On Mon, Jan 7, 2019 at 4:19 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Mon, Jan 7, 2019 at 3:33 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > Randomization of the page allocator improves the average utilization of
> > a direct-mapped memory-side-cache. Memory side caching is a platform
> > capability that Linux has been previously exposed to in HPC
> > (high-performance computing) environments on specialty platforms. In
> > that instance it was a smaller pool of high-bandwidth-memory relative to
> > higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
> > be found on general purpose server platforms where DRAM is a cache in
> > front of higher latency persistent memory [1].
> >
> > Robert offered an explanation of the state of the art of Linux
> > interactions with memory-side-caches [2], and I copy it here:
> >
> >     It's been a problem in the HPC space:
> >     http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/
> >
> >     A kernel module called zonesort is available to try to help:
> >     https://software.intel.com/en-us/articles/xeon-phi-software
> >
> >     and this abandoned patch series proposed that for the kernel:
> >     https://lkml.org/lkml/2017/8/23/195
> >
> >     Dan's patch series doesn't attempt to ensure buffers won't conflict, but
> >     also reduces the chance that the buffers will. This will make performance
> >     more consistent, albeit slower than "optimal" (which is near impossible
> >     to attain in a general-purpose kernel).  That's better than forcing
> >     users to deploy remedies like:
> >         "To eliminate this gradual degradation, we have added a Stream
> >          measurement to the Node Health Check that follows each job;
> >          nodes are rebooted whenever their measured memory bandwidth
> >          falls below 300 GB/s."
> >
> > A replacement for zonesort was merged upstream in commit cc9aec03e58f
> > "x86/numa_emulation: Introduce uniform split capability". With this
> > numa_emulation capability, memory can be split into cache sized
> > ("near-memory" sized) numa nodes. A bind operation to such a node, and
> > disabling workloads on other nodes, enables full cache performance.
> > However, once the workload exceeds the cache size then cache conflicts
> > are unavoidable. While HPC environments might be able to tolerate
> > time-scheduling of cache sized workloads, for general purpose server
> > platforms, the oversubscribed cache case will be the common case.
> >
> > The worst case scenario is that a server system owner benchmarks a
> > workload at boot with an un-contended cache only to see that performance
> > degrade over time, even below the average cache performance due to
> > excessive conflicts. Randomization clips the peaks and fills in the
> > valleys of cache utilization to yield steady average performance.
> >
> > Here are some performance impact details of the patches:
> >
> > 1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
> > 3X speedup in a contrived case that tries to force cache conflicts. The
> > contrived cased used the numa_emulation capability to force an instance
> > of the benchmark to be run in two of the near-memory sized numa nodes.
> > If both instances were placed on the same emulated they would fit and
> > cause zero conflicts.  While on separate emulated nodes without
> > randomization they underutilized the cache and conflicted unnecessarily
> > due to the in-order allocation per node.
> >
> > 2/ A well known Java server application benchmark was run with a heap
> > size that exceeded cache size by 3X. The cache conflict rate was 8% for
> > the first run and degraded to 21% after page allocator aging. With
> > randomization enabled the rate levelled out at 11%.
> >
> > 3/ A MongoDB workload did not observe measurable difference in
> > cache-conflict rates, but the overall throughput dropped by 7% with
> > randomization in one case.
> >
> > 4/ Mel Gorman ran his suite of performance workloads with randomization
> > enabled on platforms without a memory-side-cache and saw a mix of some
> > improvements and some losses [3].
> >
> > While there is potentially significant improvement for applications that
> > depend on low latency access across a wide working-set, the performance
> > may be negligible to negative for other workloads. For this reason the
> > shuffle capability defaults to off unless a direct-mapped
> > memory-side-cache is detected. Even then, the page_alloc.shuffle=0
> > parameter can be specified to disable the randomization on those
> > systems.
> >
> > Outside of memory-side-cache utilization concerns there is potentially
> > security benefit from randomization. Some data exfiltration and
> > return-oriented-programming attacks rely on the ability to infer the
> > location of sensitive data objects. The kernel page allocator,
> > especially early in system boot, has predictable first-in-first out
> > behavior for physical pages. Pages are freed in physical address order
> > when first onlined.
> >
> > Quoting Kees:
> >     "While we already have a base-address randomization
> >      (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
> >      memory layouts would certainly be using the predictability of
> >      allocation ordering (i.e. for attacks where the base address isn't
> >      important: only the relative positions between allocated memory).
> >      This is common in lots of heap-style attacks. They try to gain
> >      control over ordering by spraying allocations, etc.
> >
> >      I'd really like to see this because it gives us something similar
> >      to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."
> >
> > While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
> > caches it leaves vast bulk of memory to be predictably in order
> > allocated.  However, it should be noted, the concrete security benefits
> > are hard to quantify, and no known CVE is mitigated by this
> > randomization.
> >
> > Introduce shuffle_free_memory(), and its helper shuffle_zone(), to
> > perform a Fisher-Yates shuffle of the page allocator 'free_area' lists
> > when they are initially populated with free memory at boot and at
> > hotplug time. Do this based on either the presence of a
> > page_alloc.shuffle=Y command line parameter, or autodetection of a
> > memory-side-cache (to be added in a follow-on patch).
> >
> > The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
> > pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.
> > 10, 4MB this trades off randomization granularity for time spent
> > shuffling.  MAX_ORDER-1 was chosen to be minimally invasive to the page
> > allocator while still showing memory-side cache behavior improvements,
> > and the expectation that the security implications of finer granularity
> > randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.
> >
> > The performance impact of the shuffling appears to be in the noise
> > compared to other memory initialization work. Also the bulk of the work
> > is done in the background as a part of deferred_init_memmap().
> >
> > This initial randomization can be undone over time so a follow-on patch
> > is introduced to inject entropy on page free decisions. It is reasonable
> > to ask if the page free entropy is sufficient, but it is not enough due
> > to the in-order initial freeing of pages. At the start of that process
> > putting page1 in front or behind page0 still keeps them close together,
> > page2 is still near page1 and has a high chance of being adjacent. As
> > more pages are added ordering diversity improves, but there is still
> > high page locality for the low address pages and this leads to no
> > significant impact to the cache conflict rate.
> >
> > [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
> > [2]: https://lkml.org/lkml/2018/9/22/54
> > [3]: https://lkml.org/lkml/2018/10/12/309
> >
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Kees Cook <keescook@chromium.org>
>
> Reviewed-by: Kees Cook <keescook@chromium.org>

Thanks.

> With some comments below...
[..]
> > diff --git a/init/Kconfig b/init/Kconfig
> > index d47cb77a220e..db7758476e7a 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -1714,6 +1714,42 @@ config SLAB_FREELIST_HARDENED
> >           sacrifies to harden the kernel slab allocator against common
> >           freelist exploit methods.
> >
> > +config SHUFFLE_PAGE_ALLOCATOR
> > +       bool "Page allocator randomization"
> > +       depends on ACPI_NUMA
>
> Why does this need ACPI_NUMA? (e.g. why can't I use this on a non-ACPI
> arm64 system?)

I was thinking this would be expanded for each platform-type that will
implement the auto-detect capability. However, there really is no
direct dependency and if you wanted to just use the command line
switch that should be allowed on any platform.

I'll delete this dependency for v8, but I'll hold off on that posting
awaiting feedback from mm folks.

>
> > +       default SLAB_FREELIST_RANDOM
> > +       help
> > +         Randomization of the page allocator improves the average
> > +         utilization of a direct-mapped memory-side-cache. See section
> > +         5.2.27 Heterogeneous Memory Attribute Table (HMAT) in the ACPI
> > +         6.2a specification for an example of how a platform advertises
> > +         the presence of a memory-side-cache. There are also incidental
> > +         security benefits as it reduces the predictability of page
> > +         allocations to compliment SLAB_FREELIST_RANDOM, but the
> > +         default granularity of shuffling on 4MB (MAX_ORDER) pages is
> > +         selected based on cache utilization benefits.
> > +
> > +         While the randomization improves cache utilization it may
> > +         negatively impact workloads on platforms without a cache. For
> > +         this reason, by default, the randomization is enabled only
> > +         after runtime detection of a direct-mapped memory-side-cache.
> > +         Otherwise, the randomization may be force enabled with the
> > +         'page_alloc.shuffle' kernel command line parameter.
> > +
> > +         Say Y if unsure.
> > +
> > +config SHUFFLE_PAGE_ORDER
> > +       depends on SHUFFLE_PAGE_ALLOCATOR
> > +       int "Page allocator shuffle order"
> > +       range 0 10
> > +       default 10
> > +       help
> > +         Specify the granularity at which shuffling (randomization) is
> > +         performed. By default this is set to MAX_ORDER-1 to minimize
> > +         runtime impact of randomization and with the expectation that
> > +         SLAB_FREELIST_RANDOM mitigates heap attacks on smaller
> > +         object granularities.
> > +
> >  config SLUB_CPU_PARTIAL
> >         default y
> >         depends on SLUB && SMP
> > diff --git a/mm/Makefile b/mm/Makefile
> > index d210cc9d6f80..ac5e5ba78874 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -33,7 +33,7 @@ mmu-$(CONFIG_MMU)     += process_vm_access.o
> >  endif
> >
> >  obj-y                  := filemap.o mempool.o oom_kill.o fadvise.o \
> > -                          maccess.o page_alloc.o page-writeback.o \
> > +                          maccess.o page-writeback.o \
> >                            readahead.o swap.o truncate.o vmscan.o shmem.o \
> >                            util.o mmzone.o vmstat.o backing-dev.o \
> >                            mm_init.o mmu_context.o percpu.o slab_common.o \
> > @@ -41,6 +41,11 @@ obj-y                        := filemap.o mempool.o oom_kill.o fadvise.o \
> >                            interval_tree.o list_lru.o workingset.o \
> >                            debug.o $(mmu-y)
> >
> > +# Give 'page_alloc' its own module-parameter namespace
> > +page-alloc-y := page_alloc.o
> > +page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
> > +
> > +obj-y += page-alloc.o
>
> I'll get over it, but having both page-alloc.o and page_alloc.o hurts me. :)

It's a cheeky hack, if it doesn't survive review I won't lose any
sleep over it. I was just tempted by the siren call of the
infrastructure built-up around module_param_call().

>
> >  obj-y += init-mm.o
> >  obj-y += memblock.o
> >
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 022d4cbb3618..3602f7a2eab4 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -17,6 +17,7 @@
> >  #include <linux/poison.h>
> >  #include <linux/pfn.h>
> >  #include <linux/debugfs.h>
> > +#include <linux/shuffle.h>
> >  #include <linux/kmemleak.h>
> >  #include <linux/seq_file.h>
> >  #include <linux/memblock.h>
> > @@ -1929,9 +1930,16 @@ static unsigned long __init free_low_memory_core_early(void)
> >          *  low ram will be on Node1
> >          */
> >         for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end,
> > -                               NULL)
> > +                               NULL) {
> > +               pg_data_t *pgdat;
> > +
> >                 count += __free_memory_core(start, end);
> >
> > +               for_each_online_pgdat(pgdat)
> > +                       shuffle_free_memory(pgdat, PHYS_PFN(start),
> > +                                       PHYS_PFN(end));
> > +       }
> > +
> >         return count;
> >  }
> >
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index b9a667d36c55..7caffb9a91ab 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -23,6 +23,7 @@
> >  #include <linux/highmem.h>
> >  #include <linux/vmalloc.h>
> >  #include <linux/ioport.h>
> > +#include <linux/shuffle.h>
> >  #include <linux/delay.h>
> >  #include <linux/migrate.h>
> >  #include <linux/page-isolation.h>
> > @@ -895,6 +896,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
> >         zone->zone_pgdat->node_present_pages += onlined_pages;
> >         pgdat_resize_unlock(zone->zone_pgdat, &flags);
> >
> > +       shuffle_zone(zone, pfn, zone_end_pfn(zone));
> > +
> >         if (onlined_pages) {
> >                 node_states_set_node(nid, &arg);
> >                 if (need_zonelists_rebuild)
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index cde5dac6229a..2adcd6da8a07 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -61,6 +61,7 @@
> >  #include <linux/sched/rt.h>
> >  #include <linux/sched/mm.h>
> >  #include <linux/page_owner.h>
> > +#include <linux/shuffle.h>
> >  #include <linux/kthread.h>
> >  #include <linux/memcontrol.h>
> >  #include <linux/ftrace.h>
> > @@ -1634,6 +1635,8 @@ static int __init deferred_init_memmap(void *data)
> >         }
> >         pgdat_resize_unlock(pgdat, &flags);
> >
> > +       shuffle_zone(zone, first_init_pfn, zone_end_pfn(zone));
> > +
> >         /* Sanity check that the next zone really is unpopulated */
> >         WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
> >
> > diff --git a/mm/shuffle.c b/mm/shuffle.c
> > new file mode 100644
> > index 000000000000..07961ff41a03
> > --- /dev/null
> > +++ b/mm/shuffle.c
> > @@ -0,0 +1,215 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +// Copyright(c) 2018 Intel Corporation. All rights reserved.
> > +
> > +#include <linux/mm.h>
> > +#include <linux/init.h>
> > +#include <linux/mmzone.h>
> > +#include <linux/random.h>
> > +#include <linux/shuffle.h>
> > +#include <linux/moduleparam.h>
> > +#include "internal.h"
> > +
> > +DEFINE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
> > +static unsigned long shuffle_state;
> > +
> > +/*
> > + * Depending on the architecture, module parameter parsing may run
> > + * before, or after the cache detection. SHUFFLE_FORCE_DISABLE prevents,
> > + * or reverts the enabling of the shuffle implementation. SHUFFLE_ENABLE
> > + * attempts to turn on the implementation, but aborts if it finds
> > + * SHUFFLE_FORCE_DISABLE already set.
> > + */
> > +void page_alloc_shuffle(enum mm_shuffle_ctl ctl)
> > +{
> > +       if (ctl == SHUFFLE_FORCE_DISABLE)
> > +               set_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state);
> > +
> > +       if (test_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state)) {
> > +               if (test_and_clear_bit(SHUFFLE_ENABLE, &shuffle_state))
> > +                       static_branch_disable(&page_alloc_shuffle_key);
> > +       } else if (ctl == SHUFFLE_ENABLE
> > +                       && !test_and_set_bit(SHUFFLE_ENABLE, &shuffle_state))
> > +               static_branch_enable(&page_alloc_shuffle_key);
> > +}
> > +
> > +static bool shuffle_param;
> > +extern int shuffle_show(char *buffer, const struct kernel_param *kp)
> > +{
> > +       return sprintf(buffer, "%c\n", test_bit(SHUFFLE_ENABLE, &shuffle_state)
> > +                       ? 'Y' : 'N');
> > +}
> > +static int shuffle_store(const char *val, const struct kernel_param *kp)
> > +{
> > +       int rc = param_set_bool(val, kp);
> > +
> > +       if (rc < 0)
> > +               return rc;
> > +       if (shuffle_param)
> > +               page_alloc_shuffle(SHUFFLE_ENABLE);
> > +       else
> > +               page_alloc_shuffle(SHUFFLE_FORCE_DISABLE);
> > +       return 0;
> > +}
> > +module_param_call(shuffle, shuffle_store, shuffle_show, &shuffle_param, 0400);
>
> If this is 0400, you don't intend it to be changed after boot. If it's
> supposed to be immutable, why not make these __init calls?

It's not changeable after boot, but it's still readable after boot.
This is there to allow interrogation of whether shuffling is in-effect
at runtime.

> > + * For two pages to be swapped in the shuffle, they must be free (on a
> > + * 'free_area' lru), have the same order, and have the same migratetype.
> > + */
> > +static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order)
> > +{
> > +       struct page *page;
> > +
> > +       /*
> > +        * Given we're dealing with randomly selected pfns in a zone we
> > +        * need to ask questions like...
> > +        */
> > +
> > +       /* ...is the pfn even in the memmap? */
> > +       if (!pfn_valid_within(pfn))
> > +               return NULL;
> > +
> > +       /* ...is the pfn in a present section or a hole? */
> > +       if (!pfn_present(pfn))
> > +               return NULL;
> > +
> > +       /* ...is the page free and currently on a free_area list? */
> > +       page = pfn_to_page(pfn);
> > +       if (!PageBuddy(page))
> > +               return NULL;
> > +
> > +       /*
> > +        * ...is the page on the same list as the page we will
> > +        * shuffle it with?
> > +        */
> > +       if (page_order(page) != order)
> > +               return NULL;
> > +
> > +       return page;
> > +}
> > +
> > +/*
> > + * Fisher-Yates shuffle the freelist which prescribes iterating through
> > + * an array, pfns in this case, and randomly swapping each entry with
> > + * another in the span, end_pfn - start_pfn.
> > + *
> > + * To keep the implementation simple it does not attempt to correct for
> > + * sources of bias in the distribution, like modulo bias or
> > + * pseudo-random number generator bias. I.e. the expectation is that
> > + * this shuffling raises the bar for attacks that exploit the
> > + * predictability of page allocations, but need not be a perfect
> > + * shuffle.
> > + *
> > + * Note that we don't use @z->zone_start_pfn and zone_end_pfn(@z)
> > + * directly since the caller may be aware of holes in the zone and can
> > + * improve the accuracy of the random pfn selection.
> > + */
> > +#define SHUFFLE_RETRY 10
> > +static void __meminit shuffle_zone_order(struct zone *z, unsigned long start_pfn,
> > +               unsigned long end_pfn, const int order)
> > +{
> > +       unsigned long i, flags;
> > +       const int order_pages = 1 << order;
> > +
> > +       if (start_pfn < z->zone_start_pfn)
> > +               start_pfn = z->zone_start_pfn;
> > +       if (end_pfn > zone_end_pfn(z))
> > +               end_pfn = zone_end_pfn(z);
> > +
> > +       /* probably means that start/end were outside the zone */
> > +       if (end_pfn <= start_pfn)
> > +               return;
> > +       spin_lock_irqsave(&z->lock, flags);
> > +       start_pfn = ALIGN(start_pfn, order_pages);
> > +       for (i = start_pfn; i < end_pfn; i += order_pages) {
> > +               unsigned long j;
> > +               int migratetype, retry;
> > +               struct page *page_i, *page_j;
> > +
> > +               /*
> > +                * We expect page_i, in the sub-range of a zone being
> > +                * added (@start_pfn to @end_pfn), to more likely be
> > +                * valid compared to page_j randomly selected in the
> > +                * span @zone_start_pfn to @spanned_pages.
> > +                */
> > +               page_i = shuffle_valid_page(i, order);
> > +               if (!page_i)
> > +                       continue;
> > +
> > +               for (retry = 0; retry < SHUFFLE_RETRY; retry++) {
> > +                       /*
> > +                        * Pick a random order aligned page from the
> > +                        * start of the zone. Use the *whole* zone here
> > +                        * so that if it is freed in tiny pieces that we
> > +                        * randomize in the whole zone, not just within
> > +                        * those fragments.
> > +                        *
> > +                        * Since page_j comes from a potentially sparse
> > +                        * address range we want to try a bit harder to
> > +                        * find a shuffle point for page_i.
> > +                        */
> > +                       j = z->zone_start_pfn +
> > +                               ALIGN_DOWN(get_random_long() % z->spanned_pages,
> > +                                               order_pages);
>
> How late in the boot process does this happen, btw?

This happens early at mem_init() before the software rng is initialized.

> Do we get warnings
> from the RNG about early usage?

Yes, it would trigger on some platforms. It does not on my test system
because I'm running on an arch_get_random_long() enabled system.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization
  2019-01-08  1:48     ` Dan Williams
@ 2019-01-08 23:24       ` Kees Cook
  0 siblings, 0 replies; 16+ messages in thread
From: Kees Cook @ 2019-01-08 23:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Michal Hocko, Dave Hansen, Mike Rapoport,
	Keith Busch, Linux-MM, LKML, Mel Gorman

On Mon, Jan 7, 2019 at 5:48 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Mon, Jan 7, 2019 at 4:19 PM Kees Cook <keescook@chromium.org> wrote:
> > Why does this need ACPI_NUMA? (e.g. why can't I use this on a non-ACPI
> > arm64 system?)
>
> I was thinking this would be expanded for each platform-type that will
> implement the auto-detect capability. However, there really is no
> direct dependency and if you wanted to just use the command line
> switch that should be allowed on any platform.
>
> I'll delete this dependency for v8, but I'll hold off on that posting
> awaiting feedback from mm folks.

Okay, cool. I'm glad there wasn't a real dep. :)

> > > +static bool shuffle_param;
> > > +extern int shuffle_show(char *buffer, const struct kernel_param *kp)
> > > +{
> > > +       return sprintf(buffer, "%c\n", test_bit(SHUFFLE_ENABLE, &shuffle_state)
> > > +                       ? 'Y' : 'N');
> > > +}
> > > +static int shuffle_store(const char *val, const struct kernel_param *kp)
> > > +{
> > > +       int rc = param_set_bool(val, kp);
> > > +
> > > +       if (rc < 0)
> > > +               return rc;
> > > +       if (shuffle_param)
> > > +               page_alloc_shuffle(SHUFFLE_ENABLE);
> > > +       else
> > > +               page_alloc_shuffle(SHUFFLE_FORCE_DISABLE);
> > > +       return 0;
> > > +}
> > > +module_param_call(shuffle, shuffle_store, shuffle_show, &shuffle_param, 0400);
> >
> > If this is 0400, you don't intend it to be changed after boot. If it's
> > supposed to be immutable, why not make these __init calls?
>
> It's not changeable after boot, but it's still readable after boot.
> This is there to allow interrogation of whether shuffling is in-effect
> at runtime.

In that case, can you make all the runtime-immutable things __ro_after_init?

> > > +                               ALIGN_DOWN(get_random_long() % z->spanned_pages,
> > > +                                               order_pages);
> >
> > How late in the boot process does this happen, btw?
>
> This happens early at mem_init() before the software rng is initialized.
>
> > Do we get warnings
> > from the RNG about early usage?
>
> Yes, it would trigger on some platforms. It does not on my test system
> because I'm running on an arch_get_random_long() enabled system.

Okay, cool. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization
  2019-01-07 23:21 ` [PATCH v7 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization Dan Williams
  2019-01-08  0:18   ` Kees Cook
@ 2019-01-10 10:56   ` Mel Gorman
  2019-01-10 21:29     ` Dan Williams
  2019-01-25 14:20   ` Michal Hocko
  2 siblings, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2019-01-10 10:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Michal Hocko, Kees Cook, Dave Hansen, Mike Rapoport,
	keith.busch, linux-mm, linux-kernel

On Mon, Jan 07, 2019 at 03:21:10PM -0800, Dan Williams wrote:
> Randomization of the page allocator improves the average utilization of
> a direct-mapped memory-side-cache. Memory side caching is a platform
> capability that Linux has been previously exposed to in HPC
> (high-performance computing) environments on specialty platforms. In
> that instance it was a smaller pool of high-bandwidth-memory relative to
> higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
> be found on general purpose server platforms where DRAM is a cache in
> front of higher latency persistent memory [1].
> 

So I glanced through the series and while I won't nak it, I'm not a
major fan either so I won't ack it either. While there are merits to
randomisation in terms of cache coloring, it may not be robust. IIRC, the
main strength of randomisation vs being smart was "it's simple and usually
doesn't fall apart completely". In particular I'd worry that compaction
will undo all the randomisation work by moving related pages into the same
direct-mapped lines. Furthermore, the runtime list management of "randomly
place and head or tail of list" will have variable and non-deterministic
outcomes and may also be undone by either high-order merging or compaction.

As bad as it is, an ideal world would have a proper cache-coloring
allocation algorithm but they previously failed as the runtime overhead
exceeded the actual benefit, particularly as fully associative caches
became more popular and there was no universal "one solution fits all". One
hatchet job around it may be to have per-task free-lists that put free
pages into buckets with the obvious caveat that those lists would need
draining and secondary locking. A caveat of that is that there may need
to be arch and/or driver hooks to detect how the colors are managed which
could also turn into a mess.

The big plus of the series is that it's relatively simple and appears to
be isolated enough that it only has an impact when the necessary hardware
in place. It will deal with some cases but I'm not sure it'll survive
long-term, particularly if HPC continues to report in the field that
reboots are necessary to reshufffle the lists (taken from your linked
documents). That workaround of running STREAM before a job starts and
rebooting the machine if the performance SLAs are not met is horrid.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization
  2019-01-10 10:56   ` Mel Gorman
@ 2019-01-10 21:29     ` Dan Williams
  2019-01-10 22:52       ` Kees Cook
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2019-01-10 21:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Kees Cook, Dave Hansen,
	Mike Rapoport, Keith Busch, Linux MM, Linux Kernel Mailing List

On Thu, Jan 10, 2019 at 2:57 AM Mel Gorman <mgorman@suse.de> wrote:
>
> On Mon, Jan 07, 2019 at 03:21:10PM -0800, Dan Williams wrote:
> > Randomization of the page allocator improves the average utilization of
> > a direct-mapped memory-side-cache. Memory side caching is a platform
> > capability that Linux has been previously exposed to in HPC
> > (high-performance computing) environments on specialty platforms. In
> > that instance it was a smaller pool of high-bandwidth-memory relative to
> > higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
> > be found on general purpose server platforms where DRAM is a cache in
> > front of higher latency persistent memory [1].
> >
>
> So I glanced through the series and while I won't nak it, I'm not a
> major fan either so I won't ack it either.

Thanks for taking a look, some more comments / advocacy below...
because I'm not sure what Andrew will do with a "meh" response
compared to an ack.

> While there are merits to
> randomisation in terms of cache coloring, it may not be robust. IIRC, the
> main strength of randomisation vs being smart was "it's simple and usually
> doesn't fall apart completely". In particular I'd worry that compaction
> will undo all the randomisation work by moving related pages into the same
> direct-mapped lines. Furthermore, the runtime list management of "randomly
> place and head or tail of list" will have variable and non-deterministic
> outcomes and may also be undone by either high-order merging or compaction.

It's a fair point. To date we have not been able to measure the
average performance degrading over time (pages becoming more ordered)
but that said I think it would take more resources and time than I
have available for that trend to present. If it did present that would
only speak to a need to be more aggressive on the runtime
re-randomization. I think there's a case to be made to start simple
and only get more aggressive with evidence.

Note that higher order merging is not a current concern since the
implementation is already randomizing on MAX_ORDER sized pages. Since
memory side caches are so large there's no worry about a 4MB
randomization boundary.

However, for the (unproven) security use case where folks want to
experiment with randomizing on smaller granularity, they should be
wary of this (/me nudges Kees).

> As bad as it is, an ideal world would have a proper cache-coloring
> allocation algorithm but they previously failed as the runtime overhead
> exceeded the actual benefit, particularly as fully associative caches
> became more popular and there was no universal "one solution fits all". One
> hatchet job around it may be to have per-task free-lists that put free
> pages into buckets with the obvious caveat that those lists would need
> draining and secondary locking. A caveat of that is that there may need
> to be arch and/or driver hooks to detect how the colors are managed which
> could also turn into a mess.

We (Dave, I and others that took a look at this) started here, and the
"mess" looked daunting compared to randomization. Also a mess without
much more incremental benefit.

We also settled on a numa_emulation based approach for the cases where
an administrator knows they have a workload that can fit in the
cache... more on that below:

> The big plus of the series is that it's relatively simple and appears to
> be isolated enough that it only has an impact when the necessary hardware
> in place. It will deal with some cases but I'm not sure it'll survive
> long-term, particularly if HPC continues to report in the field that
> reboots are necessary to reshufffle the lists (taken from your linked
> documents). That workaround of running STREAM before a job starts and
> rebooting the machine if the performance SLAs are not met is horrid.

That workaround is horrid, and we have a separate solution for it
merged in commit cc9aec03e58f "x86/numa_emulation: Introduce uniform
split capability". When an administrator knows in advance that a
workload will fit in cache they can use this capability to run the
workload in a numa node that is guaranteed to not have cache conflicts
with itself.

Whereas randomization benefits the general cache-overcommit case. The
uniform numa split case addresses those niche users that can manually
time schedule jobs with different working set sizes... without needing
to reboot.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization
  2019-01-10 21:29     ` Dan Williams
@ 2019-01-10 22:52       ` Kees Cook
  0 siblings, 0 replies; 16+ messages in thread
From: Kees Cook @ 2019-01-10 22:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Mel Gorman, Andrew Morton, Michal Hocko, Dave Hansen,
	Mike Rapoport, Keith Busch, Linux MM, Linux Kernel Mailing List

On Thu, Jan 10, 2019 at 1:29 PM Dan Williams <dan.j.williams@intel.com> wrote:
> Note that higher order merging is not a current concern since the
> implementation is already randomizing on MAX_ORDER sized pages. Since
> memory side caches are so large there's no worry about a 4MB
> randomization boundary.
>
> However, for the (unproven) security use case where folks want to
> experiment with randomizing on smaller granularity, they should be
> wary of this (/me nudges Kees).

Yup. And I think this is well noted in the Kconfig help already. I
view this as slightly more fine grain randomization than we get from
just effectively the base address randomization that
CONFIG_RANDOMIZE_MEMORY performs.

I remain a fan of this series. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization
  2019-01-07 23:21 ` [PATCH v7 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization Dan Williams
  2019-01-08  0:18   ` Kees Cook
  2019-01-10 10:56   ` Mel Gorman
@ 2019-01-25 14:20   ` Michal Hocko
  2019-01-29 19:26     ` Dan Williams
  2019-01-29 20:04     ` Dan Williams
  2 siblings, 2 replies; 16+ messages in thread
From: Michal Hocko @ 2019-01-25 14:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Kees Cook, Dave Hansen, Mike Rapoport, keith.busch,
	linux-mm, linux-kernel, mgorman

On Mon 07-01-19 15:21:10, Dan Williams wrote:
[...]

Thanks a lot for the additional information. And...

> Introduce shuffle_free_memory(), and its helper shuffle_zone(), to
> perform a Fisher-Yates shuffle of the page allocator 'free_area' lists
> when they are initially populated with free memory at boot and at
> hotplug time. Do this based on either the presence of a
> page_alloc.shuffle=Y command line parameter, or autodetection of a
> memory-side-cache (to be added in a follow-on patch).

... to make it opt-in and also provide an opt-out to override for the
auto-detected case.

> The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
> pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.
> 10, 4MB this trades off randomization granularity for time spent
> shuffling.

But I do not really think we want to make this a config option. Who do
you expect will tune this? I would rather wait for those usecases to be
called out and we can give them a command line parameter to do so rather
than something hardcoded during compile time and as such really unusable
for any consumer of the pre-built kernels.

I do not have a problem with the default section though.

> MAX_ORDER-1 was chosen to be minimally invasive to the page
> allocator while still showing memory-side cache behavior improvements,
> and the expectation that the security implications of finer granularity
> randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.
> 
> The performance impact of the shuffling appears to be in the noise
> compared to other memory initialization work. Also the bulk of the work
> is done in the background as a part of deferred_init_memmap().
> 
> This initial randomization can be undone over time so a follow-on patch
> is introduced to inject entropy on page free decisions. It is reasonable
> to ask if the page free entropy is sufficient, but it is not enough due
> to the in-order initial freeing of pages. At the start of that process
> putting page1 in front or behind page0 still keeps them close together,
> page2 is still near page1 and has a high chance of being adjacent. As
> more pages are added ordering diversity improves, but there is still
> high page locality for the low address pages and this leads to no
> significant impact to the cache conflict rate.
> 
> [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
> [2]: https://lkml.org/lkml/2018/9/22/54
> [3]: https://lkml.org/lkml/2018/10/12/309

Please turn lkml.org links into http://lkml.kernel.org/r/$msg_id

[....]
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index cc4a507d7ca4..8c37a023a790 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1272,6 +1272,10 @@ void sparse_init(void);
>  #else
>  #define sparse_init()	do {} while (0)
>  #define sparse_index_init(_sec, _nid)  do {} while (0)
> +static inline int pfn_present(unsigned long pfn)
> +{
> +	return 1;
> +}

Does this really make sense? Shouldn't this default to pfn_valid on
!sparsemem?

[...]
> +config SHUFFLE_PAGE_ALLOCATOR
> +	bool "Page allocator randomization"
> +	depends on ACPI_NUMA
> +	default SLAB_FREELIST_RANDOM
> +	help
> +	  Randomization of the page allocator improves the average
> +	  utilization of a direct-mapped memory-side-cache. See section
> +	  5.2.27 Heterogeneous Memory Attribute Table (HMAT) in the ACPI
> +	  6.2a specification for an example of how a platform advertises
> +	  the presence of a memory-side-cache. There are also incidental
> +	  security benefits as it reduces the predictability of page
> +	  allocations to compliment SLAB_FREELIST_RANDOM, but the
> +	  default granularity of shuffling on 4MB (MAX_ORDER) pages is
> +	  selected based on cache utilization benefits.
> +
> +	  While the randomization improves cache utilization it may
> +	  negatively impact workloads on platforms without a cache. For
> +	  this reason, by default, the randomization is enabled only
> +	  after runtime detection of a direct-mapped memory-side-cache.
> +	  Otherwise, the randomization may be force enabled with the
> +	  'page_alloc.shuffle' kernel command line parameter.
> +
> +	  Say Y if unsure.

Do we really need to make this a choice? Are any of the tiny systems
going to be NUMA? Why cannot we just make it depend on ACPI_NUMA?

> +config SHUFFLE_PAGE_ORDER
> +	depends on SHUFFLE_PAGE_ALLOCATOR
> +	int "Page allocator shuffle order"
> +	range 0 10
> +	default 10
> +	help
> +	  Specify the granularity at which shuffling (randomization) is
> +	  performed. By default this is set to MAX_ORDER-1 to minimize
> +	  runtime impact of randomization and with the expectation that
> +	  SLAB_FREELIST_RANDOM mitigates heap attacks on smaller
> +	  object granularities.
> +

and no, do not make this configurable here as already mentioned.
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 022d4cbb3618..3602f7a2eab4 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -17,6 +17,7 @@
>  #include <linux/poison.h>
>  #include <linux/pfn.h>
>  #include <linux/debugfs.h>
> +#include <linux/shuffle.h>
>  #include <linux/kmemleak.h>
>  #include <linux/seq_file.h>
>  #include <linux/memblock.h>
> @@ -1929,9 +1930,16 @@ static unsigned long __init free_low_memory_core_early(void)
>  	 *  low ram will be on Node1
>  	 */
>  	for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end,
> -				NULL)
> +				NULL) {
> +		pg_data_t *pgdat;
> +
>  		count += __free_memory_core(start, end);
>  
> +		for_each_online_pgdat(pgdat)
> +			shuffle_free_memory(pgdat, PHYS_PFN(start),
> +					PHYS_PFN(end));
> +	}
> +
>  	return count;
>  }
>  
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index b9a667d36c55..7caffb9a91ab 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -23,6 +23,7 @@
>  #include <linux/highmem.h>
>  #include <linux/vmalloc.h>
>  #include <linux/ioport.h>
> +#include <linux/shuffle.h>
>  #include <linux/delay.h>
>  #include <linux/migrate.h>
>  #include <linux/page-isolation.h>
> @@ -895,6 +896,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
>  	zone->zone_pgdat->node_present_pages += onlined_pages;
>  	pgdat_resize_unlock(zone->zone_pgdat, &flags);
>  
> +	shuffle_zone(zone, pfn, zone_end_pfn(zone));
> +
>  	if (onlined_pages) {
>  		node_states_set_node(nid, &arg);
>  		if (need_zonelists_rebuild)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index cde5dac6229a..2adcd6da8a07 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -61,6 +61,7 @@
>  #include <linux/sched/rt.h>
>  #include <linux/sched/mm.h>
>  #include <linux/page_owner.h>
> +#include <linux/shuffle.h>
>  #include <linux/kthread.h>
>  #include <linux/memcontrol.h>
>  #include <linux/ftrace.h>
> @@ -1634,6 +1635,8 @@ static int __init deferred_init_memmap(void *data)
>  	}
>  	pgdat_resize_unlock(pgdat, &flags);
>  
> +	shuffle_zone(zone, first_init_pfn, zone_end_pfn(zone));
> +
>  	/* Sanity check that the next zone really is unpopulated */
>  	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));

I would prefer if would have less placess to place the shuffling. Why
cannot we have a single place for the bootup and one for onlining part?
page_alloc_init_late sounds like a good place for the later. You can
miss some early allocations but are those of a big interest?

I haven't checked the actual shuffling algorithm, I will trust you on
that part ;)
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 2/3] mm: Move buddy list manipulations into helpers
  2019-01-07 23:21 ` [PATCH v7 2/3] mm: Move buddy list manipulations into helpers Dan Williams
@ 2019-01-25 14:30   ` Michal Hocko
  2019-01-29 19:27     ` Dan Williams
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2019-01-25 14:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Dave Hansen, keith.busch, linux-mm, linux-kernel, mgorman

On Mon 07-01-19 15:21:16, Dan Williams wrote:
> In preparation for runtime randomization of the zone lists, take all
> (well, most of) the list_*() functions in the buddy allocator and put
> them in helper functions. Provide a common control point for injecting
> additional behavior when freeing pages.

Looks good in general and it actually makes the code more readable.
One nit below

[...]
> +static inline void rmv_page_order(struct page *page)
> +{
> +	__ClearPageBuddy(page);
> +	set_page_private(page, 0);
> +}
> +

I guess we do not really need this helper and simply squash it to its
only user.

Acked-by: Michal Hocko <mhocko@suse.com>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 3/3] mm: Maintain randomization of page free lists
       [not found] ` <154690328135.676627.5979130839159447106.stgit@dwillia2-desk3.amr.corp.intel.com>
  2019-01-08  0:19   ` [PATCH v7 3/3] mm: Maintain randomization of page free lists Kees Cook
@ 2019-01-25 14:32   ` Michal Hocko
  1 sibling, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2019-01-25 14:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Kees Cook, Dave Hansen, keith.busch, linux-mm,
	linux-kernel, mgorman

On Mon 07-01-19 15:21:21, Dan Williams wrote:
> When freeing a page with an order >= shuffle_page_order randomly select
> the front or back of the list for insertion.
> 
> While the mm tries to defragment physical pages into huge pages this can
> tend to make the page allocator more predictable over time. Inject the
> front-back randomness to preserve the initial randomness established by
> shuffle_free_memory() when the kernel was booted.
> 
> The overhead of this manipulation is constrained by only being applied
> for MAX_ORDER sized pages by default.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  include/linux/mmzone.h  |   10 ++++++++++
>  include/linux/shuffle.h |   12 ++++++++++++
>  mm/page_alloc.c         |   11 +++++++++--
>  mm/shuffle.c            |   16 ++++++++++++++++
>  4 files changed, 47 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index b78a45e0b11c..c15f7f703be0 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -98,6 +98,8 @@ extern int page_group_by_mobility_disabled;
>  struct free_area {
>  	struct list_head	free_list[MIGRATE_TYPES];
>  	unsigned long		nr_free;
> +	u64			rand;
> +	u8			rand_bits;
>  };

Do we really need per order randomness? Why a global one is not
sufficient?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization
  2019-01-25 14:20   ` Michal Hocko
@ 2019-01-29 19:26     ` Dan Williams
  2019-01-29 20:04     ` Dan Williams
  1 sibling, 0 replies; 16+ messages in thread
From: Dan Williams @ 2019-01-29 19:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kees Cook, Dave Hansen, Mike Rapoport,
	Keith Busch, Linux MM, Linux Kernel Mailing List, Mel Gorman

On Fri, Jan 25, 2019 at 6:21 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 07-01-19 15:21:10, Dan Williams wrote:
> [...]
>
> Thanks a lot for the additional information. And...

Hi Michal,

Thanks for the review!

> > Introduce shuffle_free_memory(), and its helper shuffle_zone(), to
> > perform a Fisher-Yates shuffle of the page allocator 'free_area' lists
> > when they are initially populated with free memory at boot and at
> > hotplug time. Do this based on either the presence of a
> > page_alloc.shuffle=Y command line parameter, or autodetection of a
> > memory-side-cache (to be added in a follow-on patch).
>
> ... to make it opt-in and also provide an opt-out to override for the
> auto-detected case.
>
> > The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
> > pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.
> > 10, 4MB this trades off randomization granularity for time spent
> > shuffling.
>
> But I do not really think we want to make this a config option. Who do
> you expect will tune this? I would rather wait for those usecases to be
> called out and we can give them a command line parameter to do so rather
> than something hardcoded during compile time and as such really unusable
> for any consumer of the pre-built kernels.

True. I have no problem removing it. If people want to play with
randomizing different orders they can change the compile-time constant
manually. If it turns out that there is a use case for it to be
dynamically set from the command line that then that be added when
demand / user is clarified.

> I do not have a problem with the default section though.

Ok.

> > MAX_ORDER-1 was chosen to be minimally invasive to the page
> > allocator while still showing memory-side cache behavior improvements,
> > and the expectation that the security implications of finer granularity
> > randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.
> >
> > The performance impact of the shuffling appears to be in the noise
> > compared to other memory initialization work. Also the bulk of the work
> > is done in the background as a part of deferred_init_memmap().
> >
> > This initial randomization can be undone over time so a follow-on patch
> > is introduced to inject entropy on page free decisions. It is reasonable
> > to ask if the page free entropy is sufficient, but it is not enough due
> > to the in-order initial freeing of pages. At the start of that process
> > putting page1 in front or behind page0 still keeps them close together,
> > page2 is still near page1 and has a high chance of being adjacent. As
> > more pages are added ordering diversity improves, but there is still
> > high page locality for the low address pages and this leads to no
> > significant impact to the cache conflict rate.
> >
> > [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
> > [2]: https://lkml.org/lkml/2018/9/22/54
> > [3]: https://lkml.org/lkml/2018/10/12/309
>
> Please turn lkml.org links into http://lkml.kernel.org/r/$msg_id

Will do.


>
> [....]
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index cc4a507d7ca4..8c37a023a790 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -1272,6 +1272,10 @@ void sparse_init(void);
> >  #else
> >  #define sparse_init()        do {} while (0)
> >  #define sparse_index_init(_sec, _nid)  do {} while (0)
> > +static inline int pfn_present(unsigned long pfn)
> > +{
> > +     return 1;
> > +}
>
> Does this really make sense? Shouldn't this default to pfn_valid on
> !sparsemem?
>
> [...]
> > +config SHUFFLE_PAGE_ALLOCATOR
> > +     bool "Page allocator randomization"
> > +     depends on ACPI_NUMA
> > +     default SLAB_FREELIST_RANDOM
> > +     help
> > +       Randomization of the page allocator improves the average
> > +       utilization of a direct-mapped memory-side-cache. See section
> > +       5.2.27 Heterogeneous Memory Attribute Table (HMAT) in the ACPI
> > +       6.2a specification for an example of how a platform advertises
> > +       the presence of a memory-side-cache. There are also incidental
> > +       security benefits as it reduces the predictability of page
> > +       allocations to compliment SLAB_FREELIST_RANDOM, but the
> > +       default granularity of shuffling on 4MB (MAX_ORDER) pages is
> > +       selected based on cache utilization benefits.
> > +
> > +       While the randomization improves cache utilization it may
> > +       negatively impact workloads on platforms without a cache. For
> > +       this reason, by default, the randomization is enabled only
> > +       after runtime detection of a direct-mapped memory-side-cache.
> > +       Otherwise, the randomization may be force enabled with the
> > +       'page_alloc.shuffle' kernel command line parameter.
> > +
> > +       Say Y if unsure.
>
> Do we really need to make this a choice? Are any of the tiny systems
> going to be NUMA? Why cannot we just make it depend on ACPI_NUMA?
>
> > +config SHUFFLE_PAGE_ORDER
> > +     depends on SHUFFLE_PAGE_ALLOCATOR
> > +     int "Page allocator shuffle order"
> > +     range 0 10
> > +     default 10
> > +     help
> > +       Specify the granularity at which shuffling (randomization) is
> > +       performed. By default this is set to MAX_ORDER-1 to minimize
> > +       runtime impact of randomization and with the expectation that
> > +       SLAB_FREELIST_RANDOM mitigates heap attacks on smaller
> > +       object granularities.
> > +
>
> and no, do not make this configurable here as already mentioned.
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 022d4cbb3618..3602f7a2eab4 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -17,6 +17,7 @@
> >  #include <linux/poison.h>
> >  #include <linux/pfn.h>
> >  #include <linux/debugfs.h>
> > +#include <linux/shuffle.h>
> >  #include <linux/kmemleak.h>
> >  #include <linux/seq_file.h>
> >  #include <linux/memblock.h>
> > @@ -1929,9 +1930,16 @@ static unsigned long __init free_low_memory_core_early(void)
> >        *  low ram will be on Node1
> >        */
> >       for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end,
> > -                             NULL)
> > +                             NULL) {
> > +             pg_data_t *pgdat;
> > +
> >               count += __free_memory_core(start, end);
> >
> > +             for_each_online_pgdat(pgdat)
> > +                     shuffle_free_memory(pgdat, PHYS_PFN(start),
> > +                                     PHYS_PFN(end));
> > +     }
> > +
> >       return count;
> >  }
> >
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index b9a667d36c55..7caffb9a91ab 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -23,6 +23,7 @@
> >  #include <linux/highmem.h>
> >  #include <linux/vmalloc.h>
> >  #include <linux/ioport.h>
> > +#include <linux/shuffle.h>
> >  #include <linux/delay.h>
> >  #include <linux/migrate.h>
> >  #include <linux/page-isolation.h>
> > @@ -895,6 +896,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
> >       zone->zone_pgdat->node_present_pages += onlined_pages;
> >       pgdat_resize_unlock(zone->zone_pgdat, &flags);
> >
> > +     shuffle_zone(zone, pfn, zone_end_pfn(zone));
> > +
> >       if (onlined_pages) {
> >               node_states_set_node(nid, &arg);
> >               if (need_zonelists_rebuild)
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index cde5dac6229a..2adcd6da8a07 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -61,6 +61,7 @@
> >  #include <linux/sched/rt.h>
> >  #include <linux/sched/mm.h>
> >  #include <linux/page_owner.h>
> > +#include <linux/shuffle.h>
> >  #include <linux/kthread.h>
> >  #include <linux/memcontrol.h>
> >  #include <linux/ftrace.h>
> > @@ -1634,6 +1635,8 @@ static int __init deferred_init_memmap(void *data)
> >       }
> >       pgdat_resize_unlock(pgdat, &flags);
> >
> > +     shuffle_zone(zone, first_init_pfn, zone_end_pfn(zone));
> > +
> >       /* Sanity check that the next zone really is unpopulated */
> >       WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
>
> I would prefer if would have less placess to place the shuffling. Why
> cannot we have a single place for the bootup and one for onlining part?
> page_alloc_init_late sounds like a good place for the later. You can
> miss some early allocations but are those of a big interest?
>
> I haven't checked the actual shuffling algorithm, I will trust you on
> that part ;)
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 2/3] mm: Move buddy list manipulations into helpers
  2019-01-25 14:30   ` Michal Hocko
@ 2019-01-29 19:27     ` Dan Williams
  0 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2019-01-29 19:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Dave Hansen, Keith Busch, Linux MM,
	Linux Kernel Mailing List, Mel Gorman

On Fri, Jan 25, 2019 at 6:31 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 07-01-19 15:21:16, Dan Williams wrote:
> > In preparation for runtime randomization of the zone lists, take all
> > (well, most of) the list_*() functions in the buddy allocator and put
> > them in helper functions. Provide a common control point for injecting
> > additional behavior when freeing pages.
>
> Looks good in general and it actually makes the code more readable.
> One nit below
>
> [...]
> > +static inline void rmv_page_order(struct page *page)
> > +{
> > +     __ClearPageBuddy(page);
> > +     set_page_private(page, 0);
> > +}
> > +
>
> I guess we do not really need this helper and simply squash it to its
> only user.

Ok.

>
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks.

>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization
  2019-01-25 14:20   ` Michal Hocko
  2019-01-29 19:26     ` Dan Williams
@ 2019-01-29 20:04     ` Dan Williams
  1 sibling, 0 replies; 16+ messages in thread
From: Dan Williams @ 2019-01-29 20:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kees Cook, Dave Hansen, Mike Rapoport,
	Keith Busch, Linux MM, Linux Kernel Mailing List, Mel Gorman

Whoops, did not reply to all your feedback, see below:

On Fri, Jan 25, 2019 at 6:21 AM Michal Hocko <mhocko@kernel.org> wrote:
[..]
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index cc4a507d7ca4..8c37a023a790 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -1272,6 +1272,10 @@ void sparse_init(void);
> >  #else
> >  #define sparse_init()        do {} while (0)
> >  #define sparse_index_init(_sec, _nid)  do {} while (0)
> > +static inline int pfn_present(unsigned long pfn)
> > +{
> > +     return 1;
> > +}
>
> Does this really make sense? Shouldn't this default to pfn_valid on
> !sparsemem?

Yes, I think it should be pfn_valid()

>
> [...]
> > +config SHUFFLE_PAGE_ALLOCATOR
> > +     bool "Page allocator randomization"
> > +     depends on ACPI_NUMA
> > +     default SLAB_FREELIST_RANDOM
> > +     help
> > +       Randomization of the page allocator improves the average
> > +       utilization of a direct-mapped memory-side-cache. See section
> > +       5.2.27 Heterogeneous Memory Attribute Table (HMAT) in the ACPI
> > +       6.2a specification for an example of how a platform advertises
> > +       the presence of a memory-side-cache. There are also incidental
> > +       security benefits as it reduces the predictability of page
> > +       allocations to compliment SLAB_FREELIST_RANDOM, but the
> > +       default granularity of shuffling on 4MB (MAX_ORDER) pages is
> > +       selected based on cache utilization benefits.
> > +
> > +       While the randomization improves cache utilization it may
> > +       negatively impact workloads on platforms without a cache. For
> > +       this reason, by default, the randomization is enabled only
> > +       after runtime detection of a direct-mapped memory-side-cache.
> > +       Otherwise, the randomization may be force enabled with the
> > +       'page_alloc.shuffle' kernel command line parameter.
> > +
> > +       Say Y if unsure.
>
> Do we really need to make this a choice? Are any of the tiny systems
> going to be NUMA? Why cannot we just make it depend on ACPI_NUMA?

Kees wants to use this on ARM and I removed the ACPI_NUMA dependency
in v8 (you happened to review v7).

Given the setting has performance impact I believe it should allow for
being hard disabled at compile time, but I'll update the default to:

    default SLAB_FREELIST_RANDOM && ACPI_NUMA

>
> > +config SHUFFLE_PAGE_ORDER
> > +     depends on SHUFFLE_PAGE_ALLOCATOR
> > +     int "Page allocator shuffle order"
> > +     range 0 10
> > +     default 10
> > +     help
> > +       Specify the granularity at which shuffling (randomization) is
> > +       performed. By default this is set to MAX_ORDER-1 to minimize
> > +       runtime impact of randomization and with the expectation that
> > +       SLAB_FREELIST_RANDOM mitigates heap attacks on smaller
> > +       object granularities.
> > +
>
> and no, do not make this configurable here as already mentioned.

Will remove.

> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 022d4cbb3618..3602f7a2eab4 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -17,6 +17,7 @@
> >  #include <linux/poison.h>
> >  #include <linux/pfn.h>
> >  #include <linux/debugfs.h>
> > +#include <linux/shuffle.h>
> >  #include <linux/kmemleak.h>
> >  #include <linux/seq_file.h>
> >  #include <linux/memblock.h>
> > @@ -1929,9 +1930,16 @@ static unsigned long __init free_low_memory_core_early(void)
> >        *  low ram will be on Node1
> >        */
> >       for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end,
> > -                             NULL)
> > +                             NULL) {
> > +             pg_data_t *pgdat;
> > +
> >               count += __free_memory_core(start, end);
> >
> > +             for_each_online_pgdat(pgdat)
> > +                     shuffle_free_memory(pgdat, PHYS_PFN(start),
> > +                                     PHYS_PFN(end));
> > +     }
> > +
> >       return count;
> >  }
> >
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index b9a667d36c55..7caffb9a91ab 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -23,6 +23,7 @@
> >  #include <linux/highmem.h>
> >  #include <linux/vmalloc.h>
> >  #include <linux/ioport.h>
> > +#include <linux/shuffle.h>
> >  #include <linux/delay.h>
> >  #include <linux/migrate.h>
> >  #include <linux/page-isolation.h>
> > @@ -895,6 +896,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
> >       zone->zone_pgdat->node_present_pages += onlined_pages;
> >       pgdat_resize_unlock(zone->zone_pgdat, &flags);
> >
> > +     shuffle_zone(zone, pfn, zone_end_pfn(zone));
> > +
> >       if (onlined_pages) {
> >               node_states_set_node(nid, &arg);
> >               if (need_zonelists_rebuild)
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index cde5dac6229a..2adcd6da8a07 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -61,6 +61,7 @@
> >  #include <linux/sched/rt.h>
> >  #include <linux/sched/mm.h>
> >  #include <linux/page_owner.h>
> > +#include <linux/shuffle.h>
> >  #include <linux/kthread.h>
> >  #include <linux/memcontrol.h>
> >  #include <linux/ftrace.h>
> > @@ -1634,6 +1635,8 @@ static int __init deferred_init_memmap(void *data)
> >       }
> >       pgdat_resize_unlock(pgdat, &flags);
> >
> > +     shuffle_zone(zone, first_init_pfn, zone_end_pfn(zone));
> > +
> >       /* Sanity check that the next zone really is unpopulated */
> >       WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
>
> I would prefer if would have less placess to place the shuffling. Why
> cannot we have a single place for the bootup and one for onlining part?
> page_alloc_init_late sounds like a good place for the later. You can
> miss some early allocations but are those of a big interest?

Ok, so you mean reduce the 3 callsites to 2. Replace the
free_low_memory_core_early() and deferred_init_memmap() sites with a
single shuffle call in page_alloc_init_late() after waiting for
deferred_init_memmap() work to complete? I don't see any red flags
with that, I'll give it a try.

> I haven't checked the actual shuffling algorithm, I will trust you on
> that part ;)

The algorithm has proved reliable. The breakage has only arisen from
missing locations that free large amounts of memory to the allocator
and failing to re-randomize within a whole zone, i.e. not just the
pages that were currently being hot-added.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-01-29 20:05 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-07 23:21 [PATCH v7 0/3] mm: Randomize free memory Dan Williams
2019-01-07 23:21 ` [PATCH v7 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization Dan Williams
2019-01-08  0:18   ` Kees Cook
2019-01-08  1:48     ` Dan Williams
2019-01-08 23:24       ` Kees Cook
2019-01-10 10:56   ` Mel Gorman
2019-01-10 21:29     ` Dan Williams
2019-01-10 22:52       ` Kees Cook
2019-01-25 14:20   ` Michal Hocko
2019-01-29 19:26     ` Dan Williams
2019-01-29 20:04     ` Dan Williams
2019-01-07 23:21 ` [PATCH v7 2/3] mm: Move buddy list manipulations into helpers Dan Williams
2019-01-25 14:30   ` Michal Hocko
2019-01-29 19:27     ` Dan Williams
     [not found] ` <154690328135.676627.5979130839159447106.stgit@dwillia2-desk3.amr.corp.intel.com>
2019-01-08  0:19   ` [PATCH v7 3/3] mm: Maintain randomization of page free lists Kees Cook
2019-01-25 14:32   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).