From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=hDTa=OZ=vger.kernel.org=linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-8.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS,URIBL_BLOCKED,
	USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BF954C43387
	for <linux-kernel@archiver.kernel.org>; Sun, 16 Dec 2018 12:43:49 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 6C453217FA
	for <linux-kernel@archiver.kernel.org>; Sun, 16 Dec 2018 12:43:49 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1730508AbeLPMns (ORCPT
        <rfc822;linux-kernel@archiver.kernel.org>);
        Sun, 16 Dec 2018 07:43:48 -0500
Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:52410 "EHLO
        mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1730352AbeLPMnr (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Sun, 16 Dec 2018 07:43:47 -0500
Received: from pps.filterd (m0098410.ppops.net [127.0.0.1])
        by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id wBGCdGgE071908
        for <linux-kernel@vger.kernel.org>; Sun, 16 Dec 2018 07:43:44 -0500
Received: from e06smtp05.uk.ibm.com (e06smtp05.uk.ibm.com [195.75.94.101])
        by mx0a-001b2d01.pphosted.com with ESMTP id 2pdfp4bsxw-1
        (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT)
        for <linux-kernel@vger.kernel.org>; Sun, 16 Dec 2018 07:43:44 -0500
Received: from localhost
        by e06smtp05.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted
        for <linux-kernel@vger.kernel.org> from <rppt@linux.ibm.com>;
        Sun, 16 Dec 2018 12:43:41 -0000
Received: from b06cxnps3075.portsmouth.uk.ibm.com (9.149.109.195)
        by e06smtp05.uk.ibm.com (192.168.101.135) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted;
        (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256)
        Sun, 16 Dec 2018 12:43:39 -0000
Received: from d06av22.portsmouth.uk.ibm.com (d06av22.portsmouth.uk.ibm.com [9.149.105.58])
        by b06cxnps3075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id wBGChcle44040352
        (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL);
        Sun, 16 Dec 2018 12:43:38 GMT
Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1])
        by IMSVA (Postfix) with ESMTP id 377E34C04A;
        Sun, 16 Dec 2018 12:43:38 +0000 (GMT)
Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1])
        by IMSVA (Postfix) with ESMTP id 65E1D4C046;
        Sun, 16 Dec 2018 12:43:37 +0000 (GMT)
Received: from rapoport-lnx (unknown [9.148.8.52])
        by d06av22.portsmouth.uk.ibm.com (Postfix) with ESMTPS;
        Sun, 16 Dec 2018 12:43:37 +0000 (GMT)
Date:   Sun, 16 Dec 2018 14:43:35 +0200
From:   Mike Rapoport <rppt@linux.ibm.com>
To:     Dan Williams <dan.j.williams@intel.com>
Cc:     akpm@linux-foundation.org, Michal Hocko <mhocko@suse.com>,
        Kees Cook <keescook@chromium.org>,
        Dave Hansen <dave.hansen@linux.intel.com>,
        peterz@infradead.org, linux-mm@kvack.org, x86@kernel.org,
        linux-kernel@vger.kernel.org
Subject: Re: [PATCH v5 3/5] mm: Shuffle initial free memory to improve
 memory-side-cache utilization
References: <154483851047.1672629.15001135860756738866.stgit@dwillia2-desk3.amr.corp.intel.com>
 <154483852617.1672629.2068988045031389440.stgit@dwillia2-desk3.amr.corp.intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <154483852617.1672629.2068988045031389440.stgit@dwillia2-desk3.amr.corp.intel.com>
User-Agent: Mutt/1.5.24 (2015-08-30)
X-TM-AS-GCONF: 00
x-cbid: 18121612-0020-0000-0000-000002F7C0CE
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 18121612-0021-0000-0000-00002147FFD9
Message-Id: <20181216124335.GB30212@rapoport-lnx>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2018-12-16_09:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501
 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0
 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0
 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx
 scancount=1 engine=8.0.1-1810050000 definitions=main-1812160120
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Fri, Dec 14, 2018 at 05:48:46PM -0800, Dan Williams wrote:
> Randomization of the page allocator improves the average utilization of
> a direct-mapped memory-side-cache. Memory side caching is a platform
> capability that Linux has been previously exposed to in HPC
> (high-performance computing) environments on specialty platforms. In
> that instance it was a smaller pool of high-bandwidth-memory relative to
> higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
> be found on general purpose server platforms where DRAM is a cache in
> front of higher latency persistent memory [1].
> 
> Robert offered an explanation of the state of the art of Linux
> interactions with memory-side-caches [2], and I copy it here:
> 
>     It's been a problem in the HPC space:
>     http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/
> 
>     A kernel module called zonesort is available to try to help:
>     https://software.intel.com/en-us/articles/xeon-phi-software
> 
>     and this abandoned patch series proposed that for the kernel:
>     https://lkml.org/lkml/2017/8/23/195
> 
>     Dan's patch series doesn't attempt to ensure buffers won't conflict, but
>     also reduces the chance that the buffers will. This will make performance
>     more consistent, albeit slower than "optimal" (which is near impossible
>     to attain in a general-purpose kernel).  That's better than forcing
>     users to deploy remedies like:
>         "To eliminate this gradual degradation, we have added a Stream
>          measurement to the Node Health Check that follows each job;
>          nodes are rebooted whenever their measured memory bandwidth
>          falls below 300 GB/s."
> 
> A replacement for zonesort was merged upstream in commit cc9aec03e58f
> "x86/numa_emulation: Introduce uniform split capability". With this
> numa_emulation capability, memory can be split into cache sized
> ("near-memory" sized) numa nodes. A bind operation to such a node, and
> disabling workloads on other nodes, enables full cache performance.
> However, once the workload exceeds the cache size then cache conflicts
> are unavoidable. While HPC environments might be able to tolerate
> time-scheduling of cache sized workloads, for general purpose server
> platforms, the oversubscribed cache case will be the common case.
> 
> The worst case scenario is that a server system owner benchmarks a
> workload at boot with an un-contended cache only to see that performance
> degrade over time, even below the average cache performance due to
> excessive conflicts. Randomization clips the peaks and fills in the
> valleys of cache utilization to yield steady average performance.
> 
> Here are some performance impact details of the patches:
> 
> 1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
> 3X speedup in a contrived case that tries to force cache conflicts. The
> contrived cased used the numa_emulation capability to force an instance
> of the benchmark to be run in two of the near-memory sized numa nodes.
> If both instances were placed on the same emulated they would fit and
> cause zero conflicts.  While on separate emulated nodes without
> randomization they underutilized the cache and conflicted unnecessarily
> due to the in-order allocation per node.
> 
> 2/ A well known Java server application benchmark was run with a heap
> size that exceeded cache size by 3X. The cache conflict rate was 8% for
> the first run and degraded to 21% after page allocator aging. With
> randomization enabled the rate levelled out at 11%.
> 
> 3/ A MongoDB workload did not observe measurable difference in
> cache-conflict rates, but the overall throughput dropped by 7% with
> randomization in one case.
> 
> 4/ Mel Gorman ran his suite of performance workloads with randomization
> enabled on platforms without a memory-side-cache and saw a mix of some
> improvements and some losses [3].
> 
> While there is potentially significant improvement for applications that
> depend on low latency access across a wide working-set, the performance
> may be negligible to negative for other workloads. For this reason the
> shuffle capability defaults to off unless a direct-mapped
> memory-side-cache is detected. Even then, the page_alloc.shuffle=0
> parameter can be specified to disable the randomization on those
> systems.
> 
> Outside of memory-side-cache utilization concerns there is potentially
> security benefit from randomization. Some data exfiltration and
> return-oriented-programming attacks rely on the ability to infer the
> location of sensitive data objects. The kernel page allocator,
> especially early in system boot, has predictable first-in-first out
> behavior for physical pages. Pages are freed in physical address order
> when first onlined.
> 
> Quoting Kees:
>     "While we already have a base-address randomization
>      (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
>      memory layouts would certainly be using the predictability of
>      allocation ordering (i.e. for attacks where the base address isn't
>      important: only the relative positions between allocated memory).
>      This is common in lots of heap-style attacks. They try to gain
>      control over ordering by spraying allocations, etc.
> 
>      I'd really like to see this because it gives us something similar
>      to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."
> 
> While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
> caches it leaves vast bulk of memory to be predictably in order
> allocated.  However, it should be noted, the concrete security benefits
> are hard to quantify, and no known CVE is mitigated by this
> randomization.
> 
> Introduce shuffle_free_memory(), and its helper shuffle_zone(), to
> perform a Fisher-Yates shuffle of the page allocator 'free_area' lists
> when they are initially populated with free memory at boot and at
> hotplug time.
> 
> The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
> pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.
> 10, 4MB this trades off randomization granularity for time spent
> shuffling.  MAX_ORDER-1 was chosen to be minimally invasive to the page
> allocator while still showing memory-side cache behavior improvements,
> and the expectation that the security implications of finer granularity
> randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.
> 
> The performance impact of the shuffling appears to be in the noise
> compared to other memory initialization work. Also the bulk of the work
> is done in the background as a part of deferred_init_memmap().
> 
> This initial randomization can be undone over time so a follow-on patch
> is introduced to inject entropy on page free decisions. It is reasonable
> to ask if the page free entropy is sufficient, but it is not enough due
> to the in-order initial freeing of pages. At the start of that process
> putting page1 in front or behind page0 still keeps them close together,
> page2 is still near page1 and has a high chance of being adjacent. As
> more pages are added ordering diversity improves, but there is still
> high page locality for the low address pages and this leads to no
> significant impact to the cache conflict rate.
> 
> [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
> [2]: https://lkml.org/lkml/2018/9/22/54
> [3]: https://lkml.org/lkml/2018/10/12/309
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  include/linux/list.h   |   17 ++++
>  include/linux/mm.h     |   38 +++++++++
>  include/linux/mmzone.h |    4 +
>  init/Kconfig           |   36 ++++++++
>  mm/Makefile            |    7 +-
>  mm/memblock.c          |   21 ++++-
>  mm/memory_hotplug.c    |    2 
>  mm/page_alloc.c        |    2 
>  mm/shuffle.c           |  206 ++++++++++++++++++++++++++++++++++++++++++++++++
>  9 files changed, 329 insertions(+), 4 deletions(-)
>  create mode 100644 mm/shuffle.c
> 
> diff --git a/include/linux/list.h b/include/linux/list.h
> index edb7628e46ed..3dfb8953f241 100644
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -150,6 +150,23 @@ static inline void list_replace_init(struct list_head *old,
>  	INIT_LIST_HEAD(old);
>  }
> 
> +/**
> + * list_swap - replace entry1 with entry2 and re-add entry1 at entry2's position
> + * @entry1: the location to place entry2
> + * @entry2: the location to place entry1
> + */
> +static inline void list_swap(struct list_head *entry1,
> +			     struct list_head *entry2)
> +{
> +	struct list_head *pos = entry2->prev;
> +
> +	list_del(entry2);
> +	list_replace(entry1, entry2);
> +	if (pos == entry1)
> +		pos = entry2;
> +	list_add(entry1, pos);
> +}
> +
>  /**
>   * list_del_init - deletes entry from list and reinitialize it.
>   * @entry: the element to delete from the list.
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5411de93a363..f9647779e82b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2069,6 +2069,44 @@ extern void mem_init_print_info(const char *str);
> 
>  extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
> 
> +#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR
> +DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
> +extern void page_alloc_shuffle_enable(void);
> +extern void __shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
> +		unsigned long end_pfn);
> +static inline void shuffle_free_memory(pg_data_t *pgdat,
> +		unsigned long start_pfn, unsigned long end_pfn)
> +{
> +	if (!static_branch_unlikely(&page_alloc_shuffle_key))
> +		return;
> +	__shuffle_free_memory(pgdat, start_pfn, end_pfn);
> +}
> +
> +extern void __shuffle_zone(struct zone *z, unsigned long start_pfn,
> +		unsigned long end_pfn);
> +static inline void shuffle_zone(struct zone *z, unsigned long start_pfn,
> +		unsigned long end_pfn)
> +{
> +	if (!static_branch_unlikely(&page_alloc_shuffle_key))
> +		return;
> +	__shuffle_zone(z, start_pfn, end_pfn);
> +}
> +#else
> +static inline void shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
> +		unsigned long end_pfn)
> +{
> +}
> +
> +static inline void shuffle_zone(struct zone *z, unsigned long start_pfn,
> +		unsigned long end_pfn)
> +{
> +}
> +
> +static inline void page_alloc_shuffle_enable(void)
> +{
> +}
> +#endif
> +
>  /* Free the reserved page into the buddy system, so it gets managed. */
>  static inline void __free_reserved_page(struct page *page)
>  {
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 847705a6d0ec..eafa66d66232 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1266,6 +1266,10 @@ void sparse_init(void);
>  #else
>  #define sparse_init()	do {} while (0)
>  #define sparse_index_init(_sec, _nid)  do {} while (0)
> +static inline int pfn_present(unsigned long pfn)
> +{
> +	return 1;
> +}
>  #endif /* CONFIG_SPARSEMEM */
> 
>  /*
> diff --git a/init/Kconfig b/init/Kconfig
> index cf5b5a0dcbc2..fa6812d995ec 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1720,6 +1720,42 @@ config SLAB_FREELIST_HARDENED
>  	  sacrifies to harden the kernel slab allocator against common
>  	  freelist exploit methods.
> 
> +config SHUFFLE_PAGE_ALLOCATOR
> +	bool "Page allocator randomization"
> +	depends on HAVE_MEMBLOCK_CACHE_INFO
> +	default SLAB_FREELIST_RANDOM
> +	help
> +	  Randomization of the page allocator improves the average
> +	  utilization of a direct-mapped memory-side-cache. See section
> +	  5.2.27 Heterogeneous Memory Attribute Table (HMAT) in the ACPI
> +	  6.2a specification for an example of how a platform advertises
> +	  the presence of a memory-side-cache. There are also incidental
> +	  security benefits as it reduces the predictability of page
> +	  allocations to compliment SLAB_FREELIST_RANDOM, but the
> +	  default granularity of shuffling on 4MB (MAX_ORDER) pages is
> +	  selected based on cache utilization benefits.
> +
> +	  While the randomization improves cache utilization it may
> +	  negatively impact workloads on platforms without a cache. For
> +	  this reason, by default, the randomization is enabled only
> +	  after runtime detection of a direct-mapped memory-side-cache.
> +	  Otherwise, the randomization may be force enabled with the
> +	  'page_alloc.shuffle' kernel command line parameter.
> +
> +	  Say Y if unsure.
> +
> +config SHUFFLE_PAGE_ORDER
> +	depends on SHUFFLE_PAGE_ALLOCATOR
> +	int "Page allocator shuffle order"
> +	range 0 10
> +	default 10
> +	help
> +	  Specify the granularity at which shuffling (randomization) is
> +	  performed. By default this is set to MAX_ORDER-1 to minimize
> +	  runtime impact of randomization and with the expectation that
> +	  SLAB_FREELIST_RANDOM mitigates heap attacks on smaller
> +	  object granularities.
> +
>  config SLUB_CPU_PARTIAL
>  	default y
>  	depends on SLUB && SMP
> diff --git a/mm/Makefile b/mm/Makefile
> index d210cc9d6f80..ac5e5ba78874 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -33,7 +33,7 @@ mmu-$(CONFIG_MMU)	+= process_vm_access.o
>  endif
> 
>  obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
> -			   maccess.o page_alloc.o page-writeback.o \
> +			   maccess.o page-writeback.o \
>  			   readahead.o swap.o truncate.o vmscan.o shmem.o \
>  			   util.o mmzone.o vmstat.o backing-dev.o \
>  			   mm_init.o mmu_context.o percpu.o slab_common.o \
> @@ -41,6 +41,11 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
>  			   interval_tree.o list_lru.o workingset.o \
>  			   debug.o $(mmu-y)
> 
> +# Give 'page_alloc' its own module-parameter namespace
> +page-alloc-y := page_alloc.o
> +page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
> +
> +obj-y += page-alloc.o
>  obj-y += init-mm.o
>  obj-y += memblock.o
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 185bfd4e87bb..fd617928ccc1 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -834,8 +834,16 @@ int __init_memblock memblock_set_sidecache(phys_addr_t base, phys_addr_t size,
>  		return ret;
> 
>  	for (i = start_rgn; i < end_rgn; i++) {
> -		type->regions[i].cache_size = cache_size;
> -		type->regions[i].direct_mapped = direct_mapped;
> +		struct memblock_region *r = &type->regions[i];
> +
> +		r->cache_size = cache_size;
> +		r->direct_mapped = direct_mapped;

I think this change can be merged into the previous patch

> +		/*
> +		 * Enable randomization for amortizing direct-mapped
> +		 * memory-side-cache conflicts.
> +		 */
> +		if (r->size > r->cache_size && r->direct_mapped)
> +			page_alloc_shuffle_enable();

It seems that this is the only use for ->direct_mapped in the memblock
code. Wouldn't cache_size != 0 suffice? I.e., in the code that sets the
memblock region attributes, the cache_size can be set to 0 for the non
direct mapped caches, isn't it?

>  	}
> 
>  	return 0;
> @@ -1957,9 +1965,16 @@ static unsigned long __init free_low_memory_core_early(void)
>  	 *  low ram will be on Node1
>  	 */
>  	for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end,
> -				NULL)
> +				NULL) {
> +		pg_data_t *pgdat;
> +
>  		count += __free_memory_core(start, end);
> 
> +		for_each_online_pgdat(pgdat)
> +			shuffle_free_memory(pgdat, PHYS_PFN(start),
> +					PHYS_PFN(end));
> +	}
> +
>  	return count;
>  }
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 2b2b3ccbbfb5..fb17c54d47b0 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -895,6 +895,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
>  	zone->zone_pgdat->node_present_pages += onlined_pages;
>  	pgdat_resize_unlock(zone->zone_pgdat, &flags);
> 
> +	shuffle_zone(zone, pfn, zone_end_pfn(zone));
> +
>  	if (onlined_pages) {
>  		node_states_set_node(nid, &arg);
>  		if (need_zonelists_rebuild)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2ec9cc407216..a51031cf16fe 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1595,6 +1595,8 @@ static int __init deferred_init_memmap(void *data)
>  	}
>  	pgdat_resize_unlock(pgdat, &flags);
> 
> +	shuffle_zone(zone, first_init_pfn, zone_end_pfn(zone));
> +
>  	/* Sanity check that the next zone really is unpopulated */
>  	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
> 
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> new file mode 100644
> index 000000000000..621fde268d01
> --- /dev/null
> +++ b/mm/shuffle.c
> @@ -0,0 +1,206 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2018 Intel Corporation. All rights reserved.
> +
> +#include <linux/mm.h>
> +#include <linux/init.h>
> +#include <linux/mmzone.h>
> +#include <linux/random.h>
> +#include <linux/moduleparam.h>
> +#include "internal.h"
> +
> +DEFINE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
> +static unsigned long shuffle_state;
> +enum {
> +	SHUFFLE_ENABLE,
> +	SHUFFLE_FORCE_DISABLE,
> +};
> +
> +void page_alloc_shuffle_enable(void)
> +{
> +	if (test_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state)) {
> +		/*
> +		 * If platform cache auto-detect runs before module
> +		 * parameter parsing, disable the shuffle implementation
> +		 */
> +		if (test_and_clear_bit(SHUFFLE_ENABLE, &shuffle_state))
> +			static_branch_disable(&page_alloc_shuffle_key);
> +	} else if (!test_and_set_bit(SHUFFLE_ENABLE, &shuffle_state))
> +		static_branch_enable(&page_alloc_shuffle_key);
> +}
> +
> +static bool shuffle_param;
> +static int force_shuffle(const char *val, const struct kernel_param *kp)
> +{
> +	int rc = param_set_bool(val, kp);
> +
> +	if (rc < 0)
> +		return rc;
> +	if (shuffle_param)
> +		page_alloc_shuffle_enable();
> +	else
> +		set_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state);
> +	return 0;
> +}
> +module_param_call(shuffle, force_shuffle, param_get_bool, &shuffle_param, 0400);
> +
> +/*
> + * For two pages to be swapped in the shuffle, they must be free (on a
> + * 'free_area' lru), have the same order, and have the same migratetype.
> + */
> +static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order)
> +{
> +	struct page *page;
> +
> +	/*
> +	 * Given we're dealing with randomly selected pfns in a zone we
> +	 * need to ask questions like...
> +	 */
> +
> +	/* ...is the pfn even in the memmap? */
> +	if (!pfn_valid_within(pfn))
> +		return NULL;
> +
> +	/* ...is the pfn in a present section or a hole? */
> +	if (!pfn_present(pfn))
> +		return NULL;
> +
> +	/* ...is the page free and currently on a free_area list? */
> +	page = pfn_to_page(pfn);
> +	if (!PageBuddy(page))
> +		return NULL;
> +
> +	/*
> +	 * ...is the page on the same list as the page we will
> +	 * shuffle it with?
> +	 */
> +	if (page_order(page) != order)
> +		return NULL;
> +
> +	return page;
> +}
> +
> +/*
> + * Fisher-Yates shuffle the freelist which prescribes iterating through
> + * an array, pfns in this case, and randomly swapping each entry with
> + * another in the span, end_pfn - start_pfn.
> + *
> + * To keep the implementation simple it does not attempt to correct for
> + * sources of bias in the distribution, like modulo bias or
> + * pseudo-random number generator bias. I.e. the expectation is that
> + * this shuffling raises the bar for attacks that exploit the
> + * predictability of page allocations, but need not be a perfect
> + * shuffle.
> + *
> + * Note that we don't use @z->zone_start_pfn and zone_end_pfn(@z)
> + * directly since the caller may be aware of holes in the zone and can
> + * improve the accuracy of the random pfn selection.
> + */
> +#define SHUFFLE_RETRY 10
> +static void __meminit shuffle_zone_order(struct zone *z, unsigned long start_pfn,
> +		unsigned long end_pfn, const int order)
> +{
> +	unsigned long i, flags;
> +	const int order_pages = 1 << order;
> +
> +	if (start_pfn < z->zone_start_pfn)
> +		start_pfn = z->zone_start_pfn;
> +	if (end_pfn > zone_end_pfn(z))
> +		end_pfn = zone_end_pfn(z);
> +
> +	/* probably means that start/end were outside the zone */
> +	if (end_pfn <= start_pfn)
> +		return;
> +	spin_lock_irqsave(&z->lock, flags);
> +	start_pfn = ALIGN(start_pfn, order_pages);
> +	for (i = start_pfn; i < end_pfn; i += order_pages) {
> +		unsigned long j;
> +		int migratetype, retry;
> +		struct page *page_i, *page_j;
> +
> +		/*
> +		 * We expect page_i, in the sub-range of a zone being
> +		 * added (@start_pfn to @end_pfn), to more likely be
> +		 * valid compared to page_j randomly selected in the
> +		 * span @zone_start_pfn to @spanned_pages.
> +		 */
> +		page_i = shuffle_valid_page(i, order);
> +		if (!page_i)
> +			continue;
> +
> +		for (retry = 0; retry < SHUFFLE_RETRY; retry++) {
> +			/*
> +			 * Pick a random order aligned page from the
> +			 * start of the zone. Use the *whole* zone here
> +			 * so that if it is freed in tiny pieces that we
> +			 * randomize in the whole zone, not just within
> +			 * those fragments.
> +			 *
> +			 * Since page_j comes from a potentially sparse
> +			 * address range we want to try a bit harder to
> +			 * find a shuffle point for page_i.
> +			 */
> +			j = z->zone_start_pfn +
> +				ALIGN_DOWN(get_random_long() % z->spanned_pages,
> +						order_pages);
> +			page_j = shuffle_valid_page(j, order);
> +			if (page_j && page_j != page_i)
> +				break;
> +		}
> +		if (retry >= SHUFFLE_RETRY) {
> +			pr_debug("%s: failed to swap %#lx\n", __func__, i);
> +			continue;
> +		}
> +
> +		/*
> +		 * Each migratetype corresponds to its own list, make
> +		 * sure the types match otherwise we're moving pages to
> +		 * lists where they do not belong.
> +		 */
> +		migratetype = get_pageblock_migratetype(page_i);
> +		if (get_pageblock_migratetype(page_j) != migratetype) {
> +			pr_debug("%s: migratetype mismatch %#lx\n", __func__, i);
> +			continue;
> +		}
> +
> +		list_swap(&page_i->lru, &page_j->lru);
> +
> +		pr_debug("%s: swap: %#lx -> %#lx\n", __func__, i, j);
> +
> +		/* take it easy on the zone lock */
> +		if ((i % (100 * order_pages)) == 0) {
> +			spin_unlock_irqrestore(&z->lock, flags);
> +			cond_resched();
> +			spin_lock_irqsave(&z->lock, flags);
> +		}
> +	}
> +	spin_unlock_irqrestore(&z->lock, flags);
> +}
> +
> +void __meminit __shuffle_zone(struct zone *z, unsigned long start_pfn,
> +               unsigned long end_pfn)
> +{
> +       int i;
> +
> +       /* shuffle all the orders at the specified order and higher */
> +       for (i = CONFIG_SHUFFLE_PAGE_ORDER; i < MAX_ORDER; i++)
> +               shuffle_zone_order(z, start_pfn, end_pfn, i);
> +}
> +
> +/**
> + * shuffle_free_memory - reduce the predictability of the page allocator
> + * @pgdat: node page data
> + * @start_pfn: Limit the shuffle to the greater of this value or zone start
> + * @end_pfn: Limit the shuffle to the less of this value or zone end
> + *
> + * While shuffle_zone() attempts to avoid holes with pfn_valid() and
> + * pfn_present() they can not report sub-section sized holes. @start_pfn
> + * and @end_pfn limit the shuffle to the exact memory pages being freed.
> + */
> +void __meminit __shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
> +		unsigned long end_pfn)
> +{
> +	struct zone *z;
> +
> +	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
> +		shuffle_zone(z, start_pfn, end_pfn);
> +}
> 

-- 
Sincerely yours,
Mike.