linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/5] mm: Randomize free memory
@ 2018-12-15  1:48 Dan Williams
  2018-12-15  1:48 ` [PATCH v5 1/5] acpi: Create subtable parsing infrastructure Dan Williams
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Dan Williams @ 2018-12-15  1:48 UTC (permalink / raw)
  To: akpm
  Cc: Rafael J. Wysocki, Keith Busch, Mike Rapoport, Kees Cook, x86,
	Michal Hocko, Dave Hansen, Peter Zijlstra, Rafael J. Wysocki,
	Andy Lutomirski, linux-mm, x86, linux-kernel

Changes since v4: [1]
* Default the randomization to off and enable it dynamically based on
  the detection of a memory side cache advertised by platform firmware.
  In the case of x86 this enumeration comes from the ACPI HMAT. (Michal
  and Mel)
* Improve the changelog of the patch that introduces the shuffling to
  clarify the motivation and better explain the tradeoffs. (Michal and
  Mel)
* Include the required HMAT enabling in the series.

[1]: https://lkml.kernel.org/r/153922180166.838512.8260339805733812034.stgit@dwillia2-desk3.amr.corp.intel.com

---

Quote patch 3:

Randomization of the page allocator improves the average utilization of
a direct-mapped memory-side-cache. Memory side caching is a platform
capability that Linux has been previously exposed to in HPC
(high-performance computing) environments on specialty platforms. In
that instance it was a smaller pool of high-bandwidth-memory relative to
higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
be found on general purpose server platforms where DRAM is a cache in
front of higher latency persistent memory [2].

Robert offered an explanation of the state of the art of Linux
interactions with memory-side-caches [3], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.org/lkml/2017/8/23/195

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel).  That's better than forcing
    users to deploy remedies like:
        "To eliminate this gradual degradation, we have added a Stream
         measurement to the Node Health Check that follows each job;
         nodes are rebooted whenever their measured memory bandwidth
         falls below 300 GB/s."

A replacement for zonesort was merged upstream in commit cc9aec03e58f
"x86/numa_emulation: Introduce uniform split capability". With this
numa_emulation capability, memory can be split into cache sized
("near-memory" sized) numa nodes. A bind operation to such a node, and
disabling workloads on other nodes, enables full cache performance.
However, once the workload exceeds the cache size then cache conflicts
are unavoidable. While HPC environments might be able to tolerate
time-scheduling of cache sized workloads, for general purpose server
platforms, the oversubscribed cache case will be the common case.

The worst case scenario is that a server system owner benchmarks a
workload at boot with an un-contended cache only to see that performance
degrade over time, even below the average cache performance due to
excessive conflicts. Randomization clips the peaks and fills in the
valleys of cache utilization to yield steady average performance.

See patch 3 for more details.

[2]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
[3]: https://lkml.org/lkml/2018/9/22/54

---                                                                                        
                                                                                           
Dan Williams (3):
      mm: Shuffle initial free memory to improve memory-side-cache utilization
      mm: Move buddy list manipulations into helpers
      mm: Maintain randomization of page free lists

Keith Busch (2):
      acpi: Create subtable parsing infrastructure
      acpi/numa: Set the memory-side-cache size in memblocks


 arch/ia64/kernel/acpi.c                       |   12 +
 arch/x86/Kconfig                              |    1 
 arch/x86/kernel/acpi/boot.c                   |   36 ++--
 drivers/acpi/numa.c                           |   48 +++++
 drivers/acpi/scan.c                           |    4 
 drivers/acpi/tables.c                         |   67 ++++++--
 drivers/irqchip/irq-gic-v2m.c                 |    2 
 drivers/irqchip/irq-gic-v3-its-pci-msi.c      |    2 
 drivers/irqchip/irq-gic-v3-its-platform-msi.c |    2 
 drivers/irqchip/irq-gic-v3-its.c              |    6 -
 drivers/irqchip/irq-gic-v3.c                  |    8 -
 drivers/irqchip/irq-gic.c                     |    4 
 drivers/mailbox/pcc.c                         |    2 
 include/linux/acpi.h                          |    5 -
 include/linux/list.h                          |   17 ++
 include/linux/memblock.h                      |   36 ++++
 include/linux/mm.h                            |   53 ++++++
 include/linux/mm_types.h                      |    3 
 include/linux/mmzone.h                        |   65 +++++++
 init/Kconfig                                  |   36 ++++
 mm/Kconfig                                    |    3 
 mm/Makefile                                   |    7 +
 mm/compaction.c                               |    4 
 mm/memblock.c                                 |   37 ++++
 mm/memory_hotplug.c                           |    2 
 mm/page_alloc.c                               |   81 ++++-----
 mm/shuffle.c                                  |  222 +++++++++++++++++++++++++
 27 files changed, 657 insertions(+), 108 deletions(-)
 create mode 100644 mm/shuffle.c

--
Signature

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 1/5] acpi: Create subtable parsing infrastructure
  2018-12-15  1:48 [PATCH v5 0/5] mm: Randomize free memory Dan Williams
@ 2018-12-15  1:48 ` Dan Williams
  2018-12-15  1:48 ` [PATCH v5 2/5] acpi/numa: Set the memory-side-cache size in memblocks Dan Williams
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Dan Williams @ 2018-12-15  1:48 UTC (permalink / raw)
  To: akpm
  Cc: Rafael J. Wysocki, Keith Busch, peterz, dave.hansen, linux-mm,
	x86, linux-kernel

From: Keith Busch <keith.busch@intel.com>

Parsing entries in an ACPI table had assumed a generic header
structure. There is no standard ACPI header, though, so less common
layouts with different field sizes required custom parsers to go through
their subtable entry list.

Create the infrastructure for adding different table types so parsing
the entries array may be more reused for all ACPI system tables so that
the common code doesn't need to be duplicated.

Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/ia64/kernel/acpi.c                       |   12 ++--
 arch/x86/kernel/acpi/boot.c                   |   36 +++++++------
 drivers/acpi/numa.c                           |   16 +++---
 drivers/acpi/scan.c                           |    4 +
 drivers/acpi/tables.c                         |   67 +++++++++++++++++++++----
 drivers/irqchip/irq-gic-v2m.c                 |    2 -
 drivers/irqchip/irq-gic-v3-its-pci-msi.c      |    2 -
 drivers/irqchip/irq-gic-v3-its-platform-msi.c |    2 -
 drivers/irqchip/irq-gic-v3-its.c              |    6 +-
 drivers/irqchip/irq-gic-v3.c                  |    8 +--
 drivers/irqchip/irq-gic.c                     |    4 +
 drivers/mailbox/pcc.c                         |    2 -
 include/linux/acpi.h                          |    5 +-
 13 files changed, 108 insertions(+), 58 deletions(-)

diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index 41eb281709da..3973d2c2a9b0 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -177,7 +177,7 @@ struct acpi_table_madt *acpi_madt __initdata;
 static u8 has_8259;
 
 static int __init
-acpi_parse_lapic_addr_ovr(struct acpi_subtable_header * header,
+acpi_parse_lapic_addr_ovr(union acpi_subtable_headers * header,
 			  const unsigned long end)
 {
 	struct acpi_madt_local_apic_override *lapic;
@@ -216,7 +216,7 @@ acpi_parse_lsapic(struct acpi_subtable_header * header, const unsigned long end)
 }
 
 static int __init
-acpi_parse_lapic_nmi(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long end)
 {
 	struct acpi_madt_local_apic_nmi *lacpi_nmi;
 
@@ -230,7 +230,7 @@ acpi_parse_lapic_nmi(struct acpi_subtable_header * header, const unsigned long e
 }
 
 static int __init
-acpi_parse_iosapic(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_iosapic(union acpi_subtable_headers * header, const unsigned long end)
 {
 	struct acpi_madt_io_sapic *iosapic;
 
@@ -245,7 +245,7 @@ acpi_parse_iosapic(struct acpi_subtable_header * header, const unsigned long end
 static unsigned int __initdata acpi_madt_rev;
 
 static int __init
-acpi_parse_plat_int_src(struct acpi_subtable_header * header,
+acpi_parse_plat_int_src(union acpi_subtable_headers * header,
 			const unsigned long end)
 {
 	struct acpi_madt_interrupt_source *plintsrc;
@@ -329,7 +329,7 @@ unsigned int get_cpei_target_cpu(void)
 }
 
 static int __init
-acpi_parse_int_src_ovr(struct acpi_subtable_header * header,
+acpi_parse_int_src_ovr(union acpi_subtable_headers * header,
 		       const unsigned long end)
 {
 	struct acpi_madt_interrupt_override *p;
@@ -350,7 +350,7 @@ acpi_parse_int_src_ovr(struct acpi_subtable_header * header,
 }
 
 static int __init
-acpi_parse_nmi_src(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_nmi_src(union acpi_subtable_headers * header, const unsigned long end)
 {
 	struct acpi_madt_nmi_source *nmi_src;
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 06635fbca81c..58561b4df09d 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -197,7 +197,7 @@ static int acpi_register_lapic(int id, u32 acpiid, u8 enabled)
 }
 
 static int __init
-acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end)
+acpi_parse_x2apic(union acpi_subtable_headers *header, const unsigned long end)
 {
 	struct acpi_madt_local_x2apic *processor = NULL;
 #ifdef CONFIG_X86_X2APIC
@@ -210,7 +210,7 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end)
 	if (BAD_MADT_ENTRY(processor, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 #ifdef CONFIG_X86_X2APIC
 	apic_id = processor->local_apic_id;
@@ -242,7 +242,7 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end)
 }
 
 static int __init
-acpi_parse_lapic(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_lapic(union acpi_subtable_headers * header, const unsigned long end)
 {
 	struct acpi_madt_local_apic *processor = NULL;
 
@@ -251,7 +251,7 @@ acpi_parse_lapic(struct acpi_subtable_header * header, const unsigned long end)
 	if (BAD_MADT_ENTRY(processor, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	/* Ignore invalid ID */
 	if (processor->id == 0xff)
@@ -272,7 +272,7 @@ acpi_parse_lapic(struct acpi_subtable_header * header, const unsigned long end)
 }
 
 static int __init
-acpi_parse_sapic(struct acpi_subtable_header *header, const unsigned long end)
+acpi_parse_sapic(union acpi_subtable_headers *header, const unsigned long end)
 {
 	struct acpi_madt_local_sapic *processor = NULL;
 
@@ -281,7 +281,7 @@ acpi_parse_sapic(struct acpi_subtable_header *header, const unsigned long end)
 	if (BAD_MADT_ENTRY(processor, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	acpi_register_lapic((processor->id << 8) | processor->eid,/* APIC ID */
 			    processor->processor_id, /* ACPI ID */
@@ -291,7 +291,7 @@ acpi_parse_sapic(struct acpi_subtable_header *header, const unsigned long end)
 }
 
 static int __init
-acpi_parse_lapic_addr_ovr(struct acpi_subtable_header * header,
+acpi_parse_lapic_addr_ovr(union acpi_subtable_headers * header,
 			  const unsigned long end)
 {
 	struct acpi_madt_local_apic_override *lapic_addr_ovr = NULL;
@@ -301,7 +301,7 @@ acpi_parse_lapic_addr_ovr(struct acpi_subtable_header * header,
 	if (BAD_MADT_ENTRY(lapic_addr_ovr, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	acpi_lapic_addr = lapic_addr_ovr->address;
 
@@ -309,7 +309,7 @@ acpi_parse_lapic_addr_ovr(struct acpi_subtable_header * header,
 }
 
 static int __init
-acpi_parse_x2apic_nmi(struct acpi_subtable_header *header,
+acpi_parse_x2apic_nmi(union acpi_subtable_headers *header,
 		      const unsigned long end)
 {
 	struct acpi_madt_local_x2apic_nmi *x2apic_nmi = NULL;
@@ -319,7 +319,7 @@ acpi_parse_x2apic_nmi(struct acpi_subtable_header *header,
 	if (BAD_MADT_ENTRY(x2apic_nmi, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	if (x2apic_nmi->lint != 1)
 		printk(KERN_WARNING PREFIX "NMI not connected to LINT 1!\n");
@@ -328,7 +328,7 @@ acpi_parse_x2apic_nmi(struct acpi_subtable_header *header,
 }
 
 static int __init
-acpi_parse_lapic_nmi(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long end)
 {
 	struct acpi_madt_local_apic_nmi *lapic_nmi = NULL;
 
@@ -337,7 +337,7 @@ acpi_parse_lapic_nmi(struct acpi_subtable_header * header, const unsigned long e
 	if (BAD_MADT_ENTRY(lapic_nmi, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	if (lapic_nmi->lint != 1)
 		printk(KERN_WARNING PREFIX "NMI not connected to LINT 1!\n");
@@ -449,7 +449,7 @@ static int __init mp_register_ioapic_irq(u8 bus_irq, u8 polarity,
 }
 
 static int __init
-acpi_parse_ioapic(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_ioapic(union acpi_subtable_headers * header, const unsigned long end)
 {
 	struct acpi_madt_io_apic *ioapic = NULL;
 	struct ioapic_domain_cfg cfg = {
@@ -462,7 +462,7 @@ acpi_parse_ioapic(struct acpi_subtable_header * header, const unsigned long end)
 	if (BAD_MADT_ENTRY(ioapic, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	/* Statically assign IRQ numbers for IOAPICs hosting legacy IRQs */
 	if (ioapic->global_irq_base < nr_legacy_irqs())
@@ -508,7 +508,7 @@ static void __init acpi_sci_ioapic_setup(u8 bus_irq, u16 polarity, u16 trigger,
 }
 
 static int __init
-acpi_parse_int_src_ovr(struct acpi_subtable_header * header,
+acpi_parse_int_src_ovr(union acpi_subtable_headers * header,
 		       const unsigned long end)
 {
 	struct acpi_madt_interrupt_override *intsrc = NULL;
@@ -518,7 +518,7 @@ acpi_parse_int_src_ovr(struct acpi_subtable_header * header,
 	if (BAD_MADT_ENTRY(intsrc, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	if (intsrc->source_irq == acpi_gbl_FADT.sci_interrupt) {
 		acpi_sci_ioapic_setup(intsrc->source_irq,
@@ -550,7 +550,7 @@ acpi_parse_int_src_ovr(struct acpi_subtable_header * header,
 }
 
 static int __init
-acpi_parse_nmi_src(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_nmi_src(union acpi_subtable_headers * header, const unsigned long end)
 {
 	struct acpi_madt_nmi_source *nmi_src = NULL;
 
@@ -559,7 +559,7 @@ acpi_parse_nmi_src(struct acpi_subtable_header * header, const unsigned long end
 	if (BAD_MADT_ENTRY(nmi_src, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	/* TBD: Support nimsrc entries? */
 
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 274699463b4f..f5e09c39ff22 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -338,7 +338,7 @@ acpi_numa_x2apic_affinity_init(struct acpi_srat_x2apic_cpu_affinity *pa)
 }
 
 static int __init
-acpi_parse_x2apic_affinity(struct acpi_subtable_header *header,
+acpi_parse_x2apic_affinity(union acpi_subtable_headers *header,
 			   const unsigned long end)
 {
 	struct acpi_srat_x2apic_cpu_affinity *processor_affinity;
@@ -347,7 +347,7 @@ acpi_parse_x2apic_affinity(struct acpi_subtable_header *header,
 	if (!processor_affinity)
 		return -EINVAL;
 
-	acpi_table_print_srat_entry(header);
+	acpi_table_print_srat_entry(&header->common);
 
 	/* let architecture-dependent part to do it */
 	acpi_numa_x2apic_affinity_init(processor_affinity);
@@ -356,7 +356,7 @@ acpi_parse_x2apic_affinity(struct acpi_subtable_header *header,
 }
 
 static int __init
-acpi_parse_processor_affinity(struct acpi_subtable_header *header,
+acpi_parse_processor_affinity(union acpi_subtable_headers *header,
 			      const unsigned long end)
 {
 	struct acpi_srat_cpu_affinity *processor_affinity;
@@ -365,7 +365,7 @@ acpi_parse_processor_affinity(struct acpi_subtable_header *header,
 	if (!processor_affinity)
 		return -EINVAL;
 
-	acpi_table_print_srat_entry(header);
+	acpi_table_print_srat_entry(&header->common);
 
 	/* let architecture-dependent part to do it */
 	acpi_numa_processor_affinity_init(processor_affinity);
@@ -374,7 +374,7 @@ acpi_parse_processor_affinity(struct acpi_subtable_header *header,
 }
 
 static int __init
-acpi_parse_gicc_affinity(struct acpi_subtable_header *header,
+acpi_parse_gicc_affinity(union acpi_subtable_headers *header,
 			 const unsigned long end)
 {
 	struct acpi_srat_gicc_affinity *processor_affinity;
@@ -383,7 +383,7 @@ acpi_parse_gicc_affinity(struct acpi_subtable_header *header,
 	if (!processor_affinity)
 		return -EINVAL;
 
-	acpi_table_print_srat_entry(header);
+	acpi_table_print_srat_entry(&header->common);
 
 	/* let architecture-dependent part to do it */
 	acpi_numa_gicc_affinity_init(processor_affinity);
@@ -394,7 +394,7 @@ acpi_parse_gicc_affinity(struct acpi_subtable_header *header,
 static int __initdata parsed_numa_memblks;
 
 static int __init
-acpi_parse_memory_affinity(struct acpi_subtable_header * header,
+acpi_parse_memory_affinity(union acpi_subtable_headers * header,
 			   const unsigned long end)
 {
 	struct acpi_srat_mem_affinity *memory_affinity;
@@ -403,7 +403,7 @@ acpi_parse_memory_affinity(struct acpi_subtable_header * header,
 	if (!memory_affinity)
 		return -EINVAL;
 
-	acpi_table_print_srat_entry(header);
+	acpi_table_print_srat_entry(&header->common);
 
 	/* let architecture-dependent part to do it */
 	if (!acpi_numa_memory_affinity_init(memory_affinity))
diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index bd1c59fb0e17..d98d5da6a279 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -2234,10 +2234,10 @@ static struct acpi_probe_entry *ape;
 static int acpi_probe_count;
 static DEFINE_MUTEX(acpi_probe_mutex);
 
-static int __init acpi_match_madt(struct acpi_subtable_header *header,
+static int __init acpi_match_madt(union acpi_subtable_headers *header,
 				  const unsigned long end)
 {
-	if (!ape->subtable_valid || ape->subtable_valid(header, ape))
+	if (!ape->subtable_valid || ape->subtable_valid(&header->common, ape))
 		if (!ape->probe_subtbl(header, end))
 			acpi_probe_count++;
 
diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index 61203eebf3a1..e9643b4267c7 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -49,6 +49,15 @@ static struct acpi_table_desc initial_tables[ACPI_MAX_TABLES] __initdata;
 
 static int acpi_apic_instance __initdata;
 
+enum acpi_subtable_type {
+	ACPI_SUBTABLE_COMMON,
+};
+
+struct acpi_subtable_entry {
+	union acpi_subtable_headers *hdr;
+	enum acpi_subtable_type type;
+};
+
 /*
  * Disable table checksum verification for the early stage due to the size
  * limitation of the current x86 early mapping implementation.
@@ -217,6 +226,42 @@ void acpi_table_print_madt_entry(struct acpi_subtable_header *header)
 	}
 }
 
+static unsigned long __init
+acpi_get_entry_type(struct acpi_subtable_entry *entry)
+{
+	switch (entry->type) {
+	case ACPI_SUBTABLE_COMMON:
+		return entry->hdr->common.type;
+	}
+	return 0;
+}
+
+static unsigned long __init
+acpi_get_entry_length(struct acpi_subtable_entry *entry)
+{
+	switch (entry->type) {
+	case ACPI_SUBTABLE_COMMON:
+		return entry->hdr->common.length;
+	}
+	return 0;
+}
+
+static unsigned long __init
+acpi_get_subtable_header_length(struct acpi_subtable_entry *entry)
+{
+	switch (entry->type) {
+	case ACPI_SUBTABLE_COMMON:
+		return sizeof(entry->hdr->common);
+	}
+	return 0;
+}
+
+static enum acpi_subtable_type __init
+acpi_get_subtable_type(char *id)
+{
+	return ACPI_SUBTABLE_COMMON;
+}
+
 /**
  * acpi_parse_entries_array - for each proc_num find a suitable subtable
  *
@@ -246,8 +291,8 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
 		struct acpi_subtable_proc *proc, int proc_num,
 		unsigned int max_entries)
 {
-	struct acpi_subtable_header *entry;
-	unsigned long table_end;
+	struct acpi_subtable_entry entry;
+	unsigned long table_end, subtable_len, entry_len;
 	int count = 0;
 	int errs = 0;
 	int i;
@@ -270,19 +315,20 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
 
 	/* Parse all entries looking for a match. */
 
-	entry = (struct acpi_subtable_header *)
+	entry.type = acpi_get_subtable_type(id);
+	entry.hdr = (union acpi_subtable_headers *)
 	    ((unsigned long)table_header + table_size);
+	subtable_len = acpi_get_subtable_header_length(&entry);
 
-	while (((unsigned long)entry) + sizeof(struct acpi_subtable_header) <
-	       table_end) {
+	while (((unsigned long)entry.hdr) + subtable_len  < table_end) {
 		if (max_entries && count >= max_entries)
 			break;
 
 		for (i = 0; i < proc_num; i++) {
-			if (entry->type != proc[i].id)
+			if (acpi_get_entry_type(&entry) != proc[i].id)
 				continue;
 			if (!proc[i].handler ||
-			     (!errs && proc[i].handler(entry, table_end))) {
+			     (!errs && proc[i].handler(entry.hdr, table_end))) {
 				errs++;
 				continue;
 			}
@@ -297,13 +343,14 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
 		 * If entry->length is 0, break from this loop to avoid
 		 * infinite loop.
 		 */
-		if (entry->length == 0) {
+		entry_len = acpi_get_entry_length(&entry);
+		if (entry_len == 0) {
 			pr_err("[%4.4s:0x%02x] Invalid zero length\n", id, proc->id);
 			return -EINVAL;
 		}
 
-		entry = (struct acpi_subtable_header *)
-		    ((unsigned long)entry + entry->length);
+		entry.hdr = (union acpi_subtable_headers *)
+		    ((unsigned long)entry.hdr + entry_len);
 	}
 
 	if (max_entries && count > max_entries) {
diff --git a/drivers/irqchip/irq-gic-v2m.c b/drivers/irqchip/irq-gic-v2m.c
index f5fe0100f9ff..de14e06fd9ec 100644
--- a/drivers/irqchip/irq-gic-v2m.c
+++ b/drivers/irqchip/irq-gic-v2m.c
@@ -446,7 +446,7 @@ static struct fwnode_handle *gicv2m_get_fwnode(struct device *dev)
 }
 
 static int __init
-acpi_parse_madt_msi(struct acpi_subtable_header *header,
+acpi_parse_madt_msi(union acpi_subtable_headers *header,
 		    const unsigned long end)
 {
 	int ret;
diff --git a/drivers/irqchip/irq-gic-v3-its-pci-msi.c b/drivers/irqchip/irq-gic-v3-its-pci-msi.c
index 8d6d009d1d58..c81d5b81da56 100644
--- a/drivers/irqchip/irq-gic-v3-its-pci-msi.c
+++ b/drivers/irqchip/irq-gic-v3-its-pci-msi.c
@@ -159,7 +159,7 @@ static int __init its_pci_of_msi_init(void)
 #ifdef CONFIG_ACPI
 
 static int __init
-its_pci_msi_parse_madt(struct acpi_subtable_header *header,
+its_pci_msi_parse_madt(union acpi_subtable_headers *header,
 		       const unsigned long end)
 {
 	struct acpi_madt_generic_translator *its_entry;
diff --git a/drivers/irqchip/irq-gic-v3-its-platform-msi.c b/drivers/irqchip/irq-gic-v3-its-platform-msi.c
index 7b8e87b493fe..9cdcda5bb3bd 100644
--- a/drivers/irqchip/irq-gic-v3-its-platform-msi.c
+++ b/drivers/irqchip/irq-gic-v3-its-platform-msi.c
@@ -117,7 +117,7 @@ static int __init its_pmsi_init_one(struct fwnode_handle *fwnode,
 
 #ifdef CONFIG_ACPI
 static int __init
-its_pmsi_parse_madt(struct acpi_subtable_header *header,
+its_pmsi_parse_madt(union acpi_subtable_headers *header,
 			const unsigned long end)
 {
 	struct acpi_madt_generic_translator *its_entry;
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index db20e992a40f..d6677075d68f 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -3764,13 +3764,13 @@ static int __init acpi_get_its_numa_node(u32 its_id)
 	return NUMA_NO_NODE;
 }
 
-static int __init gic_acpi_match_srat_its(struct acpi_subtable_header *header,
+static int __init gic_acpi_match_srat_its(union acpi_subtable_headers *header,
 					  const unsigned long end)
 {
 	return 0;
 }
 
-static int __init gic_acpi_parse_srat_its(struct acpi_subtable_header *header,
+static int __init gic_acpi_parse_srat_its(union acpi_subtable_headers *header,
 			 const unsigned long end)
 {
 	int node;
@@ -3837,7 +3837,7 @@ static int __init acpi_get_its_numa_node(u32 its_id) { return NUMA_NO_NODE; }
 static void __init acpi_its_srat_maps_free(void) { }
 #endif
 
-static int __init gic_acpi_parse_madt_its(struct acpi_subtable_header *header,
+static int __init gic_acpi_parse_madt_its(union acpi_subtable_headers *header,
 					  const unsigned long end)
 {
 	struct acpi_madt_generic_translator *its_entry;
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 8f87f40c9460..1729514a0578 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -1365,7 +1365,7 @@ gic_acpi_register_redist(phys_addr_t phys_base, void __iomem *redist_base)
 }
 
 static int __init
-gic_acpi_parse_madt_redist(struct acpi_subtable_header *header,
+gic_acpi_parse_madt_redist(union acpi_subtable_headers *header,
 			   const unsigned long end)
 {
 	struct acpi_madt_generic_redistributor *redist =
@@ -1383,7 +1383,7 @@ gic_acpi_parse_madt_redist(struct acpi_subtable_header *header,
 }
 
 static int __init
-gic_acpi_parse_madt_gicc(struct acpi_subtable_header *header,
+gic_acpi_parse_madt_gicc(union acpi_subtable_headers *header,
 			 const unsigned long end)
 {
 	struct acpi_madt_generic_interrupt *gicc =
@@ -1425,14 +1425,14 @@ static int __init gic_acpi_collect_gicr_base(void)
 	return -ENODEV;
 }
 
-static int __init gic_acpi_match_gicr(struct acpi_subtable_header *header,
+static int __init gic_acpi_match_gicr(union acpi_subtable_headers *header,
 				  const unsigned long end)
 {
 	/* Subtable presence means that redist exists, that's it */
 	return 0;
 }
 
-static int __init gic_acpi_match_gicc(struct acpi_subtable_header *header,
+static int __init gic_acpi_match_gicc(union acpi_subtable_headers *header,
 				      const unsigned long end)
 {
 	struct acpi_madt_generic_interrupt *gicc =
diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
index ced10c44b68a..8d2750b835da 100644
--- a/drivers/irqchip/irq-gic.c
+++ b/drivers/irqchip/irq-gic.c
@@ -1508,7 +1508,7 @@ static struct
 } acpi_data __initdata;
 
 static int __init
-gic_acpi_parse_madt_cpu(struct acpi_subtable_header *header,
+gic_acpi_parse_madt_cpu(union acpi_subtable_headers *header,
 			const unsigned long end)
 {
 	struct acpi_madt_generic_interrupt *processor;
@@ -1540,7 +1540,7 @@ gic_acpi_parse_madt_cpu(struct acpi_subtable_header *header,
 }
 
 /* The things you have to do to just *count* something... */
-static int __init acpi_dummy_func(struct acpi_subtable_header *header,
+static int __init acpi_dummy_func(union acpi_subtable_headers *header,
 				  const unsigned long end)
 {
 	return 0;
diff --git a/drivers/mailbox/pcc.c b/drivers/mailbox/pcc.c
index 256f18b67e8a..08a0a3517138 100644
--- a/drivers/mailbox/pcc.c
+++ b/drivers/mailbox/pcc.c
@@ -382,7 +382,7 @@ static const struct mbox_chan_ops pcc_chan_ops = {
  *
  * This gets called for each entry in the PCC table.
  */
-static int parse_pcc_subspace(struct acpi_subtable_header *header,
+static int parse_pcc_subspace(union acpi_subtable_headers *header,
 		const unsigned long end)
 {
 	struct acpi_pcct_subspace *ss = (struct acpi_pcct_subspace *) header;
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index ed80f147bd50..18805a967c70 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -141,10 +141,13 @@ enum acpi_address_range_id {
 
 
 /* Table Handlers */
+union acpi_subtable_headers {
+	struct acpi_subtable_header common;
+};
 
 typedef int (*acpi_tbl_table_handler)(struct acpi_table_header *table);
 
-typedef int (*acpi_tbl_entry_handler)(struct acpi_subtable_header *header,
+typedef int (*acpi_tbl_entry_handler)(union acpi_subtable_headers *header,
 				      const unsigned long end);
 
 /* Debugger support */


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 2/5] acpi/numa: Set the memory-side-cache size in memblocks
  2018-12-15  1:48 [PATCH v5 0/5] mm: Randomize free memory Dan Williams
  2018-12-15  1:48 ` [PATCH v5 1/5] acpi: Create subtable parsing infrastructure Dan Williams
@ 2018-12-15  1:48 ` Dan Williams
  2018-12-16 12:34   ` Mike Rapoport
  2018-12-15  1:48 ` [PATCH v5 3/5] mm: Shuffle initial free memory to improve memory-side-cache utilization Dan Williams
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Dan Williams @ 2018-12-15  1:48 UTC (permalink / raw)
  To: akpm
  Cc: x86, Rafael J. Wysocki, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Mike Rapoport, Keith Busch, linux-mm, x86,
	linux-kernel

From: Keith Busch <keith.busch@intel.com>

Add memblock based enumeration of memory-side-cache of System RAM.
Detect the capability in early init through HMAT tables, and set the
size in the address range memblocks if a direct mapped side cache is
present.

Cc: <x86@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/Kconfig         |    1 +
 drivers/acpi/numa.c      |   32 ++++++++++++++++++++++++++++++++
 include/linux/memblock.h |   36 ++++++++++++++++++++++++++++++++++++
 mm/Kconfig               |    3 +++
 mm/memblock.c            |   20 ++++++++++++++++++++
 5 files changed, 92 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8689e794a43c..3f9c413d8eb5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -171,6 +171,7 @@ config X86
 	select HAVE_KVM
 	select HAVE_LIVEPATCH			if X86_64
 	select HAVE_MEMBLOCK_NODE_MAP
+	select HAVE_MEMBLOCK_CACHE_INFO		if ACPI_NUMA
 	select HAVE_MIXED_BREAKPOINTS_REGS
 	select HAVE_MOD_ARCH_SPECIFIC
 	select HAVE_NMI
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index f5e09c39ff22..ec7e849f1c19 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -40,6 +40,12 @@ static int pxm_to_node_map[MAX_PXM_DOMAINS]
 static int node_to_pxm_map[MAX_NUMNODES]
 			= { [0 ... MAX_NUMNODES - 1] = PXM_INVAL };
 
+struct mem_cacheinfo {
+	phys_addr_t size;
+	bool direct_mapped;
+};
+static struct mem_cacheinfo side_cached_pxms[MAX_PXM_DOMAINS] __initdata;
+
 unsigned char acpi_srat_revision __initdata;
 int acpi_numa __initdata;
 
@@ -262,6 +268,8 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 	u64 start, end;
 	u32 hotpluggable;
 	int node, pxm;
+	u64 cache_size;
+	bool direct;
 
 	if (srat_disabled())
 		goto out_err;
@@ -308,6 +316,13 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 		pr_warn("SRAT: Failed to mark hotplug range [mem %#010Lx-%#010Lx] in memblock\n",
 			(unsigned long long)start, (unsigned long long)end - 1);
 
+	cache_size = side_cached_pxms[pxm].size;
+	direct = side_cached_pxms[pxm].direct_mapped;
+	if (cache_size &&
+	    memblock_set_sidecache(start, ma->length, cache_size, direct))
+		pr_warn("SRAT: Failed to mark side cached range [mem %#010Lx-%#010Lx] in memblock\n",
+			(unsigned long long)start, (unsigned long long)end - 1);
+
 	max_possible_pfn = max(max_possible_pfn, PFN_UP(end - 1));
 
 	return 0;
@@ -411,6 +426,18 @@ acpi_parse_memory_affinity(union acpi_subtable_headers * header,
 	return 0;
 }
 
+static int __init
+acpi_parse_cache(union acpi_subtable_headers *header, const unsigned long end)
+{
+	struct acpi_hmat_cache *c = (void *)header;
+	u32 attrs = (c->cache_attributes & ACPI_HMAT_CACHE_ASSOCIATIVITY) >> 8;
+
+	if (attrs == ACPI_HMAT_CA_DIRECT_MAPPED)
+		side_cached_pxms[c->memory_PD].direct_mapped = true;
+	side_cached_pxms[c->memory_PD].size += c->cache_size;
+	return 0;
+}
+
 static int __init acpi_parse_srat(struct acpi_table_header *table)
 {
 	struct acpi_table_srat *srat = (struct acpi_table_srat *)table;
@@ -460,6 +487,11 @@ int __init acpi_numa_init(void)
 					sizeof(struct acpi_table_srat),
 					srat_proc, ARRAY_SIZE(srat_proc), 0);
 
+		acpi_table_parse_entries(ACPI_SIG_HMAT,
+					 sizeof(struct acpi_table_hmat),
+					 ACPI_HMAT_TYPE_CACHE,
+					 acpi_parse_cache, 0);
+
 		cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
 					    acpi_parse_memory_affinity, 0);
 	}
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index aee299a6aa76..169ed3dd456d 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -60,6 +60,10 @@ struct memblock_region {
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 	int nid;
 #endif
+#ifdef CONFIG_HAVE_MEMBLOCK_CACHE_INFO
+	phys_addr_t cache_size;
+	bool direct_mapped;
+#endif
 };
 
 /**
@@ -317,6 +321,38 @@ static inline int memblock_get_region_node(const struct memblock_region *r)
 }
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
+#ifdef CONFIG_HAVE_MEMBLOCK_CACHE_INFO
+int memblock_set_sidecache(phys_addr_t base, phys_addr_t size,
+			   phys_addr_t cache_size, bool direct_mapped);
+
+static inline bool memblock_sidecache_direct_mapped(struct memblock_region *m)
+{
+	return m->direct_mapped;
+}
+
+static inline phys_addr_t memblock_sidecache_size(struct memblock_region *m)
+{
+	return m->cache_size;
+}
+#else
+static inline int memblock_set_sidecache(phys_addr_t base, phys_addr_t size,
+					 phys_addr_t cache_size,
+					 bool direct_mapped)
+{
+	return 0;
+}
+
+static inline phys_addr_t memblock_sidecache_size(struct memblock_region *m)
+{
+	return 0;
+}
+
+static inline bool memblock_sidecache_direct_mapped(struct memblock_region *m)
+{
+	return false;
+}
+#endif /* CONFIG_HAVE_MEMBLOCK_CACHE_INFO */
+
 /* Flags for memblock allocation APIs */
 #define MEMBLOCK_ALLOC_ANYWHERE	(~(phys_addr_t)0)
 #define MEMBLOCK_ALLOC_ACCESSIBLE	0
diff --git a/mm/Kconfig b/mm/Kconfig
index d85e39da47ae..c7944299a89e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -142,6 +142,9 @@ config ARCH_DISCARD_MEMBLOCK
 config MEMORY_ISOLATION
 	bool
 
+config HAVE_MEMBLOCK_CACHE_INFO
+	bool
+
 #
 # Only be set on architectures that have completely implemented memory hotplug
 # feature. If you are not sure, don't touch it.
diff --git a/mm/memblock.c b/mm/memblock.c
index 9a2d5ae81ae1..185bfd4e87bb 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -822,6 +822,26 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
 	return memblock_add_range(&memblock.reserved, base, size, MAX_NUMNODES, 0);
 }
 
+#ifdef CONFIG_HAVE_MEMBLOCK_CACHE_INFO
+int __init_memblock memblock_set_sidecache(phys_addr_t base, phys_addr_t size,
+			   phys_addr_t cache_size, bool direct_mapped)
+{
+	struct memblock_type *type = &memblock.memory;
+	int i, ret, start_rgn, end_rgn;
+
+	ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn);
+	if (ret)
+		return ret;
+
+	for (i = start_rgn; i < end_rgn; i++) {
+		type->regions[i].cache_size = cache_size;
+		type->regions[i].direct_mapped = direct_mapped;
+	}
+
+	return 0;
+}
+#endif
+
 /**
  * memblock_setclr_flag - set or clear flag for a memory region
  * @base: base address of the region


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 3/5] mm: Shuffle initial free memory to improve memory-side-cache utilization
  2018-12-15  1:48 [PATCH v5 0/5] mm: Randomize free memory Dan Williams
  2018-12-15  1:48 ` [PATCH v5 1/5] acpi: Create subtable parsing infrastructure Dan Williams
  2018-12-15  1:48 ` [PATCH v5 2/5] acpi/numa: Set the memory-side-cache size in memblocks Dan Williams
@ 2018-12-15  1:48 ` Dan Williams
  2018-12-16 12:43   ` Mike Rapoport
  2018-12-15  1:48 ` [PATCH v5 4/5] mm: Move buddy list manipulations into helpers Dan Williams
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Dan Williams @ 2018-12-15  1:48 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Kees Cook, Dave Hansen, peterz, linux-mm, x86,
	linux-kernel

Randomization of the page allocator improves the average utilization of
a direct-mapped memory-side-cache. Memory side caching is a platform
capability that Linux has been previously exposed to in HPC
(high-performance computing) environments on specialty platforms. In
that instance it was a smaller pool of high-bandwidth-memory relative to
higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
be found on general purpose server platforms where DRAM is a cache in
front of higher latency persistent memory [1].

Robert offered an explanation of the state of the art of Linux
interactions with memory-side-caches [2], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.org/lkml/2017/8/23/195

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel).  That's better than forcing
    users to deploy remedies like:
        "To eliminate this gradual degradation, we have added a Stream
         measurement to the Node Health Check that follows each job;
         nodes are rebooted whenever their measured memory bandwidth
         falls below 300 GB/s."

A replacement for zonesort was merged upstream in commit cc9aec03e58f
"x86/numa_emulation: Introduce uniform split capability". With this
numa_emulation capability, memory can be split into cache sized
("near-memory" sized) numa nodes. A bind operation to such a node, and
disabling workloads on other nodes, enables full cache performance.
However, once the workload exceeds the cache size then cache conflicts
are unavoidable. While HPC environments might be able to tolerate
time-scheduling of cache sized workloads, for general purpose server
platforms, the oversubscribed cache case will be the common case.

The worst case scenario is that a server system owner benchmarks a
workload at boot with an un-contended cache only to see that performance
degrade over time, even below the average cache performance due to
excessive conflicts. Randomization clips the peaks and fills in the
valleys of cache utilization to yield steady average performance.

Here are some performance impact details of the patches:

1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
3X speedup in a contrived case that tries to force cache conflicts. The
contrived cased used the numa_emulation capability to force an instance
of the benchmark to be run in two of the near-memory sized numa nodes.
If both instances were placed on the same emulated they would fit and
cause zero conflicts.  While on separate emulated nodes without
randomization they underutilized the cache and conflicted unnecessarily
due to the in-order allocation per node.

2/ A well known Java server application benchmark was run with a heap
size that exceeded cache size by 3X. The cache conflict rate was 8% for
the first run and degraded to 21% after page allocator aging. With
randomization enabled the rate levelled out at 11%.

3/ A MongoDB workload did not observe measurable difference in
cache-conflict rates, but the overall throughput dropped by 7% with
randomization in one case.

4/ Mel Gorman ran his suite of performance workloads with randomization
enabled on platforms without a memory-side-cache and saw a mix of some
improvements and some losses [3].

While there is potentially significant improvement for applications that
depend on low latency access across a wide working-set, the performance
may be negligible to negative for other workloads. For this reason the
shuffle capability defaults to off unless a direct-mapped
memory-side-cache is detected. Even then, the page_alloc.shuffle=0
parameter can be specified to disable the randomization on those
systems.

Outside of memory-side-cache utilization concerns there is potentially
security benefit from randomization. Some data exfiltration and
return-oriented-programming attacks rely on the ability to infer the
location of sensitive data objects. The kernel page allocator,
especially early in system boot, has predictable first-in-first out
behavior for physical pages. Pages are freed in physical address order
when first onlined.

Quoting Kees:
    "While we already have a base-address randomization
     (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
     memory layouts would certainly be using the predictability of
     allocation ordering (i.e. for attacks where the base address isn't
     important: only the relative positions between allocated memory).
     This is common in lots of heap-style attacks. They try to gain
     control over ordering by spraying allocations, etc.

     I'd really like to see this because it gives us something similar
     to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
caches it leaves vast bulk of memory to be predictably in order
allocated.  However, it should be noted, the concrete security benefits
are hard to quantify, and no known CVE is mitigated by this
randomization.

Introduce shuffle_free_memory(), and its helper shuffle_zone(), to
perform a Fisher-Yates shuffle of the page allocator 'free_area' lists
when they are initially populated with free memory at boot and at
hotplug time.

The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.
10, 4MB this trades off randomization granularity for time spent
shuffling.  MAX_ORDER-1 was chosen to be minimally invasive to the page
allocator while still showing memory-side cache behavior improvements,
and the expectation that the security implications of finer granularity
randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.

The performance impact of the shuffling appears to be in the noise
compared to other memory initialization work. Also the bulk of the work
is done in the background as a part of deferred_init_memmap().

This initial randomization can be undone over time so a follow-on patch
is introduced to inject entropy on page free decisions. It is reasonable
to ask if the page free entropy is sufficient, but it is not enough due
to the in-order initial freeing of pages. At the start of that process
putting page1 in front or behind page0 still keeps them close together,
page2 is still near page1 and has a high chance of being adjacent. As
more pages are added ordering diversity improves, but there is still
high page locality for the low address pages and this leads to no
significant impact to the cache conflict rate.

[1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
[2]: https://lkml.org/lkml/2018/9/22/54
[3]: https://lkml.org/lkml/2018/10/12/309

Cc: Michal Hocko <mhocko@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/list.h   |   17 ++++
 include/linux/mm.h     |   38 +++++++++
 include/linux/mmzone.h |    4 +
 init/Kconfig           |   36 ++++++++
 mm/Makefile            |    7 +-
 mm/memblock.c          |   21 ++++-
 mm/memory_hotplug.c    |    2 
 mm/page_alloc.c        |    2 
 mm/shuffle.c           |  206 ++++++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 329 insertions(+), 4 deletions(-)
 create mode 100644 mm/shuffle.c

diff --git a/include/linux/list.h b/include/linux/list.h
index edb7628e46ed..3dfb8953f241 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -150,6 +150,23 @@ static inline void list_replace_init(struct list_head *old,
 	INIT_LIST_HEAD(old);
 }
 
+/**
+ * list_swap - replace entry1 with entry2 and re-add entry1 at entry2's position
+ * @entry1: the location to place entry2
+ * @entry2: the location to place entry1
+ */
+static inline void list_swap(struct list_head *entry1,
+			     struct list_head *entry2)
+{
+	struct list_head *pos = entry2->prev;
+
+	list_del(entry2);
+	list_replace(entry1, entry2);
+	if (pos == entry1)
+		pos = entry2;
+	list_add(entry1, pos);
+}
+
 /**
  * list_del_init - deletes entry from list and reinitialize it.
  * @entry: the element to delete from the list.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..f9647779e82b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2069,6 +2069,44 @@ extern void mem_init_print_info(const char *str);
 
 extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
 
+#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR
+DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
+extern void page_alloc_shuffle_enable(void);
+extern void __shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
+		unsigned long end_pfn);
+static inline void shuffle_free_memory(pg_data_t *pgdat,
+		unsigned long start_pfn, unsigned long end_pfn)
+{
+	if (!static_branch_unlikely(&page_alloc_shuffle_key))
+		return;
+	__shuffle_free_memory(pgdat, start_pfn, end_pfn);
+}
+
+extern void __shuffle_zone(struct zone *z, unsigned long start_pfn,
+		unsigned long end_pfn);
+static inline void shuffle_zone(struct zone *z, unsigned long start_pfn,
+		unsigned long end_pfn)
+{
+	if (!static_branch_unlikely(&page_alloc_shuffle_key))
+		return;
+	__shuffle_zone(z, start_pfn, end_pfn);
+}
+#else
+static inline void shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
+		unsigned long end_pfn)
+{
+}
+
+static inline void shuffle_zone(struct zone *z, unsigned long start_pfn,
+		unsigned long end_pfn)
+{
+}
+
+static inline void page_alloc_shuffle_enable(void)
+{
+}
+#endif
+
 /* Free the reserved page into the buddy system, so it gets managed. */
 static inline void __free_reserved_page(struct page *page)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 847705a6d0ec..eafa66d66232 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1266,6 +1266,10 @@ void sparse_init(void);
 #else
 #define sparse_init()	do {} while (0)
 #define sparse_index_init(_sec, _nid)  do {} while (0)
+static inline int pfn_present(unsigned long pfn)
+{
+	return 1;
+}
 #endif /* CONFIG_SPARSEMEM */
 
 /*
diff --git a/init/Kconfig b/init/Kconfig
index cf5b5a0dcbc2..fa6812d995ec 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1720,6 +1720,42 @@ config SLAB_FREELIST_HARDENED
 	  sacrifies to harden the kernel slab allocator against common
 	  freelist exploit methods.
 
+config SHUFFLE_PAGE_ALLOCATOR
+	bool "Page allocator randomization"
+	depends on HAVE_MEMBLOCK_CACHE_INFO
+	default SLAB_FREELIST_RANDOM
+	help
+	  Randomization of the page allocator improves the average
+	  utilization of a direct-mapped memory-side-cache. See section
+	  5.2.27 Heterogeneous Memory Attribute Table (HMAT) in the ACPI
+	  6.2a specification for an example of how a platform advertises
+	  the presence of a memory-side-cache. There are also incidental
+	  security benefits as it reduces the predictability of page
+	  allocations to compliment SLAB_FREELIST_RANDOM, but the
+	  default granularity of shuffling on 4MB (MAX_ORDER) pages is
+	  selected based on cache utilization benefits.
+
+	  While the randomization improves cache utilization it may
+	  negatively impact workloads on platforms without a cache. For
+	  this reason, by default, the randomization is enabled only
+	  after runtime detection of a direct-mapped memory-side-cache.
+	  Otherwise, the randomization may be force enabled with the
+	  'page_alloc.shuffle' kernel command line parameter.
+
+	  Say Y if unsure.
+
+config SHUFFLE_PAGE_ORDER
+	depends on SHUFFLE_PAGE_ALLOCATOR
+	int "Page allocator shuffle order"
+	range 0 10
+	default 10
+	help
+	  Specify the granularity at which shuffling (randomization) is
+	  performed. By default this is set to MAX_ORDER-1 to minimize
+	  runtime impact of randomization and with the expectation that
+	  SLAB_FREELIST_RANDOM mitigates heap attacks on smaller
+	  object granularities.
+
 config SLUB_CPU_PARTIAL
 	default y
 	depends on SLUB && SMP
diff --git a/mm/Makefile b/mm/Makefile
index d210cc9d6f80..ac5e5ba78874 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,7 +33,7 @@ mmu-$(CONFIG_MMU)	+= process_vm_access.o
 endif
 
 obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
-			   maccess.o page_alloc.o page-writeback.o \
+			   maccess.o page-writeback.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
@@ -41,6 +41,11 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   interval_tree.o list_lru.o workingset.o \
 			   debug.o $(mmu-y)
 
+# Give 'page_alloc' its own module-parameter namespace
+page-alloc-y := page_alloc.o
+page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
+
+obj-y += page-alloc.o
 obj-y += init-mm.o
 obj-y += memblock.o
 
diff --git a/mm/memblock.c b/mm/memblock.c
index 185bfd4e87bb..fd617928ccc1 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -834,8 +834,16 @@ int __init_memblock memblock_set_sidecache(phys_addr_t base, phys_addr_t size,
 		return ret;
 
 	for (i = start_rgn; i < end_rgn; i++) {
-		type->regions[i].cache_size = cache_size;
-		type->regions[i].direct_mapped = direct_mapped;
+		struct memblock_region *r = &type->regions[i];
+
+		r->cache_size = cache_size;
+		r->direct_mapped = direct_mapped;
+		/*
+		 * Enable randomization for amortizing direct-mapped
+		 * memory-side-cache conflicts.
+		 */
+		if (r->size > r->cache_size && r->direct_mapped)
+			page_alloc_shuffle_enable();
 	}
 
 	return 0;
@@ -1957,9 +1965,16 @@ static unsigned long __init free_low_memory_core_early(void)
 	 *  low ram will be on Node1
 	 */
 	for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end,
-				NULL)
+				NULL) {
+		pg_data_t *pgdat;
+
 		count += __free_memory_core(start, end);
 
+		for_each_online_pgdat(pgdat)
+			shuffle_free_memory(pgdat, PHYS_PFN(start),
+					PHYS_PFN(end));
+	}
+
 	return count;
 }
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2b2b3ccbbfb5..fb17c54d47b0 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -895,6 +895,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	zone->zone_pgdat->node_present_pages += onlined_pages;
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
 
+	shuffle_zone(zone, pfn, zone_end_pfn(zone));
+
 	if (onlined_pages) {
 		node_states_set_node(nid, &arg);
 		if (need_zonelists_rebuild)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2ec9cc407216..a51031cf16fe 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1595,6 +1595,8 @@ static int __init deferred_init_memmap(void *data)
 	}
 	pgdat_resize_unlock(pgdat, &flags);
 
+	shuffle_zone(zone, first_init_pfn, zone_end_pfn(zone));
+
 	/* Sanity check that the next zone really is unpopulated */
 	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
 
diff --git a/mm/shuffle.c b/mm/shuffle.c
new file mode 100644
index 000000000000..621fde268d01
--- /dev/null
+++ b/mm/shuffle.c
@@ -0,0 +1,206 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2018 Intel Corporation. All rights reserved.
+
+#include <linux/mm.h>
+#include <linux/init.h>
+#include <linux/mmzone.h>
+#include <linux/random.h>
+#include <linux/moduleparam.h>
+#include "internal.h"
+
+DEFINE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
+static unsigned long shuffle_state;
+enum {
+	SHUFFLE_ENABLE,
+	SHUFFLE_FORCE_DISABLE,
+};
+
+void page_alloc_shuffle_enable(void)
+{
+	if (test_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state)) {
+		/*
+		 * If platform cache auto-detect runs before module
+		 * parameter parsing, disable the shuffle implementation
+		 */
+		if (test_and_clear_bit(SHUFFLE_ENABLE, &shuffle_state))
+			static_branch_disable(&page_alloc_shuffle_key);
+	} else if (!test_and_set_bit(SHUFFLE_ENABLE, &shuffle_state))
+		static_branch_enable(&page_alloc_shuffle_key);
+}
+
+static bool shuffle_param;
+static int force_shuffle(const char *val, const struct kernel_param *kp)
+{
+	int rc = param_set_bool(val, kp);
+
+	if (rc < 0)
+		return rc;
+	if (shuffle_param)
+		page_alloc_shuffle_enable();
+	else
+		set_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state);
+	return 0;
+}
+module_param_call(shuffle, force_shuffle, param_get_bool, &shuffle_param, 0400);
+
+/*
+ * For two pages to be swapped in the shuffle, they must be free (on a
+ * 'free_area' lru), have the same order, and have the same migratetype.
+ */
+static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order)
+{
+	struct page *page;
+
+	/*
+	 * Given we're dealing with randomly selected pfns in a zone we
+	 * need to ask questions like...
+	 */
+
+	/* ...is the pfn even in the memmap? */
+	if (!pfn_valid_within(pfn))
+		return NULL;
+
+	/* ...is the pfn in a present section or a hole? */
+	if (!pfn_present(pfn))
+		return NULL;
+
+	/* ...is the page free and currently on a free_area list? */
+	page = pfn_to_page(pfn);
+	if (!PageBuddy(page))
+		return NULL;
+
+	/*
+	 * ...is the page on the same list as the page we will
+	 * shuffle it with?
+	 */
+	if (page_order(page) != order)
+		return NULL;
+
+	return page;
+}
+
+/*
+ * Fisher-Yates shuffle the freelist which prescribes iterating through
+ * an array, pfns in this case, and randomly swapping each entry with
+ * another in the span, end_pfn - start_pfn.
+ *
+ * To keep the implementation simple it does not attempt to correct for
+ * sources of bias in the distribution, like modulo bias or
+ * pseudo-random number generator bias. I.e. the expectation is that
+ * this shuffling raises the bar for attacks that exploit the
+ * predictability of page allocations, but need not be a perfect
+ * shuffle.
+ *
+ * Note that we don't use @z->zone_start_pfn and zone_end_pfn(@z)
+ * directly since the caller may be aware of holes in the zone and can
+ * improve the accuracy of the random pfn selection.
+ */
+#define SHUFFLE_RETRY 10
+static void __meminit shuffle_zone_order(struct zone *z, unsigned long start_pfn,
+		unsigned long end_pfn, const int order)
+{
+	unsigned long i, flags;
+	const int order_pages = 1 << order;
+
+	if (start_pfn < z->zone_start_pfn)
+		start_pfn = z->zone_start_pfn;
+	if (end_pfn > zone_end_pfn(z))
+		end_pfn = zone_end_pfn(z);
+
+	/* probably means that start/end were outside the zone */
+	if (end_pfn <= start_pfn)
+		return;
+	spin_lock_irqsave(&z->lock, flags);
+	start_pfn = ALIGN(start_pfn, order_pages);
+	for (i = start_pfn; i < end_pfn; i += order_pages) {
+		unsigned long j;
+		int migratetype, retry;
+		struct page *page_i, *page_j;
+
+		/*
+		 * We expect page_i, in the sub-range of a zone being
+		 * added (@start_pfn to @end_pfn), to more likely be
+		 * valid compared to page_j randomly selected in the
+		 * span @zone_start_pfn to @spanned_pages.
+		 */
+		page_i = shuffle_valid_page(i, order);
+		if (!page_i)
+			continue;
+
+		for (retry = 0; retry < SHUFFLE_RETRY; retry++) {
+			/*
+			 * Pick a random order aligned page from the
+			 * start of the zone. Use the *whole* zone here
+			 * so that if it is freed in tiny pieces that we
+			 * randomize in the whole zone, not just within
+			 * those fragments.
+			 *
+			 * Since page_j comes from a potentially sparse
+			 * address range we want to try a bit harder to
+			 * find a shuffle point for page_i.
+			 */
+			j = z->zone_start_pfn +
+				ALIGN_DOWN(get_random_long() % z->spanned_pages,
+						order_pages);
+			page_j = shuffle_valid_page(j, order);
+			if (page_j && page_j != page_i)
+				break;
+		}
+		if (retry >= SHUFFLE_RETRY) {
+			pr_debug("%s: failed to swap %#lx\n", __func__, i);
+			continue;
+		}
+
+		/*
+		 * Each migratetype corresponds to its own list, make
+		 * sure the types match otherwise we're moving pages to
+		 * lists where they do not belong.
+		 */
+		migratetype = get_pageblock_migratetype(page_i);
+		if (get_pageblock_migratetype(page_j) != migratetype) {
+			pr_debug("%s: migratetype mismatch %#lx\n", __func__, i);
+			continue;
+		}
+
+		list_swap(&page_i->lru, &page_j->lru);
+
+		pr_debug("%s: swap: %#lx -> %#lx\n", __func__, i, j);
+
+		/* take it easy on the zone lock */
+		if ((i % (100 * order_pages)) == 0) {
+			spin_unlock_irqrestore(&z->lock, flags);
+			cond_resched();
+			spin_lock_irqsave(&z->lock, flags);
+		}
+	}
+	spin_unlock_irqrestore(&z->lock, flags);
+}
+
+void __meminit __shuffle_zone(struct zone *z, unsigned long start_pfn,
+               unsigned long end_pfn)
+{
+       int i;
+
+       /* shuffle all the orders at the specified order and higher */
+       for (i = CONFIG_SHUFFLE_PAGE_ORDER; i < MAX_ORDER; i++)
+               shuffle_zone_order(z, start_pfn, end_pfn, i);
+}
+
+/**
+ * shuffle_free_memory - reduce the predictability of the page allocator
+ * @pgdat: node page data
+ * @start_pfn: Limit the shuffle to the greater of this value or zone start
+ * @end_pfn: Limit the shuffle to the less of this value or zone end
+ *
+ * While shuffle_zone() attempts to avoid holes with pfn_valid() and
+ * pfn_present() they can not report sub-section sized holes. @start_pfn
+ * and @end_pfn limit the shuffle to the exact memory pages being freed.
+ */
+void __meminit __shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
+		unsigned long end_pfn)
+{
+	struct zone *z;
+
+	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
+		shuffle_zone(z, start_pfn, end_pfn);
+}


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 4/5] mm: Move buddy list manipulations into helpers
  2018-12-15  1:48 [PATCH v5 0/5] mm: Randomize free memory Dan Williams
                   ` (2 preceding siblings ...)
  2018-12-15  1:48 ` [PATCH v5 3/5] mm: Shuffle initial free memory to improve memory-side-cache utilization Dan Williams
@ 2018-12-15  1:48 ` Dan Williams
  2018-12-15  1:48 ` [PATCH v5 5/5] mm: Maintain randomization of page free lists Dan Williams
  2018-12-17 10:10 ` [PATCH v5 0/5] mm: Randomize free memory Rafael J. Wysocki
  5 siblings, 0 replies; 15+ messages in thread
From: Dan Williams @ 2018-12-15  1:48 UTC (permalink / raw)
  To: akpm; +Cc: Michal Hocko, Dave Hansen, peterz, linux-mm, x86, linux-kernel

In preparation for runtime randomization of the zone lists, take all
(well, most of) the list_*() functions in the buddy allocator and put
them in helper functions. Provide a common control point for injecting
additional behavior when freeing pages.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/mm.h       |    3 --
 include/linux/mm_types.h |    3 ++
 include/linux/mmzone.h   |   51 ++++++++++++++++++++++++++++++++++
 mm/compaction.c          |    4 +--
 mm/page_alloc.c          |   70 ++++++++++++++++++----------------------------
 5 files changed, 84 insertions(+), 47 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f9647779e82b..43e5a449caaf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -473,9 +473,6 @@ static inline void vma_set_anonymous(struct vm_area_struct *vma)
 struct mmu_gather;
 struct inode;
 
-#define page_private(page)		((page)->private)
-#define set_page_private(page, v)	((page)->private = (v))
-
 #if !defined(__HAVE_ARCH_PTE_DEVMAP) || !defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static inline int pmd_devmap(pmd_t pmd)
 {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5ed8f6292a53..72f37ea6dedb 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -209,6 +209,9 @@ struct page {
 #define PAGE_FRAG_CACHE_MAX_SIZE	__ALIGN_MASK(32768, ~PAGE_MASK)
 #define PAGE_FRAG_CACHE_MAX_ORDER	get_order(PAGE_FRAG_CACHE_MAX_SIZE)
 
+#define page_private(page)		((page)->private)
+#define set_page_private(page, v)	((page)->private = (v))
+
 struct page_frag_cache {
 	void * va;
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index eafa66d66232..35cc33af87f2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -18,6 +18,8 @@
 #include <linux/pageblock-flags.h>
 #include <linux/page-flags-layout.h>
 #include <linux/atomic.h>
+#include <linux/mm_types.h>
+#include <linux/page-flags.h>
 #include <asm/page.h>
 
 /* Free memory management - zoned buddy allocator.  */
@@ -98,6 +100,55 @@ struct free_area {
 	unsigned long		nr_free;
 };
 
+/* Used for pages not on another list */
+static inline void add_to_free_area(struct page *page, struct free_area *area,
+			     int migratetype)
+{
+	list_add(&page->lru, &area->free_list[migratetype]);
+	area->nr_free++;
+}
+
+/* Used for pages not on another list */
+static inline void add_to_free_area_tail(struct page *page, struct free_area *area,
+				  int migratetype)
+{
+	list_add_tail(&page->lru, &area->free_list[migratetype]);
+	area->nr_free++;
+}
+
+/* Used for pages which are on another list */
+static inline void move_to_free_area(struct page *page, struct free_area *area,
+			     int migratetype)
+{
+	list_move(&page->lru, &area->free_list[migratetype]);
+}
+
+static inline struct page *get_page_from_free_area(struct free_area *area,
+					    int migratetype)
+{
+	return list_first_entry_or_null(&area->free_list[migratetype],
+					struct page, lru);
+}
+
+static inline void rmv_page_order(struct page *page)
+{
+	__ClearPageBuddy(page);
+	set_page_private(page, 0);
+}
+
+static inline void del_page_from_free_area(struct page *page,
+		struct free_area *area, int migratetype)
+{
+	list_del(&page->lru);
+	rmv_page_order(page);
+	area->nr_free--;
+}
+
+static inline bool free_area_empty(struct free_area *area, int migratetype)
+{
+	return list_empty(&area->free_list[migratetype]);
+}
+
 struct pglist_data;
 
 /*
diff --git a/mm/compaction.c b/mm/compaction.c
index 7c607479de4a..44adbfa073b3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1359,13 +1359,13 @@ static enum compact_result __compact_finished(struct zone *zone,
 		bool can_steal;
 
 		/* Job done if page is free of the right migratetype */
-		if (!list_empty(&area->free_list[migratetype]))
+		if (!free_area_empty(area, migratetype))
 			return COMPACT_SUCCESS;
 
 #ifdef CONFIG_CMA
 		/* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
 		if (migratetype == MIGRATE_MOVABLE &&
-			!list_empty(&area->free_list[MIGRATE_CMA]))
+			!free_area_empty(area, MIGRATE_CMA))
 			return COMPACT_SUCCESS;
 #endif
 		/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a51031cf16fe..d2f7b050bc13 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -704,12 +704,6 @@ static inline void set_page_order(struct page *page, unsigned int order)
 	__SetPageBuddy(page);
 }
 
-static inline void rmv_page_order(struct page *page)
-{
-	__ClearPageBuddy(page);
-	set_page_private(page, 0);
-}
-
 /*
  * This function checks whether a page is free && is the buddy
  * we can coalesce a page and its buddy if
@@ -810,13 +804,11 @@ static inline void __free_one_page(struct page *page,
 		 * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
 		 * merge with it and move up one order.
 		 */
-		if (page_is_guard(buddy)) {
+		if (page_is_guard(buddy))
 			clear_page_guard(zone, buddy, order, migratetype);
-		} else {
-			list_del(&buddy->lru);
-			zone->free_area[order].nr_free--;
-			rmv_page_order(buddy);
-		}
+		else
+			del_page_from_free_area(buddy, &zone->free_area[order],
+					migratetype);
 		combined_pfn = buddy_pfn & pfn;
 		page = page + (combined_pfn - pfn);
 		pfn = combined_pfn;
@@ -866,15 +858,13 @@ static inline void __free_one_page(struct page *page,
 		higher_buddy = higher_page + (buddy_pfn - combined_pfn);
 		if (pfn_valid_within(buddy_pfn) &&
 		    page_is_buddy(higher_page, higher_buddy, order + 1)) {
-			list_add_tail(&page->lru,
-				&zone->free_area[order].free_list[migratetype]);
-			goto out;
+			add_to_free_area_tail(page, &zone->free_area[order],
+					      migratetype);
+			return;
 		}
 	}
 
-	list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
-out:
-	zone->free_area[order].nr_free++;
+	add_to_free_area(page, &zone->free_area[order], migratetype);
 }
 
 /*
@@ -1819,7 +1809,7 @@ static inline void expand(struct zone *zone, struct page *page,
 		if (set_page_guard(zone, &page[size], high, migratetype))
 			continue;
 
-		list_add(&page[size].lru, &area->free_list[migratetype]);
+		add_to_free_area(&page[size], area, migratetype);
 		area->nr_free++;
 		set_page_order(&page[size], high);
 	}
@@ -1961,13 +1951,10 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	/* Find a page of the appropriate size in the preferred list */
 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
 		area = &(zone->free_area[current_order]);
-		page = list_first_entry_or_null(&area->free_list[migratetype],
-							struct page, lru);
+		page = get_page_from_free_area(area, migratetype);
 		if (!page)
 			continue;
-		list_del(&page->lru);
-		rmv_page_order(page);
-		area->nr_free--;
+		del_page_from_free_area(page, area, migratetype);
 		expand(zone, page, order, current_order, area, migratetype);
 		set_pcppage_migratetype(page, migratetype);
 		return page;
@@ -2053,8 +2040,7 @@ static int move_freepages(struct zone *zone,
 		}
 
 		order = page_order(page);
-		list_move(&page->lru,
-			  &zone->free_area[order].free_list[migratetype]);
+		move_to_free_area(page, &zone->free_area[order], migratetype);
 		page += 1 << order;
 		pages_moved += 1 << order;
 	}
@@ -2206,7 +2192,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 
 single_page:
 	area = &zone->free_area[current_order];
-	list_move(&page->lru, &area->free_list[start_type]);
+	move_to_free_area(page, area, start_type);
 }
 
 /*
@@ -2230,7 +2216,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 		if (fallback_mt == MIGRATE_TYPES)
 			break;
 
-		if (list_empty(&area->free_list[fallback_mt]))
+		if (free_area_empty(area, fallback_mt))
 			continue;
 
 		if (can_steal_fallback(order, migratetype))
@@ -2317,9 +2303,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 		for (order = 0; order < MAX_ORDER; order++) {
 			struct free_area *area = &(zone->free_area[order]);
 
-			page = list_first_entry_or_null(
-					&area->free_list[MIGRATE_HIGHATOMIC],
-					struct page, lru);
+			page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC);
 			if (!page)
 				continue;
 
@@ -2432,8 +2416,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 	VM_BUG_ON(current_order == MAX_ORDER);
 
 do_steal:
-	page = list_first_entry(&area->free_list[fallback_mt],
-							struct page, lru);
+	page = get_page_from_free_area(area, fallback_mt);
 
 	steal_suitable_fallback(zone, page, start_migratetype, can_steal);
 
@@ -2860,6 +2843,7 @@ EXPORT_SYMBOL_GPL(split_page);
 
 int __isolate_free_page(struct page *page, unsigned int order)
 {
+	struct free_area *area = &page_zone(page)->free_area[order];
 	unsigned long watermark;
 	struct zone *zone;
 	int mt;
@@ -2884,9 +2868,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
 	}
 
 	/* Remove page from free list */
-	list_del(&page->lru);
-	zone->free_area[order].nr_free--;
-	rmv_page_order(page);
+
+	del_page_from_free_area(page, area, mt);
 
 	/*
 	 * Set the pageblock if the isolated page is at least half of a
@@ -3180,13 +3163,13 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 			continue;
 
 		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
-			if (!list_empty(&area->free_list[mt]))
+			if (!free_area_empty(area, mt))
 				return true;
 		}
 
 #ifdef CONFIG_CMA
 		if ((alloc_flags & ALLOC_CMA) &&
-		    !list_empty(&area->free_list[MIGRATE_CMA])) {
+		    !free_area_empty(area, MIGRATE_CMA)) {
 			return true;
 		}
 #endif
@@ -5019,7 +5002,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 
 			types[order] = 0;
 			for (type = 0; type < MIGRATE_TYPES; type++) {
-				if (!list_empty(&area->free_list[type]))
+				if (!free_area_empty(area, type))
 					types[order] |= 1 << type;
 			}
 		}
@@ -8144,6 +8127,9 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 	spin_lock_irqsave(&zone->lock, flags);
 	pfn = start_pfn;
 	while (pfn < end_pfn) {
+		struct free_area *area;
+		int mt;
+
 		if (!pfn_valid(pfn)) {
 			pfn++;
 			continue;
@@ -8162,13 +8148,13 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 		BUG_ON(page_count(page));
 		BUG_ON(!PageBuddy(page));
 		order = page_order(page);
+		area = &zone->free_area[order];
 #ifdef CONFIG_DEBUG_VM
 		pr_info("remove from free list %lx %d %lx\n",
 			pfn, 1 << order, end_pfn);
 #endif
-		list_del(&page->lru);
-		rmv_page_order(page);
-		zone->free_area[order].nr_free--;
+		mt = get_pageblock_migratetype(page);
+		del_page_from_free_area(page, area, mt);
 		for (i = 0; i < (1 << order); i++)
 			SetPageReserved((page+i));
 		pfn += (1 << order);


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 5/5] mm: Maintain randomization of page free lists
  2018-12-15  1:48 [PATCH v5 0/5] mm: Randomize free memory Dan Williams
                   ` (3 preceding siblings ...)
  2018-12-15  1:48 ` [PATCH v5 4/5] mm: Move buddy list manipulations into helpers Dan Williams
@ 2018-12-15  1:48 ` Dan Williams
  2018-12-17 10:10 ` [PATCH v5 0/5] mm: Randomize free memory Rafael J. Wysocki
  5 siblings, 0 replies; 15+ messages in thread
From: Dan Williams @ 2018-12-15  1:48 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Kees Cook, Dave Hansen, peterz, linux-mm, x86,
	linux-kernel

When freeing a page with an order >= shuffle_page_order randomly select
the front or back of the list for insertion.

While the mm tries to defragment physical pages into huge pages this can
tend to make the page allocator more predictable over time. Inject the
front-back randomness to preserve the initial randomness established by
shuffle_free_memory() when the kernel was booted.

The overhead of this manipulation is constrained by only being applied
for MAX_ORDER sized pages by default.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/mm.h     |   12 ++++++++++++
 include/linux/mmzone.h |   10 ++++++++++
 mm/page_alloc.c        |   11 +++++++++--
 mm/shuffle.c           |   16 ++++++++++++++++
 4 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 43e5a449caaf..8299267c028a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2088,6 +2088,13 @@ static inline void shuffle_zone(struct zone *z, unsigned long start_pfn,
 		return;
 	__shuffle_zone(z, start_pfn, end_pfn);
 }
+
+static inline bool is_shuffle_order(int order)
+{
+	if (!static_branch_unlikely(&page_alloc_shuffle_key))
+                return false;
+	return order >= CONFIG_SHUFFLE_PAGE_ORDER;
+}
 #else
 static inline void shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
 		unsigned long end_pfn)
@@ -2102,6 +2109,11 @@ static inline void shuffle_zone(struct zone *z, unsigned long start_pfn,
 static inline void page_alloc_shuffle_enable(void)
 {
 }
+
+static inline bool is_shuffle_order(int order)
+{
+	return false;
+}
 #endif
 
 /* Free the reserved page into the buddy system, so it gets managed. */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 35cc33af87f2..338929647eea 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -98,6 +98,8 @@ extern int page_group_by_mobility_disabled;
 struct free_area {
 	struct list_head	free_list[MIGRATE_TYPES];
 	unsigned long		nr_free;
+	u64			rand;
+	u8			rand_bits;
 };
 
 /* Used for pages not on another list */
@@ -116,6 +118,14 @@ static inline void add_to_free_area_tail(struct page *page, struct free_area *ar
 	area->nr_free++;
 }
 
+#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR
+/* Used to preserve page allocation order entropy */
+void add_to_free_area_random(struct page *page, struct free_area *area,
+		int migratetype);
+#else
+#define add_to_free_area_random add_to_free_area
+#endif
+
 /* Used for pages which are on another list */
 static inline void move_to_free_area(struct page *page, struct free_area *area,
 			     int migratetype)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d2f7b050bc13..62a40ad07593 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -42,6 +42,7 @@
 #include <linux/mempolicy.h>
 #include <linux/memremap.h>
 #include <linux/stop_machine.h>
+#include <linux/random.h>
 #include <linux/sort.h>
 #include <linux/pfn.h>
 #include <linux/backing-dev.h>
@@ -850,7 +851,8 @@ static inline void __free_one_page(struct page *page,
 	 * so it's less likely to be used soon and more likely to be merged
 	 * as a higher order page
 	 */
-	if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)) {
+	if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)
+			&& !is_shuffle_order(order)) {
 		struct page *higher_page, *higher_buddy;
 		combined_pfn = buddy_pfn & pfn;
 		higher_page = page + (combined_pfn - pfn);
@@ -864,7 +866,12 @@ static inline void __free_one_page(struct page *page,
 		}
 	}
 
-	add_to_free_area(page, &zone->free_area[order], migratetype);
+	if (is_shuffle_order(order))
+		add_to_free_area_random(page, &zone->free_area[order],
+				migratetype);
+	else
+		add_to_free_area(page, &zone->free_area[order], migratetype);
+
 }
 
 /*
diff --git a/mm/shuffle.c b/mm/shuffle.c
index 621fde268d01..5850a0761d10 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -204,3 +204,19 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
 	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
 		shuffle_zone(z, start_pfn, end_pfn);
 }
+
+void add_to_free_area_random(struct page *page, struct free_area *area,
+		int migratetype)
+{
+	if (area->rand_bits == 0) {
+		area->rand_bits = 64;
+		area->rand = get_random_u64();
+	}
+
+	if (area->rand & 1)
+		add_to_free_area(page, area, migratetype);
+	else
+		add_to_free_area_tail(page, area, migratetype);
+	area->rand_bits--;
+	area->rand >>= 1;
+}


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 2/5] acpi/numa: Set the memory-side-cache size in memblocks
  2018-12-15  1:48 ` [PATCH v5 2/5] acpi/numa: Set the memory-side-cache size in memblocks Dan Williams
@ 2018-12-16 12:34   ` Mike Rapoport
  0 siblings, 0 replies; 15+ messages in thread
From: Mike Rapoport @ 2018-12-16 12:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, x86, Rafael J. Wysocki, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Keith Busch, linux-mm, linux-kernel

On Fri, Dec 14, 2018 at 05:48:40PM -0800, Dan Williams wrote:
> From: Keith Busch <keith.busch@intel.com>
> 
> Add memblock based enumeration of memory-side-cache of System RAM.
> Detect the capability in early init through HMAT tables, and set the
> size in the address range memblocks if a direct mapped side cache is
> present.
> 
> Cc: <x86@kernel.org>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  arch/x86/Kconfig         |    1 +
>  drivers/acpi/numa.c      |   32 ++++++++++++++++++++++++++++++++
>  include/linux/memblock.h |   36 ++++++++++++++++++++++++++++++++++++
>  mm/Kconfig               |    3 +++
>  mm/memblock.c            |   20 ++++++++++++++++++++
>  5 files changed, 92 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 8689e794a43c..3f9c413d8eb5 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -171,6 +171,7 @@ config X86
>  	select HAVE_KVM
>  	select HAVE_LIVEPATCH			if X86_64
>  	select HAVE_MEMBLOCK_NODE_MAP
> +	select HAVE_MEMBLOCK_CACHE_INFO		if ACPI_NUMA
>  	select HAVE_MIXED_BREAKPOINTS_REGS
>  	select HAVE_MOD_ARCH_SPECIFIC
>  	select HAVE_NMI
> diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> index f5e09c39ff22..ec7e849f1c19 100644
> --- a/drivers/acpi/numa.c
> +++ b/drivers/acpi/numa.c
> @@ -40,6 +40,12 @@ static int pxm_to_node_map[MAX_PXM_DOMAINS]
>  static int node_to_pxm_map[MAX_NUMNODES]
>  			= { [0 ... MAX_NUMNODES - 1] = PXM_INVAL };
>  
> +struct mem_cacheinfo {
> +	phys_addr_t size;
> +	bool direct_mapped;
> +};
> +static struct mem_cacheinfo side_cached_pxms[MAX_PXM_DOMAINS] __initdata;
> +
>  unsigned char acpi_srat_revision __initdata;
>  int acpi_numa __initdata;
>  
> @@ -262,6 +268,8 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
>  	u64 start, end;
>  	u32 hotpluggable;
>  	int node, pxm;
> +	u64 cache_size;
> +	bool direct;
>  
>  	if (srat_disabled())
>  		goto out_err;
> @@ -308,6 +316,13 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
>  		pr_warn("SRAT: Failed to mark hotplug range [mem %#010Lx-%#010Lx] in memblock\n",
>  			(unsigned long long)start, (unsigned long long)end - 1);
>  
> +	cache_size = side_cached_pxms[pxm].size;
> +	direct = side_cached_pxms[pxm].direct_mapped;
> +	if (cache_size &&
> +	    memblock_set_sidecache(start, ma->length, cache_size, direct))
> +		pr_warn("SRAT: Failed to mark side cached range [mem %#010Lx-%#010Lx] in memblock\n",
> +			(unsigned long long)start, (unsigned long long)end - 1);
> +
>  	max_possible_pfn = max(max_possible_pfn, PFN_UP(end - 1));
>  
>  	return 0;
> @@ -411,6 +426,18 @@ acpi_parse_memory_affinity(union acpi_subtable_headers * header,
>  	return 0;
>  }
>  
> +static int __init
> +acpi_parse_cache(union acpi_subtable_headers *header, const unsigned long end)
> +{
> +	struct acpi_hmat_cache *c = (void *)header;
> +	u32 attrs = (c->cache_attributes & ACPI_HMAT_CACHE_ASSOCIATIVITY) >> 8;
> +
> +	if (attrs == ACPI_HMAT_CA_DIRECT_MAPPED)
> +		side_cached_pxms[c->memory_PD].direct_mapped = true;
> +	side_cached_pxms[c->memory_PD].size += c->cache_size;
> +	return 0;
> +}
> +
>  static int __init acpi_parse_srat(struct acpi_table_header *table)
>  {
>  	struct acpi_table_srat *srat = (struct acpi_table_srat *)table;
> @@ -460,6 +487,11 @@ int __init acpi_numa_init(void)
>  					sizeof(struct acpi_table_srat),
>  					srat_proc, ARRAY_SIZE(srat_proc), 0);
>  
> +		acpi_table_parse_entries(ACPI_SIG_HMAT,
> +					 sizeof(struct acpi_table_hmat),
> +					 ACPI_HMAT_TYPE_CACHE,
> +					 acpi_parse_cache, 0);
> +
>  		cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
>  					    acpi_parse_memory_affinity, 0);
>  	}
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index aee299a6aa76..169ed3dd456d 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -60,6 +60,10 @@ struct memblock_region {
>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  	int nid;
>  #endif
> +#ifdef CONFIG_HAVE_MEMBLOCK_CACHE_INFO
> +	phys_addr_t cache_size;
> +	bool direct_mapped;
> +#endif

Please add descriptions of the new fields to the 'struct memblock_region'
kernel-doc.

>  };
>  
>  /**
> @@ -317,6 +321,38 @@ static inline int memblock_get_region_node(const struct memblock_region *r)
>  }
>  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>  
> +#ifdef CONFIG_HAVE_MEMBLOCK_CACHE_INFO
> +int memblock_set_sidecache(phys_addr_t base, phys_addr_t size,
> +			   phys_addr_t cache_size, bool direct_mapped);
> +
> +static inline bool memblock_sidecache_direct_mapped(struct memblock_region *m)
> +{
> +	return m->direct_mapped;
> +}
> +
> +static inline phys_addr_t memblock_sidecache_size(struct memblock_region *m)
> +{
> +	return m->cache_size;
> +}
> +#else
> +static inline int memblock_set_sidecache(phys_addr_t base, phys_addr_t size,
> +					 phys_addr_t cache_size,
> +					 bool direct_mapped)
> +{
> +	return 0;
> +}
> +
> +static inline phys_addr_t memblock_sidecache_size(struct memblock_region *m)
> +{
> +	return 0;
> +}
> +
> +static inline bool memblock_sidecache_direct_mapped(struct memblock_region *m)
> +{
> +	return false;
> +}
> +#endif /* CONFIG_HAVE_MEMBLOCK_CACHE_INFO */
> +
>  /* Flags for memblock allocation APIs */
>  #define MEMBLOCK_ALLOC_ANYWHERE	(~(phys_addr_t)0)
>  #define MEMBLOCK_ALLOC_ACCESSIBLE	0
> diff --git a/mm/Kconfig b/mm/Kconfig
> index d85e39da47ae..c7944299a89e 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -142,6 +142,9 @@ config ARCH_DISCARD_MEMBLOCK
>  config MEMORY_ISOLATION
>  	bool
>  
> +config HAVE_MEMBLOCK_CACHE_INFO
> +	bool
> +
>  #
>  # Only be set on architectures that have completely implemented memory hotplug
>  # feature. If you are not sure, don't touch it.
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 9a2d5ae81ae1..185bfd4e87bb 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -822,6 +822,26 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
>  	return memblock_add_range(&memblock.reserved, base, size, MAX_NUMNODES, 0);
>  }
>  
> +#ifdef CONFIG_HAVE_MEMBLOCK_CACHE_INFO

Kernel-doc here would be appreciated.

> +int __init_memblock memblock_set_sidecache(phys_addr_t base, phys_addr_t size,
> +			   phys_addr_t cache_size, bool direct_mapped)
> +{
> +	struct memblock_type *type = &memblock.memory;
> +	int i, ret, start_rgn, end_rgn;
> +
> +	ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn);
> +	if (ret)
> +		return ret;
> +
> +	for (i = start_rgn; i < end_rgn; i++) {
> +		type->regions[i].cache_size = cache_size;
> +		type->regions[i].direct_mapped = direct_mapped;
> +	}
> +
> +	return 0;
> +}
> +#endif
> +
>  /**
>   * memblock_setclr_flag - set or clear flag for a memory region
>   * @base: base address of the region
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 3/5] mm: Shuffle initial free memory to improve memory-side-cache utilization
  2018-12-15  1:48 ` [PATCH v5 3/5] mm: Shuffle initial free memory to improve memory-side-cache utilization Dan Williams
@ 2018-12-16 12:43   ` Mike Rapoport
  2018-12-17 19:56     ` Dan Williams
  0 siblings, 1 reply; 15+ messages in thread
From: Mike Rapoport @ 2018-12-16 12:43 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Michal Hocko, Kees Cook, Dave Hansen, peterz, linux-mm,
	x86, linux-kernel

On Fri, Dec 14, 2018 at 05:48:46PM -0800, Dan Williams wrote:
> Randomization of the page allocator improves the average utilization of
> a direct-mapped memory-side-cache. Memory side caching is a platform
> capability that Linux has been previously exposed to in HPC
> (high-performance computing) environments on specialty platforms. In
> that instance it was a smaller pool of high-bandwidth-memory relative to
> higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
> be found on general purpose server platforms where DRAM is a cache in
> front of higher latency persistent memory [1].
> 
> Robert offered an explanation of the state of the art of Linux
> interactions with memory-side-caches [2], and I copy it here:
> 
>     It's been a problem in the HPC space:
>     http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/
> 
>     A kernel module called zonesort is available to try to help:
>     https://software.intel.com/en-us/articles/xeon-phi-software
> 
>     and this abandoned patch series proposed that for the kernel:
>     https://lkml.org/lkml/2017/8/23/195
> 
>     Dan's patch series doesn't attempt to ensure buffers won't conflict, but
>     also reduces the chance that the buffers will. This will make performance
>     more consistent, albeit slower than "optimal" (which is near impossible
>     to attain in a general-purpose kernel).  That's better than forcing
>     users to deploy remedies like:
>         "To eliminate this gradual degradation, we have added a Stream
>          measurement to the Node Health Check that follows each job;
>          nodes are rebooted whenever their measured memory bandwidth
>          falls below 300 GB/s."
> 
> A replacement for zonesort was merged upstream in commit cc9aec03e58f
> "x86/numa_emulation: Introduce uniform split capability". With this
> numa_emulation capability, memory can be split into cache sized
> ("near-memory" sized) numa nodes. A bind operation to such a node, and
> disabling workloads on other nodes, enables full cache performance.
> However, once the workload exceeds the cache size then cache conflicts
> are unavoidable. While HPC environments might be able to tolerate
> time-scheduling of cache sized workloads, for general purpose server
> platforms, the oversubscribed cache case will be the common case.
> 
> The worst case scenario is that a server system owner benchmarks a
> workload at boot with an un-contended cache only to see that performance
> degrade over time, even below the average cache performance due to
> excessive conflicts. Randomization clips the peaks and fills in the
> valleys of cache utilization to yield steady average performance.
> 
> Here are some performance impact details of the patches:
> 
> 1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
> 3X speedup in a contrived case that tries to force cache conflicts. The
> contrived cased used the numa_emulation capability to force an instance
> of the benchmark to be run in two of the near-memory sized numa nodes.
> If both instances were placed on the same emulated they would fit and
> cause zero conflicts.  While on separate emulated nodes without
> randomization they underutilized the cache and conflicted unnecessarily
> due to the in-order allocation per node.
> 
> 2/ A well known Java server application benchmark was run with a heap
> size that exceeded cache size by 3X. The cache conflict rate was 8% for
> the first run and degraded to 21% after page allocator aging. With
> randomization enabled the rate levelled out at 11%.
> 
> 3/ A MongoDB workload did not observe measurable difference in
> cache-conflict rates, but the overall throughput dropped by 7% with
> randomization in one case.
> 
> 4/ Mel Gorman ran his suite of performance workloads with randomization
> enabled on platforms without a memory-side-cache and saw a mix of some
> improvements and some losses [3].
> 
> While there is potentially significant improvement for applications that
> depend on low latency access across a wide working-set, the performance
> may be negligible to negative for other workloads. For this reason the
> shuffle capability defaults to off unless a direct-mapped
> memory-side-cache is detected. Even then, the page_alloc.shuffle=0
> parameter can be specified to disable the randomization on those
> systems.
> 
> Outside of memory-side-cache utilization concerns there is potentially
> security benefit from randomization. Some data exfiltration and
> return-oriented-programming attacks rely on the ability to infer the
> location of sensitive data objects. The kernel page allocator,
> especially early in system boot, has predictable first-in-first out
> behavior for physical pages. Pages are freed in physical address order
> when first onlined.
> 
> Quoting Kees:
>     "While we already have a base-address randomization
>      (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
>      memory layouts would certainly be using the predictability of
>      allocation ordering (i.e. for attacks where the base address isn't
>      important: only the relative positions between allocated memory).
>      This is common in lots of heap-style attacks. They try to gain
>      control over ordering by spraying allocations, etc.
> 
>      I'd really like to see this because it gives us something similar
>      to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."
> 
> While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
> caches it leaves vast bulk of memory to be predictably in order
> allocated.  However, it should be noted, the concrete security benefits
> are hard to quantify, and no known CVE is mitigated by this
> randomization.
> 
> Introduce shuffle_free_memory(), and its helper shuffle_zone(), to
> perform a Fisher-Yates shuffle of the page allocator 'free_area' lists
> when they are initially populated with free memory at boot and at
> hotplug time.
> 
> The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
> pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.
> 10, 4MB this trades off randomization granularity for time spent
> shuffling.  MAX_ORDER-1 was chosen to be minimally invasive to the page
> allocator while still showing memory-side cache behavior improvements,
> and the expectation that the security implications of finer granularity
> randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.
> 
> The performance impact of the shuffling appears to be in the noise
> compared to other memory initialization work. Also the bulk of the work
> is done in the background as a part of deferred_init_memmap().
> 
> This initial randomization can be undone over time so a follow-on patch
> is introduced to inject entropy on page free decisions. It is reasonable
> to ask if the page free entropy is sufficient, but it is not enough due
> to the in-order initial freeing of pages. At the start of that process
> putting page1 in front or behind page0 still keeps them close together,
> page2 is still near page1 and has a high chance of being adjacent. As
> more pages are added ordering diversity improves, but there is still
> high page locality for the low address pages and this leads to no
> significant impact to the cache conflict rate.
> 
> [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
> [2]: https://lkml.org/lkml/2018/9/22/54
> [3]: https://lkml.org/lkml/2018/10/12/309
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  include/linux/list.h   |   17 ++++
>  include/linux/mm.h     |   38 +++++++++
>  include/linux/mmzone.h |    4 +
>  init/Kconfig           |   36 ++++++++
>  mm/Makefile            |    7 +-
>  mm/memblock.c          |   21 ++++-
>  mm/memory_hotplug.c    |    2 
>  mm/page_alloc.c        |    2 
>  mm/shuffle.c           |  206 ++++++++++++++++++++++++++++++++++++++++++++++++
>  9 files changed, 329 insertions(+), 4 deletions(-)
>  create mode 100644 mm/shuffle.c
> 
> diff --git a/include/linux/list.h b/include/linux/list.h
> index edb7628e46ed..3dfb8953f241 100644
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -150,6 +150,23 @@ static inline void list_replace_init(struct list_head *old,
>  	INIT_LIST_HEAD(old);
>  }
> 
> +/**
> + * list_swap - replace entry1 with entry2 and re-add entry1 at entry2's position
> + * @entry1: the location to place entry2
> + * @entry2: the location to place entry1
> + */
> +static inline void list_swap(struct list_head *entry1,
> +			     struct list_head *entry2)
> +{
> +	struct list_head *pos = entry2->prev;
> +
> +	list_del(entry2);
> +	list_replace(entry1, entry2);
> +	if (pos == entry1)
> +		pos = entry2;
> +	list_add(entry1, pos);
> +}
> +
>  /**
>   * list_del_init - deletes entry from list and reinitialize it.
>   * @entry: the element to delete from the list.
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5411de93a363..f9647779e82b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2069,6 +2069,44 @@ extern void mem_init_print_info(const char *str);
> 
>  extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
> 
> +#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR
> +DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
> +extern void page_alloc_shuffle_enable(void);
> +extern void __shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
> +		unsigned long end_pfn);
> +static inline void shuffle_free_memory(pg_data_t *pgdat,
> +		unsigned long start_pfn, unsigned long end_pfn)
> +{
> +	if (!static_branch_unlikely(&page_alloc_shuffle_key))
> +		return;
> +	__shuffle_free_memory(pgdat, start_pfn, end_pfn);
> +}
> +
> +extern void __shuffle_zone(struct zone *z, unsigned long start_pfn,
> +		unsigned long end_pfn);
> +static inline void shuffle_zone(struct zone *z, unsigned long start_pfn,
> +		unsigned long end_pfn)
> +{
> +	if (!static_branch_unlikely(&page_alloc_shuffle_key))
> +		return;
> +	__shuffle_zone(z, start_pfn, end_pfn);
> +}
> +#else
> +static inline void shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
> +		unsigned long end_pfn)
> +{
> +}
> +
> +static inline void shuffle_zone(struct zone *z, unsigned long start_pfn,
> +		unsigned long end_pfn)
> +{
> +}
> +
> +static inline void page_alloc_shuffle_enable(void)
> +{
> +}
> +#endif
> +
>  /* Free the reserved page into the buddy system, so it gets managed. */
>  static inline void __free_reserved_page(struct page *page)
>  {
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 847705a6d0ec..eafa66d66232 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1266,6 +1266,10 @@ void sparse_init(void);
>  #else
>  #define sparse_init()	do {} while (0)
>  #define sparse_index_init(_sec, _nid)  do {} while (0)
> +static inline int pfn_present(unsigned long pfn)
> +{
> +	return 1;
> +}
>  #endif /* CONFIG_SPARSEMEM */
> 
>  /*
> diff --git a/init/Kconfig b/init/Kconfig
> index cf5b5a0dcbc2..fa6812d995ec 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1720,6 +1720,42 @@ config SLAB_FREELIST_HARDENED
>  	  sacrifies to harden the kernel slab allocator against common
>  	  freelist exploit methods.
> 
> +config SHUFFLE_PAGE_ALLOCATOR
> +	bool "Page allocator randomization"
> +	depends on HAVE_MEMBLOCK_CACHE_INFO
> +	default SLAB_FREELIST_RANDOM
> +	help
> +	  Randomization of the page allocator improves the average
> +	  utilization of a direct-mapped memory-side-cache. See section
> +	  5.2.27 Heterogeneous Memory Attribute Table (HMAT) in the ACPI
> +	  6.2a specification for an example of how a platform advertises
> +	  the presence of a memory-side-cache. There are also incidental
> +	  security benefits as it reduces the predictability of page
> +	  allocations to compliment SLAB_FREELIST_RANDOM, but the
> +	  default granularity of shuffling on 4MB (MAX_ORDER) pages is
> +	  selected based on cache utilization benefits.
> +
> +	  While the randomization improves cache utilization it may
> +	  negatively impact workloads on platforms without a cache. For
> +	  this reason, by default, the randomization is enabled only
> +	  after runtime detection of a direct-mapped memory-side-cache.
> +	  Otherwise, the randomization may be force enabled with the
> +	  'page_alloc.shuffle' kernel command line parameter.
> +
> +	  Say Y if unsure.
> +
> +config SHUFFLE_PAGE_ORDER
> +	depends on SHUFFLE_PAGE_ALLOCATOR
> +	int "Page allocator shuffle order"
> +	range 0 10
> +	default 10
> +	help
> +	  Specify the granularity at which shuffling (randomization) is
> +	  performed. By default this is set to MAX_ORDER-1 to minimize
> +	  runtime impact of randomization and with the expectation that
> +	  SLAB_FREELIST_RANDOM mitigates heap attacks on smaller
> +	  object granularities.
> +
>  config SLUB_CPU_PARTIAL
>  	default y
>  	depends on SLUB && SMP
> diff --git a/mm/Makefile b/mm/Makefile
> index d210cc9d6f80..ac5e5ba78874 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -33,7 +33,7 @@ mmu-$(CONFIG_MMU)	+= process_vm_access.o
>  endif
> 
>  obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
> -			   maccess.o page_alloc.o page-writeback.o \
> +			   maccess.o page-writeback.o \
>  			   readahead.o swap.o truncate.o vmscan.o shmem.o \
>  			   util.o mmzone.o vmstat.o backing-dev.o \
>  			   mm_init.o mmu_context.o percpu.o slab_common.o \
> @@ -41,6 +41,11 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
>  			   interval_tree.o list_lru.o workingset.o \
>  			   debug.o $(mmu-y)
> 
> +# Give 'page_alloc' its own module-parameter namespace
> +page-alloc-y := page_alloc.o
> +page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
> +
> +obj-y += page-alloc.o
>  obj-y += init-mm.o
>  obj-y += memblock.o
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 185bfd4e87bb..fd617928ccc1 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -834,8 +834,16 @@ int __init_memblock memblock_set_sidecache(phys_addr_t base, phys_addr_t size,
>  		return ret;
> 
>  	for (i = start_rgn; i < end_rgn; i++) {
> -		type->regions[i].cache_size = cache_size;
> -		type->regions[i].direct_mapped = direct_mapped;
> +		struct memblock_region *r = &type->regions[i];
> +
> +		r->cache_size = cache_size;
> +		r->direct_mapped = direct_mapped;

I think this change can be merged into the previous patch

> +		/*
> +		 * Enable randomization for amortizing direct-mapped
> +		 * memory-side-cache conflicts.
> +		 */
> +		if (r->size > r->cache_size && r->direct_mapped)
> +			page_alloc_shuffle_enable();

It seems that this is the only use for ->direct_mapped in the memblock
code. Wouldn't cache_size != 0 suffice? I.e., in the code that sets the
memblock region attributes, the cache_size can be set to 0 for the non
direct mapped caches, isn't it?

>  	}
> 
>  	return 0;
> @@ -1957,9 +1965,16 @@ static unsigned long __init free_low_memory_core_early(void)
>  	 *  low ram will be on Node1
>  	 */
>  	for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end,
> -				NULL)
> +				NULL) {
> +		pg_data_t *pgdat;
> +
>  		count += __free_memory_core(start, end);
> 
> +		for_each_online_pgdat(pgdat)
> +			shuffle_free_memory(pgdat, PHYS_PFN(start),
> +					PHYS_PFN(end));
> +	}
> +
>  	return count;
>  }
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 2b2b3ccbbfb5..fb17c54d47b0 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -895,6 +895,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
>  	zone->zone_pgdat->node_present_pages += onlined_pages;
>  	pgdat_resize_unlock(zone->zone_pgdat, &flags);
> 
> +	shuffle_zone(zone, pfn, zone_end_pfn(zone));
> +
>  	if (onlined_pages) {
>  		node_states_set_node(nid, &arg);
>  		if (need_zonelists_rebuild)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2ec9cc407216..a51031cf16fe 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1595,6 +1595,8 @@ static int __init deferred_init_memmap(void *data)
>  	}
>  	pgdat_resize_unlock(pgdat, &flags);
> 
> +	shuffle_zone(zone, first_init_pfn, zone_end_pfn(zone));
> +
>  	/* Sanity check that the next zone really is unpopulated */
>  	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
> 
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> new file mode 100644
> index 000000000000..621fde268d01
> --- /dev/null
> +++ b/mm/shuffle.c
> @@ -0,0 +1,206 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2018 Intel Corporation. All rights reserved.
> +
> +#include <linux/mm.h>
> +#include <linux/init.h>
> +#include <linux/mmzone.h>
> +#include <linux/random.h>
> +#include <linux/moduleparam.h>
> +#include "internal.h"
> +
> +DEFINE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
> +static unsigned long shuffle_state;
> +enum {
> +	SHUFFLE_ENABLE,
> +	SHUFFLE_FORCE_DISABLE,
> +};
> +
> +void page_alloc_shuffle_enable(void)
> +{
> +	if (test_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state)) {
> +		/*
> +		 * If platform cache auto-detect runs before module
> +		 * parameter parsing, disable the shuffle implementation
> +		 */
> +		if (test_and_clear_bit(SHUFFLE_ENABLE, &shuffle_state))
> +			static_branch_disable(&page_alloc_shuffle_key);
> +	} else if (!test_and_set_bit(SHUFFLE_ENABLE, &shuffle_state))
> +		static_branch_enable(&page_alloc_shuffle_key);
> +}
> +
> +static bool shuffle_param;
> +static int force_shuffle(const char *val, const struct kernel_param *kp)
> +{
> +	int rc = param_set_bool(val, kp);
> +
> +	if (rc < 0)
> +		return rc;
> +	if (shuffle_param)
> +		page_alloc_shuffle_enable();
> +	else
> +		set_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state);
> +	return 0;
> +}
> +module_param_call(shuffle, force_shuffle, param_get_bool, &shuffle_param, 0400);
> +
> +/*
> + * For two pages to be swapped in the shuffle, they must be free (on a
> + * 'free_area' lru), have the same order, and have the same migratetype.
> + */
> +static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order)
> +{
> +	struct page *page;
> +
> +	/*
> +	 * Given we're dealing with randomly selected pfns in a zone we
> +	 * need to ask questions like...
> +	 */
> +
> +	/* ...is the pfn even in the memmap? */
> +	if (!pfn_valid_within(pfn))
> +		return NULL;
> +
> +	/* ...is the pfn in a present section or a hole? */
> +	if (!pfn_present(pfn))
> +		return NULL;
> +
> +	/* ...is the page free and currently on a free_area list? */
> +	page = pfn_to_page(pfn);
> +	if (!PageBuddy(page))
> +		return NULL;
> +
> +	/*
> +	 * ...is the page on the same list as the page we will
> +	 * shuffle it with?
> +	 */
> +	if (page_order(page) != order)
> +		return NULL;
> +
> +	return page;
> +}
> +
> +/*
> + * Fisher-Yates shuffle the freelist which prescribes iterating through
> + * an array, pfns in this case, and randomly swapping each entry with
> + * another in the span, end_pfn - start_pfn.
> + *
> + * To keep the implementation simple it does not attempt to correct for
> + * sources of bias in the distribution, like modulo bias or
> + * pseudo-random number generator bias. I.e. the expectation is that
> + * this shuffling raises the bar for attacks that exploit the
> + * predictability of page allocations, but need not be a perfect
> + * shuffle.
> + *
> + * Note that we don't use @z->zone_start_pfn and zone_end_pfn(@z)
> + * directly since the caller may be aware of holes in the zone and can
> + * improve the accuracy of the random pfn selection.
> + */
> +#define SHUFFLE_RETRY 10
> +static void __meminit shuffle_zone_order(struct zone *z, unsigned long start_pfn,
> +		unsigned long end_pfn, const int order)
> +{
> +	unsigned long i, flags;
> +	const int order_pages = 1 << order;
> +
> +	if (start_pfn < z->zone_start_pfn)
> +		start_pfn = z->zone_start_pfn;
> +	if (end_pfn > zone_end_pfn(z))
> +		end_pfn = zone_end_pfn(z);
> +
> +	/* probably means that start/end were outside the zone */
> +	if (end_pfn <= start_pfn)
> +		return;
> +	spin_lock_irqsave(&z->lock, flags);
> +	start_pfn = ALIGN(start_pfn, order_pages);
> +	for (i = start_pfn; i < end_pfn; i += order_pages) {
> +		unsigned long j;
> +		int migratetype, retry;
> +		struct page *page_i, *page_j;
> +
> +		/*
> +		 * We expect page_i, in the sub-range of a zone being
> +		 * added (@start_pfn to @end_pfn), to more likely be
> +		 * valid compared to page_j randomly selected in the
> +		 * span @zone_start_pfn to @spanned_pages.
> +		 */
> +		page_i = shuffle_valid_page(i, order);
> +		if (!page_i)
> +			continue;
> +
> +		for (retry = 0; retry < SHUFFLE_RETRY; retry++) {
> +			/*
> +			 * Pick a random order aligned page from the
> +			 * start of the zone. Use the *whole* zone here
> +			 * so that if it is freed in tiny pieces that we
> +			 * randomize in the whole zone, not just within
> +			 * those fragments.
> +			 *
> +			 * Since page_j comes from a potentially sparse
> +			 * address range we want to try a bit harder to
> +			 * find a shuffle point for page_i.
> +			 */
> +			j = z->zone_start_pfn +
> +				ALIGN_DOWN(get_random_long() % z->spanned_pages,
> +						order_pages);
> +			page_j = shuffle_valid_page(j, order);
> +			if (page_j && page_j != page_i)
> +				break;
> +		}
> +		if (retry >= SHUFFLE_RETRY) {
> +			pr_debug("%s: failed to swap %#lx\n", __func__, i);
> +			continue;
> +		}
> +
> +		/*
> +		 * Each migratetype corresponds to its own list, make
> +		 * sure the types match otherwise we're moving pages to
> +		 * lists where they do not belong.
> +		 */
> +		migratetype = get_pageblock_migratetype(page_i);
> +		if (get_pageblock_migratetype(page_j) != migratetype) {
> +			pr_debug("%s: migratetype mismatch %#lx\n", __func__, i);
> +			continue;
> +		}
> +
> +		list_swap(&page_i->lru, &page_j->lru);
> +
> +		pr_debug("%s: swap: %#lx -> %#lx\n", __func__, i, j);
> +
> +		/* take it easy on the zone lock */
> +		if ((i % (100 * order_pages)) == 0) {
> +			spin_unlock_irqrestore(&z->lock, flags);
> +			cond_resched();
> +			spin_lock_irqsave(&z->lock, flags);
> +		}
> +	}
> +	spin_unlock_irqrestore(&z->lock, flags);
> +}
> +
> +void __meminit __shuffle_zone(struct zone *z, unsigned long start_pfn,
> +               unsigned long end_pfn)
> +{
> +       int i;
> +
> +       /* shuffle all the orders at the specified order and higher */
> +       for (i = CONFIG_SHUFFLE_PAGE_ORDER; i < MAX_ORDER; i++)
> +               shuffle_zone_order(z, start_pfn, end_pfn, i);
> +}
> +
> +/**
> + * shuffle_free_memory - reduce the predictability of the page allocator
> + * @pgdat: node page data
> + * @start_pfn: Limit the shuffle to the greater of this value or zone start
> + * @end_pfn: Limit the shuffle to the less of this value or zone end
> + *
> + * While shuffle_zone() attempts to avoid holes with pfn_valid() and
> + * pfn_present() they can not report sub-section sized holes. @start_pfn
> + * and @end_pfn limit the shuffle to the exact memory pages being freed.
> + */
> +void __meminit __shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
> +		unsigned long end_pfn)
> +{
> +	struct zone *z;
> +
> +	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
> +		shuffle_zone(z, start_pfn, end_pfn);
> +}
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 0/5] mm: Randomize free memory
  2018-12-15  1:48 [PATCH v5 0/5] mm: Randomize free memory Dan Williams
                   ` (4 preceding siblings ...)
  2018-12-15  1:48 ` [PATCH v5 5/5] mm: Maintain randomization of page free lists Dan Williams
@ 2018-12-17 10:10 ` Rafael J. Wysocki
  2018-12-17 16:32   ` Dan Williams
  5 siblings, 1 reply; 15+ messages in thread
From: Rafael J. Wysocki @ 2018-12-17 10:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Rafael J. Wysocki, Keith Busch, Mike Rapoport, Kees Cook,
	x86, Michal Hocko, Dave Hansen, Peter Zijlstra, Andy Lutomirski,
	linux-mm, linux-kernel

On Saturday, December 15, 2018 2:48:30 AM CET Dan Williams wrote:
> Changes since v4: [1]
> * Default the randomization to off and enable it dynamically based on
>   the detection of a memory side cache advertised by platform firmware.
>   In the case of x86 this enumeration comes from the ACPI HMAT. (Michal
>   and Mel)
> * Improve the changelog of the patch that introduces the shuffling to
>   clarify the motivation and better explain the tradeoffs. (Michal and
>   Mel)
> * Include the required HMAT enabling in the series.
> 
> [1]: https://lkml.kernel.org/r/153922180166.838512.8260339805733812034.stgit@dwillia2-desk3.amr.corp.intel.com
> 
> ---
> 
> Quote patch 3:
> 
> Randomization of the page allocator improves the average utilization of
> a direct-mapped memory-side-cache. Memory side caching is a platform
> capability that Linux has been previously exposed to in HPC
> (high-performance computing) environments on specialty platforms. In
> that instance it was a smaller pool of high-bandwidth-memory relative to
> higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
> be found on general purpose server platforms where DRAM is a cache in
> front of higher latency persistent memory [2].
> 
> Robert offered an explanation of the state of the art of Linux
> interactions with memory-side-caches [3], and I copy it here:
> 
>     It's been a problem in the HPC space:
>     http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/
> 
>     A kernel module called zonesort is available to try to help:
>     https://software.intel.com/en-us/articles/xeon-phi-software
> 
>     and this abandoned patch series proposed that for the kernel:
>     https://lkml.org/lkml/2017/8/23/195
> 
>     Dan's patch series doesn't attempt to ensure buffers won't conflict, but
>     also reduces the chance that the buffers will. This will make performance
>     more consistent, albeit slower than "optimal" (which is near impossible
>     to attain in a general-purpose kernel).  That's better than forcing
>     users to deploy remedies like:
>         "To eliminate this gradual degradation, we have added a Stream
>          measurement to the Node Health Check that follows each job;
>          nodes are rebooted whenever their measured memory bandwidth
>          falls below 300 GB/s."
> 
> A replacement for zonesort was merged upstream in commit cc9aec03e58f
> "x86/numa_emulation: Introduce uniform split capability". With this
> numa_emulation capability, memory can be split into cache sized
> ("near-memory" sized) numa nodes. A bind operation to such a node, and
> disabling workloads on other nodes, enables full cache performance.
> However, once the workload exceeds the cache size then cache conflicts
> are unavoidable. While HPC environments might be able to tolerate
> time-scheduling of cache sized workloads, for general purpose server
> platforms, the oversubscribed cache case will be the common case.
> 
> The worst case scenario is that a server system owner benchmarks a
> workload at boot with an un-contended cache only to see that performance
> degrade over time, even below the average cache performance due to
> excessive conflicts. Randomization clips the peaks and fills in the
> valleys of cache utilization to yield steady average performance.
> 
> See patch 3 for more details.
> 
> [2]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
> [3]: https://lkml.org/lkml/2018/9/22/54

Has this hibernation been tested with this series applied?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 0/5] mm: Randomize free memory
  2018-12-17 10:10 ` [PATCH v5 0/5] mm: Randomize free memory Rafael J. Wysocki
@ 2018-12-17 16:32   ` Dan Williams
  2018-12-18 10:45     ` Rafael J. Wysocki
  0 siblings, 1 reply; 15+ messages in thread
From: Dan Williams @ 2018-12-17 16:32 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, Rafael J. Wysocki, Keith Busch, Mike Rapoport,
	Kees Cook, X86 ML, Michal Hocko, Dave Hansen, Peter Zijlstra,
	Andy Lutomirski, Linux MM, Linux Kernel Mailing List

On Mon, Dec 17, 2018 at 2:12 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> On Saturday, December 15, 2018 2:48:30 AM CET Dan Williams wrote:
> > Changes since v4: [1]
> > * Default the randomization to off and enable it dynamically based on
> >   the detection of a memory side cache advertised by platform firmware.
> >   In the case of x86 this enumeration comes from the ACPI HMAT. (Michal
> >   and Mel)
> > * Improve the changelog of the patch that introduces the shuffling to
> >   clarify the motivation and better explain the tradeoffs. (Michal and
> >   Mel)
> > * Include the required HMAT enabling in the series.
> >
> > [1]: https://lkml.kernel.org/r/153922180166.838512.8260339805733812034.stgit@dwillia2-desk3.amr.corp.intel.com
> >
> > ---
> >
> > Quote patch 3:
> >
> > Randomization of the page allocator improves the average utilization of
> > a direct-mapped memory-side-cache. Memory side caching is a platform
> > capability that Linux has been previously exposed to in HPC
> > (high-performance computing) environments on specialty platforms. In
> > that instance it was a smaller pool of high-bandwidth-memory relative to
> > higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
> > be found on general purpose server platforms where DRAM is a cache in
> > front of higher latency persistent memory [2].
> >
> > Robert offered an explanation of the state of the art of Linux
> > interactions with memory-side-caches [3], and I copy it here:
> >
> >     It's been a problem in the HPC space:
> >     http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/
> >
> >     A kernel module called zonesort is available to try to help:
> >     https://software.intel.com/en-us/articles/xeon-phi-software
> >
> >     and this abandoned patch series proposed that for the kernel:
> >     https://lkml.org/lkml/2017/8/23/195
> >
> >     Dan's patch series doesn't attempt to ensure buffers won't conflict, but
> >     also reduces the chance that the buffers will. This will make performance
> >     more consistent, albeit slower than "optimal" (which is near impossible
> >     to attain in a general-purpose kernel).  That's better than forcing
> >     users to deploy remedies like:
> >         "To eliminate this gradual degradation, we have added a Stream
> >          measurement to the Node Health Check that follows each job;
> >          nodes are rebooted whenever their measured memory bandwidth
> >          falls below 300 GB/s."
> >
> > A replacement for zonesort was merged upstream in commit cc9aec03e58f
> > "x86/numa_emulation: Introduce uniform split capability". With this
> > numa_emulation capability, memory can be split into cache sized
> > ("near-memory" sized) numa nodes. A bind operation to such a node, and
> > disabling workloads on other nodes, enables full cache performance.
> > However, once the workload exceeds the cache size then cache conflicts
> > are unavoidable. While HPC environments might be able to tolerate
> > time-scheduling of cache sized workloads, for general purpose server
> > platforms, the oversubscribed cache case will be the common case.
> >
> > The worst case scenario is that a server system owner benchmarks a
> > workload at boot with an un-contended cache only to see that performance
> > degrade over time, even below the average cache performance due to
> > excessive conflicts. Randomization clips the peaks and fills in the
> > valleys of cache utilization to yield steady average performance.
> >
> > See patch 3 for more details.
> >
> > [2]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
> > [3]: https://lkml.org/lkml/2018/9/22/54
>
> Has this hibernation been tested with this series applied?

It has not. Is QEMU sufficient? What's your concern?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 3/5] mm: Shuffle initial free memory to improve memory-side-cache utilization
  2018-12-16 12:43   ` Mike Rapoport
@ 2018-12-17 19:56     ` Dan Williams
  2018-12-18  9:11       ` Mike Rapoport
  0 siblings, 1 reply; 15+ messages in thread
From: Dan Williams @ 2018-12-17 19:56 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Michal Hocko, Kees Cook, Dave Hansen,
	Peter Zijlstra, Linux MM, X86 ML, Linux Kernel Mailing List

On Sun, Dec 16, 2018 at 4:43 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Fri, Dec 14, 2018 at 05:48:46PM -0800, Dan Williams wrote:
> > Randomization of the page allocator improves the average utilization of
> > a direct-mapped memory-side-cache. Memory side caching is a platform
> > capability that Linux has been previously exposed to in HPC
> > (high-performance computing) environments on specialty platforms. In
> > that instance it was a smaller pool of high-bandwidth-memory relative to
> > higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
> > be found on general purpose server platforms where DRAM is a cache in
> > front of higher latency persistent memory [1].
[..]
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 185bfd4e87bb..fd617928ccc1 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -834,8 +834,16 @@ int __init_memblock memblock_set_sidecache(phys_addr_t base, phys_addr_t size,
> >               return ret;
> >
> >       for (i = start_rgn; i < end_rgn; i++) {
> > -             type->regions[i].cache_size = cache_size;
> > -             type->regions[i].direct_mapped = direct_mapped;
> > +             struct memblock_region *r = &type->regions[i];
> > +
> > +             r->cache_size = cache_size;
> > +             r->direct_mapped = direct_mapped;
>
> I think this change can be merged into the previous patch

Ok, will do.

> > +             /*
> > +              * Enable randomization for amortizing direct-mapped
> > +              * memory-side-cache conflicts.
> > +              */
> > +             if (r->size > r->cache_size && r->direct_mapped)
> > +                     page_alloc_shuffle_enable();
>
> It seems that this is the only use for ->direct_mapped in the memblock
> code. Wouldn't cache_size != 0 suffice? I.e., in the code that sets the
> memblock region attributes, the cache_size can be set to 0 for the non
> direct mapped caches, isn't it?
>

The HMAT specification allows for other cache-topologies, so it's not
sufficient to just look for non-zero size when a platform implements a
set-associative cache. The expectation is that a set-associative cache
would not need the kernel to perform memory randomization to improve
the cache utilization.

The check for memory size > cache-size is a sanity check for a
platform BIOS or system configuration that mis-reports or mis-sizes
the cache.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 3/5] mm: Shuffle initial free memory to improve memory-side-cache utilization
  2018-12-17 19:56     ` Dan Williams
@ 2018-12-18  9:11       ` Mike Rapoport
  2018-12-18 19:07         ` Dan Williams
  0 siblings, 1 reply; 15+ messages in thread
From: Mike Rapoport @ 2018-12-18  9:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Michal Hocko, Kees Cook, Dave Hansen,
	Peter Zijlstra, Linux MM, X86 ML, Linux Kernel Mailing List

On Mon, Dec 17, 2018 at 11:56:36AM -0800, Dan Williams wrote:
> On Sun, Dec 16, 2018 at 4:43 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
> >
> > On Fri, Dec 14, 2018 at 05:48:46PM -0800, Dan Williams wrote:
> > > Randomization of the page allocator improves the average utilization of
> > > a direct-mapped memory-side-cache. Memory side caching is a platform
> > > capability that Linux has been previously exposed to in HPC
> > > (high-performance computing) environments on specialty platforms. In
> > > that instance it was a smaller pool of high-bandwidth-memory relative to
> > > higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
> > > be found on general purpose server platforms where DRAM is a cache in
> > > front of higher latency persistent memory [1].
> [..]
> > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > index 185bfd4e87bb..fd617928ccc1 100644
> > > --- a/mm/memblock.c
> > > +++ b/mm/memblock.c
> > > @@ -834,8 +834,16 @@ int __init_memblock memblock_set_sidecache(phys_addr_t base, phys_addr_t size,
> > >               return ret;
> > >
> > >       for (i = start_rgn; i < end_rgn; i++) {
> > > -             type->regions[i].cache_size = cache_size;
> > > -             type->regions[i].direct_mapped = direct_mapped;
> > > +             struct memblock_region *r = &type->regions[i];
> > > +
> > > +             r->cache_size = cache_size;
> > > +             r->direct_mapped = direct_mapped;
> >
> > I think this change can be merged into the previous patch
> 
> Ok, will do.
> 
> > > +             /*
> > > +              * Enable randomization for amortizing direct-mapped
> > > +              * memory-side-cache conflicts.
> > > +              */
> > > +             if (r->size > r->cache_size && r->direct_mapped)
> > > +                     page_alloc_shuffle_enable();
> >
> > It seems that this is the only use for ->direct_mapped in the memblock
> > code. Wouldn't cache_size != 0 suffice? I.e., in the code that sets the
> > memblock region attributes, the cache_size can be set to 0 for the non
> > direct mapped caches, isn't it?
> >
> 
> The HMAT specification allows for other cache-topologies, so it's not
> sufficient to just look for non-zero size when a platform implements a
> set-associative cache. The expectation is that a set-associative cache
> would not need the kernel to perform memory randomization to improve
> the cache utilization.
> 
> The check for memory size > cache-size is a sanity check for a
> platform BIOS or system configuration that mis-reports or mis-sizes
> the cache.

Apparently I didn't explain my point well.

The acpi_numa_memory_affinity_init() already knows whether the cache is
direct mapped or a set-associative. It can just skip calling
memblock_set_sidecache() for the set-associative case.

Another thing I've noticed only now, is that memory randomization is
enabled if there is at least one memory region with a direct mapped side
cache attached and once the randomization is on the cache size and the
mapping mode do not matter. So, I think it's not necessary to store them in
the memory region at all.

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 0/5] mm: Randomize free memory
  2018-12-17 16:32   ` Dan Williams
@ 2018-12-18 10:45     ` Rafael J. Wysocki
  2018-12-19 20:25       ` Dan Williams
  0 siblings, 1 reply; 15+ messages in thread
From: Rafael J. Wysocki @ 2018-12-18 10:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Rafael J. Wysocki, Keith Busch, Mike Rapoport,
	Kees Cook, X86 ML, Michal Hocko, Dave Hansen, Peter Zijlstra,
	Andy Lutomirski, Linux MM, Linux Kernel Mailing List

On Monday, December 17, 2018 5:32:10 PM CET Dan Williams wrote:
> On Mon, Dec 17, 2018 at 2:12 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >
> > On Saturday, December 15, 2018 2:48:30 AM CET Dan Williams wrote:
> > > Changes since v4: [1]
> > > * Default the randomization to off and enable it dynamically based on
> > >   the detection of a memory side cache advertised by platform firmware.
> > >   In the case of x86 this enumeration comes from the ACPI HMAT. (Michal
> > >   and Mel)
> > > * Improve the changelog of the patch that introduces the shuffling to
> > >   clarify the motivation and better explain the tradeoffs. (Michal and
> > >   Mel)
> > > * Include the required HMAT enabling in the series.
> > >
> > > [1]: https://lkml.kernel.org/r/153922180166.838512.8260339805733812034.stgit@dwillia2-desk3.amr.corp.intel.com
> > >
> > > ---
> > >
> > > Quote patch 3:
> > >
> > > Randomization of the page allocator improves the average utilization of
> > > a direct-mapped memory-side-cache. Memory side caching is a platform
> > > capability that Linux has been previously exposed to in HPC
> > > (high-performance computing) environments on specialty platforms. In
> > > that instance it was a smaller pool of high-bandwidth-memory relative to
> > > higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
> > > be found on general purpose server platforms where DRAM is a cache in
> > > front of higher latency persistent memory [2].
> > >
> > > Robert offered an explanation of the state of the art of Linux
> > > interactions with memory-side-caches [3], and I copy it here:
> > >
> > >     It's been a problem in the HPC space:
> > >     http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/
> > >
> > >     A kernel module called zonesort is available to try to help:
> > >     https://software.intel.com/en-us/articles/xeon-phi-software
> > >
> > >     and this abandoned patch series proposed that for the kernel:
> > >     https://lkml.org/lkml/2017/8/23/195
> > >
> > >     Dan's patch series doesn't attempt to ensure buffers won't conflict, but
> > >     also reduces the chance that the buffers will. This will make performance
> > >     more consistent, albeit slower than "optimal" (which is near impossible
> > >     to attain in a general-purpose kernel).  That's better than forcing
> > >     users to deploy remedies like:
> > >         "To eliminate this gradual degradation, we have added a Stream
> > >          measurement to the Node Health Check that follows each job;
> > >          nodes are rebooted whenever their measured memory bandwidth
> > >          falls below 300 GB/s."
> > >
> > > A replacement for zonesort was merged upstream in commit cc9aec03e58f
> > > "x86/numa_emulation: Introduce uniform split capability". With this
> > > numa_emulation capability, memory can be split into cache sized
> > > ("near-memory" sized) numa nodes. A bind operation to such a node, and
> > > disabling workloads on other nodes, enables full cache performance.
> > > However, once the workload exceeds the cache size then cache conflicts
> > > are unavoidable. While HPC environments might be able to tolerate
> > > time-scheduling of cache sized workloads, for general purpose server
> > > platforms, the oversubscribed cache case will be the common case.
> > >
> > > The worst case scenario is that a server system owner benchmarks a
> > > workload at boot with an un-contended cache only to see that performance
> > > degrade over time, even below the average cache performance due to
> > > excessive conflicts. Randomization clips the peaks and fills in the
> > > valleys of cache utilization to yield steady average performance.
> > >
> > > See patch 3 for more details.
> > >
> > > [2]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
> > > [3]: https://lkml.org/lkml/2018/9/22/54
> >
> > Has this hibernation been tested with this series applied?
> 
> It has not. Is QEMU sufficient? What's your concern?

Well, hibernation does quite a bit of memory management and that involves
free memory too.  I'm not expecting any particular issues, but I may be
overlooking something and I would like to know that it doesn't break before
the changes go in.

QEMU should be sufficient, but let me talk to the power lab folks if they can
test that for you.

Is there a git branch with these changes available somewhere?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 3/5] mm: Shuffle initial free memory to improve memory-side-cache utilization
  2018-12-18  9:11       ` Mike Rapoport
@ 2018-12-18 19:07         ` Dan Williams
  0 siblings, 0 replies; 15+ messages in thread
From: Dan Williams @ 2018-12-18 19:07 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Michal Hocko, Kees Cook, Dave Hansen,
	Peter Zijlstra, Linux MM, X86 ML, Linux Kernel Mailing List

On Tue, Dec 18, 2018 at 1:11 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Mon, Dec 17, 2018 at 11:56:36AM -0800, Dan Williams wrote:
> > On Sun, Dec 16, 2018 at 4:43 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
> > >
> > > On Fri, Dec 14, 2018 at 05:48:46PM -0800, Dan Williams wrote:
> > > > Randomization of the page allocator improves the average utilization of
> > > > a direct-mapped memory-side-cache. Memory side caching is a platform
> > > > capability that Linux has been previously exposed to in HPC
> > > > (high-performance computing) environments on specialty platforms. In
> > > > that instance it was a smaller pool of high-bandwidth-memory relative to
> > > > higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
> > > > be found on general purpose server platforms where DRAM is a cache in
> > > > front of higher latency persistent memory [1].
> > [..]
> > > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > > index 185bfd4e87bb..fd617928ccc1 100644
> > > > --- a/mm/memblock.c
> > > > +++ b/mm/memblock.c
> > > > @@ -834,8 +834,16 @@ int __init_memblock memblock_set_sidecache(phys_addr_t base, phys_addr_t size,
> > > >               return ret;
> > > >
> > > >       for (i = start_rgn; i < end_rgn; i++) {
> > > > -             type->regions[i].cache_size = cache_size;
> > > > -             type->regions[i].direct_mapped = direct_mapped;
> > > > +             struct memblock_region *r = &type->regions[i];
> > > > +
> > > > +             r->cache_size = cache_size;
> > > > +             r->direct_mapped = direct_mapped;
> > >
> > > I think this change can be merged into the previous patch
> >
> > Ok, will do.
> >
> > > > +             /*
> > > > +              * Enable randomization for amortizing direct-mapped
> > > > +              * memory-side-cache conflicts.
> > > > +              */
> > > > +             if (r->size > r->cache_size && r->direct_mapped)
> > > > +                     page_alloc_shuffle_enable();
> > >
> > > It seems that this is the only use for ->direct_mapped in the memblock
> > > code. Wouldn't cache_size != 0 suffice? I.e., in the code that sets the
> > > memblock region attributes, the cache_size can be set to 0 for the non
> > > direct mapped caches, isn't it?
> > >
> >
> > The HMAT specification allows for other cache-topologies, so it's not
> > sufficient to just look for non-zero size when a platform implements a
> > set-associative cache. The expectation is that a set-associative cache
> > would not need the kernel to perform memory randomization to improve
> > the cache utilization.
> >
> > The check for memory size > cache-size is a sanity check for a
> > platform BIOS or system configuration that mis-reports or mis-sizes
> > the cache.
>
> Apparently I didn't explain my point well.
>
> The acpi_numa_memory_affinity_init() already knows whether the cache is
> direct mapped or a set-associative. It can just skip calling
> memblock_set_sidecache() for the set-associative case.
>
> Another thing I've noticed only now, is that memory randomization is
> enabled if there is at least one memory region with a direct mapped side
> cache attached and once the randomization is on the cache size and the
> mapping mode do not matter. So, I think it's not necessary to store them in
> the memory region at all.

Fair enough. I was anticipating the case when non-ACPI systems gain
this capability, but you're right no need to design that now. The size
sanity check has some small value, but given there is an override and
broken platform firmware would need to be fixed I don't think we lose
much by getting rid of it. Will re-flow without memblock integration.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 0/5] mm: Randomize free memory
  2018-12-18 10:45     ` Rafael J. Wysocki
@ 2018-12-19 20:25       ` Dan Williams
  0 siblings, 0 replies; 15+ messages in thread
From: Dan Williams @ 2018-12-19 20:25 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, Rafael J. Wysocki, Keith Busch, Mike Rapoport,
	Kees Cook, X86 ML, Michal Hocko, Dave Hansen, Peter Zijlstra,
	Andy Lutomirski, Linux MM, Linux Kernel Mailing List

On Tue, Dec 18, 2018 at 2:46 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> On Monday, December 17, 2018 5:32:10 PM CET Dan Williams wrote:
> > On Mon, Dec 17, 2018 at 2:12 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > >
> > > On Saturday, December 15, 2018 2:48:30 AM CET Dan Williams wrote:
> > > > Changes since v4: [1]
> > > > * Default the randomization to off and enable it dynamically based on
> > > >   the detection of a memory side cache advertised by platform firmware.
> > > >   In the case of x86 this enumeration comes from the ACPI HMAT. (Michal
> > > >   and Mel)
> > > > * Improve the changelog of the patch that introduces the shuffling to
> > > >   clarify the motivation and better explain the tradeoffs. (Michal and
> > > >   Mel)
> > > > * Include the required HMAT enabling in the series.
> > > >
> > > > [1]: https://lkml.kernel.org/r/153922180166.838512.8260339805733812034.stgit@dwillia2-desk3.amr.corp.intel.com
> > > >
> > > > ---
> > > >
> > > > Quote patch 3:
> > > >
> > > > Randomization of the page allocator improves the average utilization of
> > > > a direct-mapped memory-side-cache. Memory side caching is a platform
> > > > capability that Linux has been previously exposed to in HPC
> > > > (high-performance computing) environments on specialty platforms. In
> > > > that instance it was a smaller pool of high-bandwidth-memory relative to
> > > > higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
> > > > be found on general purpose server platforms where DRAM is a cache in
> > > > front of higher latency persistent memory [2].
> > > >
> > > > Robert offered an explanation of the state of the art of Linux
> > > > interactions with memory-side-caches [3], and I copy it here:
> > > >
> > > >     It's been a problem in the HPC space:
> > > >     http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/
> > > >
> > > >     A kernel module called zonesort is available to try to help:
> > > >     https://software.intel.com/en-us/articles/xeon-phi-software
> > > >
> > > >     and this abandoned patch series proposed that for the kernel:
> > > >     https://lkml.org/lkml/2017/8/23/195
> > > >
> > > >     Dan's patch series doesn't attempt to ensure buffers won't conflict, but
> > > >     also reduces the chance that the buffers will. This will make performance
> > > >     more consistent, albeit slower than "optimal" (which is near impossible
> > > >     to attain in a general-purpose kernel).  That's better than forcing
> > > >     users to deploy remedies like:
> > > >         "To eliminate this gradual degradation, we have added a Stream
> > > >          measurement to the Node Health Check that follows each job;
> > > >          nodes are rebooted whenever their measured memory bandwidth
> > > >          falls below 300 GB/s."
> > > >
> > > > A replacement for zonesort was merged upstream in commit cc9aec03e58f
> > > > "x86/numa_emulation: Introduce uniform split capability". With this
> > > > numa_emulation capability, memory can be split into cache sized
> > > > ("near-memory" sized) numa nodes. A bind operation to such a node, and
> > > > disabling workloads on other nodes, enables full cache performance.
> > > > However, once the workload exceeds the cache size then cache conflicts
> > > > are unavoidable. While HPC environments might be able to tolerate
> > > > time-scheduling of cache sized workloads, for general purpose server
> > > > platforms, the oversubscribed cache case will be the common case.
> > > >
> > > > The worst case scenario is that a server system owner benchmarks a
> > > > workload at boot with an un-contended cache only to see that performance
> > > > degrade over time, even below the average cache performance due to
> > > > excessive conflicts. Randomization clips the peaks and fills in the
> > > > valleys of cache utilization to yield steady average performance.
> > > >
> > > > See patch 3 for more details.
> > > >
> > > > [2]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
> > > > [3]: https://lkml.org/lkml/2018/9/22/54
> > >
> > > Has this hibernation been tested with this series applied?
> >
> > It has not. Is QEMU sufficient? What's your concern?
>
> Well, hibernation does quite a bit of memory management and that involves
> free memory too.  I'm not expecting any particular issues, but I may be
> overlooking something and I would like to know that it doesn't break before
> the changes go in.
>
> QEMU should be sufficient, but let me talk to the power lab folks if they can
> test that for you.

Yeah, the quick QEMU test did not immediately fall over, but a
checkout by power lab folks would be much appreciated.

> Is there a git branch with these changes available somewhere?

I have posted the upcoming v7 version of the patches here:

    https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=libnvdimm-pending

Note, that branch constantly rebases like tip/master.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-12-19 20:26 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-15  1:48 [PATCH v5 0/5] mm: Randomize free memory Dan Williams
2018-12-15  1:48 ` [PATCH v5 1/5] acpi: Create subtable parsing infrastructure Dan Williams
2018-12-15  1:48 ` [PATCH v5 2/5] acpi/numa: Set the memory-side-cache size in memblocks Dan Williams
2018-12-16 12:34   ` Mike Rapoport
2018-12-15  1:48 ` [PATCH v5 3/5] mm: Shuffle initial free memory to improve memory-side-cache utilization Dan Williams
2018-12-16 12:43   ` Mike Rapoport
2018-12-17 19:56     ` Dan Williams
2018-12-18  9:11       ` Mike Rapoport
2018-12-18 19:07         ` Dan Williams
2018-12-15  1:48 ` [PATCH v5 4/5] mm: Move buddy list manipulations into helpers Dan Williams
2018-12-15  1:48 ` [PATCH v5 5/5] mm: Maintain randomization of page free lists Dan Williams
2018-12-17 10:10 ` [PATCH v5 0/5] mm: Randomize free memory Rafael J. Wysocki
2018-12-17 16:32   ` Dan Williams
2018-12-18 10:45     ` Rafael J. Wysocki
2018-12-19 20:25       ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).