All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/2] Larger Order Protection V1
@ 2018-02-16 16:01 Christoph Lameter
  2018-02-16 16:01   ` Christoph Lameter
                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Christoph Lameter @ 2018-02-16 16:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Matthew Wilcox, linux-mm, linux-rdma, akpm,
	Thomas Schoebel-Theuer, andi, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Dave Hansen, Mike Kravetz

We have discussed for years ways to create more reliable ways to allocate large contiguous
memory segments and to avoid fragmentation. This is an ad hoc scheme based on reservation
of higher order pages in the page allocator. It is fully transparent and integrated
into the page allocator.

This approach goes back to the meeting on contiguous memory at the Plumbers conference
in 2017 and the effort by Guy and Mike Kravetz to establish and API to map contiguous
memory segments into user space. Reservations will allow the contiguous memory allocations
to work even after the system has run for a considerable time.

Contiguous memory is also important for general system performance. F.e. slab
allocators can be made to use large frames in order to optimize performance.
See patch 1.

Other use cases are jumbo frames or device driver specific allocations.

For more on this see Mike Kravetz patches in particular 

Mike Kravetz MMAP CONTIG flag support at

https://lkml.org/lkml/2017/10/3/992


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC 1/2] Protect larger order pages from breaking up
  2018-02-16 16:01 [RFC 0/2] Larger Order Protection V1 Christoph Lameter
@ 2018-02-16 16:01   ` Christoph Lameter
  2018-02-16 16:01   ` Christoph Lameter
  2018-02-16 18:27 ` [RFC 0/2] Larger Order Protection V1 Christopher Lameter
  2 siblings, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2018-02-16 16:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Matthew Wilcox, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Thomas Schoebel-Theuer,
	andi-Vw/NltI1exuRpAAqCnN02g, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Dave Hansen, Mike Kravetz

[-- Attachment #1: limit_order --]
[-- Type: text/plain, Size: 10140 bytes --]

Over time as the kernel is churning through memory it will break
up larger pages and as time progresses larger contiguous allocations
will no longer be possible. This is an approach to preserve these
large pages and prevent them from being broken up.

This is useful for example for the use of jumbo pages and can
satify various needs of subsystems and device drivers that require
large contiguous allocation to operate properly.

The idea is to reserve a pool of pages of the required order
so that the kernel is not allowed to use the pages for allocations
of a different order. This is a pool that is fully integrated
into the page allocator and therefore transparently usable.

Control over this feature is by writing to /proc/zoneinfo.

F.e. to ensure that 2000 16K pages stay available for jumbo
frames do

	echo "2=2000" >/proc/zoneinfo

or through the order=<page spec> on the kernel command line.
F.e.

	order=2=2000,4N2=500

These pages will be subject to reclaim etc as usual but will not
be broken up.

One can then also f.e. operate the slub allocator with
64k pages. Specify "slub_max_order=4 slub_min_order=4" on
the kernel command line and all slab allocator allocations
will occur in 64K page sizes.

Note that this will reduce the memory available to the application
in some cases. Reclaim may occur more often. If more than
the reserved number of higher order pages are being used then
allocations will still fail as normal.

In order to make this work just right one needs to be able to
know the workload well enough to reserve the right amount
of pages. This is comparable to other reservation schemes.

Well that f.e brings up huge pages. You can of course
also use this to reserve those and can then be sure that
you can dynamically resize your huge page pools even after
a long time of system up time.

The idea for this patch came from Thomas Schoebel-Theuer whom I met
at the LCA and who described the approach to me promising
a patch that would do this. Sadly he has vanished somehow.
However, he has been using this approach to support a
production environment for numerous years.

So I redid his patch and this is the first draft of it.


Idea-by: Thomas Schoebel-Theuer <tst-0Nly+W1lFbFDiq0p6IFu4YQuADTiUCJX@public.gmane.org>

First performance tests in a virtual enviroment show
a hackbench improvement by 6% just by increasing
the page size used by the page allocator to order 3.

Signed-off-by: Christopher Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>

Index: linux/include/linux/mmzone.h
===================================================================
--- linux.orig/include/linux/mmzone.h
+++ linux/include/linux/mmzone.h
@@ -96,6 +96,11 @@ extern int page_group_by_mobility_disabl
 struct free_area {
 	struct list_head	free_list[MIGRATE_TYPES];
 	unsigned long		nr_free;
+	/* We stop breaking up pages of this order if less than
+	 * min are available. At that point the pages can only
+	 * be used for allocations of that particular order.
+	 */
+	unsigned long		min;
 };
 
 struct pglist_data;
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -1844,7 +1844,12 @@ struct page *__rmqueue_smallest(struct z
 		area = &(zone->free_area[current_order]);
 		page = list_first_entry_or_null(&area->free_list[migratetype],
 							struct page, lru);
-		if (!page)
+		/*
+		 * Continue if no page is found or if our freelist contains
+		 * less than the minimum pages of that order. In that case
+		 * we better look for a different order.
+		 */
+		if (!page || area->nr_free < area->min)
 			continue;
 		list_del(&page->lru);
 		rmv_page_order(page);
@@ -5190,6 +5195,57 @@ static void build_zonelists(pg_data_t *p
 
 #endif	/* CONFIG_NUMA */
 
+int set_page_order_min(int node, int order, unsigned min)
+{
+	int i, o;
+	long min_pages = 0;			/* Pages already reserved */
+	long managed_pages = 0;			/* Pages managed on the node */
+	struct zone *last;
+	unsigned remaining;
+
+	/*
+	 * Determine already reserved memory for orders
+	 * plus the total of the pages on the node
+	 */
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		struct zone *z = &NODE_DATA(node)->node_zones[i];
+		if (managed_zone(z)) {
+			for (o = 0; o < MAX_ORDER; o++) {
+				if (o != order)
+					min_pages += z->free_area[o].min << o;
+
+			}
+			managed_pages += z->managed_pages;
+		}
+	}
+
+	if (min_pages + (min << order) > managed_pages / 2)
+		return -ENOMEM;
+
+	/* Set the min values for all zones on the node */
+	remaining = min;
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		struct zone *z = &NODE_DATA(node)->node_zones[i];
+		if (managed_zone(z)) {
+			u64 tmp;
+
+			tmp = (u64)z->managed_pages * (min << order);
+			do_div(tmp, managed_pages);
+			tmp >>= order;
+			z->free_area[order].min = tmp;
+
+			last = z;
+			remaining -= tmp;
+		}
+	}
+
+	/* Deal with rounding errors */
+	if (remaining)
+		last->free_area[order].min += remaining;
+
+	return 0;
+}
+
 /*
  * Boot pageset table. One per cpu which is going to be used for all
  * zones and all nodes. The parameters will be set in such a way
@@ -5424,6 +5480,7 @@ static void __meminit zone_init_free_lis
 	for_each_migratetype_order(order, t) {
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
 		zone->free_area[order].nr_free = 0;
+		zone->free_area[order].min = 0;
 	}
 }
 
@@ -6998,6 +7055,7 @@ static void __setup_per_zone_wmarks(void
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
 	struct zone *zone;
+	int order;
 	unsigned long flags;
 
 	/* Calculate total number of !ZONE_HIGHMEM pages */
@@ -7012,6 +7070,10 @@ static void __setup_per_zone_wmarks(void
 		spin_lock_irqsave(&zone->lock, flags);
 		tmp = (u64)pages_min * zone->managed_pages;
 		do_div(tmp, lowmem_pages);
+
+		for (order = 0; order < MAX_ORDER; order++)
+			tmp += zone->free_area[order].min << order;
+
 		if (is_highmem(zone)) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -27,6 +27,7 @@
 #include <linux/mm_inline.h>
 #include <linux/page_ext.h>
 #include <linux/page_owner.h>
+#include <linux/ctype.h>
 
 #include "internal.h"
 
@@ -1614,6 +1615,11 @@ static void zoneinfo_show_print(struct s
 				zone_numa_state_snapshot(zone, i));
 #endif
 
+	for (i = 0; i < MAX_ORDER; i++)
+		if (zone->free_area[i].min)
+			seq_printf(m, "\nPreserve %lu pages of order %d from breaking up.",
+				zone->free_area[i].min, i);
+
 	seq_printf(m, "\n  pagesets");
 	for_each_online_cpu(i) {
 		struct per_cpu_pageset *pageset;
@@ -1641,6 +1647,122 @@ static void zoneinfo_show_print(struct s
 	seq_putc(m, '\n');
 }
 
+static int __order_protect(char *p)
+{
+	char c;
+
+	do {
+		int order = 0;
+		int pages = 0;
+		int node = 0;
+		int rc;
+
+		/* Syntax <order>[N<node>]=number */
+		if (!isdigit(*p))
+			return -EFAULT;
+
+		while (true) {
+			c = *p++;
+
+			if (!isdigit(c))
+				break;
+
+			order = order * 10 + c - '0';
+		}
+
+		/* Check for optional node specification */
+		if (c == 'N') {
+			if (!isdigit(*p))
+				return -EFAULT;
+
+			while (true) {
+				c = *p++;
+				if (!isdigit(c))
+					break;
+				node = node * 10 + c - '0';
+			}
+		}
+
+		if (c != '=')
+			return -EINVAL;
+
+		if (!isdigit(*p))
+			return -EINVAL;
+
+		while (true) {
+			c = *p++;
+			if (!isdigit(c))
+				break;
+			pages = pages * 10 + c - '0';
+		}
+
+		if (order == 0 || order >= MAX_ORDER)
+		       return -EINVAL;
+
+		if (!node_online(node))
+			return -ENOSYS;
+
+		rc = set_page_order_min(node, order, pages);
+		if (rc)
+			return rc;
+
+	} while (c == ',');
+
+	if (c)
+		return -EINVAL;
+
+	setup_per_zone_wmarks();
+
+	return 0;
+}
+
+/*
+ * Writing to /proc/zoneinfo allows to setup the large page breakup
+ * protection.
+ *
+ * Syntax:
+ * 	<order>[N<node>]=<number>{,<order>[N<node>]=<number>}
+ *
+ * F.e. Protecting 500 pages of order 2 (16K on intel) and 300 of
+ * order 4 (64K) on node 1
+ *
+ * 	echo "2=500,4N1=300" >/proc/zoneinfo
+ *
+ */
+static ssize_t zoneinfo_write(struct file *file, const char __user *buffer,
+			size_t count, loff_t *ppos)
+{
+	char zinfo[200];
+	int rc;
+
+	if (count > sizeof(zinfo))
+		return -EINVAL;
+
+	if (copy_from_user(zinfo, buffer, count))
+		return -EFAULT;
+
+	zinfo[count - 1] = 0;
+
+	rc = __order_protect(zinfo);
+
+	if (rc)
+		return rc;
+
+	return count;
+}
+
+static int order_protect(char *s)
+{
+	int rc;
+
+	rc = __order_protect(s);
+	if (rc)
+		printk("Invalid order=%s rc=%d\n",s, rc);
+
+	return 1;
+}
+__setup("order=", order_protect);
+
 /*
  * Output information about zones in @pgdat.  All zones are printed regardless
  * of whether they are populated or not: lowmem_reserve_ratio operates on the
@@ -1672,6 +1794,7 @@ static const struct file_operations zone
 	.read		= seq_read,
 	.llseek		= seq_lseek,
 	.release	= seq_release,
+	.write		= zoneinfo_write,
 };
 
 enum writeback_stat_item {
@@ -2016,7 +2139,7 @@ void __init init_mm_internals(void)
 	proc_create("buddyinfo", 0444, NULL, &buddyinfo_file_operations);
 	proc_create("pagetypeinfo", 0444, NULL, &pagetypeinfo_file_operations);
 	proc_create("vmstat", 0444, NULL, &vmstat_file_operations);
-	proc_create("zoneinfo", 0444, NULL, &zoneinfo_file_operations);
+	proc_create("zoneinfo", 0644, NULL, &zoneinfo_file_operations);
 #endif
 }
 
Index: linux/include/linux/gfp.h
===================================================================
--- linux.orig/include/linux/gfp.h
+++ linux/include/linux/gfp.h
@@ -543,6 +543,7 @@ void drain_all_pages(struct zone *zone);
 void drain_local_pages(struct zone *zone);
 
 void page_alloc_init_late(void);
+int set_page_order_min(int node, int order, unsigned min);
 
 /*
  * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC 1/2] Protect larger order pages from breaking up
@ 2018-02-16 16:01   ` Christoph Lameter
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2018-02-16 16:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Matthew Wilcox, linux-mm, linux-rdma, akpm,
	Thomas Schoebel-Theuer, andi, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Dave Hansen, Mike Kravetz

[-- Attachment #1: limit_order --]
[-- Type: text/plain, Size: 10086 bytes --]

Over time as the kernel is churning through memory it will break
up larger pages and as time progresses larger contiguous allocations
will no longer be possible. This is an approach to preserve these
large pages and prevent them from being broken up.

This is useful for example for the use of jumbo pages and can
satify various needs of subsystems and device drivers that require
large contiguous allocation to operate properly.

The idea is to reserve a pool of pages of the required order
so that the kernel is not allowed to use the pages for allocations
of a different order. This is a pool that is fully integrated
into the page allocator and therefore transparently usable.

Control over this feature is by writing to /proc/zoneinfo.

F.e. to ensure that 2000 16K pages stay available for jumbo
frames do

	echo "2=2000" >/proc/zoneinfo

or through the order=<page spec> on the kernel command line.
F.e.

	order=2=2000,4N2=500

These pages will be subject to reclaim etc as usual but will not
be broken up.

One can then also f.e. operate the slub allocator with
64k pages. Specify "slub_max_order=4 slub_min_order=4" on
the kernel command line and all slab allocator allocations
will occur in 64K page sizes.

Note that this will reduce the memory available to the application
in some cases. Reclaim may occur more often. If more than
the reserved number of higher order pages are being used then
allocations will still fail as normal.

In order to make this work just right one needs to be able to
know the workload well enough to reserve the right amount
of pages. This is comparable to other reservation schemes.

Well that f.e brings up huge pages. You can of course
also use this to reserve those and can then be sure that
you can dynamically resize your huge page pools even after
a long time of system up time.

The idea for this patch came from Thomas Schoebel-Theuer whom I met
at the LCA and who described the approach to me promising
a patch that would do this. Sadly he has vanished somehow.
However, he has been using this approach to support a
production environment for numerous years.

So I redid his patch and this is the first draft of it.


Idea-by: Thomas Schoebel-Theuer <tst@schoebel-theuer.de>

First performance tests in a virtual enviroment show
a hackbench improvement by 6% just by increasing
the page size used by the page allocator to order 3.

Signed-off-by: Christopher Lameter <cl@linux.com>

Index: linux/include/linux/mmzone.h
===================================================================
--- linux.orig/include/linux/mmzone.h
+++ linux/include/linux/mmzone.h
@@ -96,6 +96,11 @@ extern int page_group_by_mobility_disabl
 struct free_area {
 	struct list_head	free_list[MIGRATE_TYPES];
 	unsigned long		nr_free;
+	/* We stop breaking up pages of this order if less than
+	 * min are available. At that point the pages can only
+	 * be used for allocations of that particular order.
+	 */
+	unsigned long		min;
 };
 
 struct pglist_data;
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -1844,7 +1844,12 @@ struct page *__rmqueue_smallest(struct z
 		area = &(zone->free_area[current_order]);
 		page = list_first_entry_or_null(&area->free_list[migratetype],
 							struct page, lru);
-		if (!page)
+		/*
+		 * Continue if no page is found or if our freelist contains
+		 * less than the minimum pages of that order. In that case
+		 * we better look for a different order.
+		 */
+		if (!page || area->nr_free < area->min)
 			continue;
 		list_del(&page->lru);
 		rmv_page_order(page);
@@ -5190,6 +5195,57 @@ static void build_zonelists(pg_data_t *p
 
 #endif	/* CONFIG_NUMA */
 
+int set_page_order_min(int node, int order, unsigned min)
+{
+	int i, o;
+	long min_pages = 0;			/* Pages already reserved */
+	long managed_pages = 0;			/* Pages managed on the node */
+	struct zone *last;
+	unsigned remaining;
+
+	/*
+	 * Determine already reserved memory for orders
+	 * plus the total of the pages on the node
+	 */
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		struct zone *z = &NODE_DATA(node)->node_zones[i];
+		if (managed_zone(z)) {
+			for (o = 0; o < MAX_ORDER; o++) {
+				if (o != order)
+					min_pages += z->free_area[o].min << o;
+
+			}
+			managed_pages += z->managed_pages;
+		}
+	}
+
+	if (min_pages + (min << order) > managed_pages / 2)
+		return -ENOMEM;
+
+	/* Set the min values for all zones on the node */
+	remaining = min;
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		struct zone *z = &NODE_DATA(node)->node_zones[i];
+		if (managed_zone(z)) {
+			u64 tmp;
+
+			tmp = (u64)z->managed_pages * (min << order);
+			do_div(tmp, managed_pages);
+			tmp >>= order;
+			z->free_area[order].min = tmp;
+
+			last = z;
+			remaining -= tmp;
+		}
+	}
+
+	/* Deal with rounding errors */
+	if (remaining)
+		last->free_area[order].min += remaining;
+
+	return 0;
+}
+
 /*
  * Boot pageset table. One per cpu which is going to be used for all
  * zones and all nodes. The parameters will be set in such a way
@@ -5424,6 +5480,7 @@ static void __meminit zone_init_free_lis
 	for_each_migratetype_order(order, t) {
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
 		zone->free_area[order].nr_free = 0;
+		zone->free_area[order].min = 0;
 	}
 }
 
@@ -6998,6 +7055,7 @@ static void __setup_per_zone_wmarks(void
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
 	struct zone *zone;
+	int order;
 	unsigned long flags;
 
 	/* Calculate total number of !ZONE_HIGHMEM pages */
@@ -7012,6 +7070,10 @@ static void __setup_per_zone_wmarks(void
 		spin_lock_irqsave(&zone->lock, flags);
 		tmp = (u64)pages_min * zone->managed_pages;
 		do_div(tmp, lowmem_pages);
+
+		for (order = 0; order < MAX_ORDER; order++)
+			tmp += zone->free_area[order].min << order;
+
 		if (is_highmem(zone)) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -27,6 +27,7 @@
 #include <linux/mm_inline.h>
 #include <linux/page_ext.h>
 #include <linux/page_owner.h>
+#include <linux/ctype.h>
 
 #include "internal.h"
 
@@ -1614,6 +1615,11 @@ static void zoneinfo_show_print(struct s
 				zone_numa_state_snapshot(zone, i));
 #endif
 
+	for (i = 0; i < MAX_ORDER; i++)
+		if (zone->free_area[i].min)
+			seq_printf(m, "\nPreserve %lu pages of order %d from breaking up.",
+				zone->free_area[i].min, i);
+
 	seq_printf(m, "\n  pagesets");
 	for_each_online_cpu(i) {
 		struct per_cpu_pageset *pageset;
@@ -1641,6 +1647,122 @@ static void zoneinfo_show_print(struct s
 	seq_putc(m, '\n');
 }
 
+static int __order_protect(char *p)
+{
+	char c;
+
+	do {
+		int order = 0;
+		int pages = 0;
+		int node = 0;
+		int rc;
+
+		/* Syntax <order>[N<node>]=number */
+		if (!isdigit(*p))
+			return -EFAULT;
+
+		while (true) {
+			c = *p++;
+
+			if (!isdigit(c))
+				break;
+
+			order = order * 10 + c - '0';
+		}
+
+		/* Check for optional node specification */
+		if (c == 'N') {
+			if (!isdigit(*p))
+				return -EFAULT;
+
+			while (true) {
+				c = *p++;
+				if (!isdigit(c))
+					break;
+				node = node * 10 + c - '0';
+			}
+		}
+
+		if (c != '=')
+			return -EINVAL;
+
+		if (!isdigit(*p))
+			return -EINVAL;
+
+		while (true) {
+			c = *p++;
+			if (!isdigit(c))
+				break;
+			pages = pages * 10 + c - '0';
+		}
+
+		if (order == 0 || order >= MAX_ORDER)
+		       return -EINVAL;
+
+		if (!node_online(node))
+			return -ENOSYS;
+
+		rc = set_page_order_min(node, order, pages);
+		if (rc)
+			return rc;
+
+	} while (c == ',');
+
+	if (c)
+		return -EINVAL;
+
+	setup_per_zone_wmarks();
+
+	return 0;
+}
+
+/*
+ * Writing to /proc/zoneinfo allows to setup the large page breakup
+ * protection.
+ *
+ * Syntax:
+ * 	<order>[N<node>]=<number>{,<order>[N<node>]=<number>}
+ *
+ * F.e. Protecting 500 pages of order 2 (16K on intel) and 300 of
+ * order 4 (64K) on node 1
+ *
+ * 	echo "2=500,4N1=300" >/proc/zoneinfo
+ *
+ */
+static ssize_t zoneinfo_write(struct file *file, const char __user *buffer,
+			size_t count, loff_t *ppos)
+{
+	char zinfo[200];
+	int rc;
+
+	if (count > sizeof(zinfo))
+		return -EINVAL;
+
+	if (copy_from_user(zinfo, buffer, count))
+		return -EFAULT;
+
+	zinfo[count - 1] = 0;
+
+	rc = __order_protect(zinfo);
+
+	if (rc)
+		return rc;
+
+	return count;
+}
+
+static int order_protect(char *s)
+{
+	int rc;
+
+	rc = __order_protect(s);
+	if (rc)
+		printk("Invalid order=%s rc=%d\n",s, rc);
+
+	return 1;
+}
+__setup("order=", order_protect);
+
 /*
  * Output information about zones in @pgdat.  All zones are printed regardless
  * of whether they are populated or not: lowmem_reserve_ratio operates on the
@@ -1672,6 +1794,7 @@ static const struct file_operations zone
 	.read		= seq_read,
 	.llseek		= seq_lseek,
 	.release	= seq_release,
+	.write		= zoneinfo_write,
 };
 
 enum writeback_stat_item {
@@ -2016,7 +2139,7 @@ void __init init_mm_internals(void)
 	proc_create("buddyinfo", 0444, NULL, &buddyinfo_file_operations);
 	proc_create("pagetypeinfo", 0444, NULL, &pagetypeinfo_file_operations);
 	proc_create("vmstat", 0444, NULL, &vmstat_file_operations);
-	proc_create("zoneinfo", 0444, NULL, &zoneinfo_file_operations);
+	proc_create("zoneinfo", 0644, NULL, &zoneinfo_file_operations);
 #endif
 }
 
Index: linux/include/linux/gfp.h
===================================================================
--- linux.orig/include/linux/gfp.h
+++ linux/include/linux/gfp.h
@@ -543,6 +543,7 @@ void drain_all_pages(struct zone *zone);
 void drain_local_pages(struct zone *zone);
 
 void page_alloc_init_late(void);
+int set_page_order_min(int node, int order, unsigned min);
 
 /*
  * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC 2/2] Page order diagnostics
  2018-02-16 16:01 [RFC 0/2] Larger Order Protection V1 Christoph Lameter
@ 2018-02-16 16:01   ` Christoph Lameter
  2018-02-16 16:01   ` Christoph Lameter
  2018-02-16 18:27 ` [RFC 0/2] Larger Order Protection V1 Christopher Lameter
  2 siblings, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2018-02-16 16:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Matthew Wilcox, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Thomas Schoebel-Theuer,
	andi-Vw/NltI1exuRpAAqCnN02g, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Dave Hansen, Mike Kravetz

[-- Attachment #1: order_stats --]
[-- Type: text/plain, Size: 5595 bytes --]

It is beneficial to know about the contiguous memory segments
available on a system and the number of allocations failing
for each page order.

This patch adds details per order statistics to /proc/meminfo
so the current memory use can be determined.

Also adds counters to /proc/vmstat to show allocation
failures for each page order.

Signed-off-by: Christoph Laeter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>

Index: linux/include/linux/mmzone.h
===================================================================
--- linux.orig/include/linux/mmzone.h
+++ linux/include/linux/mmzone.h
@@ -185,6 +185,10 @@ enum node_stat_item {
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
+#ifdef CONFIG_ORDER_STATS
+	NR_ORDER,
+	NR_ORDER_MAX = NR_ORDER + MAX_ORDER - 1,
+#endif
 	NR_VM_NODE_STAT_ITEMS
 };
 
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -828,6 +828,10 @@ static inline void __free_one_page(struc
 	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
 
+#ifdef CONFIG_ORDER_STATS
+	dec_node_page_state(page, NR_ORDER + order);
+#endif
+
 continue_merging:
 	while (order < max_order - 1) {
 		buddy_pfn = __find_buddy_pfn(pfn, order);
@@ -1285,6 +1289,9 @@ static void __init __free_pages_boot_cor
 	page_zone(page)->managed_pages += nr_pages;
 	set_page_refcounted(page);
 	__free_pages(page, order);
+#ifdef CONFIG_ORDER_STATS
+	inc_node_page_state(page, NR_ORDER + order);
+#endif
 }
 
 #if defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) || \
@@ -1855,6 +1862,9 @@ struct page *__rmqueue_smallest(struct z
 		rmv_page_order(page);
 		area->nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);
+#ifdef CONFIG_ORDER_STATS
+		inc_node_page_state(page, NR_ORDER + order);
+#endif
 		set_pcppage_migratetype(page, migratetype);
 		return page;
 	}
@@ -4169,6 +4179,11 @@ nopage:
 fail:
 	warn_alloc(gfp_mask, ac->nodemask,
 			"page allocation failure: order:%u", order);
+
+#ifdef CONFIG_ORDER_STATS
+	count_vm_event(ORDER0_ALLOC_FAIL + order);
+#endif
+
 got_pg:
 	return page;
 }
Index: linux/fs/proc/meminfo.c
===================================================================
--- linux.orig/fs/proc/meminfo.c
+++ linux/fs/proc/meminfo.c
@@ -51,6 +51,7 @@ static int meminfo_proc_show(struct seq_
 	long available;
 	unsigned long pages[NR_LRU_LISTS];
 	int lru;
+	int order;
 
 	si_meminfo(&i);
 	si_swapinfo(&i);
@@ -155,6 +156,11 @@ static int meminfo_proc_show(struct seq_
 		    global_zone_page_state(NR_FREE_CMA_PAGES));
 #endif
 
+#ifdef CONFIG_ORDER_STATS
+	for (order= 0; order < MAX_ORDER; order++)
+		seq_printf(m, "Order%2d Pages:     %5lu\n",
+			order, global_node_page_state(NR_ORDER + order));
+#endif
 	hugetlb_report_meminfo(m);
 
 	arch_report_meminfo(m);
Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig
+++ linux/mm/Kconfig
@@ -752,6 +752,15 @@ config PERCPU_STATS
 	  information includes global and per chunk statistics, which can
 	  be used to help understand percpu memory usage.
 
+config ORDER_STATS
+	bool "Statistics for different sized allocations"
+	default n
+	help
+	  Create statistics about the contiguous memory segments allocated
+	  through the page allocator. This creates statistics about the
+	  memory segments in use in /proc/meminfo and the node meminfo files
+	  as well as allocation failure statistics in /proc/vmstat.
+
 config GUP_BENCHMARK
 	bool "Enable infrastructure for get_user_pages_fast() benchmarking"
 	default n
Index: linux/include/linux/vm_event_item.h
===================================================================
--- linux.orig/include/linux/vm_event_item.h
+++ linux/include/linux/vm_event_item.h
@@ -111,6 +111,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
 		SWAP_RA,
 		SWAP_RA_HIT,
 #endif
+#ifdef CONFIG_ORDER_STATS
+		ORDER0_ALLOC_FAIL,
+		ORDER_MAX_FAIL = ORDER0_ALLOC_FAIL + MAX_ORDER -1,
+#endif
 		NR_VM_EVENT_ITEMS
 };
 
Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -1289,6 +1289,52 @@ const char * const vmstat_text[] = {
 	"swap_ra",
 	"swap_ra_hit",
 #endif
+#ifdef CONFIG_ORDER_STATS
+	"order0_failure",
+	"order1_failure",
+	"order2_failure",
+	"order3_failure",
+	"order4_failure",
+	"order5_failure",
+	"order6_failure",
+	"order7_failure",
+	"order8_failure",
+	"order9_failure",
+	"order10_failure",
+#ifdef CONFIG_FORCE_MAX_ZONEORDER
+#if MAX_ORDER > 11
+	"order11_failure"
+#endif
+#if MAX_ORDER > 12
+	"order12_failure"
+#endif
+#if MAX_ORDER > 13
+	"order13_failure"
+#endif
+#if MAX_ORDER > 14
+	"order14_failure"
+#endif
+#if MAX_ORDER > 15
+	"order15_failure"
+#endif
+#if MAX_ORDER > 16
+	"order16_failure"
+#endif
+#if MAX_ORDER > 17
+	"order17_failure"
+#endif
+#if MAX_ORDER > 18
+	"order18_failure"
+#endif
+#if MAX_ORDER > 19
+	"order19_failure"
+#endif
+#if MAX_ORDER > 20
+#error Please add more lines...
+#endif
+
+#endif /* CONFIG_FORCE_MAX_ZONEORDER */
+#endif /* CONFIG_ORDER_STATS */
 #endif /* CONFIG_VM_EVENTS_COUNTERS */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC 2/2] Page order diagnostics
@ 2018-02-16 16:01   ` Christoph Lameter
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2018-02-16 16:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Matthew Wilcox, linux-mm, linux-rdma, akpm,
	Thomas Schoebel-Theuer, andi, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Dave Hansen, Mike Kravetz

[-- Attachment #1: order_stats --]
[-- Type: text/plain, Size: 5572 bytes --]

It is beneficial to know about the contiguous memory segments
available on a system and the number of allocations failing
for each page order.

This patch adds details per order statistics to /proc/meminfo
so the current memory use can be determined.

Also adds counters to /proc/vmstat to show allocation
failures for each page order.

Signed-off-by: Christoph Laeter <cl@linux.com>

Index: linux/include/linux/mmzone.h
===================================================================
--- linux.orig/include/linux/mmzone.h
+++ linux/include/linux/mmzone.h
@@ -185,6 +185,10 @@ enum node_stat_item {
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
+#ifdef CONFIG_ORDER_STATS
+	NR_ORDER,
+	NR_ORDER_MAX = NR_ORDER + MAX_ORDER - 1,
+#endif
 	NR_VM_NODE_STAT_ITEMS
 };
 
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -828,6 +828,10 @@ static inline void __free_one_page(struc
 	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
 
+#ifdef CONFIG_ORDER_STATS
+	dec_node_page_state(page, NR_ORDER + order);
+#endif
+
 continue_merging:
 	while (order < max_order - 1) {
 		buddy_pfn = __find_buddy_pfn(pfn, order);
@@ -1285,6 +1289,9 @@ static void __init __free_pages_boot_cor
 	page_zone(page)->managed_pages += nr_pages;
 	set_page_refcounted(page);
 	__free_pages(page, order);
+#ifdef CONFIG_ORDER_STATS
+	inc_node_page_state(page, NR_ORDER + order);
+#endif
 }
 
 #if defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) || \
@@ -1855,6 +1862,9 @@ struct page *__rmqueue_smallest(struct z
 		rmv_page_order(page);
 		area->nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);
+#ifdef CONFIG_ORDER_STATS
+		inc_node_page_state(page, NR_ORDER + order);
+#endif
 		set_pcppage_migratetype(page, migratetype);
 		return page;
 	}
@@ -4169,6 +4179,11 @@ nopage:
 fail:
 	warn_alloc(gfp_mask, ac->nodemask,
 			"page allocation failure: order:%u", order);
+
+#ifdef CONFIG_ORDER_STATS
+	count_vm_event(ORDER0_ALLOC_FAIL + order);
+#endif
+
 got_pg:
 	return page;
 }
Index: linux/fs/proc/meminfo.c
===================================================================
--- linux.orig/fs/proc/meminfo.c
+++ linux/fs/proc/meminfo.c
@@ -51,6 +51,7 @@ static int meminfo_proc_show(struct seq_
 	long available;
 	unsigned long pages[NR_LRU_LISTS];
 	int lru;
+	int order;
 
 	si_meminfo(&i);
 	si_swapinfo(&i);
@@ -155,6 +156,11 @@ static int meminfo_proc_show(struct seq_
 		    global_zone_page_state(NR_FREE_CMA_PAGES));
 #endif
 
+#ifdef CONFIG_ORDER_STATS
+	for (order= 0; order < MAX_ORDER; order++)
+		seq_printf(m, "Order%2d Pages:     %5lu\n",
+			order, global_node_page_state(NR_ORDER + order));
+#endif
 	hugetlb_report_meminfo(m);
 
 	arch_report_meminfo(m);
Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig
+++ linux/mm/Kconfig
@@ -752,6 +752,15 @@ config PERCPU_STATS
 	  information includes global and per chunk statistics, which can
 	  be used to help understand percpu memory usage.
 
+config ORDER_STATS
+	bool "Statistics for different sized allocations"
+	default n
+	help
+	  Create statistics about the contiguous memory segments allocated
+	  through the page allocator. This creates statistics about the
+	  memory segments in use in /proc/meminfo and the node meminfo files
+	  as well as allocation failure statistics in /proc/vmstat.
+
 config GUP_BENCHMARK
 	bool "Enable infrastructure for get_user_pages_fast() benchmarking"
 	default n
Index: linux/include/linux/vm_event_item.h
===================================================================
--- linux.orig/include/linux/vm_event_item.h
+++ linux/include/linux/vm_event_item.h
@@ -111,6 +111,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
 		SWAP_RA,
 		SWAP_RA_HIT,
 #endif
+#ifdef CONFIG_ORDER_STATS
+		ORDER0_ALLOC_FAIL,
+		ORDER_MAX_FAIL = ORDER0_ALLOC_FAIL + MAX_ORDER -1,
+#endif
 		NR_VM_EVENT_ITEMS
 };
 
Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -1289,6 +1289,52 @@ const char * const vmstat_text[] = {
 	"swap_ra",
 	"swap_ra_hit",
 #endif
+#ifdef CONFIG_ORDER_STATS
+	"order0_failure",
+	"order1_failure",
+	"order2_failure",
+	"order3_failure",
+	"order4_failure",
+	"order5_failure",
+	"order6_failure",
+	"order7_failure",
+	"order8_failure",
+	"order9_failure",
+	"order10_failure",
+#ifdef CONFIG_FORCE_MAX_ZONEORDER
+#if MAX_ORDER > 11
+	"order11_failure"
+#endif
+#if MAX_ORDER > 12
+	"order12_failure"
+#endif
+#if MAX_ORDER > 13
+	"order13_failure"
+#endif
+#if MAX_ORDER > 14
+	"order14_failure"
+#endif
+#if MAX_ORDER > 15
+	"order15_failure"
+#endif
+#if MAX_ORDER > 16
+	"order16_failure"
+#endif
+#if MAX_ORDER > 17
+	"order17_failure"
+#endif
+#if MAX_ORDER > 18
+	"order18_failure"
+#endif
+#if MAX_ORDER > 19
+	"order19_failure"
+#endif
+#if MAX_ORDER > 20
+#error Please add more lines...
+#endif
+
+#endif /* CONFIG_FORCE_MAX_ZONEORDER */
+#endif /* CONFIG_ORDER_STATS */
 #endif /* CONFIG_VM_EVENTS_COUNTERS */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
  2018-02-16 16:01   ` Christoph Lameter
@ 2018-02-16 17:03       ` Andi Kleen
  -1 siblings, 0 replies; 38+ messages in thread
From: Andi Kleen @ 2018-02-16 17:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Matthew Wilcox, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Thomas Schoebel-Theuer,
	andi-Vw/NltI1exuRpAAqCnN02g, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Dave Hansen, Mike Kravetz

> First performance tests in a virtual enviroment show
> a hackbench improvement by 6% just by increasing
> the page size used by the page allocator to order 3.

So why is hackbench improving? Is that just for kernel stacks?

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
@ 2018-02-16 17:03       ` Andi Kleen
  0 siblings, 0 replies; 38+ messages in thread
From: Andi Kleen @ 2018-02-16 17:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Matthew Wilcox, linux-mm, linux-rdma, akpm,
	Thomas Schoebel-Theuer, andi, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Dave Hansen, Mike Kravetz

> First performance tests in a virtual enviroment show
> a hackbench improvement by 6% just by increasing
> the page size used by the page allocator to order 3.

So why is hackbench improving? Is that just for kernel stacks?

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
  2018-02-16 16:01   ` Christoph Lameter
  (?)
  (?)
@ 2018-02-16 18:02   ` Randy Dunlap
       [not found]     ` <b76028c6-c755-8178-2dfc-81c7db1f8bed-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  -1 siblings, 1 reply; 38+ messages in thread
From: Randy Dunlap @ 2018-02-16 18:02 UTC (permalink / raw)
  To: Christoph Lameter, Mel Gorman
  Cc: Matthew Wilcox, linux-mm, linux-rdma, akpm,
	Thomas Schoebel-Theuer, andi, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Dave Hansen, Mike Kravetz

On 02/16/2018 08:01 AM, Christoph Lameter wrote:
> Control over this feature is by writing to /proc/zoneinfo.
> 
> F.e. to ensure that 2000 16K pages stay available for jumbo
> frames do
> 
> 	echo "2=2000" >/proc/zoneinfo
> 
> or through the order=<page spec> on the kernel command line.
> F.e.
> 
> 	order=2=2000,4N2=500


Please document the the kernel command line option in
Documentation/admin-guide/kernel-parameters.txt.

I suppose that /proc/zoneinfo should be added somewhere in Documentation/vm/
but I'm not sure where that would be.

thanks,
-- 
~Randy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
  2018-02-16 17:03       ` Andi Kleen
@ 2018-02-16 18:25           ` Christopher Lameter
  -1 siblings, 0 replies; 38+ messages in thread
From: Christopher Lameter @ 2018-02-16 18:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mel Gorman, Matthew Wilcox, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Thomas Schoebel-Theuer,
	Rik van Riel, Michal Hocko, Guy Shattah, Anshuman Khandual,
	Michal Nazarewicz, Vlastimil Babka, David Nellans, Laura Abbott,
	Pavel Machek, Dave Hansen, Mike Kravetz

On Fri, 16 Feb 2018, Andi Kleen wrote:

> > First performance tests in a virtual enviroment show
> > a hackbench improvement by 6% just by increasing
> > the page size used by the page allocator to order 3.
>
> So why is hackbench improving? Is that just for kernel stacks?

Less stack overhead. The large the page size the less metadata need to be
handled. The freelists get larger and the chance of hitting the per cpu
freelist increases.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
@ 2018-02-16 18:25           ` Christopher Lameter
  0 siblings, 0 replies; 38+ messages in thread
From: Christopher Lameter @ 2018-02-16 18:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mel Gorman, Matthew Wilcox, linux-mm, linux-rdma, akpm,
	Thomas Schoebel-Theuer, Rik van Riel, Michal Hocko, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, Vlastimil Babka,
	David Nellans, Laura Abbott, Pavel Machek, Dave Hansen,
	Mike Kravetz

On Fri, 16 Feb 2018, Andi Kleen wrote:

> > First performance tests in a virtual enviroment show
> > a hackbench improvement by 6% just by increasing
> > the page size used by the page allocator to order 3.
>
> So why is hackbench improving? Is that just for kernel stacks?

Less stack overhead. The large the page size the less metadata need to be
handled. The freelists get larger and the chance of hitting the per cpu
freelist increases.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 0/2] Larger Order Protection V1
  2018-02-16 16:01 [RFC 0/2] Larger Order Protection V1 Christoph Lameter
  2018-02-16 16:01   ` Christoph Lameter
  2018-02-16 16:01   ` Christoph Lameter
@ 2018-02-16 18:27 ` Christopher Lameter
  2 siblings, 0 replies; 38+ messages in thread
From: Christopher Lameter @ 2018-02-16 18:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Matthew Wilcox, linux-mm, linux-rdma, akpm,
	Thomas Schoebel-Theuer, andi, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Dave Hansen, Mike Kravetz


Why are the patches not making linux-mm? They are on other mailing lists.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
  2018-02-16 16:01   ` Christoph Lameter
                     ` (2 preceding siblings ...)
  (?)
@ 2018-02-16 18:59   ` Mike Kravetz
       [not found]     ` <5108eb20-2b20-bd48-903e-bce312e96974-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  -1 siblings, 1 reply; 38+ messages in thread
From: Mike Kravetz @ 2018-02-16 18:59 UTC (permalink / raw)
  To: Christoph Lameter, Mel Gorman
  Cc: Matthew Wilcox, linux-mm, linux-rdma, akpm,
	Thomas Schoebel-Theuer, andi, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Dave Hansen

On 02/16/2018 08:01 AM, Christoph Lameter wrote:
> Over time as the kernel is churning through memory it will break
> up larger pages and as time progresses larger contiguous allocations
> will no longer be possible. This is an approach to preserve these
> large pages and prevent them from being broken up.
> 
> This is useful for example for the use of jumbo pages and can
> satify various needs of subsystems and device drivers that require
> large contiguous allocation to operate properly.
> 
> The idea is to reserve a pool of pages of the required order
> so that the kernel is not allowed to use the pages for allocations
> of a different order. This is a pool that is fully integrated
> into the page allocator and therefore transparently usable.
> 
> Control over this feature is by writing to /proc/zoneinfo.
> 
> F.e. to ensure that 2000 16K pages stay available for jumbo
> frames do
> 
> 	echo "2=2000" >/proc/zoneinfo
> 
> or through the order=<page spec> on the kernel command line.
> F.e.
> 
> 	order=2=2000,4N2=500
> 
> These pages will be subject to reclaim etc as usual but will not
> be broken up.
> 
> One can then also f.e. operate the slub allocator with
> 64k pages. Specify "slub_max_order=4 slub_min_order=4" on
> the kernel command line and all slab allocator allocations
> will occur in 64K page sizes.
> 
> Note that this will reduce the memory available to the application
> in some cases. Reclaim may occur more often. If more than
> the reserved number of higher order pages are being used then
> allocations will still fail as normal.
> 
> In order to make this work just right one needs to be able to
> know the workload well enough to reserve the right amount
> of pages. This is comparable to other reservation schemes.

Yes.

I like the idea that this only comes into play as the result of explicit
user/sysadmin action.  It does remind me of hugetlbfs reservations.  So,
we hope that only people who really know their workload and know what
they are doing would use this feature.

> Well that f.e brings up huge pages. You can of course
> also use this to reserve those and can then be sure that
> you can dynamically resize your huge page pools even after
> a long time of system up time.

Yes, and no.  Doesn't that assume nobody else is doing allocations
of that size?  For example, I could image THP using huge page sized
reservations.  The when it comes time to resize your hugetlbfs pool
there may not be enough.  Although, we may quickly split THP pages
in this case.  I am not sure.

IIRC, Guy Shattah's use case was for allocations greater than MAX_ORDER.
This would not directly address that.  A huge contiguous area (2GB) is
the sweet spot' for best performance in his case.  However, I think he
could still benefit from using a set of larger (such as 2MB) size
allocations which this scheme could help with.

-- 
Mike Kravetz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
  2018-02-16 16:01   ` Christoph Lameter
@ 2018-02-16 19:01       ` Dave Hansen
  -1 siblings, 0 replies; 38+ messages in thread
From: Dave Hansen @ 2018-02-16 19:01 UTC (permalink / raw)
  To: Christoph Lameter, Mel Gorman
  Cc: Matthew Wilcox, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Thomas Schoebel-Theuer,
	andi-Vw/NltI1exuRpAAqCnN02g, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Mike Kravetz

On 02/16/2018 08:01 AM, Christoph Lameter wrote:
> In order to make this work just right one needs to be able to
> know the workload well enough to reserve the right amount
> of pages. This is comparable to other reservation schemes.

Yes, but it's a reservation scheme that doesn't show up in MemFree, for
instance.  Even hugetlbfs-reserved memory subtracts from that.

This has the potential to be really confusing to apps.  If this memory
is now not available to normal apps, they might plow into the invisible
memory limits and get into nasty reclaim scenarios.

Shouldn't this subtract the memory for MemFree and friends?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
@ 2018-02-16 19:01       ` Dave Hansen
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Hansen @ 2018-02-16 19:01 UTC (permalink / raw)
  To: Christoph Lameter, Mel Gorman
  Cc: Matthew Wilcox, linux-mm, linux-rdma, akpm,
	Thomas Schoebel-Theuer, andi, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Mike Kravetz

On 02/16/2018 08:01 AM, Christoph Lameter wrote:
> In order to make this work just right one needs to be able to
> know the workload well enough to reserve the right amount
> of pages. This is comparable to other reservation schemes.

Yes, but it's a reservation scheme that doesn't show up in MemFree, for
instance.  Even hugetlbfs-reserved memory subtracts from that.

This has the potential to be really confusing to apps.  If this memory
is now not available to normal apps, they might plow into the invisible
memory limits and get into nasty reclaim scenarios.

Shouldn't this subtract the memory for MemFree and friends?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
  2018-02-16 18:59   ` Mike Kravetz
@ 2018-02-16 20:13         ` Christopher Lameter
  0 siblings, 0 replies; 38+ messages in thread
From: Christopher Lameter @ 2018-02-16 20:13 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Mel Gorman, Matthew Wilcox, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Thomas Schoebel-Theuer,
	andi-Vw/NltI1exuRpAAqCnN02g, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Dave Hansen

On Fri, 16 Feb 2018, Mike Kravetz wrote:

> > Well that f.e brings up huge pages. You can of course
> > also use this to reserve those and can then be sure that
> > you can dynamically resize your huge page pools even after
> > a long time of system up time.
>
> Yes, and no.  Doesn't that assume nobody else is doing allocations
> of that size?  For example, I could image THP using huge page sized
> reservations.  The when it comes time to resize your hugetlbfs pool
> there may not be enough.  Although, we may quickly split THP pages
> in this case.  I am not sure.

Yup it has a pool for everyone. Question is how to divide the loot ;-)

> IIRC, Guy Shattah's use case was for allocations greater than MAX_ORDER.
> This would not directly address that.  A huge contiguous area (2GB) is
> the sweet spot' for best performance in his case.  However, I think he
> could still benefit from using a set of larger (such as 2MB) size
> allocations which this scheme could help with.

MAX_ORDER can be increased to allow for larger allocations. IA64 has f.e.
a much larger MAX_ORDER size. So does powerpc. And then the reservation
scheme will work.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
@ 2018-02-16 20:13         ` Christopher Lameter
  0 siblings, 0 replies; 38+ messages in thread
From: Christopher Lameter @ 2018-02-16 20:13 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Mel Gorman, Matthew Wilcox, linux-mm, linux-rdma, akpm,
	Thomas Schoebel-Theuer, andi, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Dave Hansen

On Fri, 16 Feb 2018, Mike Kravetz wrote:

> > Well that f.e brings up huge pages. You can of course
> > also use this to reserve those and can then be sure that
> > you can dynamically resize your huge page pools even after
> > a long time of system up time.
>
> Yes, and no.  Doesn't that assume nobody else is doing allocations
> of that size?  For example, I could image THP using huge page sized
> reservations.  The when it comes time to resize your hugetlbfs pool
> there may not be enough.  Although, we may quickly split THP pages
> in this case.  I am not sure.

Yup it has a pool for everyone. Question is how to divide the loot ;-)

> IIRC, Guy Shattah's use case was for allocations greater than MAX_ORDER.
> This would not directly address that.  A huge contiguous area (2GB) is
> the sweet spot' for best performance in his case.  However, I think he
> could still benefit from using a set of larger (such as 2MB) size
> allocations which this scheme could help with.

MAX_ORDER can be increased to allow for larger allocations. IA64 has f.e.
a much larger MAX_ORDER size. So does powerpc. And then the reservation
scheme will work.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
  2018-02-16 19:01       ` Dave Hansen
  (?)
@ 2018-02-16 20:15       ` Christopher Lameter
  2018-02-16 21:08           ` Dave Hansen
  -1 siblings, 1 reply; 38+ messages in thread
From: Christopher Lameter @ 2018-02-16 20:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mel Gorman, Matthew Wilcox, linux-mm, linux-rdma, akpm,
	Thomas Schoebel-Theuer, andi, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Mike Kravetz

On Fri, 16 Feb 2018, Dave Hansen wrote:

> On 02/16/2018 08:01 AM, Christoph Lameter wrote:
> > In order to make this work just right one needs to be able to
> > know the workload well enough to reserve the right amount
> > of pages. This is comparable to other reservation schemes.
>
> Yes, but it's a reservation scheme that doesn't show up in MemFree, for
> instance.  Even hugetlbfs-reserved memory subtracts from that.

Ok. There is the question if we can get all these reservation schemes
under one hood instead of having page order specific ones in subsystems
like hugetlb.

> This has the potential to be really confusing to apps.  If this memory
> is now not available to normal apps, they might plow into the invisible
> memory limits and get into nasty reclaim scenarios.

> Shouldn't this subtract the memory for MemFree and friends?

Ok certainly we could do that. But on the other hand the memory is
available if those subsystems ask for the right order. Its not clear to me
what the right way of handling this is. Right now it adds the reserved
pages to the watermarks. But then under some circumstances the memory is
available. What is the best solution here?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
  2018-02-16 20:15       ` Christopher Lameter
@ 2018-02-16 21:08           ` Dave Hansen
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Hansen @ 2018-02-16 21:08 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Mel Gorman, Matthew Wilcox, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Thomas Schoebel-Theuer,
	andi-Vw/NltI1exuRpAAqCnN02g, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Mike Kravetz

On 02/16/2018 12:15 PM, Christopher Lameter wrote:
>> This has the potential to be really confusing to apps.  If this memory
>> is now not available to normal apps, they might plow into the invisible
>> memory limits and get into nasty reclaim scenarios.
>> Shouldn't this subtract the memory for MemFree and friends?
> Ok certainly we could do that. But on the other hand the memory is
> available if those subsystems ask for the right order. Its not clear to me
> what the right way of handling this is. Right now it adds the reserved
> pages to the watermarks. But then under some circumstances the memory is
> available. What is the best solution here?

There's definitely no perfect solution.

But, in general, I think we should cater to the dumbest users.  Folks
doing higher-order allocations are not that.  I say we make the picture
the most clear for the traditional 4k users.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
@ 2018-02-16 21:08           ` Dave Hansen
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Hansen @ 2018-02-16 21:08 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Mel Gorman, Matthew Wilcox, linux-mm, linux-rdma, akpm,
	Thomas Schoebel-Theuer, andi, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Mike Kravetz

On 02/16/2018 12:15 PM, Christopher Lameter wrote:
>> This has the potential to be really confusing to apps.  If this memory
>> is now not available to normal apps, they might plow into the invisible
>> memory limits and get into nasty reclaim scenarios.
>> Shouldn't this subtract the memory for MemFree and friends?
> Ok certainly we could do that. But on the other hand the memory is
> available if those subsystems ask for the right order. Its not clear to me
> what the right way of handling this is. Right now it adds the reserved
> pages to the watermarks. But then under some circumstances the memory is
> available. What is the best solution here?

There's definitely no perfect solution.

But, in general, I think we should cater to the dumbest users.  Folks
doing higher-order allocations are not that.  I say we make the picture
the most clear for the traditional 4k users.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
  2018-02-16 21:08           ` Dave Hansen
  (?)
@ 2018-02-16 21:43           ` Matthew Wilcox
       [not found]             ` <20180216214353.GA32655-PfSpb0PWhxZc2C7mugBRk2EX/6BAtgUQ@public.gmane.org>
  -1 siblings, 1 reply; 38+ messages in thread
From: Matthew Wilcox @ 2018-02-16 21:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Christopher Lameter, Mel Gorman, linux-mm, linux-rdma, akpm,
	Thomas Schoebel-Theuer, andi, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Mike Kravetz

On Fri, Feb 16, 2018 at 01:08:11PM -0800, Dave Hansen wrote:
> On 02/16/2018 12:15 PM, Christopher Lameter wrote:
> >> This has the potential to be really confusing to apps.  If this memory
> >> is now not available to normal apps, they might plow into the invisible
> >> memory limits and get into nasty reclaim scenarios.
> >> Shouldn't this subtract the memory for MemFree and friends?
> > Ok certainly we could do that. But on the other hand the memory is
> > available if those subsystems ask for the right order. Its not clear to me
> > what the right way of handling this is. Right now it adds the reserved
> > pages to the watermarks. But then under some circumstances the memory is
> > available. What is the best solution here?
> 
> There's definitely no perfect solution.
> 
> But, in general, I think we should cater to the dumbest users.  Folks
> doing higher-order allocations are not that.  I say we make the picture
> the most clear for the traditional 4k users.

Your way might be confusing -- if there's a system which is under varying
amounts of jumboframe load and all the 16k pages get gobbled up by the
ethernet driver, MemFree won't change at all, for example.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
  2018-02-16 21:43           ` Matthew Wilcox
@ 2018-02-16 21:47                 ` Dave Hansen
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Hansen @ 2018-02-16 21:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christopher Lameter, Mel Gorman, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Thomas Schoebel-Theuer,
	andi-Vw/NltI1exuRpAAqCnN02g, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Mike Kravetz

On 02/16/2018 01:43 PM, Matthew Wilcox wrote:
>> There's definitely no perfect solution.
>>
>> But, in general, I think we should cater to the dumbest users.  Folks
>> doing higher-order allocations are not that.  I say we make the picture
>> the most clear for the traditional 4k users.
> Your way might be confusing -- if there's a system which is under varying
> amounts of jumboframe load and all the 16k pages get gobbled up by the
> ethernet driver, MemFree won't change at all, for example.

IOW, you agree that "there's definitely no perfect solution." :)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
@ 2018-02-16 21:47                 ` Dave Hansen
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Hansen @ 2018-02-16 21:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christopher Lameter, Mel Gorman, linux-mm, linux-rdma, akpm,
	Thomas Schoebel-Theuer, andi, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Mike Kravetz

On 02/16/2018 01:43 PM, Matthew Wilcox wrote:
>> There's definitely no perfect solution.
>>
>> But, in general, I think we should cater to the dumbest users.  Folks
>> doing higher-order allocations are not that.  I say we make the picture
>> the most clear for the traditional 4k users.
> Your way might be confusing -- if there's a system which is under varying
> amounts of jumboframe load and all the 16k pages get gobbled up by the
> ethernet driver, MemFree won't change at all, for example.

IOW, you agree that "there's definitely no perfect solution." :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
  2018-02-16 18:02   ` Randy Dunlap
@ 2018-02-17 16:07         ` Mike Rapoprt
  0 siblings, 0 replies; 38+ messages in thread
From: Mike Rapoprt @ 2018-02-17 16:07 UTC (permalink / raw)
  To: Randy Dunlap, Christoph Lameter, Mel Gorman
  Cc: Matthew Wilcox, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Thomas Schoebel-Theuer,
	andi-Vw/NltI1exuRpAAqCnN02g, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Dave Hansen, Mike Kravetz



On February 16, 2018 7:02:53 PM GMT+01:00, Randy Dunlap <rdunlap-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
>On 02/16/2018 08:01 AM, Christoph Lameter wrote:
>> Control over this feature is by writing to /proc/zoneinfo.
>> 
>> F.e. to ensure that 2000 16K pages stay available for jumbo
>> frames do
>> 
>> 	echo "2=2000" >/proc/zoneinfo
>> 
>> or through the order=<page spec> on the kernel command line.
>> F.e.
>> 
>> 	order=2=2000,4N2=500
>
>
>Please document the the kernel command line option in
>Documentation/admin-guide/kernel-parameters.txt.
>
>I suppose that /proc/zoneinfo should be added somewhere in
>Documentation/vm/
>but I'm not sure where that would be.

It's in Documentation/sysctl/vm.txt and in 'man proc' [1]

[1] https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man5/proc.5

>thanks,

-- 
Sincerely yours,
Mike.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
@ 2018-02-17 16:07         ` Mike Rapoprt
  0 siblings, 0 replies; 38+ messages in thread
From: Mike Rapoprt @ 2018-02-17 16:07 UTC (permalink / raw)
  To: Randy Dunlap, Christoph Lameter, Mel Gorman
  Cc: Matthew Wilcox, linux-mm, linux-rdma, akpm,
	Thomas Schoebel-Theuer, andi, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Dave Hansen, Mike Kravetz



On February 16, 2018 7:02:53 PM GMT+01:00, Randy Dunlap <rdunlap@infradead.org> wrote:
>On 02/16/2018 08:01 AM, Christoph Lameter wrote:
>> Control over this feature is by writing to /proc/zoneinfo.
>> 
>> F.e. to ensure that 2000 16K pages stay available for jumbo
>> frames do
>> 
>> 	echo "2=2000" >/proc/zoneinfo
>> 
>> or through the order=<page spec> on the kernel command line.
>> F.e.
>> 
>> 	order=2=2000,4N2=500
>
>
>Please document the the kernel command line option in
>Documentation/admin-guide/kernel-parameters.txt.
>
>I suppose that /proc/zoneinfo should be added somewhere in
>Documentation/vm/
>but I'm not sure where that would be.

It's in Documentation/sysctl/vm.txt and in 'man proc' [1]

[1] https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man5/proc.5

>thanks,

-- 
Sincerely yours,
Mike.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 2/2] Page order diagnostics
  2018-02-16 16:01   ` Christoph Lameter
@ 2018-02-17 21:17       ` Pavel Machek
  -1 siblings, 0 replies; 38+ messages in thread
From: Pavel Machek @ 2018-02-17 21:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Matthew Wilcox, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Thomas Schoebel-Theuer,
	andi-Vw/NltI1exuRpAAqCnN02g, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Dave Hansen,
	Mike Kravetz

[-- Attachment #1: Type: text/plain, Size: 1178 bytes --]

Hi!

> @@ -1289,6 +1289,52 @@ const char * const vmstat_text[] = {
>  	"swap_ra",
>  	"swap_ra_hit",
>  #endif
> +#ifdef CONFIG_ORDER_STATS
> +	"order0_failure",
> +	"order1_failure",
> +	"order2_failure",
> +	"order3_failure",
> +	"order4_failure",
> +	"order5_failure",
> +	"order6_failure",
> +	"order7_failure",
> +	"order8_failure",
> +	"order9_failure",
> +	"order10_failure",
> +#ifdef CONFIG_FORCE_MAX_ZONEORDER
> +#if MAX_ORDER > 11
> +	"order11_failure"
> +#endif
> +#if MAX_ORDER > 12
> +	"order12_failure"
> +#endif
> +#if MAX_ORDER > 13
> +	"order13_failure"
> +#endif
> +#if MAX_ORDER > 14
> +	"order14_failure"
> +#endif
> +#if MAX_ORDER > 15
> +	"order15_failure"
> +#endif
> +#if MAX_ORDER > 16
> +	"order16_failure"
> +#endif
> +#if MAX_ORDER > 17
> +	"order17_failure"
> +#endif
> +#if MAX_ORDER > 18
> +	"order18_failure"
> +#endif
> +#if MAX_ORDER > 19
> +	"order19_failure"
> +#endif

I don't think this does what you want it to do. Commas are missing.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 2/2] Page order diagnostics
@ 2018-02-17 21:17       ` Pavel Machek
  0 siblings, 0 replies; 38+ messages in thread
From: Pavel Machek @ 2018-02-17 21:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Matthew Wilcox, linux-mm, linux-rdma, akpm,
	Thomas Schoebel-Theuer, andi, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Dave Hansen,
	Mike Kravetz

[-- Attachment #1: Type: text/plain, Size: 1178 bytes --]

Hi!

> @@ -1289,6 +1289,52 @@ const char * const vmstat_text[] = {
>  	"swap_ra",
>  	"swap_ra_hit",
>  #endif
> +#ifdef CONFIG_ORDER_STATS
> +	"order0_failure",
> +	"order1_failure",
> +	"order2_failure",
> +	"order3_failure",
> +	"order4_failure",
> +	"order5_failure",
> +	"order6_failure",
> +	"order7_failure",
> +	"order8_failure",
> +	"order9_failure",
> +	"order10_failure",
> +#ifdef CONFIG_FORCE_MAX_ZONEORDER
> +#if MAX_ORDER > 11
> +	"order11_failure"
> +#endif
> +#if MAX_ORDER > 12
> +	"order12_failure"
> +#endif
> +#if MAX_ORDER > 13
> +	"order13_failure"
> +#endif
> +#if MAX_ORDER > 14
> +	"order14_failure"
> +#endif
> +#if MAX_ORDER > 15
> +	"order15_failure"
> +#endif
> +#if MAX_ORDER > 16
> +	"order16_failure"
> +#endif
> +#if MAX_ORDER > 17
> +	"order17_failure"
> +#endif
> +#if MAX_ORDER > 18
> +	"order18_failure"
> +#endif
> +#if MAX_ORDER > 19
> +	"order19_failure"
> +#endif

I don't think this does what you want it to do. Commas are missing.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC 1/2] Protect larger order pages from breaking up
  2018-02-16 20:13         ` Christopher Lameter
@ 2018-02-18  9:00           ` Guy Shattah
  -1 siblings, 0 replies; 38+ messages in thread
From: Guy Shattah @ 2018-02-18  9:00 UTC (permalink / raw)
  To: Christopher Lameter, Mike Kravetz
  Cc: Mel Gorman, Matthew Wilcox, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Thomas Schoebel-Theuer,
	andi-Vw/NltI1exuRpAAqCnN02g, Rik van Riel, Michal Hocko,
	Anshuman Khandual, Michal Nazarewicz, Vlastimil Babka,
	David Nellans, Laura Abbott, Pavel Machek, Dave Hansen

> 
> Yup it has a pool for everyone. Question is how to divide the loot ;-)
> 
> > IIRC, Guy Shattah's use case was for allocations greater than MAX_ORDER.
> > This would not directly address that.  A huge contiguous area (2GB) is
> > the sweet spot' for best performance in his case.  However, I think he
> > could still benefit from using a set of larger (such as 2MB) size
> > allocations which this scheme could help with.
> 
> MAX_ORDER can be increased to allow for larger allocations. IA64 has f.e.
> a much larger MAX_ORDER size. So does powerpc. And then the reservation
> scheme will work.
> 

MAX_ORDER can be increased only if kernel is recompiled. 
It won't work for code running for the general case / typical user.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC 1/2] Protect larger order pages from breaking up
@ 2018-02-18  9:00           ` Guy Shattah
  0 siblings, 0 replies; 38+ messages in thread
From: Guy Shattah @ 2018-02-18  9:00 UTC (permalink / raw)
  To: Christopher Lameter, Mike Kravetz
  Cc: Mel Gorman, Matthew Wilcox, linux-mm, linux-rdma, akpm,
	Thomas Schoebel-Theuer, andi, Rik van Riel, Michal Hocko,
	Anshuman Khandual, Michal Nazarewicz, Vlastimil Babka,
	David Nellans, Laura Abbott, Pavel Machek, Dave Hansen

> 
> Yup it has a pool for everyone. Question is how to divide the loot ;-)
> 
> > IIRC, Guy Shattah's use case was for allocations greater than MAX_ORDER.
> > This would not directly address that.  A huge contiguous area (2GB) is
> > the sweet spot' for best performance in his case.  However, I think he
> > could still benefit from using a set of larger (such as 2MB) size
> > allocations which this scheme could help with.
> 
> MAX_ORDER can be increased to allow for larger allocations. IA64 has f.e.
> a much larger MAX_ORDER size. So does powerpc. And then the reservation
> scheme will work.
> 

MAX_ORDER can be increased only if kernel is recompiled. 
It won't work for code running for the general case / typical user.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
  2018-02-16 16:01   ` Christoph Lameter
                     ` (3 preceding siblings ...)
  (?)
@ 2018-02-19 10:19   ` Mel Gorman
  2018-02-19 14:42     ` Michal Hocko
                       ` (2 more replies)
  -1 siblings, 3 replies; 38+ messages in thread
From: Mel Gorman @ 2018-02-19 10:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matthew Wilcox, linux-mm, linux-rdma, akpm,
	Thomas Schoebel-Theuer, andi, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Dave Hansen, Mike Kravetz

My skynet.ie/csn.ul.ie address has been defunct for quite some time.
Mail sent to it is not guaranteed to get to me.

On Fri, Feb 16, 2018 at 10:01:11AM -0600, Christoph Lameter wrote:
> Over time as the kernel is churning through memory it will break
> up larger pages and as time progresses larger contiguous allocations
> will no longer be possible. This is an approach to preserve these
> large pages and prevent them from being broken up.
> 
> <SNIP>
> Idea-by: Thomas Schoebel-Theuer <tst@schoebel-theuer.de>
> 
> First performance tests in a virtual enviroment show
> a hackbench improvement by 6% just by increasing
> the page size used by the page allocator to order 3.
> 

The phrasing here is confusing. hackbench is not very intensive in terms of
memory, it's more fork intensive where I find it extremely unlikely that
it would hit problems with fragmentation unless memory was deliberately
fragmented first. Furthermore, the phrasing implies that the minimum order
used by the page allocator is order 3 which is not what the patch appears
to do.

> Signed-off-by: Christopher Lameter <cl@linux.com>
> 
> Index: linux/include/linux/mmzone.h
> ===================================================================
> --- linux.orig/include/linux/mmzone.h
> +++ linux/include/linux/mmzone.h
> @@ -96,6 +96,11 @@ extern int page_group_by_mobility_disabl
>  struct free_area {
>  	struct list_head	free_list[MIGRATE_TYPES];
>  	unsigned long		nr_free;
> +	/* We stop breaking up pages of this order if less than
> +	 * min are available. At that point the pages can only
> +	 * be used for allocations of that particular order.
> +	 */
> +	unsigned long		min;
>  };
>  
>  struct pglist_data;
> Index: linux/mm/page_alloc.c
> ===================================================================
> --- linux.orig/mm/page_alloc.c
> +++ linux/mm/page_alloc.c
> @@ -1844,7 +1844,12 @@ struct page *__rmqueue_smallest(struct z
>  		area = &(zone->free_area[current_order]);
>  		page = list_first_entry_or_null(&area->free_list[migratetype],
>  							struct page, lru);
> -		if (!page)
> +		/*
> +		 * Continue if no page is found or if our freelist contains
> +		 * less than the minimum pages of that order. In that case
> +		 * we better look for a different order.
> +		 */
> +		if (!page || area->nr_free < area->min)
>  			continue;
>  		list_del(&page->lru);
>  		rmv_page_order(page);

This is surprising to say the least. Assuming reservations are at order-3,
this would refuse to split order-3 even if there was sufficient reserved
pages at higher orders for a reserve. This will cause splits of higher
orders unnecessarily which could cause other fragmentation-related issues
in the future.

This is similar to a memory pool except it's not. There is no concept of a
user of high-order reserves accounting for it. Hence, a user of high-order
pages could allocate the reserve multiple times for long-term purposes
while starving other allocation requests. This could easily happen for slub
with min_order set to the same order as the reserve causing potential OOM
issues. If a pool is to be created, it should be a real pool even if it's
transparently accessed through the page allocator. It should allocate the
requested number of pages and either decide to refill is possible or pass
requests through to the page allocator when the pool is depleted. Also,
as it stands, an OOM due to the reserve would be confusing as there is no
hint the failure may have been due to the reserve.

Access to the pool is unprotected so you might create a reserve for jumbo
frames only to have them consumed by something else entirely. It's not
clear if that is even fixable as GFP flags are too coarse.

It is not covered in the changelog why MIGRATE_HIGHATOMIC was not
sufficient for jumbo frames which are generally expected to be allocated
from atomic context. If there is a problem there then maybe
MIGRATE_HIGHATOMIC should be made more strict instead of a hack like
this. It'll be very difficult, if not impossible, for this to be tuned
properly.

Finally, while I accept that fragmentation over time is a problem for
unmovable allocations (fragmentation protection was originally designed
for THP/hugetlbfs), this is papering over the problem. If greater
protections are needed then the right approach is to be more strict about
fallbacks. Specifically, unmovable allocations should migrate all movable
pages out of migrate_unmovable pageblocks before falling back and that
can be controlled by policy due to the overhead of migration. For atomic
allocations, allow fallback but use kcompact or a workqueue to migrate
movable pages out of migrate_unmovable pageblocks to limit fallbacks in
the future.

I'm not a fan of this patch.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
  2018-02-19 10:19   ` Mel Gorman
@ 2018-02-19 14:42     ` Michal Hocko
  2018-02-19 15:09     ` Christopher Lameter
  2018-02-22 21:19     ` Thomas Schoebel-Theuer
  2 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2018-02-19 14:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, Matthew Wilcox, linux-mm, linux-rdma, akpm,
	Thomas Schoebel-Theuer, andi, Rik van Riel, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, Vlastimil Babka,
	David Nellans, Laura Abbott, Pavel Machek, Dave Hansen,
	Mike Kravetz

On Mon 19-02-18 10:19:35, Mel Gorman wrote:
[...]
> Access to the pool is unprotected so you might create a reserve for jumbo
> frames only to have them consumed by something else entirely. It's not
> clear if that is even fixable as GFP flags are too coarse.
> 
> It is not covered in the changelog why MIGRATE_HIGHATOMIC was not
> sufficient for jumbo frames which are generally expected to be allocated
> from atomic context. If there is a problem there then maybe
> MIGRATE_HIGHATOMIC should be made more strict instead of a hack like
> this. It'll be very difficult, if not impossible, for this to be tuned
> properly.
> 
> Finally, while I accept that fragmentation over time is a problem for
> unmovable allocations (fragmentation protection was originally designed
> for THP/hugetlbfs), this is papering over the problem. If greater
> protections are needed then the right approach is to be more strict about
> fallbacks. Specifically, unmovable allocations should migrate all movable
> pages out of migrate_unmovable pageblocks before falling back and that
> can be controlled by policy due to the overhead of migration. For atomic
> allocations, allow fallback but use kcompact or a workqueue to migrate
> movable pages out of migrate_unmovable pageblocks to limit fallbacks in
> the future.

Completely agreed!

> I'm not a fan of this patch.

Yes, I think the approach is just wrong. It will just hit all sorts of
weird corner cases and won't work reliable for those who care.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 2/2] Page order diagnostics
  2018-02-17 21:17       ` Pavel Machek
  (?)
@ 2018-02-19 14:54       ` Christopher Lameter
  -1 siblings, 0 replies; 38+ messages in thread
From: Christopher Lameter @ 2018-02-19 14:54 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Mel Gorman, Matthew Wilcox, linux-mm, linux-rdma, akpm,
	Thomas Schoebel-Theuer, andi, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Dave Hansen,
	Mike Kravetz

n Sat, 17 Feb 2018, Pavel Machek wrote:

> I don't think this does what you want it to do. Commas are missing.

Right never tested on anything but x86.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
  2018-02-19 10:19   ` Mel Gorman
  2018-02-19 14:42     ` Michal Hocko
@ 2018-02-19 15:09     ` Christopher Lameter
  2018-02-22 21:19     ` Thomas Schoebel-Theuer
  2 siblings, 0 replies; 38+ messages in thread
From: Christopher Lameter @ 2018-02-19 15:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Matthew Wilcox, linux-mm, linux-rdma, akpm,
	Thomas Schoebel-Theuer, andi, Rik van Riel, Michal Hocko,
	Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Dave Hansen, Mike Kravetz

On Mon, 19 Feb 2018, Mel Gorman wrote:

> The phrasing here is confusing. hackbench is not very intensive in terms of
> memory, it's more fork intensive where I find it extremely unlikely that
> it would hit problems with fragmentation unless memory was deliberately
> fragmented first. Furthermore, the phrasing implies that the minimum order
> used by the page allocator is order 3 which is not what the patch appears
> to do.

It was used to illustrate the performance gain.

> > -		if (!page)
> > +		/*
> > +		 * Continue if no page is found or if our freelist contains
> > +		 * less than the minimum pages of that order. In that case
> > +		 * we better look for a different order.
> > +		 */
> > +		if (!page || area->nr_free < area->min)
> >  			continue;
> >  		list_del(&page->lru);
> >  		rmv_page_order(page);
>
> This is surprising to say the least. Assuming reservations are at order-3,
> this would refuse to split order-3 even if there was sufficient reserved
> pages at higher orders for a reserve. This will cause splits of higher
> orders unnecessarily which could cause other fragmentation-related issues
> in the future.

Well that is intended. We want to preserve a number of pages at a certain
order. If there are higher order pages available then those can be split
and the allocation will succeed while preserving the mininum number of
pages at the reserved order.

> This is similar to a memory pool except it's not. There is no concept of a
> user of high-order reserves accounting for it. Hence, a user of high-order
> pages could allocate the reserve multiple times for long-term purposes
> while starving other allocation requests. This could easily happen for slub
> with min_order set to the same order as the reserve causing potential OOM
> issues. If a pool is to be created, it should be a real pool even if it's
> transparently accessed through the page allocator. It should allocate the
> requested number of pages and either decide to refill is possible or pass
> requests through to the page allocator when the pool is depleted. Also,
> as it stands, an OOM due to the reserve would be confusing as there is no
> hint the failure may have been due to the reserve.

Ok we can add the ->min values to the OOOM report.

This is a crude approach I agree and it does require knowlege of the load
and user patterns. However, what other approach is there to allow the
system to sustain higher order allocations if those are needed? This is an
issue for which no satisfactory solution is present. So a measure like
this would allow a limited use in some situations.

> Access to the pool is unprotected so you might create a reserve for jumbo
> frames only to have them consumed by something else entirely. It's not
> clear if that is even fixable as GFP flags are too coarse.

If its consumed by something else then the parameters or the jumbo frame
setting may be adjusted. This feature is off by default so its only used
for tuning purposes.

> It is not covered in the changelog why MIGRATE_HIGHATOMIC was not
> sufficient for jumbo frames which are generally expected to be allocated
> from atomic context. If there is a problem there then maybe
> MIGRATE_HIGHATOMIC should be made more strict instead of a hack like
> this. It'll be very difficult, if not impossible, for this to be tuned
> properly.

This approach has been in use for a decade or so as mentioned in the patch
description. So please be careful with impossibility claims. This enables
handling of larger contiguous blocks of memory that are requires in some
circumstances and it has been doing that successfully (although with some
tuning effort).

> Finally, while I accept that fragmentation over time is a problem for
> unmovable allocations (fragmentation protection was originally designed
> for THP/hugetlbfs), this is papering over the problem. If greater
> protections are needed then the right approach is to be more strict about
> fallbacks. Specifically, unmovable allocations should migrate all movable
> pages out of migrate_unmovable pageblocks before falling back and that
> can be controlled by policy due to the overhead of migration. For atomic
> allocations, allow fallback but use kcompact or a workqueue to migrate
> movable pages out of migrate_unmovable pageblocks to limit fallbacks in
> the future.

This is also papering over more issues. While these measures may delay
fragmentation some bit more they will not result in a pool of large
pages being available for the system throughout the lifetime of it.

> I'm not a fan of this patch.

I am also not a fan of this patch but this is enabling something that we
wanted for a long time. Consistent ability in a limited way to allocate
large page orders.

Since we have failed to address this in other way this may be the best ad
hoc method to get there. What we have done to address fragmentation so far
are all these preventative measures that get more ineffective as time
progresses while memory sizes increase. Either we do this or we need to
actually do one of the other known measures to address fragmentation like
making inode/dentries movable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
  2018-02-19 10:19   ` Mel Gorman
  2018-02-19 14:42     ` Michal Hocko
  2018-02-19 15:09     ` Christopher Lameter
@ 2018-02-22 21:19     ` Thomas Schoebel-Theuer
  2018-02-22 21:53       ` Zi Yan
  2018-02-23  9:59       ` Mel Gorman
  2 siblings, 2 replies; 38+ messages in thread
From: Thomas Schoebel-Theuer @ 2018-02-22 21:19 UTC (permalink / raw)
  To: Mel Gorman, Christoph Lameter
  Cc: Matthew Wilcox, linux-mm, linux-rdma, akpm, andi, Rik van Riel,
	Michal Hocko, Guy Shattah, Anshuman Khandual, Michal Nazarewicz,
	Vlastimil Babka, David Nellans, Laura Abbott, Pavel Machek,
	Dave Hansen, Mike Kravetz

[-- Attachment #1: Type: text/plain, Size: 5015 bytes --]

On 02/19/18 11:19, Mel Gorman wrote:
>
>> Index: linux/mm/page_alloc.c
>> ===================================================================
>> --- linux.orig/mm/page_alloc.c
>> +++ linux/mm/page_alloc.c
>> @@ -1844,7 +1844,12 @@ struct page *__rmqueue_smallest(struct z
>>   		area = &(zone->free_area[current_order]);
>>   		page = list_first_entry_or_null(&area->free_list[migratetype],
>>   							struct page, lru);
>> -		if (!page)
>> +		/*
>> +		 * Continue if no page is found or if our freelist contains
>> +		 * less than the minimum pages of that order. In that case
>> +		 * we better look for a different order.
>> +		 */
>> +		if (!page || area->nr_free < area->min)
>>   			continue;
>>   		list_del(&page->lru);
>>   		rmv_page_order(page);
> This is surprising to say the least. Assuming reservations are at order-3,
> this would refuse to split order-3 even if there was sufficient reserved
> pages at higher orders for a reserve.

Hi Mel,

I agree with you that the above code does not really do what it should.

At least, the condition needs to be changed to:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 76c9688b6a0a..193dfd85a6b1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1837,7 +1837,15 @@ struct page *__rmqueue_smallest(struct zone 
*zone, unsigned int order,
                 area = &(zone->free_area[current_order]);
                 page = 
list_first_entry_or_null(&area->free_list[migratetype],
                                                         struct page, lru);
-               if (!page)
+               /*
+                * Continue if no page is found or if we are about to
+                * split a truly higher order than requested.
+                * There is no limit for just _using_ exactly the right
+                * order. The limit is only for _splitting_ some
+                * higher order.
+                */
+               if (!page ||
+                   (area->nr_free < area->min && current_order > order))
                         continue;
                 list_del(&page->lru);
                 rmv_page_order(page);


The "&& current_order > order" part is _crucial_. If left out, it will 
work even counter-productive. I know this from development of my 
original patch some years ago.

Please have a look at the attached patchset for kernel 3.16 which is in 
_production_ at 1&1 Internet SE at about 20,000 servers for several 
years now, starting from kernel 3.2.x to 3.16.x (or maybe the very first 
version was for 2.6.32, I don't remember exactly).

It has collected several millions of operation hours in total, and it is 
known to work miracles for some of our workloads.

Porting to later kernels should be relatively easy. Also notice that the 
switch labels at patch #2 could need some minor tweaking, e.g. also 
including ZONE_DMA32 or similar, and also might need some 
architecture-specific tweaking. All of the tweaking is depending on the 
actual workload. I am using it only at datacenter servers (webhosting) 
and at x86_64.

Please notice that the user interface of my patchset is extremely simple 
and can be easily understood by junior sysadmins:

After running your box for several days or weeks or even months (or 
possibly, after you just got an OOM), just do
# cat /proc/sys/vm/perorder_statistics > /etc/defaults/my_perorder_reserve

Then add a trivial startup script, e.g. to systemd or to sysv init etc, 
which just does the following early during the next reboot:
# cat /etc/defaults/my_perorder_reserve > /proc/sys/vm/perorder_reserve

That's it.

No need for a deep understanding of the theory of the memory 
fragmentation problem.

Also no need for adding anything to the boot commandline. Fragmentation 
will typically occur only after some days or weeks or months of 
operation, at least in all of the practical cases I have personally seen 
at 1&1 datacenters and their workloads.

Please notice that fragmentation can be a very serious problem for 
operations if you are hurt by it. It can seriously harm your business. 
And it is _extremely_ specific to the actual workload, and to the 
hardware / chipset / etc. This is addressed by the above method of 
determining the right values from _actual_ operations (not from 
speculation) and then memoizing them.

The attached patchset tries to be very simple, but in my practical 
experience it is a very effective practical solution.

When requested, I can post the mathematical theory behind the patch, or 
I could give a presentation at some of the next conferences if I would 
be invited (or better give a practical explanation instead). But 
probably nobody on these lists wants to deal with any theories.

Just _play_ with the patchset practically, and then you will notice.

Cheers and greetings,

Yours sincerly old-school hacker Thomas


P.S. I cannot attend these lists full-time due to my workload at 1&1 
which is unfortunately not designed for upstream hacking, so please stay 
patient with me if an answer takes a few days.



[-- Attachment #2: 0001-mm-fix-fragmentation-by-pre-reserving-higher-order-p.patch --]
[-- Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
  2018-02-22 21:19     ` Thomas Schoebel-Theuer
@ 2018-02-22 21:53       ` Zi Yan
  2018-02-23  2:01         ` Christopher Lameter
  2018-02-23  9:59       ` Mel Gorman
  1 sibling, 1 reply; 38+ messages in thread
From: Zi Yan @ 2018-02-22 21:53 UTC (permalink / raw)
  To: Thomas Schoebel-Theuer
  Cc: Mel Gorman, Christoph Lameter, Matthew Wilcox, linux-mm,
	linux-rdma, akpm, andi, Rik van Riel, Michal Hocko, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, Vlastimil Babka,
	David Nellans, Laura Abbott, Pavel Machek, Dave Hansen,
	Mike Kravetz

[-- Attachment #1: Type: text/plain, Size: 1445 bytes --]

On 22 Feb 2018, at 16:19, Thomas Schoebel-Theuer wrote:

<snip>
>
> No need for a deep understanding of the theory of the memory fragmentation problem.
>
> Also no need for adding anything to the boot commandline. Fragmentation will typically occur only after some days or weeks or months of operation, at least in all of the practical cases I have personally seen at 1&1 datacenters and their workloads.
>
> Please notice that fragmentation can be a very serious problem for operations if you are hurt by it. It can seriously harm your business. And it is _extremely_ specific to the actual workload, and to the hardware / chipset / etc. This is addressed by the above method of determining the right values from _actual_ operations (not from speculation) and then memoizing them.
>
> The attached patchset tries to be very simple, but in my practical experience it is a very effective practical solution.
>
> When requested, I can post the mathematical theory behind the patch, or I could give a presentation at some of the next conferences if I would be invited (or better give a practical explanation instead). But probably nobody on these lists wants to deal with any theories.

Hi Thomas,

I am very interested in the theory behind your patch. Do you mind sharing it? Is there
any required math background before reading it? Is there any related papers/articles I could
also read?

Thanks.

--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
  2018-02-22 21:53       ` Zi Yan
@ 2018-02-23  2:01         ` Christopher Lameter
  2018-02-23  2:16           ` Zi Yan
  0 siblings, 1 reply; 38+ messages in thread
From: Christopher Lameter @ 2018-02-23  2:01 UTC (permalink / raw)
  To: Zi Yan
  Cc: Thomas Schoebel-Theuer, Mel Gorman, Matthew Wilcox, linux-mm,
	linux-rdma, akpm, andi, Rik van Riel, Michal Hocko, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, Vlastimil Babka,
	David Nellans, Laura Abbott, Pavel Machek, Dave Hansen,
	Mike Kravetz

On Thu, 22 Feb 2018, Zi Yan wrote:

> I am very interested in the theory behind your patch. Do you mind sharing it? Is there
> any required math background before reading it? Is there any related papers/articles I could
> also read?

His patches were attached to the email you responded to. Guess I should
update the patchset with the suggested changes and repost.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
  2018-02-23  2:01         ` Christopher Lameter
@ 2018-02-23  2:16           ` Zi Yan
  2018-02-23  2:45             ` Christopher Lameter
  0 siblings, 1 reply; 38+ messages in thread
From: Zi Yan @ 2018-02-23  2:16 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Thomas Schoebel-Theuer, Mel Gorman, Matthew Wilcox, linux-mm,
	linux-rdma, akpm, andi, Rik van Riel, Michal Hocko, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, Vlastimil Babka,
	David Nellans, Laura Abbott, Pavel Machek, Dave Hansen,
	Mike Kravetz

[-- Attachment #1: Type: text/plain, Size: 911 bytes --]

Yes. I saw the attached patches. I am definitely going to apply them and see how they work out.

In his last patch, there are a bunch of magic numbers used to reserve free page blocks
at different orders. I think that is the most interesting part. If Thomas can share how
to determine these numbers with his theory based on workloads, hardware/chipset, that would
be a great guideline for sysadmins to take advantage of the patches.

—
Best Regards,
Yan Zi

On 22 Feb 2018, at 21:01, Christopher Lameter wrote:

> On Thu, 22 Feb 2018, Zi Yan wrote:
>
>> I am very interested in the theory behind your patch. Do you mind sharing it? Is there
>> any required math background before reading it? Is there any related papers/articles I could
>> also read?
>
> His patches were attached to the email you responded to. Guess I should
> update the patchset with the suggested changes and repost.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 557 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
  2018-02-23  2:16           ` Zi Yan
@ 2018-02-23  2:45             ` Christopher Lameter
  0 siblings, 0 replies; 38+ messages in thread
From: Christopher Lameter @ 2018-02-23  2:45 UTC (permalink / raw)
  To: Zi Yan
  Cc: Thomas Schoebel-Theuer, Mel Gorman, Matthew Wilcox, linux-mm,
	linux-rdma, akpm, andi, Rik van Riel, Michal Hocko, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, Vlastimil Babka,
	David Nellans, Laura Abbott, Pavel Machek, Dave Hansen,
	Mike Kravetz

On Thu, 22 Feb 2018, Zi Yan wrote:

> Yes. I saw the attached patches. I am definitely going to apply them and see how they work out.
>
> In his last patch, there are a bunch of magic numbers used to reserve free page blocks
> at different orders. I think that is the most interesting part. If Thomas can share how
> to determine these numbers with his theory based on workloads, hardware/chipset, that would
> be a great guideline for sysadmins to take advantage of the patches.

These numbers are specific to the loads encountered in his situation and
the patches are specific to the machine configurations in his environment.

I have tried to generalize his idea and produce a patchset that is
reviewable and acceptable. I will update the patchset as needed.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 1/2] Protect larger order pages from breaking up
  2018-02-22 21:19     ` Thomas Schoebel-Theuer
  2018-02-22 21:53       ` Zi Yan
@ 2018-02-23  9:59       ` Mel Gorman
  1 sibling, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2018-02-23  9:59 UTC (permalink / raw)
  To: Thomas Schoebel-Theuer
  Cc: Christoph Lameter, Matthew Wilcox, linux-mm, linux-rdma, akpm,
	andi, Rik van Riel, Michal Hocko, Guy Shattah, Anshuman Khandual,
	Michal Nazarewicz, Vlastimil Babka, David Nellans, Laura Abbott,
	Pavel Machek, Dave Hansen, Mike Kravetz

On Thu, Feb 22, 2018 at 10:19:32PM +0100, Thomas Schoebel-Theuer wrote:
> Please have a look at the attached patchset for kernel 3.16 which is in
> _production_ at 1&1 Internet SE at about 20,000 servers for several years
> now, starting from kernel 3.2.x to 3.16.x (or maybe the very first version
> was for 2.6.32, I don't remember exactly).
> 

3.16 is 4 years old. Crucially, it's missing at least commit
0aaa29a56e4fb ("mm, page_alloc: reserve pageblocks for high-order atomic
allocations on demand") and commit 97a16fc82a7c5 ("mm, page_alloc: only
enforce watermarks for order-0 allocations"), both of which were
introduced in 4.4 (2 years ago) and both which have a significant impact
on the treatment of high-order allocation requests.

> It has collected several millions of operation hours in total, and it is
> known to work miracles for some of our workloads.
> 

Be that as it may, it does not prove that it's necessary for current
kernels, if the tuning is necessary or if it's possible to deal with
this without manual monitoring and tuning of individual hardware
configurations or workloads.

> Porting to later kernels should be relatively easy. Also notice that the
> switch labels at patch #2 could need some minor tweaking, e.g. also
> including ZONE_DMA32 or similar, and also might need some
> architecture-specific tweaking. All of the tweaking is depending on the
> actual workload. I am using it only at datacenter servers (webhosting) and
> at x86_64.
> 
> Please notice that the user interface of my patchset is extremely simple and
> can be easily understood by junior sysadmins:
> 
> After running your box for several days or weeks or even months (or
> possibly, after you just got an OOM), just do
> # cat /proc/sys/vm/perorder_statistics > /etc/defaults/my_perorder_reserve
> 

And my point was that in so far as it is possible, this should be managed
without tuning at all. My concern is that if the patches were merged as-is
without supporting proof showing how and when it's necessary that it's
effectively dead code.

> Also no need for adding anything to the boot commandline. Fragmentation will
> typically occur only after some days or weeks or months of operation, at
> least in all of the practical cases I have personally seen at 1&1
> datacenters and their workloads.
> 

I accept the logic but it's also been a long time since I received a
high-order-atomic-allocation failure bug in the field. That said, none of
the field situations I deal with use SLUB on the grounds the reliance of
high-order allocations for high performance can be problematic in itself
within sufficently long uptimes.

> When requested, I can post the mathematical theory behind the patch, or I
> could give a presentation at some of the next conferences if I would be
> invited (or better give a practical explanation instead). But probably
> nobody on these lists wants to deal with any theories.
> 

I think I'm ok, I should have a reasonable grounding in the relevant
theory to not require a detailed exaplanation.

I accept that your day-to-day situation does not allow much upstream
hacking but the associated data for the patches in general are
insufficient to show that it's a problem with current kernels and is
absolutely required to have a tunable. The current HIGHATOMIC protection
and alternative treatment of watermarks may be enough. Alternatively, it
may be necessary to more aggressively protect MIGRATE_UNMOVABLE
pageblocks from being polluted with MOVABLE pages when fallbacks and
memory pressure occurs but that has similarly not been proven.

A changelog for adding new pools should include details on why the
existing mechanisms do not work, why they cannot be handled
automatically (e.g. preemptively moving MOVABLE pages out of UNMOVABLE
blocks before fragmentation degrades further) and an example of an OOM
caused by fragmentation.

I recognise that the burden of proof is high in this case but I'm not
comfortable with adding tuning and maintenance overhead just in case
it's required.

> Just _play_ with the patchset practically, and then you will notice.
> 

Unfortunately, since the last round of patches I wrote dealing with
high-order allocation failures, I have not personally encountered a situation
whereby performance or functionality were limited by high-order allocation
delays or failures. It could be argued that THP allocations were a problem
but for the most part, that has been dealt with by not stalling aggressively
any more as the overhead was too high for the relatively marginal gain (this
is not universally accepted but in those cases, it goes back to preemptively
moving MOVABLE pages out of UNMOVABLE pageblocks). I could play with the
patch but it's highly unlikely I'll detect a difference. While I have a
test-bed, it doesn't have loading of long uptime of complex applications
depending on jumbo frame allocations critical for high-performance.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2018-02-23  9:59 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-16 16:01 [RFC 0/2] Larger Order Protection V1 Christoph Lameter
2018-02-16 16:01 ` [RFC 1/2] Protect larger order pages from breaking up Christoph Lameter
2018-02-16 16:01   ` Christoph Lameter
     [not found]   ` <20180216160121.519788537-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
2018-02-16 17:03     ` Andi Kleen
2018-02-16 17:03       ` Andi Kleen
     [not found]       ` <20180216170354.vpbuugzqsrrfc4js-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>
2018-02-16 18:25         ` Christopher Lameter
2018-02-16 18:25           ` Christopher Lameter
2018-02-16 19:01     ` Dave Hansen
2018-02-16 19:01       ` Dave Hansen
2018-02-16 20:15       ` Christopher Lameter
2018-02-16 21:08         ` Dave Hansen
2018-02-16 21:08           ` Dave Hansen
2018-02-16 21:43           ` Matthew Wilcox
     [not found]             ` <20180216214353.GA32655-PfSpb0PWhxZc2C7mugBRk2EX/6BAtgUQ@public.gmane.org>
2018-02-16 21:47               ` Dave Hansen
2018-02-16 21:47                 ` Dave Hansen
2018-02-16 18:02   ` Randy Dunlap
     [not found]     ` <b76028c6-c755-8178-2dfc-81c7db1f8bed-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2018-02-17 16:07       ` Mike Rapoprt
2018-02-17 16:07         ` Mike Rapoprt
2018-02-16 18:59   ` Mike Kravetz
     [not found]     ` <5108eb20-2b20-bd48-903e-bce312e96974-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2018-02-16 20:13       ` Christopher Lameter
2018-02-16 20:13         ` Christopher Lameter
2018-02-18  9:00         ` Guy Shattah
2018-02-18  9:00           ` Guy Shattah
2018-02-19 10:19   ` Mel Gorman
2018-02-19 14:42     ` Michal Hocko
2018-02-19 15:09     ` Christopher Lameter
2018-02-22 21:19     ` Thomas Schoebel-Theuer
2018-02-22 21:53       ` Zi Yan
2018-02-23  2:01         ` Christopher Lameter
2018-02-23  2:16           ` Zi Yan
2018-02-23  2:45             ` Christopher Lameter
2018-02-23  9:59       ` Mel Gorman
2018-02-16 16:01 ` [RFC 2/2] Page order diagnostics Christoph Lameter
2018-02-16 16:01   ` Christoph Lameter
     [not found]   ` <20180216160121.583566579-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
2018-02-17 21:17     ` Pavel Machek
2018-02-17 21:17       ` Pavel Machek
2018-02-19 14:54       ` Christopher Lameter
2018-02-16 18:27 ` [RFC 0/2] Larger Order Protection V1 Christopher Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.