All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Move away from non-failing small allocations
@ 2015-03-11 20:54 ` Michal Hocko
  0 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-11 20:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Dave Chinner, Mel Gorman, Rik van Riel,
	Wu Fengguang, linux-mm, LKML, Linux API

Hi,
as per discussion at LSF/MM summit few days back it seems there is a
general agreement on moving away from "small allocations do not fail"
concept.

There are two patches in this series. The first one exports a sysctl
knob which controls how hard small allocation (!__GFP_NOFAIL ones of
course) retry when we get completely out of memory before the allocation
fails. The default is still retry infinitely because we cannot simply
change the 14+ years behavior right away. It will take years before all
the potential fallouts are discovered and fixed and we can change the
default value.

The second patch is the first step in the transition plan. It changes
the default but it is NOT an upstream material. It is aimed for brave
testers who can cope with failures. I have talked to Andrew and he
was willing to keep that patch in mmotm tree. It would be even better
to have this in linux-next because the testing coverage would be even
bigger. Dave Chinner has also shown an interest to integrate this into
his xfstest farm. It would be great if Fenguang could add it into the
zero testing project too (if the pushing the patch into linux-next
would be too controversial).


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 0/2] Move away from non-failing small allocations
@ 2015-03-11 20:54 ` Michal Hocko
  0 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-11 20:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Dave Chinner, Mel Gorman, Rik van Riel,
	Wu Fengguang, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, LKML, Linux API

Hi,
as per discussion at LSF/MM summit few days back it seems there is a
general agreement on moving away from "small allocations do not fail"
concept.

There are two patches in this series. The first one exports a sysctl
knob which controls how hard small allocation (!__GFP_NOFAIL ones of
course) retry when we get completely out of memory before the allocation
fails. The default is still retry infinitely because we cannot simply
change the 14+ years behavior right away. It will take years before all
the potential fallouts are discovered and fixed and we can change the
default value.

The second patch is the first step in the transition plan. It changes
the default but it is NOT an upstream material. It is aimed for brave
testers who can cope with failures. I have talked to Andrew and he
was willing to keep that patch in mmotm tree. It would be even better
to have this in linux-next because the testing coverage would be even
bigger. Dave Chinner has also shown an interest to integrate this into
his xfstest farm. It would be great if Fenguang could add it into the
zero testing project too (if the pushing the patch into linux-next
would be too controversial).

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 0/2] Move away from non-failing small allocations
@ 2015-03-11 20:54 ` Michal Hocko
  0 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-11 20:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Dave Chinner, Mel Gorman, Rik van Riel,
	Wu Fengguang, linux-mm, LKML, Linux API

Hi,
as per discussion at LSF/MM summit few days back it seems there is a
general agreement on moving away from "small allocations do not fail"
concept.

There are two patches in this series. The first one exports a sysctl
knob which controls how hard small allocation (!__GFP_NOFAIL ones of
course) retry when we get completely out of memory before the allocation
fails. The default is still retry infinitely because we cannot simply
change the 14+ years behavior right away. It will take years before all
the potential fallouts are discovered and fixed and we can change the
default value.

The second patch is the first step in the transition plan. It changes
the default but it is NOT an upstream material. It is aimed for brave
testers who can cope with failures. I have talked to Andrew and he
was willing to keep that patch in mmotm tree. It would be even better
to have this in linux-next because the testing coverage would be even
bigger. Dave Chinner has also shown an interest to integrate this into
his xfstest farm. It would be great if Fenguang could add it into the
zero testing project too (if the pushing the patch into linux-next
would be too controversial).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 1/2] mm: Allow small allocations to fail
@ 2015-03-11 20:54   ` Michal Hocko
  0 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-11 20:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Dave Chinner, Mel Gorman, Rik van Riel,
	Wu Fengguang, linux-mm, LKML, Linux API

It's been ages since small allocations basically imply __GFP_NOFAIL
behavior (with some nuance - e.g. allocation might fail if the current
task is an OOM victim). The idea at the time was that the OOM killer
will make sufficient progress so the small allocation will succeed in
the end.
This assumption is flawed, though, because the retrying allocation might
be blocking a resource (e.g. a lock) which might prevent the OOM killer
victim from making progress and so the system is basically deadlocked.

Another aspect is that this behavior makes it extremely hard to make any
allocation failure policy implementation at the allocation caller. Most
of the allocation paths already check the for the allocation failure and
handle it properly.

There are some places which BUG_ON failure (mostly early boot code) and
they do not have to be changed. It is better to see a panic rather than
a silent freeze of the machine which is the case now.

Finally, if a non-failing allocation is unavoidable then __GFP_NOFAIL
flag is there to express this strong requirement. It is much better to
have a simple way to check all those places and come up with a solution
which will guarantee a forward progress for them.

As this behavior is established for many years we cannot change it
immediately. This patch instead exports a new sysctl/proc knob which
tells allocator how much to retry. The higher the number the longer will
the allocator loop and try to trigger OOM killer when the memory is too
low. This implementation counts only those retries which involved OOM
killer because we do not want to be too eager to fail the request.

I have tested it on a small machine (100M RAM + 128M swap, 2 CPUs) and
hammering it with anon consumers (10x 100M anon mapping per process and
10 processing running in parallel):
$ grep "Out of memory" anon-oom-1-retry| wc -l
8
$ grep "allocation failure" anon-oom-1-retry| wc -l
39

$ grep "Out of memory" anon-oom-10-retry| wc -l
10
$ grep "allocation failure" anon-oom-10-retry| wc -l
32

$ grep "Out of memory" anon-oom-100-retry| wc -l
10
$ grep "allocation failure" anon-oom-100-retry| wc -l
0

The default value is ULONG_MAX which basically preserves the current
behavior (endless retries). The idea is that we start with testing
systems first and lower the value to catch potential fallouts (crashes
due to unchecked failures or other misbehavior like FS ro-remounts
etc...). Allocation failures are already reported by warn_alloc_failed
so we should be able to catch the allocation path before an issue is
triggered.
.
We will try to encourage distributions to change the default in the
second step so that we get a much bigger exposure.

And finally we can change the default in the kernel while still keeping
the knob for conservative configurations. This will be long run but
let's start.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 Documentation/sysctl/vm.txt | 24 ++++++++++++++++++++++++
 include/linux/mm.h          |  2 ++
 kernel/sysctl.c             |  8 ++++++++
 mm/page_alloc.c             | 28 ++++++++++++++++++++++------
 4 files changed, 56 insertions(+), 6 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index e9c706e4627a..09f352ef8c3c 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -53,6 +53,7 @@ Currently, these files are in /proc/sys/vm:
 - page-cluster
 - panic_on_oom
 - percpu_pagelist_fraction
+- retry_allocation_attempts
 - stat_interval
 - swappiness
 - user_reserve_kbytes
@@ -707,6 +708,29 @@ sysctl, it will revert to this default behavior.
 
 ==============================================================
 
+retry_allocation_attempts
+
+Page allocator tries hard to not fail small allocations requests.
+Currently it retries indefinitely for small allocations requests (<= 32kB).
+This works mostly fine but under an extreme low memory conditions system
+might end up in deadlock situations because the looping allocation
+request might block further progress for OOM killer victims.
+
+Even though this hasn't turned out to be a huge problem for many years the
+long term plan is to move away from this default behavior but as this is
+a long established behavior we cannot change it immediately.
+
+This knob should help in the transition. It tells how many times should
+allocator retry when the system is OOM before the allocation fails.
+The default value (ULONG_MAX) preserves the old behavior. This is a safe
+default for production systems which cannot afford any unexpected
+downtimes. More experimental systems might set it to a small number
+(>=1), the higher the value the less probable would be allocation
+failures when OOM is transient and could be resolved without the
+particular allocation to fail.
+
+==============================================================
+
 stat_interval
 
 The time interval between which vm statistics are updated.  The default
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b720b5146a4e..e3b42f46e743 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -75,6 +75,8 @@ extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
 extern unsigned long sysctl_overcommit_kbytes;
 
+extern unsigned long sysctl_nr_alloc_retry;
+
 extern int overcommit_ratio_handler(struct ctl_table *, int, void __user *,
 				    size_t *, loff_t *);
 extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 88ea2d6e0031..4525f25e961b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1499,6 +1499,14 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_doulongvec_minmax,
 	},
+	{
+		.procname	= "retry_allocation_attempts",
+		.data		= &sysctl_nr_alloc_retry,
+		.maxlen		= sizeof(sysctl_nr_alloc_retry),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+		.extra1		= &one_ul,
+	},
 	{ }
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 58f6cf5bdde2..7ae07a5d08df 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -123,6 +123,17 @@ unsigned long dirty_balance_reserve __read_mostly;
 int percpu_pagelist_fraction;
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
 
+/*
+ * Number of allocation retries after the system is considered OOM.
+ * We have been retrying indefinitely for low order allocations for
+ * a very long time and this sysctl should help us to move away from
+ * this behavior because it complicates low memory conditions handling.
+ * The current default is preserving the behavior but non-critical
+ * environments are encouraged to lower the value to catch potential
+ * issues which should be fixed.
+ */
+unsigned long sysctl_nr_alloc_retry = ULONG_MAX;
+
 #ifdef CONFIG_PM_SLEEP
 /*
  * The following functions are used by the suspend/hibernate code to temporarily
@@ -2322,7 +2333,8 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 static inline int
 should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 				unsigned long did_some_progress,
-				unsigned long pages_reclaimed)
+				unsigned long pages_reclaimed,
+				unsigned long nr_retries)
 {
 	/* Do not loop if specifically requested */
 	if (gfp_mask & __GFP_NORETRY)
@@ -2342,11 +2354,12 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 
 	/*
 	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
-	 * means __GFP_NOFAIL, but that may not be true in other
-	 * implementations.
+	 * retries allocations as per global configuration which might
+	 * also be indefinitely.
 	 */
-	if (order <= PAGE_ALLOC_COSTLY_ORDER)
-		return 1;
+	if (order <= PAGE_ALLOC_COSTLY_ORDER &&
+			nr_retries < sysctl_nr_alloc_retry)
+			return 1;
 
 	/*
 	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
@@ -2651,6 +2664,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	unsigned long nr_retries = 0;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -2794,7 +2808,7 @@ retry:
 	/* Check if we should retry the allocation */
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, did_some_progress,
-						pages_reclaimed)) {
+			       pages_reclaimed, nr_retries)) {
 		/*
 		 * If we fail to make progress by freeing individual
 		 * pages, but the allocation wants us to keep going,
@@ -2807,6 +2821,8 @@ retry:
 				goto got_pg;
 			if (!did_some_progress)
 				goto nopage;
+
+			nr_retries++;
 		}
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 1/2] mm: Allow small allocations to fail
@ 2015-03-11 20:54   ` Michal Hocko
  0 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-11 20:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Dave Chinner, Mel Gorman, Rik van Riel,
	Wu Fengguang, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, LKML, Linux API

It's been ages since small allocations basically imply __GFP_NOFAIL
behavior (with some nuance - e.g. allocation might fail if the current
task is an OOM victim). The idea at the time was that the OOM killer
will make sufficient progress so the small allocation will succeed in
the end.
This assumption is flawed, though, because the retrying allocation might
be blocking a resource (e.g. a lock) which might prevent the OOM killer
victim from making progress and so the system is basically deadlocked.

Another aspect is that this behavior makes it extremely hard to make any
allocation failure policy implementation at the allocation caller. Most
of the allocation paths already check the for the allocation failure and
handle it properly.

There are some places which BUG_ON failure (mostly early boot code) and
they do not have to be changed. It is better to see a panic rather than
a silent freeze of the machine which is the case now.

Finally, if a non-failing allocation is unavoidable then __GFP_NOFAIL
flag is there to express this strong requirement. It is much better to
have a simple way to check all those places and come up with a solution
which will guarantee a forward progress for them.

As this behavior is established for many years we cannot change it
immediately. This patch instead exports a new sysctl/proc knob which
tells allocator how much to retry. The higher the number the longer will
the allocator loop and try to trigger OOM killer when the memory is too
low. This implementation counts only those retries which involved OOM
killer because we do not want to be too eager to fail the request.

I have tested it on a small machine (100M RAM + 128M swap, 2 CPUs) and
hammering it with anon consumers (10x 100M anon mapping per process and
10 processing running in parallel):
$ grep "Out of memory" anon-oom-1-retry| wc -l
8
$ grep "allocation failure" anon-oom-1-retry| wc -l
39

$ grep "Out of memory" anon-oom-10-retry| wc -l
10
$ grep "allocation failure" anon-oom-10-retry| wc -l
32

$ grep "Out of memory" anon-oom-100-retry| wc -l
10
$ grep "allocation failure" anon-oom-100-retry| wc -l
0

The default value is ULONG_MAX which basically preserves the current
behavior (endless retries). The idea is that we start with testing
systems first and lower the value to catch potential fallouts (crashes
due to unchecked failures or other misbehavior like FS ro-remounts
etc...). Allocation failures are already reported by warn_alloc_failed
so we should be able to catch the allocation path before an issue is
triggered.
.
We will try to encourage distributions to change the default in the
second step so that we get a much bigger exposure.

And finally we can change the default in the kernel while still keeping
the knob for conservative configurations. This will be long run but
let's start.

Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
---
 Documentation/sysctl/vm.txt | 24 ++++++++++++++++++++++++
 include/linux/mm.h          |  2 ++
 kernel/sysctl.c             |  8 ++++++++
 mm/page_alloc.c             | 28 ++++++++++++++++++++++------
 4 files changed, 56 insertions(+), 6 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index e9c706e4627a..09f352ef8c3c 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -53,6 +53,7 @@ Currently, these files are in /proc/sys/vm:
 - page-cluster
 - panic_on_oom
 - percpu_pagelist_fraction
+- retry_allocation_attempts
 - stat_interval
 - swappiness
 - user_reserve_kbytes
@@ -707,6 +708,29 @@ sysctl, it will revert to this default behavior.
 
 ==============================================================
 
+retry_allocation_attempts
+
+Page allocator tries hard to not fail small allocations requests.
+Currently it retries indefinitely for small allocations requests (<= 32kB).
+This works mostly fine but under an extreme low memory conditions system
+might end up in deadlock situations because the looping allocation
+request might block further progress for OOM killer victims.
+
+Even though this hasn't turned out to be a huge problem for many years the
+long term plan is to move away from this default behavior but as this is
+a long established behavior we cannot change it immediately.
+
+This knob should help in the transition. It tells how many times should
+allocator retry when the system is OOM before the allocation fails.
+The default value (ULONG_MAX) preserves the old behavior. This is a safe
+default for production systems which cannot afford any unexpected
+downtimes. More experimental systems might set it to a small number
+(>=1), the higher the value the less probable would be allocation
+failures when OOM is transient and could be resolved without the
+particular allocation to fail.
+
+==============================================================
+
 stat_interval
 
 The time interval between which vm statistics are updated.  The default
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b720b5146a4e..e3b42f46e743 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -75,6 +75,8 @@ extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
 extern unsigned long sysctl_overcommit_kbytes;
 
+extern unsigned long sysctl_nr_alloc_retry;
+
 extern int overcommit_ratio_handler(struct ctl_table *, int, void __user *,
 				    size_t *, loff_t *);
 extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 88ea2d6e0031..4525f25e961b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1499,6 +1499,14 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_doulongvec_minmax,
 	},
+	{
+		.procname	= "retry_allocation_attempts",
+		.data		= &sysctl_nr_alloc_retry,
+		.maxlen		= sizeof(sysctl_nr_alloc_retry),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+		.extra1		= &one_ul,
+	},
 	{ }
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 58f6cf5bdde2..7ae07a5d08df 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -123,6 +123,17 @@ unsigned long dirty_balance_reserve __read_mostly;
 int percpu_pagelist_fraction;
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
 
+/*
+ * Number of allocation retries after the system is considered OOM.
+ * We have been retrying indefinitely for low order allocations for
+ * a very long time and this sysctl should help us to move away from
+ * this behavior because it complicates low memory conditions handling.
+ * The current default is preserving the behavior but non-critical
+ * environments are encouraged to lower the value to catch potential
+ * issues which should be fixed.
+ */
+unsigned long sysctl_nr_alloc_retry = ULONG_MAX;
+
 #ifdef CONFIG_PM_SLEEP
 /*
  * The following functions are used by the suspend/hibernate code to temporarily
@@ -2322,7 +2333,8 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 static inline int
 should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 				unsigned long did_some_progress,
-				unsigned long pages_reclaimed)
+				unsigned long pages_reclaimed,
+				unsigned long nr_retries)
 {
 	/* Do not loop if specifically requested */
 	if (gfp_mask & __GFP_NORETRY)
@@ -2342,11 +2354,12 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 
 	/*
 	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
-	 * means __GFP_NOFAIL, but that may not be true in other
-	 * implementations.
+	 * retries allocations as per global configuration which might
+	 * also be indefinitely.
 	 */
-	if (order <= PAGE_ALLOC_COSTLY_ORDER)
-		return 1;
+	if (order <= PAGE_ALLOC_COSTLY_ORDER &&
+			nr_retries < sysctl_nr_alloc_retry)
+			return 1;
 
 	/*
 	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
@@ -2651,6 +2664,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	unsigned long nr_retries = 0;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -2794,7 +2808,7 @@ retry:
 	/* Check if we should retry the allocation */
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, did_some_progress,
-						pages_reclaimed)) {
+			       pages_reclaimed, nr_retries)) {
 		/*
 		 * If we fail to make progress by freeing individual
 		 * pages, but the allocation wants us to keep going,
@@ -2807,6 +2821,8 @@ retry:
 				goto got_pg;
 			if (!did_some_progress)
 				goto nopage;
+
+			nr_retries++;
 		}
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 1/2] mm: Allow small allocations to fail
@ 2015-03-11 20:54   ` Michal Hocko
  0 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-11 20:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Dave Chinner, Mel Gorman, Rik van Riel,
	Wu Fengguang, linux-mm, LKML, Linux API

It's been ages since small allocations basically imply __GFP_NOFAIL
behavior (with some nuance - e.g. allocation might fail if the current
task is an OOM victim). The idea at the time was that the OOM killer
will make sufficient progress so the small allocation will succeed in
the end.
This assumption is flawed, though, because the retrying allocation might
be blocking a resource (e.g. a lock) which might prevent the OOM killer
victim from making progress and so the system is basically deadlocked.

Another aspect is that this behavior makes it extremely hard to make any
allocation failure policy implementation at the allocation caller. Most
of the allocation paths already check the for the allocation failure and
handle it properly.

There are some places which BUG_ON failure (mostly early boot code) and
they do not have to be changed. It is better to see a panic rather than
a silent freeze of the machine which is the case now.

Finally, if a non-failing allocation is unavoidable then __GFP_NOFAIL
flag is there to express this strong requirement. It is much better to
have a simple way to check all those places and come up with a solution
which will guarantee a forward progress for them.

As this behavior is established for many years we cannot change it
immediately. This patch instead exports a new sysctl/proc knob which
tells allocator how much to retry. The higher the number the longer will
the allocator loop and try to trigger OOM killer when the memory is too
low. This implementation counts only those retries which involved OOM
killer because we do not want to be too eager to fail the request.

I have tested it on a small machine (100M RAM + 128M swap, 2 CPUs) and
hammering it with anon consumers (10x 100M anon mapping per process and
10 processing running in parallel):
$ grep "Out of memory" anon-oom-1-retry| wc -l
8
$ grep "allocation failure" anon-oom-1-retry| wc -l
39

$ grep "Out of memory" anon-oom-10-retry| wc -l
10
$ grep "allocation failure" anon-oom-10-retry| wc -l
32

$ grep "Out of memory" anon-oom-100-retry| wc -l
10
$ grep "allocation failure" anon-oom-100-retry| wc -l
0

The default value is ULONG_MAX which basically preserves the current
behavior (endless retries). The idea is that we start with testing
systems first and lower the value to catch potential fallouts (crashes
due to unchecked failures or other misbehavior like FS ro-remounts
etc...). Allocation failures are already reported by warn_alloc_failed
so we should be able to catch the allocation path before an issue is
triggered.
.
We will try to encourage distributions to change the default in the
second step so that we get a much bigger exposure.

And finally we can change the default in the kernel while still keeping
the knob for conservative configurations. This will be long run but
let's start.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 Documentation/sysctl/vm.txt | 24 ++++++++++++++++++++++++
 include/linux/mm.h          |  2 ++
 kernel/sysctl.c             |  8 ++++++++
 mm/page_alloc.c             | 28 ++++++++++++++++++++++------
 4 files changed, 56 insertions(+), 6 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index e9c706e4627a..09f352ef8c3c 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -53,6 +53,7 @@ Currently, these files are in /proc/sys/vm:
 - page-cluster
 - panic_on_oom
 - percpu_pagelist_fraction
+- retry_allocation_attempts
 - stat_interval
 - swappiness
 - user_reserve_kbytes
@@ -707,6 +708,29 @@ sysctl, it will revert to this default behavior.
 
 ==============================================================
 
+retry_allocation_attempts
+
+Page allocator tries hard to not fail small allocations requests.
+Currently it retries indefinitely for small allocations requests (<= 32kB).
+This works mostly fine but under an extreme low memory conditions system
+might end up in deadlock situations because the looping allocation
+request might block further progress for OOM killer victims.
+
+Even though this hasn't turned out to be a huge problem for many years the
+long term plan is to move away from this default behavior but as this is
+a long established behavior we cannot change it immediately.
+
+This knob should help in the transition. It tells how many times should
+allocator retry when the system is OOM before the allocation fails.
+The default value (ULONG_MAX) preserves the old behavior. This is a safe
+default for production systems which cannot afford any unexpected
+downtimes. More experimental systems might set it to a small number
+(>=1), the higher the value the less probable would be allocation
+failures when OOM is transient and could be resolved without the
+particular allocation to fail.
+
+==============================================================
+
 stat_interval
 
 The time interval between which vm statistics are updated.  The default
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b720b5146a4e..e3b42f46e743 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -75,6 +75,8 @@ extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
 extern unsigned long sysctl_overcommit_kbytes;
 
+extern unsigned long sysctl_nr_alloc_retry;
+
 extern int overcommit_ratio_handler(struct ctl_table *, int, void __user *,
 				    size_t *, loff_t *);
 extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 88ea2d6e0031..4525f25e961b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1499,6 +1499,14 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_doulongvec_minmax,
 	},
+	{
+		.procname	= "retry_allocation_attempts",
+		.data		= &sysctl_nr_alloc_retry,
+		.maxlen		= sizeof(sysctl_nr_alloc_retry),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+		.extra1		= &one_ul,
+	},
 	{ }
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 58f6cf5bdde2..7ae07a5d08df 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -123,6 +123,17 @@ unsigned long dirty_balance_reserve __read_mostly;
 int percpu_pagelist_fraction;
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
 
+/*
+ * Number of allocation retries after the system is considered OOM.
+ * We have been retrying indefinitely for low order allocations for
+ * a very long time and this sysctl should help us to move away from
+ * this behavior because it complicates low memory conditions handling.
+ * The current default is preserving the behavior but non-critical
+ * environments are encouraged to lower the value to catch potential
+ * issues which should be fixed.
+ */
+unsigned long sysctl_nr_alloc_retry = ULONG_MAX;
+
 #ifdef CONFIG_PM_SLEEP
 /*
  * The following functions are used by the suspend/hibernate code to temporarily
@@ -2322,7 +2333,8 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 static inline int
 should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 				unsigned long did_some_progress,
-				unsigned long pages_reclaimed)
+				unsigned long pages_reclaimed,
+				unsigned long nr_retries)
 {
 	/* Do not loop if specifically requested */
 	if (gfp_mask & __GFP_NORETRY)
@@ -2342,11 +2354,12 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 
 	/*
 	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
-	 * means __GFP_NOFAIL, but that may not be true in other
-	 * implementations.
+	 * retries allocations as per global configuration which might
+	 * also be indefinitely.
 	 */
-	if (order <= PAGE_ALLOC_COSTLY_ORDER)
-		return 1;
+	if (order <= PAGE_ALLOC_COSTLY_ORDER &&
+			nr_retries < sysctl_nr_alloc_retry)
+			return 1;
 
 	/*
 	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
@@ -2651,6 +2664,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	unsigned long nr_retries = 0;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -2794,7 +2808,7 @@ retry:
 	/* Check if we should retry the allocation */
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, did_some_progress,
-						pages_reclaimed)) {
+			       pages_reclaimed, nr_retries)) {
 		/*
 		 * If we fail to make progress by freeing individual
 		 * pages, but the allocation wants us to keep going,
@@ -2807,6 +2821,8 @@ retry:
 				goto got_pg;
 			if (!did_some_progress)
 				goto nopage;
+
+			nr_retries++;
 		}
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 2/2] mmotm: Enable small allocation to fail
  2015-03-11 20:54 ` Michal Hocko
  (?)
@ 2015-03-11 20:54   ` Michal Hocko
  -1 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-11 20:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Dave Chinner, Mel Gorman, Rik van Riel,
	Wu Fengguang, linux-mm, LKML, Linux API

Let's break the universe... for those who are willing and brave enough to
run mmotm (and ideally linux-next) tree. OOM situations will lead to
bugs which were hidden for years most probably but it is time we eat our
own dog food and fix them up finally.

The patch itself is trivial. Simply allow only one allocation retry
after OOM killer has been triggered.

THIS IS NOT a patch to be merged to LINUS TREE. At least not now.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7ae07a5d08df..583f0f27c97e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -132,7 +132,7 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
  * environments are encouraged to lower the value to catch potential
  * issues which should be fixed.
  */
-unsigned long sysctl_nr_alloc_retry = ULONG_MAX;
+unsigned long sysctl_nr_alloc_retry = 1;
 
 #ifdef CONFIG_PM_SLEEP
 /*
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 2/2] mmotm: Enable small allocation to fail
@ 2015-03-11 20:54   ` Michal Hocko
  0 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-11 20:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Dave Chinner, Mel Gorman, Rik van Riel,
	Wu Fengguang, linux-mm, LKML, Linux API

Let's break the universe... for those who are willing and brave enough to
run mmotm (and ideally linux-next) tree. OOM situations will lead to
bugs which were hidden for years most probably but it is time we eat our
own dog food and fix them up finally.

The patch itself is trivial. Simply allow only one allocation retry
after OOM killer has been triggered.

THIS IS NOT a patch to be merged to LINUS TREE. At least not now.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7ae07a5d08df..583f0f27c97e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -132,7 +132,7 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
  * environments are encouraged to lower the value to catch potential
  * issues which should be fixed.
  */
-unsigned long sysctl_nr_alloc_retry = ULONG_MAX;
+unsigned long sysctl_nr_alloc_retry = 1;
 
 #ifdef CONFIG_PM_SLEEP
 /*
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 2/2] mmotm: Enable small allocation to fail
@ 2015-03-11 20:54   ` Michal Hocko
  0 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-11 20:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Dave Chinner, Mel Gorman, Rik van Riel,
	Wu Fengguang, linux-mm, LKML, Linux API

Let's break the universe... for those who are willing and brave enough to
run mmotm (and ideally linux-next) tree. OOM situations will lead to
bugs which were hidden for years most probably but it is time we eat our
own dog food and fix them up finally.

The patch itself is trivial. Simply allow only one allocation retry
after OOM killer has been triggered.

THIS IS NOT a patch to be merged to LINUS TREE. At least not now.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7ae07a5d08df..583f0f27c97e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -132,7 +132,7 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
  * environments are encouraged to lower the value to catch potential
  * issues which should be fixed.
  */
-unsigned long sysctl_nr_alloc_retry = ULONG_MAX;
+unsigned long sysctl_nr_alloc_retry = 1;
 
 #ifdef CONFIG_PM_SLEEP
 /*
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/2] Move away from non-failing small allocations
@ 2015-03-11 22:36   ` Sasha Levin
  0 siblings, 0 replies; 63+ messages in thread
From: Sasha Levin @ 2015-03-11 22:36 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Johannes Weiner, Dave Chinner, Mel Gorman, Rik van Riel,
	Wu Fengguang, linux-mm, LKML, Linux API

On 03/11/2015 04:54 PM, Michal Hocko wrote:
> The second patch is the first step in the transition plan. It changes
> the default but it is NOT an upstream material. It is aimed for brave
> testers who can cope with failures. I have talked to Andrew and he
> was willing to keep that patch in mmotm tree. It would be even better
> to have this in linux-next because the testing coverage would be even
> bigger. Dave Chinner has also shown an interest to integrate this into
> his xfstest farm. It would be great if Fenguang could add it into the
> zero testing project too (if the pushing the patch into linux-next
> would be too controversial).

Stuff in mmotm automatically end up in linux-next.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/2] Move away from non-failing small allocations
@ 2015-03-11 22:36   ` Sasha Levin
  0 siblings, 0 replies; 63+ messages in thread
From: Sasha Levin @ 2015-03-11 22:36 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Johannes Weiner, Dave Chinner, Mel Gorman, Rik van Riel,
	Wu Fengguang, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, LKML, Linux API

On 03/11/2015 04:54 PM, Michal Hocko wrote:
> The second patch is the first step in the transition plan. It changes
> the default but it is NOT an upstream material. It is aimed for brave
> testers who can cope with failures. I have talked to Andrew and he
> was willing to keep that patch in mmotm tree. It would be even better
> to have this in linux-next because the testing coverage would be even
> bigger. Dave Chinner has also shown an interest to integrate this into
> his xfstest farm. It would be great if Fenguang could add it into the
> zero testing project too (if the pushing the patch into linux-next
> would be too controversial).

Stuff in mmotm automatically end up in linux-next.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/2] Move away from non-failing small allocations
@ 2015-03-11 22:36   ` Sasha Levin
  0 siblings, 0 replies; 63+ messages in thread
From: Sasha Levin @ 2015-03-11 22:36 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Johannes Weiner, Dave Chinner, Mel Gorman, Rik van Riel,
	Wu Fengguang, linux-mm, LKML, Linux API

On 03/11/2015 04:54 PM, Michal Hocko wrote:
> The second patch is the first step in the transition plan. It changes
> the default but it is NOT an upstream material. It is aimed for brave
> testers who can cope with failures. I have talked to Andrew and he
> was willing to keep that patch in mmotm tree. It would be even better
> to have this in linux-next because the testing coverage would be even
> bigger. Dave Chinner has also shown an interest to integrate this into
> his xfstest farm. It would be great if Fenguang could add it into the
> zero testing project too (if the pushing the patch into linux-next
> would be too controversial).

Stuff in mmotm automatically end up in linux-next.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2] mm: Allow small allocations to fail
  2015-03-11 20:54   ` Michal Hocko
@ 2015-03-12 12:54     ` Tetsuo Handa
  -1 siblings, 0 replies; 63+ messages in thread
From: Tetsuo Handa @ 2015-03-12 12:54 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: hannes, david, mgorman, riel, fengguang.wu, fernando_b1,
	linux-mm, linux-kernel

(The Cc: line seems to be partially truncated. Please re-add if needed.)

Michal Hocko wrote:
> Finally, if a non-failing allocation is unavoidable then __GFP_NOFAIL
> flag is there to express this strong requirement. It is much better to
> have a simple way to check all those places and come up with a solution
> which will guarantee a forward progress for them.

Keeping gfp flags passed to ongoing allocation inside "struct task_struct"
will allow the OOM killer to skip OOM victims doing __GFP_NOFAIL.
http://marc.info/?l=linux-mm&m=141671829611143&w=2 would give a hint.

> As this behavior is established for many years we cannot change it
> immediately. This patch instead exports a new sysctl/proc knob which
> tells allocator how much to retry. The higher the number the longer will
> the allocator loop and try to trigger OOM killer when the memory is too
> low. This implementation counts only those retries which involved OOM
> killer because we do not want to be too eager to fail the request.

I prefer jiffies timeouts than retry counts, for jiffies will allow vmcore
to tell how long the process was stalled for memory allocation.
http://marc.info/?l=linux-mm&m=141671821111135&w=1 and
http://marc.info/?l=linux-mm&m=141709978209207&w=1 would give a hint.

> The default value is ULONG_MAX which basically preserves the current
> behavior (endless retries). The idea is that we start with testing
> systems first and lower the value to catch potential fallouts (crashes
> due to unchecked failures or other misbehavior like FS ro-remounts
> etc...). Allocation failures are already reported by warn_alloc_failed
> so we should be able to catch the allocation path before an issue is
> triggered.

Few developers are using fault-injection capability (CONFIG_FAILSLAB and
CONFIG_FAIL_PAGE_ALLOC). Even less developers would be performing OOM
stress tests. Printing allocation failure messages only upon OOM condition
is Whack-A-Mole where moles remain hidden until distribution kernel users
by chance (or by intent) triggered OOM condition.

I tried SystemTap-based mandatory fault-injection hooks at
http://marc.info/?l=linux-kernel&m=141951300713051&w=2 and I reported
random crashes at
http://lists.freedesktop.org/archives/dri-devel/2015-January/075922.html .
How can we find the exact culprit allocation when an issue is triggered
some time after the first failure messages?

I think that your knob helps avoiding infinite loop if lower value is
given, but I don't think that your knob helps catching potential fallouts.

> We will try to encourage distributions to change the default in the
> second step so that we get a much bigger exposure.

Can we expect that distribution kernel users are willing to perform OOM
stress tests which kernel developers did not perform?

> And finally we can change the default in the kernel while still keeping
> the knob for conservative configurations. This will be long run but
> let's start.

And finally what patches will you propose for already running systems
using distribution kernels? I can't wait for years (or decades) until
your knob and fixes for fallouts are backported.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2] mm: Allow small allocations to fail
@ 2015-03-12 12:54     ` Tetsuo Handa
  0 siblings, 0 replies; 63+ messages in thread
From: Tetsuo Handa @ 2015-03-12 12:54 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: hannes, david, mgorman, riel, fengguang.wu, fernando_b1,
	linux-mm, linux-kernel

(The Cc: line seems to be partially truncated. Please re-add if needed.)

Michal Hocko wrote:
> Finally, if a non-failing allocation is unavoidable then __GFP_NOFAIL
> flag is there to express this strong requirement. It is much better to
> have a simple way to check all those places and come up with a solution
> which will guarantee a forward progress for them.

Keeping gfp flags passed to ongoing allocation inside "struct task_struct"
will allow the OOM killer to skip OOM victims doing __GFP_NOFAIL.
http://marc.info/?l=linux-mm&m=141671829611143&w=2 would give a hint.

> As this behavior is established for many years we cannot change it
> immediately. This patch instead exports a new sysctl/proc knob which
> tells allocator how much to retry. The higher the number the longer will
> the allocator loop and try to trigger OOM killer when the memory is too
> low. This implementation counts only those retries which involved OOM
> killer because we do not want to be too eager to fail the request.

I prefer jiffies timeouts than retry counts, for jiffies will allow vmcore
to tell how long the process was stalled for memory allocation.
http://marc.info/?l=linux-mm&m=141671821111135&w=1 and
http://marc.info/?l=linux-mm&m=141709978209207&w=1 would give a hint.

> The default value is ULONG_MAX which basically preserves the current
> behavior (endless retries). The idea is that we start with testing
> systems first and lower the value to catch potential fallouts (crashes
> due to unchecked failures or other misbehavior like FS ro-remounts
> etc...). Allocation failures are already reported by warn_alloc_failed
> so we should be able to catch the allocation path before an issue is
> triggered.

Few developers are using fault-injection capability (CONFIG_FAILSLAB and
CONFIG_FAIL_PAGE_ALLOC). Even less developers would be performing OOM
stress tests. Printing allocation failure messages only upon OOM condition
is Whack-A-Mole where moles remain hidden until distribution kernel users
by chance (or by intent) triggered OOM condition.

I tried SystemTap-based mandatory fault-injection hooks at
http://marc.info/?l=linux-kernel&m=141951300713051&w=2 and I reported
random crashes at
http://lists.freedesktop.org/archives/dri-devel/2015-January/075922.html .
How can we find the exact culprit allocation when an issue is triggered
some time after the first failure messages?

I think that your knob helps avoiding infinite loop if lower value is
given, but I don't think that your knob helps catching potential fallouts.

> We will try to encourage distributions to change the default in the
> second step so that we get a much bigger exposure.

Can we expect that distribution kernel users are willing to perform OOM
stress tests which kernel developers did not perform?

> And finally we can change the default in the kernel while still keeping
> the knob for conservative configurations. This will be long run but
> let's start.

And finally what patches will you propose for already running systems
using distribution kernels? I can't wait for years (or decades) until
your knob and fixes for fallouts are backported.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2] mm: Allow small allocations to fail
  2015-03-12 12:54     ` Tetsuo Handa
@ 2015-03-12 13:12       ` Michal Hocko
  -1 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-12 13:12 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, hannes, david, mgorman, riel, fengguang.wu, fernando_b1,
	linux-mm, linux-kernel

On Thu 12-03-15 21:54:47, Tetsuo Handa wrote:
> (The Cc: line seems to be partially truncated. Please re-add if needed.)
> 
> Michal Hocko wrote:
> > Finally, if a non-failing allocation is unavoidable then __GFP_NOFAIL
> > flag is there to express this strong requirement. It is much better to
> > have a simple way to check all those places and come up with a solution
> > which will guarantee a forward progress for them.
> 
> Keeping gfp flags passed to ongoing allocation inside "struct task_struct"
> will allow the OOM killer to skip OOM victims doing __GFP_NOFAIL.
> http://marc.info/?l=linux-mm&m=141671829611143&w=2 would give a hint.

Not related to the patch IMNHO. This patch is about allocations which
might fail. The comment simply says that those who _cannot_ should
be annotated properly.
 
> > As this behavior is established for many years we cannot change it
> > immediately. This patch instead exports a new sysctl/proc knob which
> > tells allocator how much to retry. The higher the number the longer will
> > the allocator loop and try to trigger OOM killer when the memory is too
> > low. This implementation counts only those retries which involved OOM
> > killer because we do not want to be too eager to fail the request.
> 
> I prefer jiffies timeouts than retry counts, for jiffies will allow vmcore
> to tell how long the process was stalled for memory allocation.
> http://marc.info/?l=linux-mm&m=141671821111135&w=1 and
> http://marc.info/?l=linux-mm&m=141709978209207&w=1 would give a hint.
> 
> > The default value is ULONG_MAX which basically preserves the current
> > behavior (endless retries). The idea is that we start with testing
> > systems first and lower the value to catch potential fallouts (crashes
> > due to unchecked failures or other misbehavior like FS ro-remounts
> > etc...). Allocation failures are already reported by warn_alloc_failed
> > so we should be able to catch the allocation path before an issue is
> > triggered.
> 
> Few developers are using fault-injection capability (CONFIG_FAILSLAB and
> CONFIG_FAIL_PAGE_ALLOC). Even less developers would be performing OOM
> stress tests. Printing allocation failure messages only upon OOM condition
> is Whack-A-Mole where moles remain hidden until distribution kernel users
> by chance (or by intent) triggered OOM condition.
> 
> I tried SystemTap-based mandatory fault-injection hooks at
> http://marc.info/?l=linux-kernel&m=141951300713051&w=2 and I reported
> random crashes at
> http://lists.freedesktop.org/archives/dri-devel/2015-January/075922.html .
> How can we find the exact culprit allocation when an issue is triggered
> some time after the first failure messages?
> 
> I think that your knob helps avoiding infinite loop if lower value is
> given, but I don't think that your knob helps catching potential fallouts.

An allocation failure with the full trace will at least help you see
_who_ might be causing troubles later. Some bugs might be subtle and
harder to debug, of course.
 
> > We will try to encourage distributions to change the default in the
> > second step so that we get a much bigger exposure.
> 
> Can we expect that distribution kernel users are willing to perform OOM
> stress tests which kernel developers did not perform?
> 
> > And finally we can change the default in the kernel while still keeping
> > the knob for conservative configurations. This will be long run but
> > let's start.
> 
> And finally what patches will you propose for already running systems
> using distribution kernels? I can't wait for years (or decades) until
> your knob and fixes for fallouts are backported.

I am afraid I do not care about old and distribution kernels in _this_
_proposal_. This is an attempt to move away from a concept which is not
healthy IMHO and it will take long time to get make it happen. OOM
killer deadlocks is just one of the reasons. So please do not conflate
those two things together.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2] mm: Allow small allocations to fail
@ 2015-03-12 13:12       ` Michal Hocko
  0 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-12 13:12 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, hannes, david, mgorman, riel, fengguang.wu, fernando_b1,
	linux-mm, linux-kernel

On Thu 12-03-15 21:54:47, Tetsuo Handa wrote:
> (The Cc: line seems to be partially truncated. Please re-add if needed.)
> 
> Michal Hocko wrote:
> > Finally, if a non-failing allocation is unavoidable then __GFP_NOFAIL
> > flag is there to express this strong requirement. It is much better to
> > have a simple way to check all those places and come up with a solution
> > which will guarantee a forward progress for them.
> 
> Keeping gfp flags passed to ongoing allocation inside "struct task_struct"
> will allow the OOM killer to skip OOM victims doing __GFP_NOFAIL.
> http://marc.info/?l=linux-mm&m=141671829611143&w=2 would give a hint.

Not related to the patch IMNHO. This patch is about allocations which
might fail. The comment simply says that those who _cannot_ should
be annotated properly.
 
> > As this behavior is established for many years we cannot change it
> > immediately. This patch instead exports a new sysctl/proc knob which
> > tells allocator how much to retry. The higher the number the longer will
> > the allocator loop and try to trigger OOM killer when the memory is too
> > low. This implementation counts only those retries which involved OOM
> > killer because we do not want to be too eager to fail the request.
> 
> I prefer jiffies timeouts than retry counts, for jiffies will allow vmcore
> to tell how long the process was stalled for memory allocation.
> http://marc.info/?l=linux-mm&m=141671821111135&w=1 and
> http://marc.info/?l=linux-mm&m=141709978209207&w=1 would give a hint.
> 
> > The default value is ULONG_MAX which basically preserves the current
> > behavior (endless retries). The idea is that we start with testing
> > systems first and lower the value to catch potential fallouts (crashes
> > due to unchecked failures or other misbehavior like FS ro-remounts
> > etc...). Allocation failures are already reported by warn_alloc_failed
> > so we should be able to catch the allocation path before an issue is
> > triggered.
> 
> Few developers are using fault-injection capability (CONFIG_FAILSLAB and
> CONFIG_FAIL_PAGE_ALLOC). Even less developers would be performing OOM
> stress tests. Printing allocation failure messages only upon OOM condition
> is Whack-A-Mole where moles remain hidden until distribution kernel users
> by chance (or by intent) triggered OOM condition.
> 
> I tried SystemTap-based mandatory fault-injection hooks at
> http://marc.info/?l=linux-kernel&m=141951300713051&w=2 and I reported
> random crashes at
> http://lists.freedesktop.org/archives/dri-devel/2015-January/075922.html .
> How can we find the exact culprit allocation when an issue is triggered
> some time after the first failure messages?
> 
> I think that your knob helps avoiding infinite loop if lower value is
> given, but I don't think that your knob helps catching potential fallouts.

An allocation failure with the full trace will at least help you see
_who_ might be causing troubles later. Some bugs might be subtle and
harder to debug, of course.
 
> > We will try to encourage distributions to change the default in the
> > second step so that we get a much bigger exposure.
> 
> Can we expect that distribution kernel users are willing to perform OOM
> stress tests which kernel developers did not perform?
> 
> > And finally we can change the default in the kernel while still keeping
> > the knob for conservative configurations. This will be long run but
> > let's start.
> 
> And finally what patches will you propose for already running systems
> using distribution kernels? I can't wait for years (or decades) until
> your knob and fixes for fallouts are backported.

I am afraid I do not care about old and distribution kernels in _this_
_proposal_. This is an attempt to move away from a concept which is not
healthy IMHO and it will take long time to get make it happen. OOM
killer deadlocks is just one of the reasons. So please do not conflate
those two things together.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2] mm: Allow small allocations to fail
  2015-03-11 20:54   ` Michal Hocko
@ 2015-03-15  5:43     ` Tetsuo Handa
  -1 siblings, 0 replies; 63+ messages in thread
From: Tetsuo Handa @ 2015-03-15  5:43 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: hannes, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

Michal Hocko wrote:
> As this behavior is established for many years we cannot change it
> immediately. This patch instead exports a new sysctl/proc knob which
> tells allocator how much to retry. The higher the number the longer will
> the allocator loop and try to trigger OOM killer when the memory is too
> low. This implementation counts only those retries which involved OOM
> killer because we do not want to be too eager to fail the request.

I found that this patch conflicts with commit cc87317726f8 ("mm: page_alloc:
revert inadvertent !__GFP_FS retry behavior change") and thus counting retries
regardless of whether the OOM killer was involved, making !__GFP_FS allocation
to fail as eager as commit 9879de7373fc ("mm: page_alloc: embed OOM killing
naturally into allocation slowpath") did when sysctl_nr_alloc_retry == 1.

----------
XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
warn_alloc_failed: 212565 callbacks suppressed
crond: page allocation failure: order:0, mode:0x2015a
rngd: page allocation failure: order:0, mode:0x2015a
CPU: 3 PID: 1667 Comm: rngd Not tainted 4.0.0-rc3+ #37
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
 0000000000000000 00000000ce4cec53 0000000000000000 ffffffff815f30c4
 000000000002015a ffffffff8111063e ffff88007fffdb00 0000000000000000
 0000000000000040 ffff88007c223db0 0000000000000000 00000000ce4cec53
Call Trace:
 [<ffffffff815f30c4>] ? dump_stack+0x40/0x50
 [<ffffffff8111063e>] ? warn_alloc_failed+0xee/0x150
 [<ffffffff81113b03>] ? __alloc_pages_nodemask+0x623/0xa10
 [<ffffffff81150c57>] ? alloc_pages_current+0x87/0x100
 [<ffffffff8110d30d>] ? filemap_fault+0x1bd/0x400
 [<ffffffff812e3dbc>] ? radix_tree_next_chunk+0x5c/0x240
 [<ffffffff8112f85b>] ? __do_fault+0x4b/0xe0
 [<ffffffff81134465>] ? handle_mm_fault+0xc85/0x1640
 [<ffffffff81051c9a>] ? __do_page_fault+0x16a/0x430
 [<ffffffff81051f90>] ? do_page_fault+0x30/0x70
 [<ffffffff815fb03f>] ? error_exit+0x1f/0x60
 [<ffffffff815fae18>] ? page_fault+0x28/0x30
----------

If you want to count only those retries which involved OOM killer, you need
to do like

-			nr_retries++;
+			if (gfp_mask & __GFP_FS)
+				nr_retries++;

in this patch.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2] mm: Allow small allocations to fail
@ 2015-03-15  5:43     ` Tetsuo Handa
  0 siblings, 0 replies; 63+ messages in thread
From: Tetsuo Handa @ 2015-03-15  5:43 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: hannes, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

Michal Hocko wrote:
> As this behavior is established for many years we cannot change it
> immediately. This patch instead exports a new sysctl/proc knob which
> tells allocator how much to retry. The higher the number the longer will
> the allocator loop and try to trigger OOM killer when the memory is too
> low. This implementation counts only those retries which involved OOM
> killer because we do not want to be too eager to fail the request.

I found that this patch conflicts with commit cc87317726f8 ("mm: page_alloc:
revert inadvertent !__GFP_FS retry behavior change") and thus counting retries
regardless of whether the OOM killer was involved, making !__GFP_FS allocation
to fail as eager as commit 9879de7373fc ("mm: page_alloc: embed OOM killing
naturally into allocation slowpath") did when sysctl_nr_alloc_retry == 1.

----------
XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
warn_alloc_failed: 212565 callbacks suppressed
crond: page allocation failure: order:0, mode:0x2015a
rngd: page allocation failure: order:0, mode:0x2015a
CPU: 3 PID: 1667 Comm: rngd Not tainted 4.0.0-rc3+ #37
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
 0000000000000000 00000000ce4cec53 0000000000000000 ffffffff815f30c4
 000000000002015a ffffffff8111063e ffff88007fffdb00 0000000000000000
 0000000000000040 ffff88007c223db0 0000000000000000 00000000ce4cec53
Call Trace:
 [<ffffffff815f30c4>] ? dump_stack+0x40/0x50
 [<ffffffff8111063e>] ? warn_alloc_failed+0xee/0x150
 [<ffffffff81113b03>] ? __alloc_pages_nodemask+0x623/0xa10
 [<ffffffff81150c57>] ? alloc_pages_current+0x87/0x100
 [<ffffffff8110d30d>] ? filemap_fault+0x1bd/0x400
 [<ffffffff812e3dbc>] ? radix_tree_next_chunk+0x5c/0x240
 [<ffffffff8112f85b>] ? __do_fault+0x4b/0xe0
 [<ffffffff81134465>] ? handle_mm_fault+0xc85/0x1640
 [<ffffffff81051c9a>] ? __do_page_fault+0x16a/0x430
 [<ffffffff81051f90>] ? do_page_fault+0x30/0x70
 [<ffffffff815fb03f>] ? error_exit+0x1f/0x60
 [<ffffffff815fae18>] ? page_fault+0x28/0x30
----------

If you want to count only those retries which involved OOM killer, you need
to do like

-			nr_retries++;
+			if (gfp_mask & __GFP_FS)
+				nr_retries++;

in this patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2] mm: Allow small allocations to fail
  2015-03-15  5:43     ` Tetsuo Handa
@ 2015-03-15 12:13       ` Michal Hocko
  -1 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-15 12:13 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, hannes, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

On Sun 15-03-15 14:43:37, Tetsuo Handa wrote:
[...]
> If you want to count only those retries which involved OOM killer, you need
> to do like
> 
> -			nr_retries++;
> +			if (gfp_mask & __GFP_FS)
> +				nr_retries++;
> 
> in this patch.

No, we shouldn't create another type of hidden NOFAIL allocation like
this. I understand that the wording of the changelog might be confusing,
though.

It says: "This implementation counts only those retries which involved
OOM killer because we do not want to be too eager to fail the request."

Would it be more clear if I changed that to?
"This implemetnation counts only those retries when the system is
considered OOM because all previous reclaim attempts have resulted
in no progress because we do not want to be too eager to fail the
request."

We definitely _want_ to fail GFP_NOFS allocations.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2] mm: Allow small allocations to fail
@ 2015-03-15 12:13       ` Michal Hocko
  0 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-15 12:13 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, hannes, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

On Sun 15-03-15 14:43:37, Tetsuo Handa wrote:
[...]
> If you want to count only those retries which involved OOM killer, you need
> to do like
> 
> -			nr_retries++;
> +			if (gfp_mask & __GFP_FS)
> +				nr_retries++;
> 
> in this patch.

No, we shouldn't create another type of hidden NOFAIL allocation like
this. I understand that the wording of the changelog might be confusing,
though.

It says: "This implementation counts only those retries which involved
OOM killer because we do not want to be too eager to fail the request."

Would it be more clear if I changed that to?
"This implemetnation counts only those retries when the system is
considered OOM because all previous reclaim attempts have resulted
in no progress because we do not want to be too eager to fail the
request."

We definitely _want_ to fail GFP_NOFS allocations.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2] mm: Allow small allocations to fail
  2015-03-15 12:13       ` Michal Hocko
@ 2015-03-15 13:06         ` Tetsuo Handa
  -1 siblings, 0 replies; 63+ messages in thread
From: Tetsuo Handa @ 2015-03-15 13:06 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, hannes, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Sun 15-03-15 14:43:37, Tetsuo Handa wrote:
> [...]
> > If you want to count only those retries which involved OOM killer, you need
> > to do like
> > 
> > -			nr_retries++;
> > +			if (gfp_mask & __GFP_FS)
> > +				nr_retries++;
> > 
> > in this patch.
> 
> No, we shouldn't create another type of hidden NOFAIL allocation like
> this. I understand that the wording of the changelog might be confusing,
> though.
> 
> It says: "This implementation counts only those retries which involved
> OOM killer because we do not want to be too eager to fail the request."
> 
> Would it be more clear if I changed that to?
> "This implemetnation counts only those retries when the system is
> considered OOM because all previous reclaim attempts have resulted
> in no progress because we do not want to be too eager to fail the
> request."
> 
> We definitely _want_ to fail GFP_NOFS allocations.

I see. The updated changelog is much more clear.

> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2] mm: Allow small allocations to fail
@ 2015-03-15 13:06         ` Tetsuo Handa
  0 siblings, 0 replies; 63+ messages in thread
From: Tetsuo Handa @ 2015-03-15 13:06 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, hannes, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Sun 15-03-15 14:43:37, Tetsuo Handa wrote:
> [...]
> > If you want to count only those retries which involved OOM killer, you need
> > to do like
> > 
> > -			nr_retries++;
> > +			if (gfp_mask & __GFP_FS)
> > +				nr_retries++;
> > 
> > in this patch.
> 
> No, we shouldn't create another type of hidden NOFAIL allocation like
> this. I understand that the wording of the changelog might be confusing,
> though.
> 
> It says: "This implementation counts only those retries which involved
> OOM killer because we do not want to be too eager to fail the request."
> 
> Would it be more clear if I changed that to?
> "This implemetnation counts only those retries when the system is
> considered OOM because all previous reclaim attempts have resulted
> in no progress because we do not want to be too eager to fail the
> request."
> 
> We definitely _want_ to fail GFP_NOFS allocations.

I see. The updated changelog is much more clear.

> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 1/2 v2] mm: Allow small allocations to fail
  2015-03-15 13:06         ` Tetsuo Handa
@ 2015-03-16  7:46           ` Michal Hocko
  -1 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-16  7:46 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, hannes, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

On Sun 15-03-15 22:06:54, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > this. I understand that the wording of the changelog might be confusing,
> > though.
> > 
> > It says: "This implementation counts only those retries which involved
> > OOM killer because we do not want to be too eager to fail the request."
> > 
> > Would it be more clear if I changed that to?
> > "This implemetnation counts only those retries when the system is
> > considered OOM because all previous reclaim attempts have resulted
> > in no progress because we do not want to be too eager to fail the
> > request."
> > 
> > We definitely _want_ to fail GFP_NOFS allocations.
> 
> I see. The updated changelog is much more clear.

Patch with the updated changelog (no other changes)
---
>From 615eb47a3915d2b7f1b366206793e1c286d91f40 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Tue, 10 Mar 2015 16:20:16 -0400
Subject: [PATCH] mm: Allow small allocations to fail

It's been ages since small allocations basically imply __GFP_NOFAIL
behavior (with some nuance - e.g. allocation might fail if the current
task is an OOM victim). The idea at the time was that the OOM killer
will make sufficient progress so the small allocation will succeed in
the end.
This assumption is flawed, though, because the retrying allocation might
be blocking a resource (e.g. a lock) which might prevent the OOM killer
victim from making progress and so the system is basically deadlocked.

Another aspect is that this behavior makes it extremely hard to make any
allocation failure policy implementation at the allocation caller. Most
of the allocation paths already check the for the allocation failure and
handle it properly.

There are some places which BUG_ON failure (mostly early boot code) and
they do not have to be changed. It is better to see a panic rather than
a silent freeze of the machine which is the case now.

Finally, if a non-failing allocation is unavoidable then __GFP_NOFAIL
flag is there to express this strong requirement. It is much better to
have a simple way to check all those places and come up with a solution
which will guarantee a forward progress for them.

As this behavior is established for many years we cannot change it
immediately. This patch instead exports a new sysctl/proc knob which
tells allocator how much to retry. The higher the number the longer will
the allocator loop and try to trigger OOM killer when the memory is too
low. This implemetnation counts only those retries when the system is
considered OOM because all previous reclaim attempts have resulted
in no progress because we do not want to be too eager to fail the
request.

I have tested it on a small machine (100M RAM + 128M swap, 2 CPUs) and
hammering it with anon consumers (10x 100M anon mapping per process and
10 processing running in parallel):
$ grep "Out of memory" anon-oom-1-retry| wc -l
8
$ grep "allocation failure" anon-oom-1-retry| wc -l
39

$ grep "Out of memory" anon-oom-10-retry| wc -l
10
$ grep "allocation failure" anon-oom-10-retry| wc -l
32

$ grep "Out of memory" anon-oom-100-retry| wc -l
10
$ grep "allocation failure" anon-oom-100-retry| wc -l
0

The default value is ULONG_MAX which basically preserves the current
behavior (endless retries). The idea is that we start with testing
systems first and lower the value to catch potential fallouts (crashes
due to unchecked failures or other misbehavior like FS ro-remounts
etc...). Allocation failures are already reported by warn_alloc_failed
so we should be able to catch the allocation path before an issue is
triggered.
.
We will try to encourage distributions to change the default in the
second step so that we get a much bigger exposure.

And finally we can change the default in the kernel while still keeping
the knob for conservative configurations. This will be long run but
let's start.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 Documentation/sysctl/vm.txt | 24 ++++++++++++++++++++++++
 include/linux/mm.h          |  2 ++
 kernel/sysctl.c             |  8 ++++++++
 mm/page_alloc.c             | 28 ++++++++++++++++++++++------
 4 files changed, 56 insertions(+), 6 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index e9c706e4627a..09f352ef8c3c 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -53,6 +53,7 @@ Currently, these files are in /proc/sys/vm:
 - page-cluster
 - panic_on_oom
 - percpu_pagelist_fraction
+- retry_allocation_attempts
 - stat_interval
 - swappiness
 - user_reserve_kbytes
@@ -707,6 +708,29 @@ sysctl, it will revert to this default behavior.
 
 ==============================================================
 
+retry_allocation_attempts
+
+Page allocator tries hard to not fail small allocations requests.
+Currently it retries indefinitely for small allocations requests (<= 32kB).
+This works mostly fine but under an extreme low memory conditions system
+might end up in deadlock situations because the looping allocation
+request might block further progress for OOM killer victims.
+
+Even though this hasn't turned out to be a huge problem for many years the
+long term plan is to move away from this default behavior but as this is
+a long established behavior we cannot change it immediately.
+
+This knob should help in the transition. It tells how many times should
+allocator retry when the system is OOM before the allocation fails.
+The default value (ULONG_MAX) preserves the old behavior. This is a safe
+default for production systems which cannot afford any unexpected
+downtimes. More experimental systems might set it to a small number
+(>=1), the higher the value the less probable would be allocation
+failures when OOM is transient and could be resolved without the
+particular allocation to fail.
+
+==============================================================
+
 stat_interval
 
 The time interval between which vm statistics are updated.  The default
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b720b5146a4e..e3b42f46e743 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -75,6 +75,8 @@ extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
 extern unsigned long sysctl_overcommit_kbytes;
 
+extern unsigned long sysctl_nr_alloc_retry;
+
 extern int overcommit_ratio_handler(struct ctl_table *, int, void __user *,
 				    size_t *, loff_t *);
 extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 88ea2d6e0031..4525f25e961b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1499,6 +1499,14 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_doulongvec_minmax,
 	},
+	{
+		.procname	= "retry_allocation_attempts",
+		.data		= &sysctl_nr_alloc_retry,
+		.maxlen		= sizeof(sysctl_nr_alloc_retry),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+		.extra1		= &one_ul,
+	},
 	{ }
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 58f6cf5bdde2..7ae07a5d08df 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -123,6 +123,17 @@ unsigned long dirty_balance_reserve __read_mostly;
 int percpu_pagelist_fraction;
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
 
+/*
+ * Number of allocation retries after the system is considered OOM.
+ * We have been retrying indefinitely for low order allocations for
+ * a very long time and this sysctl should help us to move away from
+ * this behavior because it complicates low memory conditions handling.
+ * The current default is preserving the behavior but non-critical
+ * environments are encouraged to lower the value to catch potential
+ * issues which should be fixed.
+ */
+unsigned long sysctl_nr_alloc_retry = ULONG_MAX;
+
 #ifdef CONFIG_PM_SLEEP
 /*
  * The following functions are used by the suspend/hibernate code to temporarily
@@ -2322,7 +2333,8 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 static inline int
 should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 				unsigned long did_some_progress,
-				unsigned long pages_reclaimed)
+				unsigned long pages_reclaimed,
+				unsigned long nr_retries)
 {
 	/* Do not loop if specifically requested */
 	if (gfp_mask & __GFP_NORETRY)
@@ -2342,11 +2354,12 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 
 	/*
 	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
-	 * means __GFP_NOFAIL, but that may not be true in other
-	 * implementations.
+	 * retries allocations as per global configuration which might
+	 * also be indefinitely.
 	 */
-	if (order <= PAGE_ALLOC_COSTLY_ORDER)
-		return 1;
+	if (order <= PAGE_ALLOC_COSTLY_ORDER &&
+			nr_retries < sysctl_nr_alloc_retry)
+			return 1;
 
 	/*
 	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
@@ -2651,6 +2664,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	unsigned long nr_retries = 0;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -2794,7 +2808,7 @@ retry:
 	/* Check if we should retry the allocation */
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, did_some_progress,
-						pages_reclaimed)) {
+			       pages_reclaimed, nr_retries)) {
 		/*
 		 * If we fail to make progress by freeing individual
 		 * pages, but the allocation wants us to keep going,
@@ -2807,6 +2821,8 @@ retry:
 				goto got_pg;
 			if (!did_some_progress)
 				goto nopage;
+
+			nr_retries++;
 		}
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
-- 
2.1.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 1/2 v2] mm: Allow small allocations to fail
@ 2015-03-16  7:46           ` Michal Hocko
  0 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-16  7:46 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, hannes, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

On Sun 15-03-15 22:06:54, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > this. I understand that the wording of the changelog might be confusing,
> > though.
> > 
> > It says: "This implementation counts only those retries which involved
> > OOM killer because we do not want to be too eager to fail the request."
> > 
> > Would it be more clear if I changed that to?
> > "This implemetnation counts only those retries when the system is
> > considered OOM because all previous reclaim attempts have resulted
> > in no progress because we do not want to be too eager to fail the
> > request."
> > 
> > We definitely _want_ to fail GFP_NOFS allocations.
> 
> I see. The updated changelog is much more clear.

Patch with the updated changelog (no other changes)
---

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
  2015-03-16  7:46           ` Michal Hocko
@ 2015-03-16 21:11             ` Johannes Weiner
  -1 siblings, 0 replies; 63+ messages in thread
From: Johannes Weiner @ 2015-03-16 21:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, akpm, david, mgorman, riel, fengguang.wu, linux-mm,
	linux-kernel

On Mon, Mar 16, 2015 at 08:46:07AM +0100, Michal Hocko wrote:
> @@ -707,6 +708,29 @@ sysctl, it will revert to this default behavior.
>  
>  ==============================================================
>  
> +retry_allocation_attempts
> +
> +Page allocator tries hard to not fail small allocations requests.
> +Currently it retries indefinitely for small allocations requests (<= 32kB).
> +This works mostly fine but under an extreme low memory conditions system
> +might end up in deadlock situations because the looping allocation
> +request might block further progress for OOM killer victims.
> +
> +Even though this hasn't turned out to be a huge problem for many years the
> +long term plan is to move away from this default behavior but as this is
> +a long established behavior we cannot change it immediately.
> +
> +This knob should help in the transition. It tells how many times should
> +allocator retry when the system is OOM before the allocation fails.
> +The default value (ULONG_MAX) preserves the old behavior. This is a safe
> +default for production systems which cannot afford any unexpected
> +downtimes. More experimental systems might set it to a small number
> +(>=1), the higher the value the less probable would be allocation
> +failures when OOM is transient and could be resolved without the
> +particular allocation to fail.

This is a negotiation between the page allocator and the various
requirements of its in-kernel users.  If *we* can't make an educated
guess with the entire codebase available, how the heck can we expect
userspace to?

And just assuming for a second that they actually do a better job than
us, are they going to send us a report of their workload and machine
specs and the value that worked for them?  Of course not, why would
you think they'd suddenly send anything but regression reports?

And we wouldn't get regression reports without changing the default,
because really, what is the incentive to mess with that knob?  Making
a lockup you probably never encountered less likely to trigger, while
adding failures of unknown quantity or quality into the system?

This is truly insane.  You're taking one magic factor out of a complex
kernel mechanism and dump it on userspace, which has neither reason
nor context to meaningfully change the default.  We'd never leave that
state of transition.  Only when machines do lock up in the wild, at
least we can tell them they should have set this knob to "like, 50?"

If we want to address this problem, we are the ones that have to make
the call.  Pick a value based on a reasonable model, make it the
default, then deal with the fallout and update our assumptions.

Once that is done, whether we want to provide a boolean failsafe to
revert this in the field is another question.

A sysctl certainly doesn't sound appropriate to me because this is not
a tunable that we expect people to set according to their usecase.  We
expect our model to work for *everybody*.  A boot flag would be
marginally better but it still reeks too much of tunable.

Maybe CONFIG_FAILABLE_SMALL_ALLOCS.  Maybe something more euphemistic.
But I honestly can't think of anything that wouldn't scream "horrible
leak of implementation details."  The user just shouldn't ever care.

Given that there are usually several stages of various testing between
when a commit gets merged upstream and when it finally makes it into a
critical production system, maybe we don't need to provide userspace
control over this at all?

So what value do we choose?

Once we kick the OOM killer we should give the victim some time to
exit and then try the allocation again.  Looping just ONCE after that
means we scan all the LRU pages in the system a second time and invoke
the shrinkers another twelve times, with ratios approaching 1.  If the
OOM killer doesn't yield an allocatable page after this, I see very
little point in going on.  After all, we expect all our callers to
handle errors.

So why not just pass an "oomed" bool to should_alloc_retry() and bail
on small allocations at that point?  Put it upstream and deal with the
fallout long before this hits critical infrastructure?  By presumably
fixing up caller error handling and GFP flags?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
@ 2015-03-16 21:11             ` Johannes Weiner
  0 siblings, 0 replies; 63+ messages in thread
From: Johannes Weiner @ 2015-03-16 21:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, akpm, david, mgorman, riel, fengguang.wu, linux-mm,
	linux-kernel

On Mon, Mar 16, 2015 at 08:46:07AM +0100, Michal Hocko wrote:
> @@ -707,6 +708,29 @@ sysctl, it will revert to this default behavior.
>  
>  ==============================================================
>  
> +retry_allocation_attempts
> +
> +Page allocator tries hard to not fail small allocations requests.
> +Currently it retries indefinitely for small allocations requests (<= 32kB).
> +This works mostly fine but under an extreme low memory conditions system
> +might end up in deadlock situations because the looping allocation
> +request might block further progress for OOM killer victims.
> +
> +Even though this hasn't turned out to be a huge problem for many years the
> +long term plan is to move away from this default behavior but as this is
> +a long established behavior we cannot change it immediately.
> +
> +This knob should help in the transition. It tells how many times should
> +allocator retry when the system is OOM before the allocation fails.
> +The default value (ULONG_MAX) preserves the old behavior. This is a safe
> +default for production systems which cannot afford any unexpected
> +downtimes. More experimental systems might set it to a small number
> +(>=1), the higher the value the less probable would be allocation
> +failures when OOM is transient and could be resolved without the
> +particular allocation to fail.

This is a negotiation between the page allocator and the various
requirements of its in-kernel users.  If *we* can't make an educated
guess with the entire codebase available, how the heck can we expect
userspace to?

And just assuming for a second that they actually do a better job than
us, are they going to send us a report of their workload and machine
specs and the value that worked for them?  Of course not, why would
you think they'd suddenly send anything but regression reports?

And we wouldn't get regression reports without changing the default,
because really, what is the incentive to mess with that knob?  Making
a lockup you probably never encountered less likely to trigger, while
adding failures of unknown quantity or quality into the system?

This is truly insane.  You're taking one magic factor out of a complex
kernel mechanism and dump it on userspace, which has neither reason
nor context to meaningfully change the default.  We'd never leave that
state of transition.  Only when machines do lock up in the wild, at
least we can tell them they should have set this knob to "like, 50?"

If we want to address this problem, we are the ones that have to make
the call.  Pick a value based on a reasonable model, make it the
default, then deal with the fallout and update our assumptions.

Once that is done, whether we want to provide a boolean failsafe to
revert this in the field is another question.

A sysctl certainly doesn't sound appropriate to me because this is not
a tunable that we expect people to set according to their usecase.  We
expect our model to work for *everybody*.  A boot flag would be
marginally better but it still reeks too much of tunable.

Maybe CONFIG_FAILABLE_SMALL_ALLOCS.  Maybe something more euphemistic.
But I honestly can't think of anything that wouldn't scream "horrible
leak of implementation details."  The user just shouldn't ever care.

Given that there are usually several stages of various testing between
when a commit gets merged upstream and when it finally makes it into a
critical production system, maybe we don't need to provide userspace
control over this at all?

So what value do we choose?

Once we kick the OOM killer we should give the victim some time to
exit and then try the allocation again.  Looping just ONCE after that
means we scan all the LRU pages in the system a second time and invoke
the shrinkers another twelve times, with ratios approaching 1.  If the
OOM killer doesn't yield an allocatable page after this, I see very
little point in going on.  After all, we expect all our callers to
handle errors.

So why not just pass an "oomed" bool to should_alloc_retry() and bail
on small allocations at that point?  Put it upstream and deal with the
fallout long before this hits critical infrastructure?  By presumably
fixing up caller error handling and GFP flags?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/2] Move away from non-failing small allocations
@ 2015-03-16 22:38   ` Andrew Morton
  0 siblings, 0 replies; 63+ messages in thread
From: Andrew Morton @ 2015-03-16 22:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Dave Chinner, Mel Gorman, Rik van Riel,
	Wu Fengguang, linux-mm, LKML, Linux API

On Wed, 11 Mar 2015 16:54:52 -0400 Michal Hocko <mhocko@suse.cz> wrote:

> as per discussion at LSF/MM summit few days back it seems there is a
> general agreement on moving away from "small allocations do not fail"
> concept.

Such a change affects basically every part of the kernel and every
kernel developer.  I expect most developers will say "it works well
enough and I'm not getting any bug reports so why should I spend time
on this?".  It would help if we were to explain the justification very
clearly.  https://lwn.net/Articles/636017/ is Jon's writeup of the
conference discussion.

Realistically, I don't think this overall effort will be successful -
we'll add the knob, it won't get enough testing and any attempt to
alter the default will be us deliberately destabilizing the kernel
without knowing how badly :(


I wonder if we can alter the behaviour only for filesystem code, so we
constrain the new behaviour just to that code where we're having
problems.  Most/all fs code goes via vfs methods so there's a reasonably
small set of places where we can call

static inline void enter_fs_code(struct super_block *sb)
{
	if (sb->my_small_allocations_can_fail)
		current->small_allocations_can_fail++;
}

that way (or something similar) we can select the behaviour on a per-fs
basis and the rest of the kernel remains unaffected.  Other subsystems
can opt in as well.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/2] Move away from non-failing small allocations
@ 2015-03-16 22:38   ` Andrew Morton
  0 siblings, 0 replies; 63+ messages in thread
From: Andrew Morton @ 2015-03-16 22:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Dave Chinner, Mel Gorman, Rik van Riel,
	Wu Fengguang, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, LKML, Linux API

On Wed, 11 Mar 2015 16:54:52 -0400 Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:

> as per discussion at LSF/MM summit few days back it seems there is a
> general agreement on moving away from "small allocations do not fail"
> concept.

Such a change affects basically every part of the kernel and every
kernel developer.  I expect most developers will say "it works well
enough and I'm not getting any bug reports so why should I spend time
on this?".  It would help if we were to explain the justification very
clearly.  https://lwn.net/Articles/636017/ is Jon's writeup of the
conference discussion.

Realistically, I don't think this overall effort will be successful -
we'll add the knob, it won't get enough testing and any attempt to
alter the default will be us deliberately destabilizing the kernel
without knowing how badly :(


I wonder if we can alter the behaviour only for filesystem code, so we
constrain the new behaviour just to that code where we're having
problems.  Most/all fs code goes via vfs methods so there's a reasonably
small set of places where we can call

static inline void enter_fs_code(struct super_block *sb)
{
	if (sb->my_small_allocations_can_fail)
		current->small_allocations_can_fail++;
}

that way (or something similar) we can select the behaviour on a per-fs
basis and the rest of the kernel remains unaffected.  Other subsystems
can opt in as well.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/2] Move away from non-failing small allocations
@ 2015-03-16 22:38   ` Andrew Morton
  0 siblings, 0 replies; 63+ messages in thread
From: Andrew Morton @ 2015-03-16 22:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Dave Chinner, Mel Gorman, Rik van Riel,
	Wu Fengguang, linux-mm, LKML, Linux API

On Wed, 11 Mar 2015 16:54:52 -0400 Michal Hocko <mhocko@suse.cz> wrote:

> as per discussion at LSF/MM summit few days back it seems there is a
> general agreement on moving away from "small allocations do not fail"
> concept.

Such a change affects basically every part of the kernel and every
kernel developer.  I expect most developers will say "it works well
enough and I'm not getting any bug reports so why should I spend time
on this?".  It would help if we were to explain the justification very
clearly.  https://lwn.net/Articles/636017/ is Jon's writeup of the
conference discussion.

Realistically, I don't think this overall effort will be successful -
we'll add the knob, it won't get enough testing and any attempt to
alter the default will be us deliberately destabilizing the kernel
without knowing how badly :(


I wonder if we can alter the behaviour only for filesystem code, so we
constrain the new behaviour just to that code where we're having
problems.  Most/all fs code goes via vfs methods so there's a reasonably
small set of places where we can call

static inline void enter_fs_code(struct super_block *sb)
{
	if (sb->my_small_allocations_can_fail)
		current->small_allocations_can_fail++;
}

that way (or something similar) we can select the behaviour on a per-fs
basis and the rest of the kernel remains unaffected.  Other subsystems
can opt in as well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/2] Move away from non-failing small allocations
  2015-03-16 22:38   ` Andrew Morton
@ 2015-03-17  9:07     ` Michal Hocko
  -1 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-17  9:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Dave Chinner, Mel Gorman, Rik van Riel,
	Wu Fengguang, linux-mm, LKML, Linux API

On Mon 16-03-15 15:38:43, Andrew Morton wrote:
> On Wed, 11 Mar 2015 16:54:52 -0400 Michal Hocko <mhocko@suse.cz> wrote:
> 
> > as per discussion at LSF/MM summit few days back it seems there is a
> > general agreement on moving away from "small allocations do not fail"
> > concept.
> 
> Such a change affects basically every part of the kernel and every
> kernel developer.  I expect most developers will say "it works well
> enough and I'm not getting any bug reports so why should I spend time
> on this?".  It would help if we were to explain the justification very
> clearly.  https://lwn.net/Articles/636017/ is Jon's writeup of the
> conference discussion.

OK, I thought that the description in the patch 1/2 was clear about the
motivation. I can try harder of course. Which part do you miss there? Or
was it the cover that wasn't specific enough?
 
> Realistically, I don't think this overall effort will be successful -
> we'll add the knob, it won't get enough testing and any attempt to
> alter the default will be us deliberately destabilizing the kernel
> without knowing how badly :(

Without the knob we do not allow users to test this at all though and
the transition will _never_ happen. Which is IMHO bad.
 
> I wonder if we can alter the behaviour only for filesystem code, so we
> constrain the new behaviour just to that code where we're having
> problems.  Most/all fs code goes via vfs methods so there's a reasonably
> small set of places where we can call

We are seeing issues with the fs code now because the test cases which
led to the current discussion exercise FS code. The code which does
lock(); kmalloc(GFP_KERNEL) is not reduced there though. I am pretty sure
we can find other subsystems if we try hard enough.

> static inline void enter_fs_code(struct super_block *sb)
> {
> 	if (sb->my_small_allocations_can_fail)
> 		current->small_allocations_can_fail++;
> }
> 
> that way (or something similar) we can select the behaviour on a per-fs
> basis and the rest of the kernel remains unaffected.  Other subsystems
> can opt in as well.

This is basically leading to GFP_MAYFAIL which is completely backwards
(the hard requirement should be an exception not a default rule).
I really do not want to end up with stuffing random may_fail annotations
all over the kernel.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/2] Move away from non-failing small allocations
@ 2015-03-17  9:07     ` Michal Hocko
  0 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-17  9:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Dave Chinner, Mel Gorman, Rik van Riel,
	Wu Fengguang, linux-mm, LKML, Linux API

On Mon 16-03-15 15:38:43, Andrew Morton wrote:
> On Wed, 11 Mar 2015 16:54:52 -0400 Michal Hocko <mhocko@suse.cz> wrote:
> 
> > as per discussion at LSF/MM summit few days back it seems there is a
> > general agreement on moving away from "small allocations do not fail"
> > concept.
> 
> Such a change affects basically every part of the kernel and every
> kernel developer.  I expect most developers will say "it works well
> enough and I'm not getting any bug reports so why should I spend time
> on this?".  It would help if we were to explain the justification very
> clearly.  https://lwn.net/Articles/636017/ is Jon's writeup of the
> conference discussion.

OK, I thought that the description in the patch 1/2 was clear about the
motivation. I can try harder of course. Which part do you miss there? Or
was it the cover that wasn't specific enough?
 
> Realistically, I don't think this overall effort will be successful -
> we'll add the knob, it won't get enough testing and any attempt to
> alter the default will be us deliberately destabilizing the kernel
> without knowing how badly :(

Without the knob we do not allow users to test this at all though and
the transition will _never_ happen. Which is IMHO bad.
 
> I wonder if we can alter the behaviour only for filesystem code, so we
> constrain the new behaviour just to that code where we're having
> problems.  Most/all fs code goes via vfs methods so there's a reasonably
> small set of places where we can call

We are seeing issues with the fs code now because the test cases which
led to the current discussion exercise FS code. The code which does
lock(); kmalloc(GFP_KERNEL) is not reduced there though. I am pretty sure
we can find other subsystems if we try hard enough.

> static inline void enter_fs_code(struct super_block *sb)
> {
> 	if (sb->my_small_allocations_can_fail)
> 		current->small_allocations_can_fail++;
> }
> 
> that way (or something similar) we can select the behaviour on a per-fs
> basis and the rest of the kernel remains unaffected.  Other subsystems
> can opt in as well.

This is basically leading to GFP_MAYFAIL which is completely backwards
(the hard requirement should be an exception not a default rule).
I really do not want to end up with stuffing random may_fail annotations
all over the kernel.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
  2015-03-16 21:11             ` Johannes Weiner
@ 2015-03-17 10:25               ` Michal Hocko
  -1 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-17 10:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, akpm, david, mgorman, riel, fengguang.wu, linux-mm,
	linux-kernel

On Mon 16-03-15 17:11:46, Johannes Weiner wrote:
> On Mon, Mar 16, 2015 at 08:46:07AM +0100, Michal Hocko wrote:
> > @@ -707,6 +708,29 @@ sysctl, it will revert to this default behavior.
> >  
> >  ==============================================================
> >  
> > +retry_allocation_attempts
> > +
> > +Page allocator tries hard to not fail small allocations requests.
> > +Currently it retries indefinitely for small allocations requests (<= 32kB).
> > +This works mostly fine but under an extreme low memory conditions system
> > +might end up in deadlock situations because the looping allocation
> > +request might block further progress for OOM killer victims.
> > +
> > +Even though this hasn't turned out to be a huge problem for many years the
> > +long term plan is to move away from this default behavior but as this is
> > +a long established behavior we cannot change it immediately.
> > +
> > +This knob should help in the transition. It tells how many times should
> > +allocator retry when the system is OOM before the allocation fails.
> > +The default value (ULONG_MAX) preserves the old behavior. This is a safe
> > +default for production systems which cannot afford any unexpected
> > +downtimes. More experimental systems might set it to a small number
> > +(>=1), the higher the value the less probable would be allocation
> > +failures when OOM is transient and could be resolved without the
> > +particular allocation to fail.
> 
> This is a negotiation between the page allocator and the various
> requirements of its in-kernel users.  If *we* can't make an educated
> guess with the entire codebase available, how the heck can we expect
> userspace to?
> 
> And just assuming for a second that they actually do a better job than
> us, are they going to send us a report of their workload and machine
> specs and the value that worked for them?  Of course not, why would
> you think they'd suddenly send anything but regression reports?
>
> And we wouldn't get regression reports without changing the default,
> because really, what is the incentive to mess with that knob?  Making
> a lockup you probably never encountered less likely to trigger, while
> adding failures of unknown quantity or quality into the system?
> 
> This is truly insane.  You're taking one magic factor out of a complex
> kernel mechanism and dump it on userspace, which has neither reason
> nor context to meaningfully change the default.  We'd never leave that
> state of transition.  Only when machines do lock up in the wild, at
> least we can tell them they should have set this knob to "like, 50?"
> 
> If we want to address this problem, we are the ones that have to make
> the call.  Pick a value based on a reasonable model, make it the
> default, then deal with the fallout and update our assumptions.
> 
> Once that is done, whether we want to provide a boolean failsafe to
> revert this in the field is another question.

I have no problem having the behavior enabled/disabled rather than a
number if people think this is a better idea. This is the primary point
I've posted this to linux-api mailing list as well. I definitely do not
want to get stuck in the pick your number discussion.

> A sysctl certainly doesn't sound appropriate to me because this is not
> a tunable that we expect people to set according to their usecase.  We
> expect our model to work for *everybody*.  A boot flag would be
> marginally better but it still reeks too much of tunable.

I am OK with a boot option as well if the sysctl is considered
inappropriate. It is less flexible though. Consider a regression testing
where the same load is run 2 times once with failing allocations and
once without it. Why should we force the tester to do a reboot cycle?
 
> Maybe CONFIG_FAILABLE_SMALL_ALLOCS.  Maybe something more euphemistic.
> But I honestly can't think of anything that wouldn't scream "horrible
> leak of implementation details."  The user just shouldn't ever care.

Any config option basically means that distribution users will not get
to test this until distributions change the default and this won't
happen until the testing coverage and period was sufficient. See the
chicken and egg problem? This is basically undermining the whole idea
about the voluntary testing. So no, I really do not like this.
 
> Given that there are usually several stages of various testing between
> when a commit gets merged upstream and when it finally makes it into a
> critical production system, maybe we don't need to provide userspace
> control over this at all?

I can still see conservative users not changing this behavior _ever_.
Even after the rest of the world trusts the new default. They should
have a way to disable it. Many of those are running distribution kernels
so they really need a way to control the behavior. Be it a boot time
option or sysctl. Historically we were using sysctl for backward
compatibility and I do not see any reason to be different here as well.

> So what value do we choose?
> 
> Once we kick the OOM killer we should give the victim some time to
> exit and then try the allocation again.  Looping just ONCE after that
> means we scan all the LRU pages in the system a second time and invoke
> the shrinkers another twelve times, with ratios approaching 1.  If the
> OOM killer doesn't yield an allocatable page after this, I see very
> little point in going on.  After all, we expect all our callers to
> handle errors.

I am OK with the single retry. As shown with the tests the same load
might end up with less allocation failures with the higher values but
that is a detail. Users of !GFP_NOFAIL should be prepared for failures
and the if the failures are too excessive I agree this should be
addressed in the page allocator.
 
> So why not just pass an "oomed" bool to should_alloc_retry() and bail
> on small allocations at that point?  Put it upstream and deal with the
> fallout long before this hits critical infrastructure?  By presumably
> fixing up caller error handling and GFP flags?

This is way too risky IMO. We cannot change a long established behavior
that quickly. I do agree we should allow failing in linux-next and
development trees. So that it is us kernel developers to start testing
first. Then we have zero day testing projects and Fenguang has shown an
interest in this as well. I would also expect/hoped for some testing
internal within major distributions. We are no way close to have this
default behavior in the Linus tree, though.

That is why I've proposed 3 steps 1) voluntary testers, 2) distributions
default 3) upstream default. Why don't you think this is a proper
approach?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
@ 2015-03-17 10:25               ` Michal Hocko
  0 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-17 10:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, akpm, david, mgorman, riel, fengguang.wu, linux-mm,
	linux-kernel

On Mon 16-03-15 17:11:46, Johannes Weiner wrote:
> On Mon, Mar 16, 2015 at 08:46:07AM +0100, Michal Hocko wrote:
> > @@ -707,6 +708,29 @@ sysctl, it will revert to this default behavior.
> >  
> >  ==============================================================
> >  
> > +retry_allocation_attempts
> > +
> > +Page allocator tries hard to not fail small allocations requests.
> > +Currently it retries indefinitely for small allocations requests (<= 32kB).
> > +This works mostly fine but under an extreme low memory conditions system
> > +might end up in deadlock situations because the looping allocation
> > +request might block further progress for OOM killer victims.
> > +
> > +Even though this hasn't turned out to be a huge problem for many years the
> > +long term plan is to move away from this default behavior but as this is
> > +a long established behavior we cannot change it immediately.
> > +
> > +This knob should help in the transition. It tells how many times should
> > +allocator retry when the system is OOM before the allocation fails.
> > +The default value (ULONG_MAX) preserves the old behavior. This is a safe
> > +default for production systems which cannot afford any unexpected
> > +downtimes. More experimental systems might set it to a small number
> > +(>=1), the higher the value the less probable would be allocation
> > +failures when OOM is transient and could be resolved without the
> > +particular allocation to fail.
> 
> This is a negotiation between the page allocator and the various
> requirements of its in-kernel users.  If *we* can't make an educated
> guess with the entire codebase available, how the heck can we expect
> userspace to?
> 
> And just assuming for a second that they actually do a better job than
> us, are they going to send us a report of their workload and machine
> specs and the value that worked for them?  Of course not, why would
> you think they'd suddenly send anything but regression reports?
>
> And we wouldn't get regression reports without changing the default,
> because really, what is the incentive to mess with that knob?  Making
> a lockup you probably never encountered less likely to trigger, while
> adding failures of unknown quantity or quality into the system?
> 
> This is truly insane.  You're taking one magic factor out of a complex
> kernel mechanism and dump it on userspace, which has neither reason
> nor context to meaningfully change the default.  We'd never leave that
> state of transition.  Only when machines do lock up in the wild, at
> least we can tell them they should have set this knob to "like, 50?"
> 
> If we want to address this problem, we are the ones that have to make
> the call.  Pick a value based on a reasonable model, make it the
> default, then deal with the fallout and update our assumptions.
> 
> Once that is done, whether we want to provide a boolean failsafe to
> revert this in the field is another question.

I have no problem having the behavior enabled/disabled rather than a
number if people think this is a better idea. This is the primary point
I've posted this to linux-api mailing list as well. I definitely do not
want to get stuck in the pick your number discussion.

> A sysctl certainly doesn't sound appropriate to me because this is not
> a tunable that we expect people to set according to their usecase.  We
> expect our model to work for *everybody*.  A boot flag would be
> marginally better but it still reeks too much of tunable.

I am OK with a boot option as well if the sysctl is considered
inappropriate. It is less flexible though. Consider a regression testing
where the same load is run 2 times once with failing allocations and
once without it. Why should we force the tester to do a reboot cycle?
 
> Maybe CONFIG_FAILABLE_SMALL_ALLOCS.  Maybe something more euphemistic.
> But I honestly can't think of anything that wouldn't scream "horrible
> leak of implementation details."  The user just shouldn't ever care.

Any config option basically means that distribution users will not get
to test this until distributions change the default and this won't
happen until the testing coverage and period was sufficient. See the
chicken and egg problem? This is basically undermining the whole idea
about the voluntary testing. So no, I really do not like this.
 
> Given that there are usually several stages of various testing between
> when a commit gets merged upstream and when it finally makes it into a
> critical production system, maybe we don't need to provide userspace
> control over this at all?

I can still see conservative users not changing this behavior _ever_.
Even after the rest of the world trusts the new default. They should
have a way to disable it. Many of those are running distribution kernels
so they really need a way to control the behavior. Be it a boot time
option or sysctl. Historically we were using sysctl for backward
compatibility and I do not see any reason to be different here as well.

> So what value do we choose?
> 
> Once we kick the OOM killer we should give the victim some time to
> exit and then try the allocation again.  Looping just ONCE after that
> means we scan all the LRU pages in the system a second time and invoke
> the shrinkers another twelve times, with ratios approaching 1.  If the
> OOM killer doesn't yield an allocatable page after this, I see very
> little point in going on.  After all, we expect all our callers to
> handle errors.

I am OK with the single retry. As shown with the tests the same load
might end up with less allocation failures with the higher values but
that is a detail. Users of !GFP_NOFAIL should be prepared for failures
and the if the failures are too excessive I agree this should be
addressed in the page allocator.
 
> So why not just pass an "oomed" bool to should_alloc_retry() and bail
> on small allocations at that point?  Put it upstream and deal with the
> fallout long before this hits critical infrastructure?  By presumably
> fixing up caller error handling and GFP flags?

This is way too risky IMO. We cannot change a long established behavior
that quickly. I do agree we should allow failing in linux-next and
development trees. So that it is us kernel developers to start testing
first. Then we have zero day testing projects and Fenguang has shown an
interest in this as well. I would also expect/hoped for some testing
internal within major distributions. We are no way close to have this
default behavior in the Linus tree, though.

That is why I've proposed 3 steps 1) voluntary testers, 2) distributions
default 3) upstream default. Why don't you think this is a proper
approach?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
  2015-03-16  7:46           ` Michal Hocko
@ 2015-03-17 11:13             ` Tetsuo Handa
  -1 siblings, 0 replies; 63+ messages in thread
From: Tetsuo Handa @ 2015-03-17 11:13 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, hannes, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Sun 15-03-15 22:06:54, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > this. I understand that the wording of the changelog might be confusing,
> > > though.
> > > 
> > > It says: "This implementation counts only those retries which involved
> > > OOM killer because we do not want to be too eager to fail the request."
> > > 
> > > Would it be more clear if I changed that to?
> > > "This implemetnation counts only those retries when the system is
> > > considered OOM because all previous reclaim attempts have resulted
> > > in no progress because we do not want to be too eager to fail the
> > > request."
> > > 
> > > We definitely _want_ to fail GFP_NOFS allocations.
> > 
> > I see. The updated changelog is much more clear.
> 
> Patch with the updated changelog (no other changes)

Now the changelog is clear that "Involved OOM killer" == "__GFP_FS allocation"
and "Considered OOM" == "both __GFP_FS and !__GFP_FS allocation".

One more thing I want to confirm about this patch's changelog.
This patch will generate the same result shown below.

Tetsuo Handa wrote:
> I also tested on XFS. One is Linux 3.19 and the other is Linux 3.19
> with debug printk patch shown above. According to console logs,
> oom_kill_process() is trivially called via pagefault_out_of_memory()
> for the former kernel. Due to giving up !GFP_FS allocations immediately?
> 
> (From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-unpatched.txt.xz )
> ---------- xfs / Linux 3.19 ----------
> [  793.283099] su invoked oom-killer: gfp_mask=0x0, order=0, oom_score_adj=0
> [  793.283102] su cpuset=/ mems_allowed=0
> [  793.283104] CPU: 3 PID: 9552 Comm: su Not tainted 3.19.0 #40
> [  793.283159] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> [  793.283161]  0000000000000000 ffff88007ac03bf8 ffffffff816ae9d4 000000000000bebe
> [  793.283162]  ffff880078b0d740 ffff88007ac03c98 ffffffff816ac7ac 0000000000000206
> [  793.283163]  0000000481f30298 ffff880073e55850 ffff88007ac03c88 ffff88007a20bef8
> [  793.283164] Call Trace:
> [  793.283169]  [<ffffffff816ae9d4>] dump_stack+0x45/0x57
> [  793.283171]  [<ffffffff816ac7ac>] dump_header+0x7f/0x1f1
> [  793.283174]  [<ffffffff8114b36b>] oom_kill_process+0x22b/0x390
> [  793.283177]  [<ffffffff810776d0>] ? has_capability_noaudit+0x20/0x30
> [  793.283178]  [<ffffffff8114bb72>] out_of_memory+0x4b2/0x500
> [  793.283179]  [<ffffffff8114bc37>] pagefault_out_of_memory+0x77/0x90
> [  793.283180]  [<ffffffff816aab2c>] mm_fault_error+0x67/0x140
> [  793.283182]  [<ffffffff8105a9f6>] __do_page_fault+0x3f6/0x580
> [  793.283185]  [<ffffffff810aed1d>] ? remove_wait_queue+0x4d/0x60
> [  793.283186]  [<ffffffff81070fcb>] ? do_wait+0x12b/0x240
> [  793.283187]  [<ffffffff8105abb1>] do_page_fault+0x31/0x70
> [  793.283189]  [<ffffffff816b83e8>] page_fault+0x28/0x30
> ---------- xfs / Linux 3.19 ----------

Are all memory allocations caused by page fault __GFP_FS allocation?
If memory allocations caused by page fault are !__GFP_FS allocation
(e.g. 0x2015a == __GFP_HARDWALL | __GFP_COLD | __GFP_IO | __GFP_WAIT |
__GFP_HIGHMEM | __GFP_MOVABLE), this patch will start trivially involving
OOM killer for !__GFP_FS allocation.

I haven't tried how many processes can be killed by this path, but this path
can potentially OOM-kill most of OOM-killable processes depending on how long
the OOM condition lasts. It would be better to mention that a lot of processes
might be OOM-killed by page faults due to this change.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
@ 2015-03-17 11:13             ` Tetsuo Handa
  0 siblings, 0 replies; 63+ messages in thread
From: Tetsuo Handa @ 2015-03-17 11:13 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, hannes, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Sun 15-03-15 22:06:54, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > this. I understand that the wording of the changelog might be confusing,
> > > though.
> > > 
> > > It says: "This implementation counts only those retries which involved
> > > OOM killer because we do not want to be too eager to fail the request."
> > > 
> > > Would it be more clear if I changed that to?
> > > "This implemetnation counts only those retries when the system is
> > > considered OOM because all previous reclaim attempts have resulted
> > > in no progress because we do not want to be too eager to fail the
> > > request."
> > > 
> > > We definitely _want_ to fail GFP_NOFS allocations.
> > 
> > I see. The updated changelog is much more clear.
> 
> Patch with the updated changelog (no other changes)

Now the changelog is clear that "Involved OOM killer" == "__GFP_FS allocation"
and "Considered OOM" == "both __GFP_FS and !__GFP_FS allocation".

One more thing I want to confirm about this patch's changelog.
This patch will generate the same result shown below.

Tetsuo Handa wrote:
> I also tested on XFS. One is Linux 3.19 and the other is Linux 3.19
> with debug printk patch shown above. According to console logs,
> oom_kill_process() is trivially called via pagefault_out_of_memory()
> for the former kernel. Due to giving up !GFP_FS allocations immediately?
> 
> (From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-unpatched.txt.xz )
> ---------- xfs / Linux 3.19 ----------
> [  793.283099] su invoked oom-killer: gfp_mask=0x0, order=0, oom_score_adj=0
> [  793.283102] su cpuset=/ mems_allowed=0
> [  793.283104] CPU: 3 PID: 9552 Comm: su Not tainted 3.19.0 #40
> [  793.283159] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> [  793.283161]  0000000000000000 ffff88007ac03bf8 ffffffff816ae9d4 000000000000bebe
> [  793.283162]  ffff880078b0d740 ffff88007ac03c98 ffffffff816ac7ac 0000000000000206
> [  793.283163]  0000000481f30298 ffff880073e55850 ffff88007ac03c88 ffff88007a20bef8
> [  793.283164] Call Trace:
> [  793.283169]  [<ffffffff816ae9d4>] dump_stack+0x45/0x57
> [  793.283171]  [<ffffffff816ac7ac>] dump_header+0x7f/0x1f1
> [  793.283174]  [<ffffffff8114b36b>] oom_kill_process+0x22b/0x390
> [  793.283177]  [<ffffffff810776d0>] ? has_capability_noaudit+0x20/0x30
> [  793.283178]  [<ffffffff8114bb72>] out_of_memory+0x4b2/0x500
> [  793.283179]  [<ffffffff8114bc37>] pagefault_out_of_memory+0x77/0x90
> [  793.283180]  [<ffffffff816aab2c>] mm_fault_error+0x67/0x140
> [  793.283182]  [<ffffffff8105a9f6>] __do_page_fault+0x3f6/0x580
> [  793.283185]  [<ffffffff810aed1d>] ? remove_wait_queue+0x4d/0x60
> [  793.283186]  [<ffffffff81070fcb>] ? do_wait+0x12b/0x240
> [  793.283187]  [<ffffffff8105abb1>] do_page_fault+0x31/0x70
> [  793.283189]  [<ffffffff816b83e8>] page_fault+0x28/0x30
> ---------- xfs / Linux 3.19 ----------

Are all memory allocations caused by page fault __GFP_FS allocation?
If memory allocations caused by page fault are !__GFP_FS allocation
(e.g. 0x2015a == __GFP_HARDWALL | __GFP_COLD | __GFP_IO | __GFP_WAIT |
__GFP_HIGHMEM | __GFP_MOVABLE), this patch will start trivially involving
OOM killer for !__GFP_FS allocation.

I haven't tried how many processes can be killed by this path, but this path
can potentially OOM-kill most of OOM-killable processes depending on how long
the OOM condition lasts. It would be better to mention that a lot of processes
might be OOM-killed by page faults due to this change.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
  2015-03-17 11:13             ` Tetsuo Handa
@ 2015-03-17 13:15               ` Michal Hocko
  -1 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-17 13:15 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, hannes, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

On Tue 17-03-15 20:13:42, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sun 15-03-15 22:06:54, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > [...]
> > > > this. I understand that the wording of the changelog might be confusing,
> > > > though.
> > > > 
> > > > It says: "This implementation counts only those retries which involved
> > > > OOM killer because we do not want to be too eager to fail the request."
> > > > 
> > > > Would it be more clear if I changed that to?
> > > > "This implemetnation counts only those retries when the system is
> > > > considered OOM because all previous reclaim attempts have resulted
> > > > in no progress because we do not want to be too eager to fail the
> > > > request."
> > > > 
> > > > We definitely _want_ to fail GFP_NOFS allocations.
> > > 
> > > I see. The updated changelog is much more clear.
> > 
> > Patch with the updated changelog (no other changes)
> 
> Now the changelog is clear that "Involved OOM killer" == "__GFP_FS allocation"
> and "Considered OOM" == "both __GFP_FS and !__GFP_FS allocation".
> 
> One more thing I want to confirm about this patch's changelog.
> This patch will generate the same result shown below.
> 
> Tetsuo Handa wrote:
> > I also tested on XFS. One is Linux 3.19 and the other is Linux 3.19
> > with debug printk patch shown above. According to console logs,
> > oom_kill_process() is trivially called via pagefault_out_of_memory()
> > for the former kernel. Due to giving up !GFP_FS allocations immediately?
> > 
> > (From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-unpatched.txt.xz )
> > ---------- xfs / Linux 3.19 ----------
> > [  793.283099] su invoked oom-killer: gfp_mask=0x0, order=0, oom_score_adj=0
> > [  793.283102] su cpuset=/ mems_allowed=0
> > [  793.283104] CPU: 3 PID: 9552 Comm: su Not tainted 3.19.0 #40
> > [  793.283159] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> > [  793.283161]  0000000000000000 ffff88007ac03bf8 ffffffff816ae9d4 000000000000bebe
> > [  793.283162]  ffff880078b0d740 ffff88007ac03c98 ffffffff816ac7ac 0000000000000206
> > [  793.283163]  0000000481f30298 ffff880073e55850 ffff88007ac03c88 ffff88007a20bef8
> > [  793.283164] Call Trace:
> > [  793.283169]  [<ffffffff816ae9d4>] dump_stack+0x45/0x57
> > [  793.283171]  [<ffffffff816ac7ac>] dump_header+0x7f/0x1f1
> > [  793.283174]  [<ffffffff8114b36b>] oom_kill_process+0x22b/0x390
> > [  793.283177]  [<ffffffff810776d0>] ? has_capability_noaudit+0x20/0x30
> > [  793.283178]  [<ffffffff8114bb72>] out_of_memory+0x4b2/0x500
> > [  793.283179]  [<ffffffff8114bc37>] pagefault_out_of_memory+0x77/0x90
> > [  793.283180]  [<ffffffff816aab2c>] mm_fault_error+0x67/0x140
> > [  793.283182]  [<ffffffff8105a9f6>] __do_page_fault+0x3f6/0x580
> > [  793.283185]  [<ffffffff810aed1d>] ? remove_wait_queue+0x4d/0x60
> > [  793.283186]  [<ffffffff81070fcb>] ? do_wait+0x12b/0x240
> > [  793.283187]  [<ffffffff8105abb1>] do_page_fault+0x31/0x70
> > [  793.283189]  [<ffffffff816b83e8>] page_fault+0x28/0x30
> > ---------- xfs / Linux 3.19 ----------
> 
> Are all memory allocations caused by page fault __GFP_FS allocation?

They should be GFP_HIGHUSER_MOVABLE or GFP_KERNEL. There should be no
reason to have GFP_NOFS there because the page fault doesn't come from a
fs path.

> If memory allocations caused by page fault are !__GFP_FS allocation
> (e.g. 0x2015a == __GFP_HARDWALL | __GFP_COLD | __GFP_IO | __GFP_WAIT |
> __GFP_HIGHMEM | __GFP_MOVABLE), this patch will start trivially involving
> OOM killer for !__GFP_FS allocation.
> 
> I haven't tried how many processes can be killed by this path, but this path
> can potentially OOM-kill most of OOM-killable processes depending on how long
> the OOM condition lasts. It would be better to mention that a lot of processes
> might be OOM-killed by page faults due to this change.

Tasks being killed inside a page fault path is nothing new. The rate
would be higher if small allocations start failing as well but is this
worth special mentioning? Other small allocations would start failing as
well...

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
@ 2015-03-17 13:15               ` Michal Hocko
  0 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-17 13:15 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, hannes, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

On Tue 17-03-15 20:13:42, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sun 15-03-15 22:06:54, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > [...]
> > > > this. I understand that the wording of the changelog might be confusing,
> > > > though.
> > > > 
> > > > It says: "This implementation counts only those retries which involved
> > > > OOM killer because we do not want to be too eager to fail the request."
> > > > 
> > > > Would it be more clear if I changed that to?
> > > > "This implemetnation counts only those retries when the system is
> > > > considered OOM because all previous reclaim attempts have resulted
> > > > in no progress because we do not want to be too eager to fail the
> > > > request."
> > > > 
> > > > We definitely _want_ to fail GFP_NOFS allocations.
> > > 
> > > I see. The updated changelog is much more clear.
> > 
> > Patch with the updated changelog (no other changes)
> 
> Now the changelog is clear that "Involved OOM killer" == "__GFP_FS allocation"
> and "Considered OOM" == "both __GFP_FS and !__GFP_FS allocation".
> 
> One more thing I want to confirm about this patch's changelog.
> This patch will generate the same result shown below.
> 
> Tetsuo Handa wrote:
> > I also tested on XFS. One is Linux 3.19 and the other is Linux 3.19
> > with debug printk patch shown above. According to console logs,
> > oom_kill_process() is trivially called via pagefault_out_of_memory()
> > for the former kernel. Due to giving up !GFP_FS allocations immediately?
> > 
> > (From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-unpatched.txt.xz )
> > ---------- xfs / Linux 3.19 ----------
> > [  793.283099] su invoked oom-killer: gfp_mask=0x0, order=0, oom_score_adj=0
> > [  793.283102] su cpuset=/ mems_allowed=0
> > [  793.283104] CPU: 3 PID: 9552 Comm: su Not tainted 3.19.0 #40
> > [  793.283159] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> > [  793.283161]  0000000000000000 ffff88007ac03bf8 ffffffff816ae9d4 000000000000bebe
> > [  793.283162]  ffff880078b0d740 ffff88007ac03c98 ffffffff816ac7ac 0000000000000206
> > [  793.283163]  0000000481f30298 ffff880073e55850 ffff88007ac03c88 ffff88007a20bef8
> > [  793.283164] Call Trace:
> > [  793.283169]  [<ffffffff816ae9d4>] dump_stack+0x45/0x57
> > [  793.283171]  [<ffffffff816ac7ac>] dump_header+0x7f/0x1f1
> > [  793.283174]  [<ffffffff8114b36b>] oom_kill_process+0x22b/0x390
> > [  793.283177]  [<ffffffff810776d0>] ? has_capability_noaudit+0x20/0x30
> > [  793.283178]  [<ffffffff8114bb72>] out_of_memory+0x4b2/0x500
> > [  793.283179]  [<ffffffff8114bc37>] pagefault_out_of_memory+0x77/0x90
> > [  793.283180]  [<ffffffff816aab2c>] mm_fault_error+0x67/0x140
> > [  793.283182]  [<ffffffff8105a9f6>] __do_page_fault+0x3f6/0x580
> > [  793.283185]  [<ffffffff810aed1d>] ? remove_wait_queue+0x4d/0x60
> > [  793.283186]  [<ffffffff81070fcb>] ? do_wait+0x12b/0x240
> > [  793.283187]  [<ffffffff8105abb1>] do_page_fault+0x31/0x70
> > [  793.283189]  [<ffffffff816b83e8>] page_fault+0x28/0x30
> > ---------- xfs / Linux 3.19 ----------
> 
> Are all memory allocations caused by page fault __GFP_FS allocation?

They should be GFP_HIGHUSER_MOVABLE or GFP_KERNEL. There should be no
reason to have GFP_NOFS there because the page fault doesn't come from a
fs path.

> If memory allocations caused by page fault are !__GFP_FS allocation
> (e.g. 0x2015a == __GFP_HARDWALL | __GFP_COLD | __GFP_IO | __GFP_WAIT |
> __GFP_HIGHMEM | __GFP_MOVABLE), this patch will start trivially involving
> OOM killer for !__GFP_FS allocation.
> 
> I haven't tried how many processes can be killed by this path, but this path
> can potentially OOM-kill most of OOM-killable processes depending on how long
> the OOM condition lasts. It would be better to mention that a lot of processes
> might be OOM-killed by page faults due to this change.

Tasks being killed inside a page fault path is nothing new. The rate
would be higher if small allocations start failing as well but is this
worth special mentioning? Other small allocations would start failing as
well...

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
  2015-03-17 10:25               ` Michal Hocko
@ 2015-03-17 13:29                 ` Johannes Weiner
  -1 siblings, 0 replies; 63+ messages in thread
From: Johannes Weiner @ 2015-03-17 13:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, akpm, david, mgorman, riel, fengguang.wu, linux-mm,
	linux-kernel

On Tue, Mar 17, 2015 at 11:25:08AM +0100, Michal Hocko wrote:
> On Mon 16-03-15 17:11:46, Johannes Weiner wrote:
> > A sysctl certainly doesn't sound appropriate to me because this is not
> > a tunable that we expect people to set according to their usecase.  We
> > expect our model to work for *everybody*.  A boot flag would be
> > marginally better but it still reeks too much of tunable.
> 
> I am OK with a boot option as well if the sysctl is considered
> inappropriate. It is less flexible though. Consider a regression testing
> where the same load is run 2 times once with failing allocations and
> once without it. Why should we force the tester to do a reboot cycle?

Because we can get rid of the Kconfig more easily once we transitioned.

> > Maybe CONFIG_FAILABLE_SMALL_ALLOCS.  Maybe something more euphemistic.
> > But I honestly can't think of anything that wouldn't scream "horrible
> > leak of implementation details."  The user just shouldn't ever care.
> 
> Any config option basically means that distribution users will not get
> to test this until distributions change the default and this won't
> happen until the testing coverage and period was sufficient. See the
> chicken and egg problem? This is basically undermining the whole idea
> about the voluntary testing. So no, I really do not like this.

Why would anybody volunteer to test this?  What are you giving users
in exchange for potentially destabilizing their kernels?

> > Given that there are usually several stages of various testing between
> > when a commit gets merged upstream and when it finally makes it into a
> > critical production system, maybe we don't need to provide userspace
> > control over this at all?
> 
> I can still see conservative users not changing this behavior _ever_.
> Even after the rest of the world trusts the new default. They should
> have a way to disable it. Many of those are running distribution kernels
> so they really need a way to control the behavior. Be it a boot time
> option or sysctl. Historically we were using sysctl for backward
> compatibility and I do not see any reason to be different here as well.

Again, this is an implementation detail that we are trying to fix up.
It has nothing to do with userspace, it's not a heuristic.  It's bad
enough that this would be at all selectable from userspace, now you
want to make it permanently configurable?

The problem we have to solve here is finding a value that doesn't
deadlock the allocator, makes error situations stable and behave
predictably, and doesn't regress real workloads out there.

Your proposal tries to avoid immediate regressions at the cost of
keeping the deadlock potential AND fragmenting the test space, which
will make the whole situation even more fragile.  Why would you want
production systems to run code that nobody else is running anymore?
We have a functioning testing pipeline to evaluate kernel changes like
this: private tree -> subsystem tree -> next -> rc -> release ->
stable -> longterm -> vendor.  We propagate risky changes to bigger
and bigger test coverage domains and back them out once they introduce
regressions.  You are trying to bypass this mechanism in an ad-hoc way
with no plan of ever re-uniting the configuration space, but by
splitting the test base in half (or N in your original proposal) you
are setting us up for bugs reported in vendor kernels that didn't get
caught through our primary means of maturing kernel changes.

Furthermore, it makes the code's behavior harder to predict and reason
about, which makes subsequent development prone to errors and yet more
regressions.

You're trying so hard to be defensive about this that you're actually
making everybody worse off.  Prioritizing a single aspect of a change
above everything else will never lead to good solutions.  Engineering
is about making trade-offs and finding the sweet spots.

> > So what value do we choose?
> > 
> > Once we kick the OOM killer we should give the victim some time to
> > exit and then try the allocation again.  Looping just ONCE after that
> > means we scan all the LRU pages in the system a second time and invoke
> > the shrinkers another twelve times, with ratios approaching 1.  If the
> > OOM killer doesn't yield an allocatable page after this, I see very
> > little point in going on.  After all, we expect all our callers to
> > handle errors.
> 
> I am OK with the single retry. As shown with the tests the same load
> might end up with less allocation failures with the higher values but
> that is a detail. Users of !GFP_NOFAIL should be prepared for failures
> and the if the failures are too excessive I agree this should be
> addressed in the page allocator.

Well yeah, allocation failures are fully expected to increase with
artificial stress tests.  It doesn't really mean anything.  All we can
do is make an educated guess and start exposing real workloads.

> > So why not just pass an "oomed" bool to should_alloc_retry() and bail
> > on small allocations at that point?  Put it upstream and deal with the
> > fallout long before this hits critical infrastructure?  By presumably
> > fixing up caller error handling and GFP flags?
> 
> This is way too risky IMO. We cannot change a long established behavior
> that quickly. I do agree we should allow failing in linux-next and
> development trees. So that it is us kernel developers to start testing
> first. Then we have zero day testing projects and Fenguang has shown an
> interest in this as well. I would also expect/hoped for some testing
> internal within major distributions. We are no way close to have this
> default behavior in the Linus tree, though.

The age of this behavior has nothing to do with how fast we trigger
bugs and fix them up.

The only problem here is the scale and the unknown impact.  We will
know the impact only by exposure to real workloads, and Andrew made a
suggestion already to keep the scale of the initial change low(er).

> That is why I've proposed 3 steps 1) voluntary testers, 2) distributions
> default 3) upstream default. Why don't you think this is a proper
> approach?

Because nobody will volunteer.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
@ 2015-03-17 13:29                 ` Johannes Weiner
  0 siblings, 0 replies; 63+ messages in thread
From: Johannes Weiner @ 2015-03-17 13:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, akpm, david, mgorman, riel, fengguang.wu, linux-mm,
	linux-kernel

On Tue, Mar 17, 2015 at 11:25:08AM +0100, Michal Hocko wrote:
> On Mon 16-03-15 17:11:46, Johannes Weiner wrote:
> > A sysctl certainly doesn't sound appropriate to me because this is not
> > a tunable that we expect people to set according to their usecase.  We
> > expect our model to work for *everybody*.  A boot flag would be
> > marginally better but it still reeks too much of tunable.
> 
> I am OK with a boot option as well if the sysctl is considered
> inappropriate. It is less flexible though. Consider a regression testing
> where the same load is run 2 times once with failing allocations and
> once without it. Why should we force the tester to do a reboot cycle?

Because we can get rid of the Kconfig more easily once we transitioned.

> > Maybe CONFIG_FAILABLE_SMALL_ALLOCS.  Maybe something more euphemistic.
> > But I honestly can't think of anything that wouldn't scream "horrible
> > leak of implementation details."  The user just shouldn't ever care.
> 
> Any config option basically means that distribution users will not get
> to test this until distributions change the default and this won't
> happen until the testing coverage and period was sufficient. See the
> chicken and egg problem? This is basically undermining the whole idea
> about the voluntary testing. So no, I really do not like this.

Why would anybody volunteer to test this?  What are you giving users
in exchange for potentially destabilizing their kernels?

> > Given that there are usually several stages of various testing between
> > when a commit gets merged upstream and when it finally makes it into a
> > critical production system, maybe we don't need to provide userspace
> > control over this at all?
> 
> I can still see conservative users not changing this behavior _ever_.
> Even after the rest of the world trusts the new default. They should
> have a way to disable it. Many of those are running distribution kernels
> so they really need a way to control the behavior. Be it a boot time
> option or sysctl. Historically we were using sysctl for backward
> compatibility and I do not see any reason to be different here as well.

Again, this is an implementation detail that we are trying to fix up.
It has nothing to do with userspace, it's not a heuristic.  It's bad
enough that this would be at all selectable from userspace, now you
want to make it permanently configurable?

The problem we have to solve here is finding a value that doesn't
deadlock the allocator, makes error situations stable and behave
predictably, and doesn't regress real workloads out there.

Your proposal tries to avoid immediate regressions at the cost of
keeping the deadlock potential AND fragmenting the test space, which
will make the whole situation even more fragile.  Why would you want
production systems to run code that nobody else is running anymore?
We have a functioning testing pipeline to evaluate kernel changes like
this: private tree -> subsystem tree -> next -> rc -> release ->
stable -> longterm -> vendor.  We propagate risky changes to bigger
and bigger test coverage domains and back them out once they introduce
regressions.  You are trying to bypass this mechanism in an ad-hoc way
with no plan of ever re-uniting the configuration space, but by
splitting the test base in half (or N in your original proposal) you
are setting us up for bugs reported in vendor kernels that didn't get
caught through our primary means of maturing kernel changes.

Furthermore, it makes the code's behavior harder to predict and reason
about, which makes subsequent development prone to errors and yet more
regressions.

You're trying so hard to be defensive about this that you're actually
making everybody worse off.  Prioritizing a single aspect of a change
above everything else will never lead to good solutions.  Engineering
is about making trade-offs and finding the sweet spots.

> > So what value do we choose?
> > 
> > Once we kick the OOM killer we should give the victim some time to
> > exit and then try the allocation again.  Looping just ONCE after that
> > means we scan all the LRU pages in the system a second time and invoke
> > the shrinkers another twelve times, with ratios approaching 1.  If the
> > OOM killer doesn't yield an allocatable page after this, I see very
> > little point in going on.  After all, we expect all our callers to
> > handle errors.
> 
> I am OK with the single retry. As shown with the tests the same load
> might end up with less allocation failures with the higher values but
> that is a detail. Users of !GFP_NOFAIL should be prepared for failures
> and the if the failures are too excessive I agree this should be
> addressed in the page allocator.

Well yeah, allocation failures are fully expected to increase with
artificial stress tests.  It doesn't really mean anything.  All we can
do is make an educated guess and start exposing real workloads.

> > So why not just pass an "oomed" bool to should_alloc_retry() and bail
> > on small allocations at that point?  Put it upstream and deal with the
> > fallout long before this hits critical infrastructure?  By presumably
> > fixing up caller error handling and GFP flags?
> 
> This is way too risky IMO. We cannot change a long established behavior
> that quickly. I do agree we should allow failing in linux-next and
> development trees. So that it is us kernel developers to start testing
> first. Then we have zero day testing projects and Fenguang has shown an
> interest in this as well. I would also expect/hoped for some testing
> internal within major distributions. We are no way close to have this
> default behavior in the Linus tree, though.

The age of this behavior has nothing to do with how fast we trigger
bugs and fix them up.

The only problem here is the scale and the unknown impact.  We will
know the impact only by exposure to real workloads, and Andrew made a
suggestion already to keep the scale of the initial change low(er).

> That is why I've proposed 3 steps 1) voluntary testers, 2) distributions
> default 3) upstream default. Why don't you think this is a proper
> approach?

Because nobody will volunteer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/2] Move away from non-failing small allocations
  2015-03-17  9:07     ` Michal Hocko
@ 2015-03-17 14:06       ` Tetsuo Handa
  -1 siblings, 0 replies; 63+ messages in thread
From: Tetsuo Handa @ 2015-03-17 14:06 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: hannes, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Mon 16-03-15 15:38:43, Andrew Morton wrote:
> > Realistically, I don't think this overall effort will be successful -
> > we'll add the knob, it won't get enough testing and any attempt to
> > alter the default will be us deliberately destabilizing the kernel
> > without knowing how badly :(
> 
> Without the knob we do not allow users to test this at all though and
> the transition will _never_ happen. Which is IMHO bad.
> 

Even with the knob, quite little users will test this. The consequence is
likely that end users rush into customer support center about obscure bugs.
I'm working at a support center, and such bugs are really annoying.

> > I wonder if we can alter the behaviour only for filesystem code, so we
> > constrain the new behaviour just to that code where we're having
> > problems.  Most/all fs code goes via vfs methods so there's a reasonably
> > small set of places where we can call
> 
> We are seeing issues with the fs code now because the test cases which
> led to the current discussion exercise FS code. The code which does
> lock(); kmalloc(GFP_KERNEL) is not reduced there though. I am pretty sure
> we can find other subsystems if we try hard enough.

I'm expecting for patches which avoids deadlock by lock(); kmalloc(GFP_KERNEL).

> > static inline void enter_fs_code(struct super_block *sb)
> > {
> > 	if (sb->my_small_allocations_can_fail)
> > 		current->small_allocations_can_fail++;
> > }
> > 
> > that way (or something similar) we can select the behaviour on a per-fs
> > basis and the rest of the kernel remains unaffected.  Other subsystems
> > can opt in as well.
> 
> This is basically leading to GFP_MAYFAIL which is completely backwards
> (the hard requirement should be an exception not a default rule).
> I really do not want to end up with stuffing random may_fail annotations
> all over the kernel.
> 

I wish that GFP_NOFS / GFP_NOIO regions are annotated with

  static inline void enter_fs_code(void)
  {
  #ifdef CONFIG_DEBUG_GFP_FLAGS
  	current->in_fs_code++;
  #endif
  }

  static inline void leave_fs_code(void)
  {
  #ifdef CONFIG_DEBUG_GFP_FLAGS
  	current->in_fs_code--;
  #endif
  }

  static inline void enter_io_code(void)
  {
  #ifdef CONFIG_DEBUG_GFP_FLAGS
  	current->in_io_code++;
  #endif
  }

  static inline void leave_io_code(void)
  {
  #ifdef CONFIG_DEBUG_GFP_FLAGS
  	current->in_io_code--;
  #endif
  }

so that inappropriate GFP_KERNEL usage inside GFP_NOFS region are catchable
by doing

  struct page *
  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
                          struct zonelist *zonelist, nodemask_t *nodemask)
  {
  	struct zoneref *preferred_zoneref;
  	struct page *page = NULL;
  	unsigned int cpuset_mems_cookie;
  	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
  	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
  	struct alloc_context ac = {
  		.high_zoneidx = gfp_zone(gfp_mask),
  		.nodemask = nodemask,
  		.migratetype = gfpflags_to_migratetype(gfp_mask),
  	};
  	
  	gfp_mask &= gfp_allowed_mask;
 +#ifdef CONFIG_DEBUG_GFP_FLAGS
 +	WARN_ON(current->in_fs_code & (gfp_mask & __GFP_FS));
 +	WARN_ON(current->in_io_code & (gfp_mask & __GFP_IO));
 +#endif
  
  	lockdep_trace_alloc(gfp_mask);
  

. It is difficult for non-fs developers to determine whether they need to use
GFP_NOFS than GFP_KERNEL in their code. An example is seen at
http://marc.info/?l=linux-security-module&m=138556479607024&w=2 .

Moreover, I don't know how GFP flags are managed when stacked like
"a swap file on ext4 on top of LVM (with snapshots) on a RAID array
connected over iSCSI" (quoted from comments on Jon's writeup), but I
wish that the distinction between GFP_KERNEL / GFP_NOFS / GFP_NOIO
are removed from memory allocating function callers by doing

  static inline void enter_fs_code(void)
  {
  	current->in_fs_code++;
  }

  static inline void leave_fs_code(void)
  {
  	current->in_fs_code--;
  }

  static inline void enter_io_code(void)
  {
  	current->in_io_code++;
  }

  static inline void leave_io_code(void)
  {
  	current->in_io_code--;
  }

  struct page *
  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
                          struct zonelist *zonelist, nodemask_t *nodemask)
  {
  	struct zoneref *preferred_zoneref;
  	struct page *page = NULL;
  	unsigned int cpuset_mems_cookie;
  	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
  	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
  	struct alloc_context ac = {
  		.high_zoneidx = gfp_zone(gfp_mask),
  		.nodemask = nodemask,
  		.migratetype = gfpflags_to_migratetype(gfp_mask),
  	};
  	
  	gfp_mask &= gfp_allowed_mask;
 +	if (current->in_fs_code)
 +		gfp_mask &= ~__GFP_FS;
 +	if (current->in_io_code)
 +		gfp_mask &= ~__GFP_IO;
  
  	lockdep_trace_alloc(gfp_mask);
  

so that GFP flags passed to memory allocations involved by stacking
will be appropriately masked.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/2] Move away from non-failing small allocations
@ 2015-03-17 14:06       ` Tetsuo Handa
  0 siblings, 0 replies; 63+ messages in thread
From: Tetsuo Handa @ 2015-03-17 14:06 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: hannes, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Mon 16-03-15 15:38:43, Andrew Morton wrote:
> > Realistically, I don't think this overall effort will be successful -
> > we'll add the knob, it won't get enough testing and any attempt to
> > alter the default will be us deliberately destabilizing the kernel
> > without knowing how badly :(
> 
> Without the knob we do not allow users to test this at all though and
> the transition will _never_ happen. Which is IMHO bad.
> 

Even with the knob, quite little users will test this. The consequence is
likely that end users rush into customer support center about obscure bugs.
I'm working at a support center, and such bugs are really annoying.

> > I wonder if we can alter the behaviour only for filesystem code, so we
> > constrain the new behaviour just to that code where we're having
> > problems.  Most/all fs code goes via vfs methods so there's a reasonably
> > small set of places where we can call
> 
> We are seeing issues with the fs code now because the test cases which
> led to the current discussion exercise FS code. The code which does
> lock(); kmalloc(GFP_KERNEL) is not reduced there though. I am pretty sure
> we can find other subsystems if we try hard enough.

I'm expecting for patches which avoids deadlock by lock(); kmalloc(GFP_KERNEL).

> > static inline void enter_fs_code(struct super_block *sb)
> > {
> > 	if (sb->my_small_allocations_can_fail)
> > 		current->small_allocations_can_fail++;
> > }
> > 
> > that way (or something similar) we can select the behaviour on a per-fs
> > basis and the rest of the kernel remains unaffected.  Other subsystems
> > can opt in as well.
> 
> This is basically leading to GFP_MAYFAIL which is completely backwards
> (the hard requirement should be an exception not a default rule).
> I really do not want to end up with stuffing random may_fail annotations
> all over the kernel.
> 

I wish that GFP_NOFS / GFP_NOIO regions are annotated with

  static inline void enter_fs_code(void)
  {
  #ifdef CONFIG_DEBUG_GFP_FLAGS
  	current->in_fs_code++;
  #endif
  }

  static inline void leave_fs_code(void)
  {
  #ifdef CONFIG_DEBUG_GFP_FLAGS
  	current->in_fs_code--;
  #endif
  }

  static inline void enter_io_code(void)
  {
  #ifdef CONFIG_DEBUG_GFP_FLAGS
  	current->in_io_code++;
  #endif
  }

  static inline void leave_io_code(void)
  {
  #ifdef CONFIG_DEBUG_GFP_FLAGS
  	current->in_io_code--;
  #endif
  }

so that inappropriate GFP_KERNEL usage inside GFP_NOFS region are catchable
by doing

  struct page *
  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
                          struct zonelist *zonelist, nodemask_t *nodemask)
  {
  	struct zoneref *preferred_zoneref;
  	struct page *page = NULL;
  	unsigned int cpuset_mems_cookie;
  	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
  	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
  	struct alloc_context ac = {
  		.high_zoneidx = gfp_zone(gfp_mask),
  		.nodemask = nodemask,
  		.migratetype = gfpflags_to_migratetype(gfp_mask),
  	};
  	
  	gfp_mask &= gfp_allowed_mask;
 +#ifdef CONFIG_DEBUG_GFP_FLAGS
 +	WARN_ON(current->in_fs_code & (gfp_mask & __GFP_FS));
 +	WARN_ON(current->in_io_code & (gfp_mask & __GFP_IO));
 +#endif
  
  	lockdep_trace_alloc(gfp_mask);
  

. It is difficult for non-fs developers to determine whether they need to use
GFP_NOFS than GFP_KERNEL in their code. An example is seen at
http://marc.info/?l=linux-security-module&m=138556479607024&w=2 .

Moreover, I don't know how GFP flags are managed when stacked like
"a swap file on ext4 on top of LVM (with snapshots) on a RAID array
connected over iSCSI" (quoted from comments on Jon's writeup), but I
wish that the distinction between GFP_KERNEL / GFP_NOFS / GFP_NOIO
are removed from memory allocating function callers by doing

  static inline void enter_fs_code(void)
  {
  	current->in_fs_code++;
  }

  static inline void leave_fs_code(void)
  {
  	current->in_fs_code--;
  }

  static inline void enter_io_code(void)
  {
  	current->in_io_code++;
  }

  static inline void leave_io_code(void)
  {
  	current->in_io_code--;
  }

  struct page *
  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
                          struct zonelist *zonelist, nodemask_t *nodemask)
  {
  	struct zoneref *preferred_zoneref;
  	struct page *page = NULL;
  	unsigned int cpuset_mems_cookie;
  	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
  	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
  	struct alloc_context ac = {
  		.high_zoneidx = gfp_zone(gfp_mask),
  		.nodemask = nodemask,
  		.migratetype = gfpflags_to_migratetype(gfp_mask),
  	};
  	
  	gfp_mask &= gfp_allowed_mask;
 +	if (current->in_fs_code)
 +		gfp_mask &= ~__GFP_FS;
 +	if (current->in_io_code)
 +		gfp_mask &= ~__GFP_IO;
  
  	lockdep_trace_alloc(gfp_mask);
  

so that GFP flags passed to memory allocations involved by stacking
will be appropriately masked.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
  2015-03-17 13:29                 ` Johannes Weiner
@ 2015-03-17 14:17                   ` Michal Hocko
  -1 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-17 14:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, akpm, david, mgorman, riel, fengguang.wu, linux-mm,
	linux-kernel

On Tue 17-03-15 09:29:26, Johannes Weiner wrote:
> On Tue, Mar 17, 2015 at 11:25:08AM +0100, Michal Hocko wrote:
> > On Mon 16-03-15 17:11:46, Johannes Weiner wrote:
> > > A sysctl certainly doesn't sound appropriate to me because this is not
> > > a tunable that we expect people to set according to their usecase.  We
> > > expect our model to work for *everybody*.  A boot flag would be
> > > marginally better but it still reeks too much of tunable.
> > 
> > I am OK with a boot option as well if the sysctl is considered
> > inappropriate. It is less flexible though. Consider a regression testing
> > where the same load is run 2 times once with failing allocations and
> > once without it. Why should we force the tester to do a reboot cycle?
> 
> Because we can get rid of the Kconfig more easily once we transitioned.

How? We might be forced to keep the original behavior _for ever_. I do
not see any difference between runtime, boottime or compiletime option.
Except for the flexibility which is different for each one of course. We
can argue about which one is the most appropriate of course but I feel
strongly we cannot go and change the semantic right away.

> > > Maybe CONFIG_FAILABLE_SMALL_ALLOCS.  Maybe something more euphemistic.
> > > But I honestly can't think of anything that wouldn't scream "horrible
> > > leak of implementation details."  The user just shouldn't ever care.
> > 
> > Any config option basically means that distribution users will not get
> > to test this until distributions change the default and this won't
> > happen until the testing coverage and period was sufficient. See the
> > chicken and egg problem? This is basically undermining the whole idea
> > about the voluntary testing. So no, I really do not like this.
> 
> Why would anybody volunteer to test this?  What are you giving users
> in exchange for potentially destabilizing their kernels?

More stable kernels long term.

> > > Given that there are usually several stages of various testing between
> > > when a commit gets merged upstream and when it finally makes it into a
> > > critical production system, maybe we don't need to provide userspace
> > > control over this at all?
> > 
> > I can still see conservative users not changing this behavior _ever_.
> > Even after the rest of the world trusts the new default. They should
> > have a way to disable it. Many of those are running distribution kernels
> > so they really need a way to control the behavior. Be it a boot time
> > option or sysctl. Historically we were using sysctl for backward
> > compatibility and I do not see any reason to be different here as well.
> 
> Again, this is an implementation detail that we are trying to fix up.

This is not an implementation detail! This is about change of the
_semantic_ of the allocator. I wouldn't call it an implementation
detail.

> It has nothing to do with userspace, it's not a heuristic.  It's bad
> enough that this would be at all selectable from userspace, now you
> want to make it permanently configurable?
> 
> The problem we have to solve here is finding a value that doesn't
> deadlock the allocator, makes error situations stable and behave
> predictably, and doesn't regress real workloads out there.
> Your proposal tries to avoid immediate regressions at the cost of
> keeping the deadlock potential AND fragmenting the test space, which
> will make the whole situation even more fragile. 

While the deadlocks are possible the history shows they are not really
that common. While unexpected allocation failures are much more risky
because they would _regress_ previously working kernel. So I see this
conservative approach appropriate.

> Why would you want production systems to run code that nobody else is
> running anymore?

I do not understand this.

> We have a functioning testing pipeline to evaluate kernel changes like
> this: private tree -> subsystem tree -> next -> rc -> release ->
> stable -> longterm -> vendor.

This might work for smaller changes not when basically the whole kernel
is affected and the potential regression space is hard to predict and
potentially very large.

> We propagate risky changes to bigger
> and bigger test coverage domains and back them out once they introduce
> regressions.

Great so we end up reverting this in a month or two when the first users
stumble over a bug and we are back to square one. Excellent plan...

> You are trying to bypass this mechanism in an ad-hoc way
> with no plan of ever re-uniting the configuration space, but by
> splitting the test base in half (or N in your original proposal) you
> are setting us up for bugs reported in vendor kernels that didn't get
> caught through our primary means of maturing kernel changes.
> 
> Furthermore, it makes the code's behavior harder to predict and reason
> about, which makes subsequent development prone to errors and yet more
> regressions.

How come? !GFP_NOFAIL allocations _have_ to check for allocation
failures regardless the underlying allocator implementation.

> You're trying so hard to be defensive about this that you're actually
> making everybody worse off.  Prioritizing a single aspect of a change
> above everything else will never lead to good solutions.  Engineering
> is about making trade-offs and finding the sweet spots.

OK, so I am really wondering what you are proposing as an alternative.
Simply start failing allocations right away is hazardous and
irresponsible and not going to fly because we would quickly end up
reverting the change. Which will not help us to change the current
non-failing semantic which will be more and more PITA over the time
because it pushes us into the corner, it is deadlock prone and doesn't
allow callers to define proper fail strategies.

> > > So what value do we choose?
> > > 
> > > Once we kick the OOM killer we should give the victim some time to
> > > exit and then try the allocation again.  Looping just ONCE after that
> > > means we scan all the LRU pages in the system a second time and invoke
> > > the shrinkers another twelve times, with ratios approaching 1.  If the
> > > OOM killer doesn't yield an allocatable page after this, I see very
> > > little point in going on.  After all, we expect all our callers to
> > > handle errors.
> > 
> > I am OK with the single retry. As shown with the tests the same load
> > might end up with less allocation failures with the higher values but
> > that is a detail. Users of !GFP_NOFAIL should be prepared for failures
> > and the if the failures are too excessive I agree this should be
> > addressed in the page allocator.
> 
> Well yeah, allocation failures are fully expected to increase with
> artificial stress tests.  It doesn't really mean anything.  All we can
> do is make an educated guess and start exposing real workloads.
> 
> > > So why not just pass an "oomed" bool to should_alloc_retry() and bail
> > > on small allocations at that point?  Put it upstream and deal with the
> > > fallout long before this hits critical infrastructure?  By presumably
> > > fixing up caller error handling and GFP flags?
> > 
> > This is way too risky IMO. We cannot change a long established behavior
> > that quickly. I do agree we should allow failing in linux-next and
> > development trees. So that it is us kernel developers to start testing
> > first. Then we have zero day testing projects and Fenguang has shown an
> > interest in this as well. I would also expect/hoped for some testing
> > internal within major distributions. We are no way close to have this
> > default behavior in the Linus tree, though.
> 
> The age of this behavior has nothing to do with how fast we trigger
> bugs and fix them up.
> 
> The only problem here is the scale and the unknown impact.  We will
> know the impact only by exposure to real workloads, and Andrew made a
> suggestion already to keep the scale of the initial change low(er).
> 
> > That is why I've proposed 3 steps 1) voluntary testers, 2) distributions
> > default 3) upstream default. Why don't you think this is a proper
> > approach?
> 
> Because nobody will volunteer.

I have heard otherwise while your claim is unfounded.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
@ 2015-03-17 14:17                   ` Michal Hocko
  0 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-17 14:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, akpm, david, mgorman, riel, fengguang.wu, linux-mm,
	linux-kernel

On Tue 17-03-15 09:29:26, Johannes Weiner wrote:
> On Tue, Mar 17, 2015 at 11:25:08AM +0100, Michal Hocko wrote:
> > On Mon 16-03-15 17:11:46, Johannes Weiner wrote:
> > > A sysctl certainly doesn't sound appropriate to me because this is not
> > > a tunable that we expect people to set according to their usecase.  We
> > > expect our model to work for *everybody*.  A boot flag would be
> > > marginally better but it still reeks too much of tunable.
> > 
> > I am OK with a boot option as well if the sysctl is considered
> > inappropriate. It is less flexible though. Consider a regression testing
> > where the same load is run 2 times once with failing allocations and
> > once without it. Why should we force the tester to do a reboot cycle?
> 
> Because we can get rid of the Kconfig more easily once we transitioned.

How? We might be forced to keep the original behavior _for ever_. I do
not see any difference between runtime, boottime or compiletime option.
Except for the flexibility which is different for each one of course. We
can argue about which one is the most appropriate of course but I feel
strongly we cannot go and change the semantic right away.

> > > Maybe CONFIG_FAILABLE_SMALL_ALLOCS.  Maybe something more euphemistic.
> > > But I honestly can't think of anything that wouldn't scream "horrible
> > > leak of implementation details."  The user just shouldn't ever care.
> > 
> > Any config option basically means that distribution users will not get
> > to test this until distributions change the default and this won't
> > happen until the testing coverage and period was sufficient. See the
> > chicken and egg problem? This is basically undermining the whole idea
> > about the voluntary testing. So no, I really do not like this.
> 
> Why would anybody volunteer to test this?  What are you giving users
> in exchange for potentially destabilizing their kernels?

More stable kernels long term.

> > > Given that there are usually several stages of various testing between
> > > when a commit gets merged upstream and when it finally makes it into a
> > > critical production system, maybe we don't need to provide userspace
> > > control over this at all?
> > 
> > I can still see conservative users not changing this behavior _ever_.
> > Even after the rest of the world trusts the new default. They should
> > have a way to disable it. Many of those are running distribution kernels
> > so they really need a way to control the behavior. Be it a boot time
> > option or sysctl. Historically we were using sysctl for backward
> > compatibility and I do not see any reason to be different here as well.
> 
> Again, this is an implementation detail that we are trying to fix up.

This is not an implementation detail! This is about change of the
_semantic_ of the allocator. I wouldn't call it an implementation
detail.

> It has nothing to do with userspace, it's not a heuristic.  It's bad
> enough that this would be at all selectable from userspace, now you
> want to make it permanently configurable?
> 
> The problem we have to solve here is finding a value that doesn't
> deadlock the allocator, makes error situations stable and behave
> predictably, and doesn't regress real workloads out there.
> Your proposal tries to avoid immediate regressions at the cost of
> keeping the deadlock potential AND fragmenting the test space, which
> will make the whole situation even more fragile. 

While the deadlocks are possible the history shows they are not really
that common. While unexpected allocation failures are much more risky
because they would _regress_ previously working kernel. So I see this
conservative approach appropriate.

> Why would you want production systems to run code that nobody else is
> running anymore?

I do not understand this.

> We have a functioning testing pipeline to evaluate kernel changes like
> this: private tree -> subsystem tree -> next -> rc -> release ->
> stable -> longterm -> vendor.

This might work for smaller changes not when basically the whole kernel
is affected and the potential regression space is hard to predict and
potentially very large.

> We propagate risky changes to bigger
> and bigger test coverage domains and back them out once they introduce
> regressions.

Great so we end up reverting this in a month or two when the first users
stumble over a bug and we are back to square one. Excellent plan...

> You are trying to bypass this mechanism in an ad-hoc way
> with no plan of ever re-uniting the configuration space, but by
> splitting the test base in half (or N in your original proposal) you
> are setting us up for bugs reported in vendor kernels that didn't get
> caught through our primary means of maturing kernel changes.
> 
> Furthermore, it makes the code's behavior harder to predict and reason
> about, which makes subsequent development prone to errors and yet more
> regressions.

How come? !GFP_NOFAIL allocations _have_ to check for allocation
failures regardless the underlying allocator implementation.

> You're trying so hard to be defensive about this that you're actually
> making everybody worse off.  Prioritizing a single aspect of a change
> above everything else will never lead to good solutions.  Engineering
> is about making trade-offs and finding the sweet spots.

OK, so I am really wondering what you are proposing as an alternative.
Simply start failing allocations right away is hazardous and
irresponsible and not going to fly because we would quickly end up
reverting the change. Which will not help us to change the current
non-failing semantic which will be more and more PITA over the time
because it pushes us into the corner, it is deadlock prone and doesn't
allow callers to define proper fail strategies.

> > > So what value do we choose?
> > > 
> > > Once we kick the OOM killer we should give the victim some time to
> > > exit and then try the allocation again.  Looping just ONCE after that
> > > means we scan all the LRU pages in the system a second time and invoke
> > > the shrinkers another twelve times, with ratios approaching 1.  If the
> > > OOM killer doesn't yield an allocatable page after this, I see very
> > > little point in going on.  After all, we expect all our callers to
> > > handle errors.
> > 
> > I am OK with the single retry. As shown with the tests the same load
> > might end up with less allocation failures with the higher values but
> > that is a detail. Users of !GFP_NOFAIL should be prepared for failures
> > and the if the failures are too excessive I agree this should be
> > addressed in the page allocator.
> 
> Well yeah, allocation failures are fully expected to increase with
> artificial stress tests.  It doesn't really mean anything.  All we can
> do is make an educated guess and start exposing real workloads.
> 
> > > So why not just pass an "oomed" bool to should_alloc_retry() and bail
> > > on small allocations at that point?  Put it upstream and deal with the
> > > fallout long before this hits critical infrastructure?  By presumably
> > > fixing up caller error handling and GFP flags?
> > 
> > This is way too risky IMO. We cannot change a long established behavior
> > that quickly. I do agree we should allow failing in linux-next and
> > development trees. So that it is us kernel developers to start testing
> > first. Then we have zero day testing projects and Fenguang has shown an
> > interest in this as well. I would also expect/hoped for some testing
> > internal within major distributions. We are no way close to have this
> > default behavior in the Linus tree, though.
> 
> The age of this behavior has nothing to do with how fast we trigger
> bugs and fix them up.
> 
> The only problem here is the scale and the unknown impact.  We will
> know the impact only by exposure to real workloads, and Andrew made a
> suggestion already to keep the scale of the initial change low(er).
> 
> > That is why I've proposed 3 steps 1) voluntary testers, 2) distributions
> > default 3) upstream default. Why don't you think this is a proper
> > approach?
> 
> Because nobody will volunteer.

I have heard otherwise while your claim is unfounded.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
  2015-03-17 14:17                   ` Michal Hocko
@ 2015-03-17 17:26                     ` Johannes Weiner
  -1 siblings, 0 replies; 63+ messages in thread
From: Johannes Weiner @ 2015-03-17 17:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, akpm, david, mgorman, riel, fengguang.wu, linux-mm,
	linux-kernel

On Tue, Mar 17, 2015 at 03:17:29PM +0100, Michal Hocko wrote:
> On Tue 17-03-15 09:29:26, Johannes Weiner wrote:
> > On Tue, Mar 17, 2015 at 11:25:08AM +0100, Michal Hocko wrote:
> > > On Mon 16-03-15 17:11:46, Johannes Weiner wrote:
> > > > A sysctl certainly doesn't sound appropriate to me because this is not
> > > > a tunable that we expect people to set according to their usecase.  We
> > > > expect our model to work for *everybody*.  A boot flag would be
> > > > marginally better but it still reeks too much of tunable.
> > > 
> > > I am OK with a boot option as well if the sysctl is considered
> > > inappropriate. It is less flexible though. Consider a regression testing
> > > where the same load is run 2 times once with failing allocations and
> > > once without it. Why should we force the tester to do a reboot cycle?
> > 
> > Because we can get rid of the Kconfig more easily once we transitioned.
> 
> How? We might be forced to keep the original behavior _for ever_. I do
> not see any difference between runtime, boottime or compiletime option.
> Except for the flexibility which is different for each one of course. We
> can argue about which one is the most appropriate of course but I feel
> strongly we cannot go and change the semantic right away.

Sure, why not add another slab allocator while you're at it.  How many
times do we have to repeat the same mistakes?  If the old model sucks,
then it needs to be fixed or replaced.  Don't just offer another one
that sucks in different ways and ask the user to pick their poison,
with a promise that we might improve the newer model until it's
suitable to ditch the old one.

This is nothing more than us failing and giving up trying to actually
solve our problems.

> > > > Given that there are usually several stages of various testing between
> > > > when a commit gets merged upstream and when it finally makes it into a
> > > > critical production system, maybe we don't need to provide userspace
> > > > control over this at all?
> > > 
> > > I can still see conservative users not changing this behavior _ever_.
> > > Even after the rest of the world trusts the new default. They should
> > > have a way to disable it. Many of those are running distribution kernels
> > > so they really need a way to control the behavior. Be it a boot time
> > > option or sysctl. Historically we were using sysctl for backward
> > > compatibility and I do not see any reason to be different here as well.
> > 
> > Again, this is an implementation detail that we are trying to fix up.
> 
> This is not an implementation detail! This is about change of the
> _semantic_ of the allocator. I wouldn't call it an implementation
> detail.

We can make the allocator robust through improving reclaim and the OOM
killer.  This "nr of retries" is 100% an implementation detail of this
single stupid function.

On a higher level, allowing the page allocator to return NULL is an
implementation detail of the operating system, userspace doesn't care
how the allocator and the callers communicate as long as the callers
can compensate for the allocator changing.  Involving userspace in
this decision is simply crazy talk.  They have no incentive to
partake.  MM people have to coordinate with other kernel developers to
deal with allocation NULLs without regressing userspace.  Maybe they
can fail the allocations without any problems, maybe they want to wait
for other events that they have more insight into than the allocator.
This is what Dave meant when he said that we should provide mechanism
and leave policy to the callsites.

It's 100% a kernel implementation detail that has NOTHING to do with
userspace.  Zilch.  It's about how the allocator implements the OOM
mechanism and how the allocation sites implement the OOM policy.

> > It has nothing to do with userspace, it's not a heuristic.  It's bad
> > enough that this would be at all selectable from userspace, now you
> > want to make it permanently configurable?
> > 
> > The problem we have to solve here is finding a value that doesn't
> > deadlock the allocator, makes error situations stable and behave
> > predictably, and doesn't regress real workloads out there.
> > Your proposal tries to avoid immediate regressions at the cost of
> > keeping the deadlock potential AND fragmenting the test space, which
> > will make the whole situation even more fragile. 
> 
> While the deadlocks are possible the history shows they are not really
> that common. While unexpected allocation failures are much more risky
> because they would _regress_ previously working kernel. So I see this
> conservative approach appropriate.
> 
> > Why would you want production systems to run code that nobody else is
> > running anymore?
> 
> I do not understand this.

Can you please read the entire email before replying?  What I meant by
this is explained following this question.  You explicitely asked for
permanently segregating the behavior of upstream kernels from that of
critical production systems.

> > We have a functioning testing pipeline to evaluate kernel changes like
> > this: private tree -> subsystem tree -> next -> rc -> release ->
> > stable -> longterm -> vendor.
> 
> This might work for smaller changes not when basically the whole kernel
> is affected and the potential regression space is hard to predict and
> potentially very large.

Hence Andrew's suggestion to partition the callers and do the
transition incrementally.

> > We propagate risky changes to bigger
> > and bigger test coverage domains and back them out once they introduce
> > regressions.
> 
> Great so we end up reverting this in a month or two when the first users
> stumble over a bug and we are back to square one. Excellent plan...

No, we're not.  We now have data on the missing pieces.  We need to
update our initial assumptions, evaluate our caller requirements.
Update the way we perform reclaim, finetune how we determine OOM
situations - maybe we just need some smart waits.  All this would
actually improve the kernel.

That whole "nr of retries" is stupid in the first place.  The amount
of work that is retried is completely implementation dependent and
changes all the time.  We can probably wait for much more sensible
events.  For example, if the things we do in a single loop give up
prematurely, then maybe instead of just adding more loops, we could
add a timeout-wait for the OOM victim to exit.  Change the congestion
throttling.  Whatever.  Anything is better than making the iterations
of a variable loop configurable to userspace.  But what needs to be
done depends on the way real applications actually regress.  Are
allocations failing right before the OOM victim exited and we should
have waited for it instead?  Are there in-flight writebacks we could
have waited for and we need to adjust our throttling in vmscan.c?
Because that throttling has only been tuned to save CPU cycles during
our endless reclaim, not actually to reliably make LRU reclaim trail
dirty page laundering.  There is so much room for optimizations that
would leave us with a better functioning system across the map, than
throwing braindead retrying at the problem.  But we need the data.

Those endless retry loops have masked reliability problems in the
underlying reclaim and OOM code.  We can not address them without
exposure.  And we likely won't be needing this single magic number
once the implementation is better and a single sequence of robust
reclaim and OOM kills is enough to determine that we are thoroughly
out of memory and there is no point in retrying inside the allocator.
Whatever is left in terms of OOM policy should be the responsibility
of the caller.

> > You are trying to bypass this mechanism in an ad-hoc way
> > with no plan of ever re-uniting the configuration space, but by
> > splitting the test base in half (or N in your original proposal) you
> > are setting us up for bugs reported in vendor kernels that didn't get
> > caught through our primary means of maturing kernel changes.
> > 
> > Furthermore, it makes the code's behavior harder to predict and reason
> > about, which makes subsequent development prone to errors and yet more
> > regressions.
> 
> How come? !GFP_NOFAIL allocations _have_ to check for allocation
> failures regardless the underlying allocator implementation.

Can you please think a bit longer about these emails before replying?

If you split the configuration space into kernels that endlessly retry
and those that do not, you can introduce new deadlocks to the nofail
kernels which don't get caught in the canfail kernels.  If you weaken
the code that executes in each loop, you can regress robustness in the
canfail kernels which is not caught in the nofail kernels.  You'll hit
deadlocks in the production environments that were not existent in the
canfail testing setups, and experience from production environments
won't translate to upstream fixes very well.

> > You're trying so hard to be defensive about this that you're actually
> > making everybody worse off.  Prioritizing a single aspect of a change
> > above everything else will never lead to good solutions.  Engineering
> > is about making trade-offs and finding the sweet spots.
> 
> OK, so I am really wondering what you are proposing as an alternative.
> Simply start failing allocations right away is hazardous and
> irresponsible and not going to fly because we would quickly end up
> reverting the change. Which will not help us to change the current
> non-failing semantic which will be more and more PITA over the time
> because it pushes us into the corner, it is deadlock prone and doesn't
> allow callers to define proper fail strategies.

Maybe run a smarter test than an artificial stress for starters, see
if this actually matters for an array of more realistic mmtests and/or
filesystem tests.  And then analyse those failures instead of bumping
the nr_retries knob blindly.

And I agree with Andrew that we could probably be selective INSIDE THE
KERNEL about which callers are taking the plunge.  The only reason to
be careful with this change is the scale, it has nothing to do with
long-standing behavior.  That's just handwaving.  Make it opt-in on a
kernel code level, not on a userspace level, so that we have those
responsible for the callsite code be aware of this change and can
think of the consequences up front.  Let XFS people think about
failing small allocations in their context: which of those are allowed
to propagate to userspace and which aren't?  If we regress userspace
because allocation failures leak by accident, we know the caller needs
to be fixed.  If we regress a workload by failing failable allocations
earlier than before, we know that the page allocator should try
harder/smarter.  This is the advantage of having an actual model: you
can figure out who is violating it and fix the problem where it occurs
instead of papering it over.

Then let ext4 people think about it and ease them into it.  Let them
know what is coming and what they should be prepared for, and then we
can work with them in fixing up any issues.  Once the big ticket items
are done we can flip the rest and deal with that fallout separately.

There is an existing path to make and evaluate such changes and you
haven't made a case why we should deviate from that.  We didn't ask
users to choose between fine-grained locking or the big kernel lock,
either, did we?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
@ 2015-03-17 17:26                     ` Johannes Weiner
  0 siblings, 0 replies; 63+ messages in thread
From: Johannes Weiner @ 2015-03-17 17:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, akpm, david, mgorman, riel, fengguang.wu, linux-mm,
	linux-kernel

On Tue, Mar 17, 2015 at 03:17:29PM +0100, Michal Hocko wrote:
> On Tue 17-03-15 09:29:26, Johannes Weiner wrote:
> > On Tue, Mar 17, 2015 at 11:25:08AM +0100, Michal Hocko wrote:
> > > On Mon 16-03-15 17:11:46, Johannes Weiner wrote:
> > > > A sysctl certainly doesn't sound appropriate to me because this is not
> > > > a tunable that we expect people to set according to their usecase.  We
> > > > expect our model to work for *everybody*.  A boot flag would be
> > > > marginally better but it still reeks too much of tunable.
> > > 
> > > I am OK with a boot option as well if the sysctl is considered
> > > inappropriate. It is less flexible though. Consider a regression testing
> > > where the same load is run 2 times once with failing allocations and
> > > once without it. Why should we force the tester to do a reboot cycle?
> > 
> > Because we can get rid of the Kconfig more easily once we transitioned.
> 
> How? We might be forced to keep the original behavior _for ever_. I do
> not see any difference between runtime, boottime or compiletime option.
> Except for the flexibility which is different for each one of course. We
> can argue about which one is the most appropriate of course but I feel
> strongly we cannot go and change the semantic right away.

Sure, why not add another slab allocator while you're at it.  How many
times do we have to repeat the same mistakes?  If the old model sucks,
then it needs to be fixed or replaced.  Don't just offer another one
that sucks in different ways and ask the user to pick their poison,
with a promise that we might improve the newer model until it's
suitable to ditch the old one.

This is nothing more than us failing and giving up trying to actually
solve our problems.

> > > > Given that there are usually several stages of various testing between
> > > > when a commit gets merged upstream and when it finally makes it into a
> > > > critical production system, maybe we don't need to provide userspace
> > > > control over this at all?
> > > 
> > > I can still see conservative users not changing this behavior _ever_.
> > > Even after the rest of the world trusts the new default. They should
> > > have a way to disable it. Many of those are running distribution kernels
> > > so they really need a way to control the behavior. Be it a boot time
> > > option or sysctl. Historically we were using sysctl for backward
> > > compatibility and I do not see any reason to be different here as well.
> > 
> > Again, this is an implementation detail that we are trying to fix up.
> 
> This is not an implementation detail! This is about change of the
> _semantic_ of the allocator. I wouldn't call it an implementation
> detail.

We can make the allocator robust through improving reclaim and the OOM
killer.  This "nr of retries" is 100% an implementation detail of this
single stupid function.

On a higher level, allowing the page allocator to return NULL is an
implementation detail of the operating system, userspace doesn't care
how the allocator and the callers communicate as long as the callers
can compensate for the allocator changing.  Involving userspace in
this decision is simply crazy talk.  They have no incentive to
partake.  MM people have to coordinate with other kernel developers to
deal with allocation NULLs without regressing userspace.  Maybe they
can fail the allocations without any problems, maybe they want to wait
for other events that they have more insight into than the allocator.
This is what Dave meant when he said that we should provide mechanism
and leave policy to the callsites.

It's 100% a kernel implementation detail that has NOTHING to do with
userspace.  Zilch.  It's about how the allocator implements the OOM
mechanism and how the allocation sites implement the OOM policy.

> > It has nothing to do with userspace, it's not a heuristic.  It's bad
> > enough that this would be at all selectable from userspace, now you
> > want to make it permanently configurable?
> > 
> > The problem we have to solve here is finding a value that doesn't
> > deadlock the allocator, makes error situations stable and behave
> > predictably, and doesn't regress real workloads out there.
> > Your proposal tries to avoid immediate regressions at the cost of
> > keeping the deadlock potential AND fragmenting the test space, which
> > will make the whole situation even more fragile. 
> 
> While the deadlocks are possible the history shows they are not really
> that common. While unexpected allocation failures are much more risky
> because they would _regress_ previously working kernel. So I see this
> conservative approach appropriate.
> 
> > Why would you want production systems to run code that nobody else is
> > running anymore?
> 
> I do not understand this.

Can you please read the entire email before replying?  What I meant by
this is explained following this question.  You explicitely asked for
permanently segregating the behavior of upstream kernels from that of
critical production systems.

> > We have a functioning testing pipeline to evaluate kernel changes like
> > this: private tree -> subsystem tree -> next -> rc -> release ->
> > stable -> longterm -> vendor.
> 
> This might work for smaller changes not when basically the whole kernel
> is affected and the potential regression space is hard to predict and
> potentially very large.

Hence Andrew's suggestion to partition the callers and do the
transition incrementally.

> > We propagate risky changes to bigger
> > and bigger test coverage domains and back them out once they introduce
> > regressions.
> 
> Great so we end up reverting this in a month or two when the first users
> stumble over a bug and we are back to square one. Excellent plan...

No, we're not.  We now have data on the missing pieces.  We need to
update our initial assumptions, evaluate our caller requirements.
Update the way we perform reclaim, finetune how we determine OOM
situations - maybe we just need some smart waits.  All this would
actually improve the kernel.

That whole "nr of retries" is stupid in the first place.  The amount
of work that is retried is completely implementation dependent and
changes all the time.  We can probably wait for much more sensible
events.  For example, if the things we do in a single loop give up
prematurely, then maybe instead of just adding more loops, we could
add a timeout-wait for the OOM victim to exit.  Change the congestion
throttling.  Whatever.  Anything is better than making the iterations
of a variable loop configurable to userspace.  But what needs to be
done depends on the way real applications actually regress.  Are
allocations failing right before the OOM victim exited and we should
have waited for it instead?  Are there in-flight writebacks we could
have waited for and we need to adjust our throttling in vmscan.c?
Because that throttling has only been tuned to save CPU cycles during
our endless reclaim, not actually to reliably make LRU reclaim trail
dirty page laundering.  There is so much room for optimizations that
would leave us with a better functioning system across the map, than
throwing braindead retrying at the problem.  But we need the data.

Those endless retry loops have masked reliability problems in the
underlying reclaim and OOM code.  We can not address them without
exposure.  And we likely won't be needing this single magic number
once the implementation is better and a single sequence of robust
reclaim and OOM kills is enough to determine that we are thoroughly
out of memory and there is no point in retrying inside the allocator.
Whatever is left in terms of OOM policy should be the responsibility
of the caller.

> > You are trying to bypass this mechanism in an ad-hoc way
> > with no plan of ever re-uniting the configuration space, but by
> > splitting the test base in half (or N in your original proposal) you
> > are setting us up for bugs reported in vendor kernels that didn't get
> > caught through our primary means of maturing kernel changes.
> > 
> > Furthermore, it makes the code's behavior harder to predict and reason
> > about, which makes subsequent development prone to errors and yet more
> > regressions.
> 
> How come? !GFP_NOFAIL allocations _have_ to check for allocation
> failures regardless the underlying allocator implementation.

Can you please think a bit longer about these emails before replying?

If you split the configuration space into kernels that endlessly retry
and those that do not, you can introduce new deadlocks to the nofail
kernels which don't get caught in the canfail kernels.  If you weaken
the code that executes in each loop, you can regress robustness in the
canfail kernels which is not caught in the nofail kernels.  You'll hit
deadlocks in the production environments that were not existent in the
canfail testing setups, and experience from production environments
won't translate to upstream fixes very well.

> > You're trying so hard to be defensive about this that you're actually
> > making everybody worse off.  Prioritizing a single aspect of a change
> > above everything else will never lead to good solutions.  Engineering
> > is about making trade-offs and finding the sweet spots.
> 
> OK, so I am really wondering what you are proposing as an alternative.
> Simply start failing allocations right away is hazardous and
> irresponsible and not going to fly because we would quickly end up
> reverting the change. Which will not help us to change the current
> non-failing semantic which will be more and more PITA over the time
> because it pushes us into the corner, it is deadlock prone and doesn't
> allow callers to define proper fail strategies.

Maybe run a smarter test than an artificial stress for starters, see
if this actually matters for an array of more realistic mmtests and/or
filesystem tests.  And then analyse those failures instead of bumping
the nr_retries knob blindly.

And I agree with Andrew that we could probably be selective INSIDE THE
KERNEL about which callers are taking the plunge.  The only reason to
be careful with this change is the scale, it has nothing to do with
long-standing behavior.  That's just handwaving.  Make it opt-in on a
kernel code level, not on a userspace level, so that we have those
responsible for the callsite code be aware of this change and can
think of the consequences up front.  Let XFS people think about
failing small allocations in their context: which of those are allowed
to propagate to userspace and which aren't?  If we regress userspace
because allocation failures leak by accident, we know the caller needs
to be fixed.  If we regress a workload by failing failable allocations
earlier than before, we know that the page allocator should try
harder/smarter.  This is the advantage of having an actual model: you
can figure out who is violating it and fix the problem where it occurs
instead of papering it over.

Then let ext4 people think about it and ease them into it.  Let them
know what is coming and what they should be prepared for, and then we
can work with them in fixing up any issues.  Once the big ticket items
are done we can flip the rest and deal with that fallout separately.

There is an existing path to make and evaluate such changes and you
haven't made a case why we should deviate from that.  We didn't ask
users to choose between fine-grained locking or the big kernel lock,
either, did we?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
  2015-03-17 17:26                     ` Johannes Weiner
@ 2015-03-17 19:41                       ` Michal Hocko
  -1 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-17 19:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, akpm, david, mgorman, riel, fengguang.wu, linux-mm,
	linux-kernel

On Tue 17-03-15 13:26:28, Johannes Weiner wrote:
> On Tue, Mar 17, 2015 at 03:17:29PM +0100, Michal Hocko wrote:
> > On Tue 17-03-15 09:29:26, Johannes Weiner wrote:
> > > On Tue, Mar 17, 2015 at 11:25:08AM +0100, Michal Hocko wrote:
> > > > On Mon 16-03-15 17:11:46, Johannes Weiner wrote:
> > > > > A sysctl certainly doesn't sound appropriate to me because this is not
> > > > > a tunable that we expect people to set according to their usecase.  We
> > > > > expect our model to work for *everybody*.  A boot flag would be
> > > > > marginally better but it still reeks too much of tunable.
> > > > 
> > > > I am OK with a boot option as well if the sysctl is considered
> > > > inappropriate. It is less flexible though. Consider a regression testing
> > > > where the same load is run 2 times once with failing allocations and
> > > > once without it. Why should we force the tester to do a reboot cycle?
> > > 
> > > Because we can get rid of the Kconfig more easily once we transitioned.
> > 
> > How? We might be forced to keep the original behavior _for ever_. I do
> > not see any difference between runtime, boottime or compiletime option.
> > Except for the flexibility which is different for each one of course. We
> > can argue about which one is the most appropriate of course but I feel
> > strongly we cannot go and change the semantic right away.
> 
> Sure, why not add another slab allocator while you're at it.  How many
> times do we have to repeat the same mistakes?  If the old model sucks,
> then it needs to be fixed or replaced.  Don't just offer another one
> that sucks in different ways and ask the user to pick their poison,
> with a promise that we might improve the newer model until it's
> suitable to ditch the old one.
> 
> This is nothing more than us failing and giving up trying to actually
> solve our problems.

I probably fail to communicate the primary intention here. The point
of the knob is _not_ to move the responsibility to userspace. Although
I would agree that the knob as proposed might look like that and that is
my fault.

The primary motivation is to actually help _solving_ our long standing
problem. Default non-failing allocations policy is simply wrong and we
should move away from it. We have a way to _explicitly_ request such a
behavior. Are we in agreement on this part?

The problem, as I see it, is that such a change cannot be pushed to
Linus tree without extensive testing because there are thousands of code
paths which never got exercised. We have basically two options here.
Either have a non-upstream patch (e.g. sitting in mmotm and linux-next)
and have developers to do their testing. This will surely help to
catch a lot of fallouts and fix them right away. But we will miss those
who are using Linus based trees and would be willing to help to test
in their loads which we never dreamed of.
The other option would be pushing an experimental code to the Linus
tree (and distribution kernels) and allow people to turn it on to help
testing.

I am not ignoring the rest of the email, I just want to make sure we are
on the same page before we go into a potentially lengthy discussion just
to find out we are talking past each other.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
@ 2015-03-17 19:41                       ` Michal Hocko
  0 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-17 19:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, akpm, david, mgorman, riel, fengguang.wu, linux-mm,
	linux-kernel

On Tue 17-03-15 13:26:28, Johannes Weiner wrote:
> On Tue, Mar 17, 2015 at 03:17:29PM +0100, Michal Hocko wrote:
> > On Tue 17-03-15 09:29:26, Johannes Weiner wrote:
> > > On Tue, Mar 17, 2015 at 11:25:08AM +0100, Michal Hocko wrote:
> > > > On Mon 16-03-15 17:11:46, Johannes Weiner wrote:
> > > > > A sysctl certainly doesn't sound appropriate to me because this is not
> > > > > a tunable that we expect people to set according to their usecase.  We
> > > > > expect our model to work for *everybody*.  A boot flag would be
> > > > > marginally better but it still reeks too much of tunable.
> > > > 
> > > > I am OK with a boot option as well if the sysctl is considered
> > > > inappropriate. It is less flexible though. Consider a regression testing
> > > > where the same load is run 2 times once with failing allocations and
> > > > once without it. Why should we force the tester to do a reboot cycle?
> > > 
> > > Because we can get rid of the Kconfig more easily once we transitioned.
> > 
> > How? We might be forced to keep the original behavior _for ever_. I do
> > not see any difference between runtime, boottime or compiletime option.
> > Except for the flexibility which is different for each one of course. We
> > can argue about which one is the most appropriate of course but I feel
> > strongly we cannot go and change the semantic right away.
> 
> Sure, why not add another slab allocator while you're at it.  How many
> times do we have to repeat the same mistakes?  If the old model sucks,
> then it needs to be fixed or replaced.  Don't just offer another one
> that sucks in different ways and ask the user to pick their poison,
> with a promise that we might improve the newer model until it's
> suitable to ditch the old one.
> 
> This is nothing more than us failing and giving up trying to actually
> solve our problems.

I probably fail to communicate the primary intention here. The point
of the knob is _not_ to move the responsibility to userspace. Although
I would agree that the knob as proposed might look like that and that is
my fault.

The primary motivation is to actually help _solving_ our long standing
problem. Default non-failing allocations policy is simply wrong and we
should move away from it. We have a way to _explicitly_ request such a
behavior. Are we in agreement on this part?

The problem, as I see it, is that such a change cannot be pushed to
Linus tree without extensive testing because there are thousands of code
paths which never got exercised. We have basically two options here.
Either have a non-upstream patch (e.g. sitting in mmotm and linux-next)
and have developers to do their testing. This will surely help to
catch a lot of fallouts and fix them right away. But we will miss those
who are using Linus based trees and would be willing to help to test
in their loads which we never dreamed of.
The other option would be pushing an experimental code to the Linus
tree (and distribution kernels) and allow people to turn it on to help
testing.

I am not ignoring the rest of the email, I just want to make sure we are
on the same page before we go into a potentially lengthy discussion just
to find out we are talking past each other.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
  2015-03-17 19:41                       ` Michal Hocko
@ 2015-03-18  9:10                         ` Vlastimil Babka
  -1 siblings, 0 replies; 63+ messages in thread
From: Vlastimil Babka @ 2015-03-18  9:10 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner
  Cc: Tetsuo Handa, akpm, david, mgorman, riel, fengguang.wu, linux-mm,
	linux-kernel

On 03/17/2015 08:41 PM, Michal Hocko wrote:
> On Tue 17-03-15 13:26:28, Johannes Weiner wrote:
>> On Tue, Mar 17, 2015 at 03:17:29PM +0100, Michal Hocko wrote:
>>> On Tue 17-03-15 09:29:26, Johannes Weiner wrote:
>>>> On Tue, Mar 17, 2015 at 11:25:08AM +0100, Michal Hocko wrote:
>>>>> On Mon 16-03-15 17:11:46, Johannes Weiner wrote:
>>>>>> A sysctl certainly doesn't sound appropriate to me because this is not
>>>>>> a tunable that we expect people to set according to their usecase.  We
>>>>>> expect our model to work for *everybody*.  A boot flag would be
>>>>>> marginally better but it still reeks too much of tunable.
>>>>>
>>>>> I am OK with a boot option as well if the sysctl is considered
>>>>> inappropriate. It is less flexible though. Consider a regression testing
>>>>> where the same load is run 2 times once with failing allocations and
>>>>> once without it. Why should we force the tester to do a reboot cycle?
>>>>
>>>> Because we can get rid of the Kconfig more easily once we transitioned.
>>>
>>> How? We might be forced to keep the original behavior _for ever_. I do
>>> not see any difference between runtime, boottime or compiletime option.
>>> Except for the flexibility which is different for each one of course. We
>>> can argue about which one is the most appropriate of course but I feel
>>> strongly we cannot go and change the semantic right away.
>>
>> Sure, why not add another slab allocator while you're at it.  How many
>> times do we have to repeat the same mistakes?  If the old model sucks,
>> then it needs to be fixed or replaced.  Don't just offer another one
>> that sucks in different ways and ask the user to pick their poison,
>> with a promise that we might improve the newer model until it's
>> suitable to ditch the old one.
>>
>> This is nothing more than us failing and giving up trying to actually
>> solve our problems.
>
> I probably fail to communicate the primary intention here. The point
> of the knob is _not_ to move the responsibility to userspace. Although
> I would agree that the knob as proposed might look like that and that is
> my fault.
>
> The primary motivation is to actually help _solving_ our long standing
> problem. Default non-failing allocations policy is simply wrong and we
> should move away from it. We have a way to _explicitly_ request such a
> behavior. Are we in agreement on this part?
>
> The problem, as I see it, is that such a change cannot be pushed to
> Linus tree without extensive testing because there are thousands of code
> paths which never got exercised. We have basically two options here.
> Either have a non-upstream patch (e.g. sitting in mmotm and linux-next)
> and have developers to do their testing. This will surely help to
> catch a lot of fallouts and fix them right away. But we will miss those
> who are using Linus based trees and would be willing to help to test
> in their loads which we never dreamed of.
> The other option would be pushing an experimental code to the Linus
> tree (and distribution kernels) and allow people to turn it on to help
> testing.
>
> I am not ignoring the rest of the email, I just want to make sure we are
> on the same page before we go into a potentially lengthy discussion just
> to find out we are talking past each other.
>
> [...]

After reading this discussion, my impression is: as I understand your 
motivation, the knob is supposed to expose code that has broken handling 
of small allocation failures, because the handling was never exercised 
(and thus found to be broken) before. The steps you are proposing are to 
allow this to be tested by those who understand that it might break 
their machines, until those broken allocation sites are either fixed or 
converted to __GFP_NOFAIL. We want the change of implicit nofail 
behavior to happen, as then we limit the potential deadlocks to 
explicitly annotated allocation sites, which simplifies efforts to 
prevent the deadlocks (e.g. with reserves).

AFAIU, Johannes is worried that the knob adds some possibility that 
allocations will fail prematurely, even though further trying would 
allow it to succeed and would not introduce a deadlock. The probability 
of this is hard to predict even inside MM, yet we assume that userspace 
will set the value. This might discourage some of the volunteers that 
would be willing to test the new behavior, since they could get extra 
spurious failures. He would like to see this to be as reliable as 
possible, failing allocation only when it's absolutely certain that 
nothing else can be done, and not depend on a magically set value from 
userspace. He also believes that we can still improve on the what "can 
be done" part.

I'll add that I think if we do improve the reclaim etc, and make 
allocations failures rarer, then the whole testing effort will have much 
lower chance of finding the places where allocation failures are not 
handled properly. Also Michal says that catching those depend on running 
all "their loads which we never dreamed of". In that case, if our goal 
is to fix all broken allocation sites with some quantifiable 
probability, I'm afraid we might be really better off with some form of 
fault injection, which will trigger the failures with the probability we 
set, and not depend on corner case low memory conditions manifesting
just at the time the workload is at one of the broken allocation sites.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
@ 2015-03-18  9:10                         ` Vlastimil Babka
  0 siblings, 0 replies; 63+ messages in thread
From: Vlastimil Babka @ 2015-03-18  9:10 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner
  Cc: Tetsuo Handa, akpm, david, mgorman, riel, fengguang.wu, linux-mm,
	linux-kernel

On 03/17/2015 08:41 PM, Michal Hocko wrote:
> On Tue 17-03-15 13:26:28, Johannes Weiner wrote:
>> On Tue, Mar 17, 2015 at 03:17:29PM +0100, Michal Hocko wrote:
>>> On Tue 17-03-15 09:29:26, Johannes Weiner wrote:
>>>> On Tue, Mar 17, 2015 at 11:25:08AM +0100, Michal Hocko wrote:
>>>>> On Mon 16-03-15 17:11:46, Johannes Weiner wrote:
>>>>>> A sysctl certainly doesn't sound appropriate to me because this is not
>>>>>> a tunable that we expect people to set according to their usecase.  We
>>>>>> expect our model to work for *everybody*.  A boot flag would be
>>>>>> marginally better but it still reeks too much of tunable.
>>>>>
>>>>> I am OK with a boot option as well if the sysctl is considered
>>>>> inappropriate. It is less flexible though. Consider a regression testing
>>>>> where the same load is run 2 times once with failing allocations and
>>>>> once without it. Why should we force the tester to do a reboot cycle?
>>>>
>>>> Because we can get rid of the Kconfig more easily once we transitioned.
>>>
>>> How? We might be forced to keep the original behavior _for ever_. I do
>>> not see any difference between runtime, boottime or compiletime option.
>>> Except for the flexibility which is different for each one of course. We
>>> can argue about which one is the most appropriate of course but I feel
>>> strongly we cannot go and change the semantic right away.
>>
>> Sure, why not add another slab allocator while you're at it.  How many
>> times do we have to repeat the same mistakes?  If the old model sucks,
>> then it needs to be fixed or replaced.  Don't just offer another one
>> that sucks in different ways and ask the user to pick their poison,
>> with a promise that we might improve the newer model until it's
>> suitable to ditch the old one.
>>
>> This is nothing more than us failing and giving up trying to actually
>> solve our problems.
>
> I probably fail to communicate the primary intention here. The point
> of the knob is _not_ to move the responsibility to userspace. Although
> I would agree that the knob as proposed might look like that and that is
> my fault.
>
> The primary motivation is to actually help _solving_ our long standing
> problem. Default non-failing allocations policy is simply wrong and we
> should move away from it. We have a way to _explicitly_ request such a
> behavior. Are we in agreement on this part?
>
> The problem, as I see it, is that such a change cannot be pushed to
> Linus tree without extensive testing because there are thousands of code
> paths which never got exercised. We have basically two options here.
> Either have a non-upstream patch (e.g. sitting in mmotm and linux-next)
> and have developers to do their testing. This will surely help to
> catch a lot of fallouts and fix them right away. But we will miss those
> who are using Linus based trees and would be willing to help to test
> in their loads which we never dreamed of.
> The other option would be pushing an experimental code to the Linus
> tree (and distribution kernels) and allow people to turn it on to help
> testing.
>
> I am not ignoring the rest of the email, I just want to make sure we are
> on the same page before we go into a potentially lengthy discussion just
> to find out we are talking past each other.
>
> [...]

After reading this discussion, my impression is: as I understand your 
motivation, the knob is supposed to expose code that has broken handling 
of small allocation failures, because the handling was never exercised 
(and thus found to be broken) before. The steps you are proposing are to 
allow this to be tested by those who understand that it might break 
their machines, until those broken allocation sites are either fixed or 
converted to __GFP_NOFAIL. We want the change of implicit nofail 
behavior to happen, as then we limit the potential deadlocks to 
explicitly annotated allocation sites, which simplifies efforts to 
prevent the deadlocks (e.g. with reserves).

AFAIU, Johannes is worried that the knob adds some possibility that 
allocations will fail prematurely, even though further trying would 
allow it to succeed and would not introduce a deadlock. The probability 
of this is hard to predict even inside MM, yet we assume that userspace 
will set the value. This might discourage some of the volunteers that 
would be willing to test the new behavior, since they could get extra 
spurious failures. He would like to see this to be as reliable as 
possible, failing allocation only when it's absolutely certain that 
nothing else can be done, and not depend on a magically set value from 
userspace. He also believes that we can still improve on the what "can 
be done" part.

I'll add that I think if we do improve the reclaim etc, and make 
allocations failures rarer, then the whole testing effort will have much 
lower chance of finding the places where allocation failures are not 
handled properly. Also Michal says that catching those depend on running 
all "their loads which we never dreamed of". In that case, if our goal 
is to fix all broken allocation sites with some quantifiable 
probability, I'm afraid we might be really better off with some form of 
fault injection, which will trigger the failures with the probability we 
set, and not depend on corner case low memory conditions manifesting
just at the time the workload is at one of the broken allocation sites.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
  2015-03-17 13:15               ` Michal Hocko
@ 2015-03-18 11:33                 ` Tetsuo Handa
  -1 siblings, 0 replies; 63+ messages in thread
From: Tetsuo Handa @ 2015-03-18 11:33 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, hannes, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

Michal Hocko wrote:
> > Tetsuo Handa wrote:
> > > I also tested on XFS. One is Linux 3.19 and the other is Linux 3.19
> > > with debug printk patch shown above. According to console logs,
> > > oom_kill_process() is trivially called via pagefault_out_of_memory()
> > > for the former kernel. Due to giving up !GFP_FS allocations immediately?
> > >
> > > (From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-unpatched.txt.xz )
> > > ---------- xfs / Linux 3.19 ----------
> > > [  793.283099] su invoked oom-killer: gfp_mask=0x0, order=0, oom_score_adj=0
> > > [  793.283102] su cpuset=/ mems_allowed=0
> > > [  793.283104] CPU: 3 PID: 9552 Comm: su Not tainted 3.19.0 #40
> > > [  793.283159] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> > > [  793.283161]  0000000000000000 ffff88007ac03bf8 ffffffff816ae9d4 000000000000bebe
> > > [  793.283162]  ffff880078b0d740 ffff88007ac03c98 ffffffff816ac7ac 0000000000000206
> > > [  793.283163]  0000000481f30298 ffff880073e55850 ffff88007ac03c88 ffff88007a20bef8
> > > [  793.283164] Call Trace:
> > > [  793.283169]  [<ffffffff816ae9d4>] dump_stack+0x45/0x57
> > > [  793.283171]  [<ffffffff816ac7ac>] dump_header+0x7f/0x1f1
> > > [  793.283174]  [<ffffffff8114b36b>] oom_kill_process+0x22b/0x390
> > > [  793.283177]  [<ffffffff810776d0>] ? has_capability_noaudit+0x20/0x30
> > > [  793.283178]  [<ffffffff8114bb72>] out_of_memory+0x4b2/0x500
> > > [  793.283179]  [<ffffffff8114bc37>] pagefault_out_of_memory+0x77/0x90
> > > [  793.283180]  [<ffffffff816aab2c>] mm_fault_error+0x67/0x140
> > > [  793.283182]  [<ffffffff8105a9f6>] __do_page_fault+0x3f6/0x580
> > > [  793.283185]  [<ffffffff810aed1d>] ? remove_wait_queue+0x4d/0x60
> > > [  793.283186]  [<ffffffff81070fcb>] ? do_wait+0x12b/0x240
> > > [  793.283187]  [<ffffffff8105abb1>] do_page_fault+0x31/0x70
> > > [  793.283189]  [<ffffffff816b83e8>] page_fault+0x28/0x30
> > > ---------- xfs / Linux 3.19 ----------
> >
> > Are all memory allocations caused by page fault __GFP_FS allocation?
> 
> They should be GFP_HIGHUSER_MOVABLE or GFP_KERNEL. There should be no
> reason to have GFP_NOFS there because the page fault doesn't come from a
> fs path.

Excuse me, but are you sure? I am seeing 0x2015a (!__GFP_NOFS) allocation
failures from page fault. SystemTap also reports that 0x2015a is used from
page fault.

----------
[root@localhost ~]# stap -p4 -d xfs -m pagefault -g -DSTP_NO_OVERLOAD -e '
global traces_bt[65536];
probe begin { printf("Probe start!\n"); }
probe kernel.function("__alloc_pages_nodemask") {
  if ($gfp_mask == 0x2015a && execname() != "stapio") {
    bt = backtrace();
    if (traces_bt[bt]++ == 0) {
      printf("%s (%u) order:%u gfp:0x%x\n", execname(), tid(), $order, $gfp_mask);
      print_stack(bt);
      printf("\n\n");
    }
  }
}
probe end { delete traces_bt; }'
pagefault.ko
[root@localhost ~]# staprun pagefault.ko
Probe start!
rsyslogd (1852) order:0 gfp:0x2015a
 0xffffffff81130030 : __alloc_pages_nodemask+0x0/0x9a0 [kernel]
 0xffffffff81170d87 : alloc_pages_current+0xa7/0x170 [kernel]
 0xffffffff81126d07 : __page_cache_alloc+0xb7/0xd0 [kernel]
 0xffffffff811287a5 : filemap_fault+0x1b5/0x440 [kernel]
 0xffffffff811502ff : __do_fault+0x3f/0xc0 [kernel]
 0xffffffff811518e1 : handle_mm_fault+0x5e1/0x13b0 [kernel]
 0xffffffff810463ef : __do_page_fault+0x18f/0x430 [kernel]
 0xffffffff8104676c : do_page_fault+0xc/0x10 [kernel]
 0xffffffff814d67a2 : page_fault+0x22/0x30 [kernel]
----------

So, your patch introduces a trigger to involve OOM killer for !__GFP_FS
allocation. I myself think that we should trigger OOM killer for !__GFP_FS
allocation in order to make forward progress in case the OOM victim is blocked.
What is the reason we did not involve OOM killer for !__GFP_FS allocation?

Below is an example from http://I-love.SAKURA.ne.jp/tmp/serial-20150318.txt.xz
which is Linux 4.0-rc4 + your patch applied with sysctl_nr_alloc_retry == 1
which has fallen into infinite "XFS: possible memory allocation deadlock in
xfs_buf_allocate_memory (mode:0x250)" retry trap called OOM-deadlock by
running multiple memory stressing processes described at
http://www.spinics.net/lists/linux-ext4/msg47216.html .

----------
[  584.766247] Out of memory: Kill process 27800 (a.out) score 17 or sacrifice child
[  584.766248] Killed process 27800 (a.out) total-vm:69516kB, anon-rss:33236kB, file-rss:4kB
(...snipped...)
[  587.097942] XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
(...snipped...)
[  891.677310] a.out           D ffff880069c3fb78     0 27800      1 0x00100084
[  891.679239]  ffff880069c3fb78 ffff880057b2f570 ffff88007cfaf3b0 0000000000000000
[  891.681368]  ffff88007fffdb08 0000000000000000 ffff880069c3c010 ffff88007cfaf3b0
[  891.683519]  ffff88007bde5dc4 00000000ffffffff ffff88007bde5dc8 ffff880069c3fb98
[  891.685654] Call Trace:
[  891.686350]  [<ffffffff814d1aee>] schedule+0x3e/0x90
[  891.687645]  [<ffffffff814d1d0e>] schedule_preempt_disabled+0xe/0x10
[  891.689289]  [<ffffffff814d2c42>] __mutex_lock_slowpath+0x92/0x100
[  891.690898]  [<ffffffff81190c16>] ? unlazy_walk+0xe6/0x150
[  891.692333]  [<ffffffff814d2cd3>] mutex_lock+0x23/0x40
[  891.693671]  [<ffffffff8119145d>] lookup_slow+0x3d/0xc0
[  891.695036]  [<ffffffff811946c5>] link_path_walk+0x375/0x910
[  891.696523]  [<ffffffff81194d28>] path_init+0xc8/0x460
[  891.697864]  [<ffffffff811970c2>] path_openat+0x72/0x680
[  891.699280]  [<ffffffff81177f72>] ? fallback_alloc+0x192/0x200
[  891.700852]  [<ffffffff811771d8>] ? kmem_getpages+0x58/0x110
[  891.702334]  [<ffffffff8119771a>] do_filp_open+0x4a/0xa0
[  891.703769]  [<ffffffff811a382d>] ? __alloc_fd+0xcd/0x140
[  891.705200]  [<ffffffff81183d45>] do_sys_open+0x145/0x240
[  891.706650]  [<ffffffff81183e7e>] SyS_open+0x1e/0x20
[  891.707976]  [<ffffffff814d4d32>] system_call_fastpath+0x12/0x17
(...snipped...)
[  899.777423] init: page allocation failure: order:0, mode:0x2015a
[  899.777424] CPU: 2 PID: 1 Comm: init Tainted: G            E   4.0.0-rc4+ #13
[  899.777425] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  899.777426]  0000000000000000 ffff88007d07ba98 ffffffff814d0ee5 0000000000000001
[  899.777426]  000000000002015a ffff88007d07bb28 ffffffff8112f2ba ffff88007fffdb28
[  899.777427]  ffff88007d07bab8 0000000000000020 000000000002015a 0000000000000000
[  899.777428] Call Trace:
[  899.777430]  [<ffffffff814d0ee5>] dump_stack+0x48/0x5b
[  899.777431]  [<ffffffff8112f2ba>] warn_alloc_failed+0xea/0x130
[  899.777432]  [<ffffffff81130699>] __alloc_pages_nodemask+0x669/0x9a0
[  899.777434]  [<ffffffff81170d87>] alloc_pages_current+0xa7/0x170
[  899.777435]  [<ffffffff81126d07>] __page_cache_alloc+0xb7/0xd0
[  899.777436]  [<ffffffff811287a5>] filemap_fault+0x1b5/0x440
[  899.777437]  [<ffffffff811502ff>] __do_fault+0x3f/0xc0
[  899.777438]  [<ffffffff811518e1>] handle_mm_fault+0x5e1/0x13b0
[  899.777441]  [<ffffffff8108098a>] ? set_next_entity+0x2a/0x60
[  899.777442]  [<ffffffff810463ef>] __do_page_fault+0x18f/0x430
[  899.777443]  [<ffffffff8104676c>] do_page_fault+0xc/0x10
[  899.777445]  [<ffffffff814d67a2>] page_fault+0x22/0x30
(...snipped...)
[ 1013.096701] XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
----------

We have mutex_lock() which prevented effectively GFP_NOFAIL allocation
at xfs_buf_allocate_memory() from making forward progress when the OOM
victim is blocked at mutex_lock(). As long as there is GFP_NOFAIL users,
we need some heuristic mechanism for detecting stalls.

While your patch seems to shorten the duration of !__GFP_FS allocations,
I can't feel that the I/O layer is making forward progress because the
system is stalling as if forever retrying !__GFP_FS allocations than
return I/O error to the caller. Maybe somewhere in the I/O layer is
stalling due to use of the same watermark threshold for GFP_NOIO /
GFP_NOFS / GFP_KERNEL allocations, though I didn't check for details...

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
@ 2015-03-18 11:33                 ` Tetsuo Handa
  0 siblings, 0 replies; 63+ messages in thread
From: Tetsuo Handa @ 2015-03-18 11:33 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, hannes, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

Michal Hocko wrote:
> > Tetsuo Handa wrote:
> > > I also tested on XFS. One is Linux 3.19 and the other is Linux 3.19
> > > with debug printk patch shown above. According to console logs,
> > > oom_kill_process() is trivially called via pagefault_out_of_memory()
> > > for the former kernel. Due to giving up !GFP_FS allocations immediately?
> > >
> > > (From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-unpatched.txt.xz )
> > > ---------- xfs / Linux 3.19 ----------
> > > [  793.283099] su invoked oom-killer: gfp_mask=0x0, order=0, oom_score_adj=0
> > > [  793.283102] su cpuset=/ mems_allowed=0
> > > [  793.283104] CPU: 3 PID: 9552 Comm: su Not tainted 3.19.0 #40
> > > [  793.283159] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> > > [  793.283161]  0000000000000000 ffff88007ac03bf8 ffffffff816ae9d4 000000000000bebe
> > > [  793.283162]  ffff880078b0d740 ffff88007ac03c98 ffffffff816ac7ac 0000000000000206
> > > [  793.283163]  0000000481f30298 ffff880073e55850 ffff88007ac03c88 ffff88007a20bef8
> > > [  793.283164] Call Trace:
> > > [  793.283169]  [<ffffffff816ae9d4>] dump_stack+0x45/0x57
> > > [  793.283171]  [<ffffffff816ac7ac>] dump_header+0x7f/0x1f1
> > > [  793.283174]  [<ffffffff8114b36b>] oom_kill_process+0x22b/0x390
> > > [  793.283177]  [<ffffffff810776d0>] ? has_capability_noaudit+0x20/0x30
> > > [  793.283178]  [<ffffffff8114bb72>] out_of_memory+0x4b2/0x500
> > > [  793.283179]  [<ffffffff8114bc37>] pagefault_out_of_memory+0x77/0x90
> > > [  793.283180]  [<ffffffff816aab2c>] mm_fault_error+0x67/0x140
> > > [  793.283182]  [<ffffffff8105a9f6>] __do_page_fault+0x3f6/0x580
> > > [  793.283185]  [<ffffffff810aed1d>] ? remove_wait_queue+0x4d/0x60
> > > [  793.283186]  [<ffffffff81070fcb>] ? do_wait+0x12b/0x240
> > > [  793.283187]  [<ffffffff8105abb1>] do_page_fault+0x31/0x70
> > > [  793.283189]  [<ffffffff816b83e8>] page_fault+0x28/0x30
> > > ---------- xfs / Linux 3.19 ----------
> >
> > Are all memory allocations caused by page fault __GFP_FS allocation?
> 
> They should be GFP_HIGHUSER_MOVABLE or GFP_KERNEL. There should be no
> reason to have GFP_NOFS there because the page fault doesn't come from a
> fs path.

Excuse me, but are you sure? I am seeing 0x2015a (!__GFP_NOFS) allocation
failures from page fault. SystemTap also reports that 0x2015a is used from
page fault.

----------
[root@localhost ~]# stap -p4 -d xfs -m pagefault -g -DSTP_NO_OVERLOAD -e '
global traces_bt[65536];
probe begin { printf("Probe start!\n"); }
probe kernel.function("__alloc_pages_nodemask") {
  if ($gfp_mask == 0x2015a && execname() != "stapio") {
    bt = backtrace();
    if (traces_bt[bt]++ == 0) {
      printf("%s (%u) order:%u gfp:0x%x\n", execname(), tid(), $order, $gfp_mask);
      print_stack(bt);
      printf("\n\n");
    }
  }
}
probe end { delete traces_bt; }'
pagefault.ko
[root@localhost ~]# staprun pagefault.ko
Probe start!
rsyslogd (1852) order:0 gfp:0x2015a
 0xffffffff81130030 : __alloc_pages_nodemask+0x0/0x9a0 [kernel]
 0xffffffff81170d87 : alloc_pages_current+0xa7/0x170 [kernel]
 0xffffffff81126d07 : __page_cache_alloc+0xb7/0xd0 [kernel]
 0xffffffff811287a5 : filemap_fault+0x1b5/0x440 [kernel]
 0xffffffff811502ff : __do_fault+0x3f/0xc0 [kernel]
 0xffffffff811518e1 : handle_mm_fault+0x5e1/0x13b0 [kernel]
 0xffffffff810463ef : __do_page_fault+0x18f/0x430 [kernel]
 0xffffffff8104676c : do_page_fault+0xc/0x10 [kernel]
 0xffffffff814d67a2 : page_fault+0x22/0x30 [kernel]
----------

So, your patch introduces a trigger to involve OOM killer for !__GFP_FS
allocation. I myself think that we should trigger OOM killer for !__GFP_FS
allocation in order to make forward progress in case the OOM victim is blocked.
What is the reason we did not involve OOM killer for !__GFP_FS allocation?

Below is an example from http://I-love.SAKURA.ne.jp/tmp/serial-20150318.txt.xz
which is Linux 4.0-rc4 + your patch applied with sysctl_nr_alloc_retry == 1
which has fallen into infinite "XFS: possible memory allocation deadlock in
xfs_buf_allocate_memory (mode:0x250)" retry trap called OOM-deadlock by
running multiple memory stressing processes described at
http://www.spinics.net/lists/linux-ext4/msg47216.html .

----------
[  584.766247] Out of memory: Kill process 27800 (a.out) score 17 or sacrifice child
[  584.766248] Killed process 27800 (a.out) total-vm:69516kB, anon-rss:33236kB, file-rss:4kB
(...snipped...)
[  587.097942] XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
(...snipped...)
[  891.677310] a.out           D ffff880069c3fb78     0 27800      1 0x00100084
[  891.679239]  ffff880069c3fb78 ffff880057b2f570 ffff88007cfaf3b0 0000000000000000
[  891.681368]  ffff88007fffdb08 0000000000000000 ffff880069c3c010 ffff88007cfaf3b0
[  891.683519]  ffff88007bde5dc4 00000000ffffffff ffff88007bde5dc8 ffff880069c3fb98
[  891.685654] Call Trace:
[  891.686350]  [<ffffffff814d1aee>] schedule+0x3e/0x90
[  891.687645]  [<ffffffff814d1d0e>] schedule_preempt_disabled+0xe/0x10
[  891.689289]  [<ffffffff814d2c42>] __mutex_lock_slowpath+0x92/0x100
[  891.690898]  [<ffffffff81190c16>] ? unlazy_walk+0xe6/0x150
[  891.692333]  [<ffffffff814d2cd3>] mutex_lock+0x23/0x40
[  891.693671]  [<ffffffff8119145d>] lookup_slow+0x3d/0xc0
[  891.695036]  [<ffffffff811946c5>] link_path_walk+0x375/0x910
[  891.696523]  [<ffffffff81194d28>] path_init+0xc8/0x460
[  891.697864]  [<ffffffff811970c2>] path_openat+0x72/0x680
[  891.699280]  [<ffffffff81177f72>] ? fallback_alloc+0x192/0x200
[  891.700852]  [<ffffffff811771d8>] ? kmem_getpages+0x58/0x110
[  891.702334]  [<ffffffff8119771a>] do_filp_open+0x4a/0xa0
[  891.703769]  [<ffffffff811a382d>] ? __alloc_fd+0xcd/0x140
[  891.705200]  [<ffffffff81183d45>] do_sys_open+0x145/0x240
[  891.706650]  [<ffffffff81183e7e>] SyS_open+0x1e/0x20
[  891.707976]  [<ffffffff814d4d32>] system_call_fastpath+0x12/0x17
(...snipped...)
[  899.777423] init: page allocation failure: order:0, mode:0x2015a
[  899.777424] CPU: 2 PID: 1 Comm: init Tainted: G            E   4.0.0-rc4+ #13
[  899.777425] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  899.777426]  0000000000000000 ffff88007d07ba98 ffffffff814d0ee5 0000000000000001
[  899.777426]  000000000002015a ffff88007d07bb28 ffffffff8112f2ba ffff88007fffdb28
[  899.777427]  ffff88007d07bab8 0000000000000020 000000000002015a 0000000000000000
[  899.777428] Call Trace:
[  899.777430]  [<ffffffff814d0ee5>] dump_stack+0x48/0x5b
[  899.777431]  [<ffffffff8112f2ba>] warn_alloc_failed+0xea/0x130
[  899.777432]  [<ffffffff81130699>] __alloc_pages_nodemask+0x669/0x9a0
[  899.777434]  [<ffffffff81170d87>] alloc_pages_current+0xa7/0x170
[  899.777435]  [<ffffffff81126d07>] __page_cache_alloc+0xb7/0xd0
[  899.777436]  [<ffffffff811287a5>] filemap_fault+0x1b5/0x440
[  899.777437]  [<ffffffff811502ff>] __do_fault+0x3f/0xc0
[  899.777438]  [<ffffffff811518e1>] handle_mm_fault+0x5e1/0x13b0
[  899.777441]  [<ffffffff8108098a>] ? set_next_entity+0x2a/0x60
[  899.777442]  [<ffffffff810463ef>] __do_page_fault+0x18f/0x430
[  899.777443]  [<ffffffff8104676c>] do_page_fault+0xc/0x10
[  899.777445]  [<ffffffff814d67a2>] page_fault+0x22/0x30
(...snipped...)
[ 1013.096701] XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
----------

We have mutex_lock() which prevented effectively GFP_NOFAIL allocation
at xfs_buf_allocate_memory() from making forward progress when the OOM
victim is blocked at mutex_lock(). As long as there is GFP_NOFAIL users,
we need some heuristic mechanism for detecting stalls.

While your patch seems to shorten the duration of !__GFP_FS allocations,
I can't feel that the I/O layer is making forward progress because the
system is stalling as if forever retrying !__GFP_FS allocations than
return I/O error to the caller. Maybe somewhere in the I/O layer is
stalling due to use of the same watermark threshold for GFP_NOIO /
GFP_NOFS / GFP_KERNEL allocations, though I didn't check for details...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
  2015-03-17 19:41                       ` Michal Hocko
@ 2015-03-18 11:35                         ` Tetsuo Handa
  -1 siblings, 0 replies; 63+ messages in thread
From: Tetsuo Handa @ 2015-03-18 11:35 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: akpm, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

I'm not opposing to have fundamental solutions. As you know the fundamental
solution will need many years to complete, I'm asking for interim workaround
which we can use now.

Michal Hocko wrote:
> The problem, as I see it, is that such a change cannot be pushed to
> Linus tree without extensive testing because there are thousands of code
> paths which never got exercised. We have basically two options here.

Your options are based on your proposal.
We can have different options based on Johannes's and my proposal.

> Either have a non-upstream patch (e.g. sitting in mmotm and linux-next)
> and have developers to do their testing. This will surely help to
> catch a lot of fallouts and fix them right away. But we will miss those
> who are using Linus based trees and would be willing to help to test
> in their loads which we never dreamed of.
> The other option would be pushing an experimental code to the Linus
> tree (and distribution kernels) and allow people to turn it on to help
> testing.

The third option is to purge majority of code paths which never got
exercised, by replacing kmalloc() with kmalloc_nofail() where amount of
requested size is known to be <= PAGE_SIZE bytes.

The third option becomes possible if we "allow triggering the OOM killer for
both __GFP_FS allocations and !__GFP_FS allocations" and "introduce the
OOM-killer timeout" so that OOM-deadlock which we are already observing can
be handled.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
@ 2015-03-18 11:35                         ` Tetsuo Handa
  0 siblings, 0 replies; 63+ messages in thread
From: Tetsuo Handa @ 2015-03-18 11:35 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: akpm, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

I'm not opposing to have fundamental solutions. As you know the fundamental
solution will need many years to complete, I'm asking for interim workaround
which we can use now.

Michal Hocko wrote:
> The problem, as I see it, is that such a change cannot be pushed to
> Linus tree without extensive testing because there are thousands of code
> paths which never got exercised. We have basically two options here.

Your options are based on your proposal.
We can have different options based on Johannes's and my proposal.

> Either have a non-upstream patch (e.g. sitting in mmotm and linux-next)
> and have developers to do their testing. This will surely help to
> catch a lot of fallouts and fix them right away. But we will miss those
> who are using Linus based trees and would be willing to help to test
> in their loads which we never dreamed of.
> The other option would be pushing an experimental code to the Linus
> tree (and distribution kernels) and allow people to turn it on to help
> testing.

The third option is to purge majority of code paths which never got
exercised, by replacing kmalloc() with kmalloc_nofail() where amount of
requested size is known to be <= PAGE_SIZE bytes.

The third option becomes possible if we "allow triggering the OOM killer for
both __GFP_FS allocations and !__GFP_FS allocations" and "introduce the
OOM-killer timeout" so that OOM-deadlock which we are already observing can
be handled.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
  2015-03-18  9:10                         ` Vlastimil Babka
@ 2015-03-18 12:04                           ` Michal Hocko
  -1 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-18 12:04 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Johannes Weiner, Tetsuo Handa, akpm, david, mgorman, riel,
	fengguang.wu, linux-mm, linux-kernel

On Wed 18-03-15 10:10:39, Vlastimil Babka wrote:
> On 03/17/2015 08:41 PM, Michal Hocko wrote:
> >On Tue 17-03-15 13:26:28, Johannes Weiner wrote:
> >>On Tue, Mar 17, 2015 at 03:17:29PM +0100, Michal Hocko wrote:
> >>>On Tue 17-03-15 09:29:26, Johannes Weiner wrote:
> >>>>On Tue, Mar 17, 2015 at 11:25:08AM +0100, Michal Hocko wrote:
> >>>>>On Mon 16-03-15 17:11:46, Johannes Weiner wrote:
> >>>>>>A sysctl certainly doesn't sound appropriate to me because this is not
> >>>>>>a tunable that we expect people to set according to their usecase.  We
> >>>>>>expect our model to work for *everybody*.  A boot flag would be
> >>>>>>marginally better but it still reeks too much of tunable.
> >>>>>
> >>>>>I am OK with a boot option as well if the sysctl is considered
> >>>>>inappropriate. It is less flexible though. Consider a regression testing
> >>>>>where the same load is run 2 times once with failing allocations and
> >>>>>once without it. Why should we force the tester to do a reboot cycle?
> >>>>
> >>>>Because we can get rid of the Kconfig more easily once we transitioned.
> >>>
> >>>How? We might be forced to keep the original behavior _for ever_. I do
> >>>not see any difference between runtime, boottime or compiletime option.
> >>>Except for the flexibility which is different for each one of course. We
> >>>can argue about which one is the most appropriate of course but I feel
> >>>strongly we cannot go and change the semantic right away.
> >>
> >>Sure, why not add another slab allocator while you're at it.  How many
> >>times do we have to repeat the same mistakes?  If the old model sucks,
> >>then it needs to be fixed or replaced.  Don't just offer another one
> >>that sucks in different ways and ask the user to pick their poison,
> >>with a promise that we might improve the newer model until it's
> >>suitable to ditch the old one.
> >>
> >>This is nothing more than us failing and giving up trying to actually
> >>solve our problems.
> >
> >I probably fail to communicate the primary intention here. The point
> >of the knob is _not_ to move the responsibility to userspace. Although
> >I would agree that the knob as proposed might look like that and that is
> >my fault.
> >
> >The primary motivation is to actually help _solving_ our long standing
> >problem. Default non-failing allocations policy is simply wrong and we
> >should move away from it. We have a way to _explicitly_ request such a
> >behavior. Are we in agreement on this part?
> >
> >The problem, as I see it, is that such a change cannot be pushed to
> >Linus tree without extensive testing because there are thousands of code
> >paths which never got exercised. We have basically two options here.
> >Either have a non-upstream patch (e.g. sitting in mmotm and linux-next)
> >and have developers to do their testing. This will surely help to
> >catch a lot of fallouts and fix them right away. But we will miss those
> >who are using Linus based trees and would be willing to help to test
> >in their loads which we never dreamed of.
> >The other option would be pushing an experimental code to the Linus
> >tree (and distribution kernels) and allow people to turn it on to help
> >testing.
> >
> >I am not ignoring the rest of the email, I just want to make sure we are
> >on the same page before we go into a potentially lengthy discussion just
> >to find out we are talking past each other.
> >
> >[...]
> 
> After reading this discussion, my impression is: as I understand your
> motivation, the knob is supposed to expose code that has broken handling of
> small allocation failures, because the handling was never exercised (and
> thus found to be broken) before. The steps you are proposing are to allow
> this to be tested by those who understand that it might break their
> machines, until those broken allocation sites are either fixed or converted
> to __GFP_NOFAIL. We want the change of implicit nofail behavior to happen,
> as then we limit the potential deadlocks to explicitly annotated allocation
> sites, which simplifies efforts to prevent the deadlocks (e.g. with
> reserves).

Exactly.
 
> AFAIU, Johannes is worried that the knob adds some possibility that
> allocations will fail prematurely, even though further trying would allow it
> to succeed and would not introduce a deadlock. 

OK, that definitely wasn't an intention and I have realized that the
knob as proposed is a bad way to go forward. I should have gone with
on/off knob which would only tell whether legacy mode (don't fail small
allocations by default) is enabled or disabled.

But this would still have the configuration space for testing issues
mentioned by Johannes. So I would like to hear what people think about
the following points:
1) Should we move away from non-failing small allocation default at all?
2) If yes do we want a transition plan or simply step in that direction
   and wait for what pops out?

> The probability of this is
> hard to predict even inside MM, yet we assume that userspace will set the
> value. This might discourage some of the volunteers that would be willing to
> test the new behavior, since they could get extra spurious failures. He
> would like to see this to be as reliable as possible, failing allocation
> only when it's absolutely certain that nothing else can be done, and not
> depend on a magically set value from userspace. He also believes that we can
> still improve on the what "can be done" part.

And I agree with that. I am sorry if that wasn't clear from my previous
emails. I just believe that what ever improvements we come up with we
still should make the (non)failing semantic clear.
 
> I'll add that I think if we do improve the reclaim etc, and make allocations
> failures rarer, then the whole testing effort will have much lower chance of
> finding the places where allocation failures are not handled properly. Also
> Michal says that catching those depend on running all "their loads which we
> never dreamed of". In that case, if our goal is to fix all broken allocation
> sites with some quantifiable probability, I'm afraid we might be really
> better off with some form of fault injection, which will trigger the
> failures with the probability we set, and not depend on corner case
> low memory conditions manifesting just at the time the workload is at
> one of the broken allocation sites.

Fault injection is certainly another and orthogonal way to go IMO.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
@ 2015-03-18 12:04                           ` Michal Hocko
  0 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-18 12:04 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Johannes Weiner, Tetsuo Handa, akpm, david, mgorman, riel,
	fengguang.wu, linux-mm, linux-kernel

On Wed 18-03-15 10:10:39, Vlastimil Babka wrote:
> On 03/17/2015 08:41 PM, Michal Hocko wrote:
> >On Tue 17-03-15 13:26:28, Johannes Weiner wrote:
> >>On Tue, Mar 17, 2015 at 03:17:29PM +0100, Michal Hocko wrote:
> >>>On Tue 17-03-15 09:29:26, Johannes Weiner wrote:
> >>>>On Tue, Mar 17, 2015 at 11:25:08AM +0100, Michal Hocko wrote:
> >>>>>On Mon 16-03-15 17:11:46, Johannes Weiner wrote:
> >>>>>>A sysctl certainly doesn't sound appropriate to me because this is not
> >>>>>>a tunable that we expect people to set according to their usecase.  We
> >>>>>>expect our model to work for *everybody*.  A boot flag would be
> >>>>>>marginally better but it still reeks too much of tunable.
> >>>>>
> >>>>>I am OK with a boot option as well if the sysctl is considered
> >>>>>inappropriate. It is less flexible though. Consider a regression testing
> >>>>>where the same load is run 2 times once with failing allocations and
> >>>>>once without it. Why should we force the tester to do a reboot cycle?
> >>>>
> >>>>Because we can get rid of the Kconfig more easily once we transitioned.
> >>>
> >>>How? We might be forced to keep the original behavior _for ever_. I do
> >>>not see any difference between runtime, boottime or compiletime option.
> >>>Except for the flexibility which is different for each one of course. We
> >>>can argue about which one is the most appropriate of course but I feel
> >>>strongly we cannot go and change the semantic right away.
> >>
> >>Sure, why not add another slab allocator while you're at it.  How many
> >>times do we have to repeat the same mistakes?  If the old model sucks,
> >>then it needs to be fixed or replaced.  Don't just offer another one
> >>that sucks in different ways and ask the user to pick their poison,
> >>with a promise that we might improve the newer model until it's
> >>suitable to ditch the old one.
> >>
> >>This is nothing more than us failing and giving up trying to actually
> >>solve our problems.
> >
> >I probably fail to communicate the primary intention here. The point
> >of the knob is _not_ to move the responsibility to userspace. Although
> >I would agree that the knob as proposed might look like that and that is
> >my fault.
> >
> >The primary motivation is to actually help _solving_ our long standing
> >problem. Default non-failing allocations policy is simply wrong and we
> >should move away from it. We have a way to _explicitly_ request such a
> >behavior. Are we in agreement on this part?
> >
> >The problem, as I see it, is that such a change cannot be pushed to
> >Linus tree without extensive testing because there are thousands of code
> >paths which never got exercised. We have basically two options here.
> >Either have a non-upstream patch (e.g. sitting in mmotm and linux-next)
> >and have developers to do their testing. This will surely help to
> >catch a lot of fallouts and fix them right away. But we will miss those
> >who are using Linus based trees and would be willing to help to test
> >in their loads which we never dreamed of.
> >The other option would be pushing an experimental code to the Linus
> >tree (and distribution kernels) and allow people to turn it on to help
> >testing.
> >
> >I am not ignoring the rest of the email, I just want to make sure we are
> >on the same page before we go into a potentially lengthy discussion just
> >to find out we are talking past each other.
> >
> >[...]
> 
> After reading this discussion, my impression is: as I understand your
> motivation, the knob is supposed to expose code that has broken handling of
> small allocation failures, because the handling was never exercised (and
> thus found to be broken) before. The steps you are proposing are to allow
> this to be tested by those who understand that it might break their
> machines, until those broken allocation sites are either fixed or converted
> to __GFP_NOFAIL. We want the change of implicit nofail behavior to happen,
> as then we limit the potential deadlocks to explicitly annotated allocation
> sites, which simplifies efforts to prevent the deadlocks (e.g. with
> reserves).

Exactly.
 
> AFAIU, Johannes is worried that the knob adds some possibility that
> allocations will fail prematurely, even though further trying would allow it
> to succeed and would not introduce a deadlock. 

OK, that definitely wasn't an intention and I have realized that the
knob as proposed is a bad way to go forward. I should have gone with
on/off knob which would only tell whether legacy mode (don't fail small
allocations by default) is enabled or disabled.

But this would still have the configuration space for testing issues
mentioned by Johannes. So I would like to hear what people think about
the following points:
1) Should we move away from non-failing small allocation default at all?
2) If yes do we want a transition plan or simply step in that direction
   and wait for what pops out?

> The probability of this is
> hard to predict even inside MM, yet we assume that userspace will set the
> value. This might discourage some of the volunteers that would be willing to
> test the new behavior, since they could get extra spurious failures. He
> would like to see this to be as reliable as possible, failing allocation
> only when it's absolutely certain that nothing else can be done, and not
> depend on a magically set value from userspace. He also believes that we can
> still improve on the what "can be done" part.

And I agree with that. I am sorry if that wasn't clear from my previous
emails. I just believe that what ever improvements we come up with we
still should make the (non)failing semantic clear.
 
> I'll add that I think if we do improve the reclaim etc, and make allocations
> failures rarer, then the whole testing effort will have much lower chance of
> finding the places where allocation failures are not handled properly. Also
> Michal says that catching those depend on running all "their loads which we
> never dreamed of". In that case, if our goal is to fix all broken allocation
> sites with some quantifiable probability, I'm afraid we might be really
> better off with some form of fault injection, which will trigger the
> failures with the probability we set, and not depend on corner case
> low memory conditions manifesting just at the time the workload is at
> one of the broken allocation sites.

Fault injection is certainly another and orthogonal way to go IMO.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
  2015-03-18 11:33                 ` Tetsuo Handa
@ 2015-03-18 12:23                   ` Michal Hocko
  -1 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-18 12:23 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, hannes, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

On Wed 18-03-15 20:33:03, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > Tetsuo Handa wrote:
> > > > I also tested on XFS. One is Linux 3.19 and the other is Linux 3.19
> > > > with debug printk patch shown above. According to console logs,
> > > > oom_kill_process() is trivially called via pagefault_out_of_memory()
> > > > for the former kernel. Due to giving up !GFP_FS allocations immediately?
> > > >
> > > > (From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-unpatched.txt.xz )
> > > > ---------- xfs / Linux 3.19 ----------
> > > > [  793.283099] su invoked oom-killer: gfp_mask=0x0, order=0, oom_score_adj=0
> > > > [  793.283102] su cpuset=/ mems_allowed=0
> > > > [  793.283104] CPU: 3 PID: 9552 Comm: su Not tainted 3.19.0 #40
> > > > [  793.283159] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> > > > [  793.283161]  0000000000000000 ffff88007ac03bf8 ffffffff816ae9d4 000000000000bebe
> > > > [  793.283162]  ffff880078b0d740 ffff88007ac03c98 ffffffff816ac7ac 0000000000000206
> > > > [  793.283163]  0000000481f30298 ffff880073e55850 ffff88007ac03c88 ffff88007a20bef8
> > > > [  793.283164] Call Trace:
> > > > [  793.283169]  [<ffffffff816ae9d4>] dump_stack+0x45/0x57
> > > > [  793.283171]  [<ffffffff816ac7ac>] dump_header+0x7f/0x1f1
> > > > [  793.283174]  [<ffffffff8114b36b>] oom_kill_process+0x22b/0x390
> > > > [  793.283177]  [<ffffffff810776d0>] ? has_capability_noaudit+0x20/0x30
> > > > [  793.283178]  [<ffffffff8114bb72>] out_of_memory+0x4b2/0x500
> > > > [  793.283179]  [<ffffffff8114bc37>] pagefault_out_of_memory+0x77/0x90
> > > > [  793.283180]  [<ffffffff816aab2c>] mm_fault_error+0x67/0x140
> > > > [  793.283182]  [<ffffffff8105a9f6>] __do_page_fault+0x3f6/0x580
> > > > [  793.283185]  [<ffffffff810aed1d>] ? remove_wait_queue+0x4d/0x60
> > > > [  793.283186]  [<ffffffff81070fcb>] ? do_wait+0x12b/0x240
> > > > [  793.283187]  [<ffffffff8105abb1>] do_page_fault+0x31/0x70
> > > > [  793.283189]  [<ffffffff816b83e8>] page_fault+0x28/0x30
> > > > ---------- xfs / Linux 3.19 ----------
> > >
> > > Are all memory allocations caused by page fault __GFP_FS allocation?
> > 
> > They should be GFP_HIGHUSER_MOVABLE or GFP_KERNEL. There should be no
> > reason to have GFP_NOFS there because the page fault doesn't come from a
> > fs path.
> 
> Excuse me, but are you sure? I am seeing 0x2015a (!__GFP_NOFS) allocation
> failures from page fault. SystemTap also reports that 0x2015a is used from
> page fault.
> 
> ----------
> [root@localhost ~]# stap -p4 -d xfs -m pagefault -g -DSTP_NO_OVERLOAD -e '
> global traces_bt[65536];
> probe begin { printf("Probe start!\n"); }
> probe kernel.function("__alloc_pages_nodemask") {
>   if ($gfp_mask == 0x2015a && execname() != "stapio") {
>     bt = backtrace();
>     if (traces_bt[bt]++ == 0) {
>       printf("%s (%u) order:%u gfp:0x%x\n", execname(), tid(), $order, $gfp_mask);
>       print_stack(bt);
>       printf("\n\n");
>     }
>   }
> }
> probe end { delete traces_bt; }'
> pagefault.ko
> [root@localhost ~]# staprun pagefault.ko
> Probe start!
> rsyslogd (1852) order:0 gfp:0x2015a
>  0xffffffff81130030 : __alloc_pages_nodemask+0x0/0x9a0 [kernel]
>  0xffffffff81170d87 : alloc_pages_current+0xa7/0x170 [kernel]
>  0xffffffff81126d07 : __page_cache_alloc+0xb7/0xd0 [kernel]
>  0xffffffff811287a5 : filemap_fault+0x1b5/0x440 [kernel]
>  0xffffffff811502ff : __do_fault+0x3f/0xc0 [kernel]
>  0xffffffff811518e1 : handle_mm_fault+0x5e1/0x13b0 [kernel]
>  0xffffffff810463ef : __do_page_fault+0x18f/0x430 [kernel]
>  0xffffffff8104676c : do_page_fault+0xc/0x10 [kernel]
>  0xffffffff814d67a2 : page_fault+0x22/0x30 [kernel]

Hmm, interesting. This seems to be page_cache_read path. I really fail
to see why we are considering mapping_gfp_mask here. We are not holding
any fs locks in this path AFAICS. Moreover we are doing GFP_KERNEL
allocation few lines below. I guess this is something to be fixed. I
will look into this.

> ----------
> 
> So, your patch introduces a trigger to involve OOM killer for !__GFP_FS
> allocation. I myself think that we should trigger OOM killer for !__GFP_FS
> allocation in order to make forward progress in case the OOM victim is blocked.
> What is the reason we did not involve OOM killer for !__GFP_FS allocation?

Because the reclaim context for these allocations is very restricted. We
might have a lot of cache which needs to be written down before it will
be reclaimed. If we triggered OOM from this path we would see a lot of
pre-mature OOM killers triggered.

> Below is an example from http://I-love.SAKURA.ne.jp/tmp/serial-20150318.txt.xz
> which is Linux 4.0-rc4 + your patch applied with sysctl_nr_alloc_retry == 1
> which has fallen into infinite "XFS: possible memory allocation deadlock in
> xfs_buf_allocate_memory (mode:0x250)" retry trap called OOM-deadlock by
> running multiple memory stressing processes described at
> http://www.spinics.net/lists/linux-ext4/msg47216.html .
> 
> [  584.766247] Out of memory: Kill process 27800 (a.out) score 17 or sacrifice child
> [  584.766248] Killed process 27800 (a.out) total-vm:69516kB, anon-rss:33236kB, file-rss:4kB
> (...snipped...)
> [  587.097942] XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
> (...snipped...)
> [  891.677310] a.out           D ffff880069c3fb78     0 27800      1 0x00100084
> [  891.679239]  ffff880069c3fb78 ffff880057b2f570 ffff88007cfaf3b0 0000000000000000
> [  891.681368]  ffff88007fffdb08 0000000000000000 ffff880069c3c010 ffff88007cfaf3b0
> [  891.683519]  ffff88007bde5dc4 00000000ffffffff ffff88007bde5dc8 ffff880069c3fb98
> [  891.685654] Call Trace:
> [  891.686350]  [<ffffffff814d1aee>] schedule+0x3e/0x90
> [  891.687645]  [<ffffffff814d1d0e>] schedule_preempt_disabled+0xe/0x10
> [  891.689289]  [<ffffffff814d2c42>] __mutex_lock_slowpath+0x92/0x100
> [  891.690898]  [<ffffffff81190c16>] ? unlazy_walk+0xe6/0x150
> [  891.692333]  [<ffffffff814d2cd3>] mutex_lock+0x23/0x40
> [  891.693671]  [<ffffffff8119145d>] lookup_slow+0x3d/0xc0
> [  891.695036]  [<ffffffff811946c5>] link_path_walk+0x375/0x910
> [  891.696523]  [<ffffffff81194d28>] path_init+0xc8/0x460
> [  891.697864]  [<ffffffff811970c2>] path_openat+0x72/0x680
> [  891.699280]  [<ffffffff81177f72>] ? fallback_alloc+0x192/0x200
> [  891.700852]  [<ffffffff811771d8>] ? kmem_getpages+0x58/0x110
> [  891.702334]  [<ffffffff8119771a>] do_filp_open+0x4a/0xa0
> [  891.703769]  [<ffffffff811a382d>] ? __alloc_fd+0xcd/0x140
> [  891.705200]  [<ffffffff81183d45>] do_sys_open+0x145/0x240
> [  891.706650]  [<ffffffff81183e7e>] SyS_open+0x1e/0x20
> [  891.707976]  [<ffffffff814d4d32>] system_call_fastpath+0x12/0x17
> (...snipped...)
> [  899.777423] init: page allocation failure: order:0, mode:0x2015a
> [  899.777424] CPU: 2 PID: 1 Comm: init Tainted: G            E   4.0.0-rc4+ #13
> [  899.777425] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> [  899.777426]  0000000000000000 ffff88007d07ba98 ffffffff814d0ee5 0000000000000001
> [  899.777426]  000000000002015a ffff88007d07bb28 ffffffff8112f2ba ffff88007fffdb28
> [  899.777427]  ffff88007d07bab8 0000000000000020 000000000002015a 0000000000000000
> [  899.777428] Call Trace:
> [  899.777430]  [<ffffffff814d0ee5>] dump_stack+0x48/0x5b
> [  899.777431]  [<ffffffff8112f2ba>] warn_alloc_failed+0xea/0x130
> [  899.777432]  [<ffffffff81130699>] __alloc_pages_nodemask+0x669/0x9a0
> [  899.777434]  [<ffffffff81170d87>] alloc_pages_current+0xa7/0x170
> [  899.777435]  [<ffffffff81126d07>] __page_cache_alloc+0xb7/0xd0
> [  899.777436]  [<ffffffff811287a5>] filemap_fault+0x1b5/0x440
> [  899.777437]  [<ffffffff811502ff>] __do_fault+0x3f/0xc0
> [  899.777438]  [<ffffffff811518e1>] handle_mm_fault+0x5e1/0x13b0
> [  899.777441]  [<ffffffff8108098a>] ? set_next_entity+0x2a/0x60
> [  899.777442]  [<ffffffff810463ef>] __do_page_fault+0x18f/0x430
> [  899.777443]  [<ffffffff8104676c>] do_page_fault+0xc/0x10
> [  899.777445]  [<ffffffff814d67a2>] page_fault+0x22/0x30
> (...snipped...)
> [ 1013.096701] XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
> ----------
> 
> We have mutex_lock() which prevented effectively GFP_NOFAIL allocation
> at xfs_buf_allocate_memory() from making forward progress when the OOM
> victim is blocked at mutex_lock(). As long as there is GFP_NOFAIL users,
> we need some heuristic mechanism for detecting stalls.

One of those is to give GFP_NOFAIL user an access to reserves after it
is not able to make any progress after several OOM attempts. If the
caller is using GFP_NOFAIL appropriately then we should be slightly
better off. XFS people refused to replace opencoded GFP_NOFAIL because
they have plans to implement failure strategies so they didn't consider
the change worth it.
 
> While your patch seems to shorten the duration of !__GFP_FS allocations,
> I can't feel that the I/O layer is making forward progress because the
> system is stalling as if forever retrying !__GFP_FS allocations than
> return I/O error to the caller.

Yes and this is unfixable from the MM layer IMO.

> Maybe somewhere in the I/O layer is
> stalling due to use of the same watermark threshold for GFP_NOIO /
> GFP_NOFS / GFP_KERNEL allocations, though I didn't check for details...

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
@ 2015-03-18 12:23                   ` Michal Hocko
  0 siblings, 0 replies; 63+ messages in thread
From: Michal Hocko @ 2015-03-18 12:23 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, hannes, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

On Wed 18-03-15 20:33:03, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > Tetsuo Handa wrote:
> > > > I also tested on XFS. One is Linux 3.19 and the other is Linux 3.19
> > > > with debug printk patch shown above. According to console logs,
> > > > oom_kill_process() is trivially called via pagefault_out_of_memory()
> > > > for the former kernel. Due to giving up !GFP_FS allocations immediately?
> > > >
> > > > (From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-unpatched.txt.xz )
> > > > ---------- xfs / Linux 3.19 ----------
> > > > [  793.283099] su invoked oom-killer: gfp_mask=0x0, order=0, oom_score_adj=0
> > > > [  793.283102] su cpuset=/ mems_allowed=0
> > > > [  793.283104] CPU: 3 PID: 9552 Comm: su Not tainted 3.19.0 #40
> > > > [  793.283159] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> > > > [  793.283161]  0000000000000000 ffff88007ac03bf8 ffffffff816ae9d4 000000000000bebe
> > > > [  793.283162]  ffff880078b0d740 ffff88007ac03c98 ffffffff816ac7ac 0000000000000206
> > > > [  793.283163]  0000000481f30298 ffff880073e55850 ffff88007ac03c88 ffff88007a20bef8
> > > > [  793.283164] Call Trace:
> > > > [  793.283169]  [<ffffffff816ae9d4>] dump_stack+0x45/0x57
> > > > [  793.283171]  [<ffffffff816ac7ac>] dump_header+0x7f/0x1f1
> > > > [  793.283174]  [<ffffffff8114b36b>] oom_kill_process+0x22b/0x390
> > > > [  793.283177]  [<ffffffff810776d0>] ? has_capability_noaudit+0x20/0x30
> > > > [  793.283178]  [<ffffffff8114bb72>] out_of_memory+0x4b2/0x500
> > > > [  793.283179]  [<ffffffff8114bc37>] pagefault_out_of_memory+0x77/0x90
> > > > [  793.283180]  [<ffffffff816aab2c>] mm_fault_error+0x67/0x140
> > > > [  793.283182]  [<ffffffff8105a9f6>] __do_page_fault+0x3f6/0x580
> > > > [  793.283185]  [<ffffffff810aed1d>] ? remove_wait_queue+0x4d/0x60
> > > > [  793.283186]  [<ffffffff81070fcb>] ? do_wait+0x12b/0x240
> > > > [  793.283187]  [<ffffffff8105abb1>] do_page_fault+0x31/0x70
> > > > [  793.283189]  [<ffffffff816b83e8>] page_fault+0x28/0x30
> > > > ---------- xfs / Linux 3.19 ----------
> > >
> > > Are all memory allocations caused by page fault __GFP_FS allocation?
> > 
> > They should be GFP_HIGHUSER_MOVABLE or GFP_KERNEL. There should be no
> > reason to have GFP_NOFS there because the page fault doesn't come from a
> > fs path.
> 
> Excuse me, but are you sure? I am seeing 0x2015a (!__GFP_NOFS) allocation
> failures from page fault. SystemTap also reports that 0x2015a is used from
> page fault.
> 
> ----------
> [root@localhost ~]# stap -p4 -d xfs -m pagefault -g -DSTP_NO_OVERLOAD -e '
> global traces_bt[65536];
> probe begin { printf("Probe start!\n"); }
> probe kernel.function("__alloc_pages_nodemask") {
>   if ($gfp_mask == 0x2015a && execname() != "stapio") {
>     bt = backtrace();
>     if (traces_bt[bt]++ == 0) {
>       printf("%s (%u) order:%u gfp:0x%x\n", execname(), tid(), $order, $gfp_mask);
>       print_stack(bt);
>       printf("\n\n");
>     }
>   }
> }
> probe end { delete traces_bt; }'
> pagefault.ko
> [root@localhost ~]# staprun pagefault.ko
> Probe start!
> rsyslogd (1852) order:0 gfp:0x2015a
>  0xffffffff81130030 : __alloc_pages_nodemask+0x0/0x9a0 [kernel]
>  0xffffffff81170d87 : alloc_pages_current+0xa7/0x170 [kernel]
>  0xffffffff81126d07 : __page_cache_alloc+0xb7/0xd0 [kernel]
>  0xffffffff811287a5 : filemap_fault+0x1b5/0x440 [kernel]
>  0xffffffff811502ff : __do_fault+0x3f/0xc0 [kernel]
>  0xffffffff811518e1 : handle_mm_fault+0x5e1/0x13b0 [kernel]
>  0xffffffff810463ef : __do_page_fault+0x18f/0x430 [kernel]
>  0xffffffff8104676c : do_page_fault+0xc/0x10 [kernel]
>  0xffffffff814d67a2 : page_fault+0x22/0x30 [kernel]

Hmm, interesting. This seems to be page_cache_read path. I really fail
to see why we are considering mapping_gfp_mask here. We are not holding
any fs locks in this path AFAICS. Moreover we are doing GFP_KERNEL
allocation few lines below. I guess this is something to be fixed. I
will look into this.

> ----------
> 
> So, your patch introduces a trigger to involve OOM killer for !__GFP_FS
> allocation. I myself think that we should trigger OOM killer for !__GFP_FS
> allocation in order to make forward progress in case the OOM victim is blocked.
> What is the reason we did not involve OOM killer for !__GFP_FS allocation?

Because the reclaim context for these allocations is very restricted. We
might have a lot of cache which needs to be written down before it will
be reclaimed. If we triggered OOM from this path we would see a lot of
pre-mature OOM killers triggered.

> Below is an example from http://I-love.SAKURA.ne.jp/tmp/serial-20150318.txt.xz
> which is Linux 4.0-rc4 + your patch applied with sysctl_nr_alloc_retry == 1
> which has fallen into infinite "XFS: possible memory allocation deadlock in
> xfs_buf_allocate_memory (mode:0x250)" retry trap called OOM-deadlock by
> running multiple memory stressing processes described at
> http://www.spinics.net/lists/linux-ext4/msg47216.html .
> 
> [  584.766247] Out of memory: Kill process 27800 (a.out) score 17 or sacrifice child
> [  584.766248] Killed process 27800 (a.out) total-vm:69516kB, anon-rss:33236kB, file-rss:4kB
> (...snipped...)
> [  587.097942] XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
> (...snipped...)
> [  891.677310] a.out           D ffff880069c3fb78     0 27800      1 0x00100084
> [  891.679239]  ffff880069c3fb78 ffff880057b2f570 ffff88007cfaf3b0 0000000000000000
> [  891.681368]  ffff88007fffdb08 0000000000000000 ffff880069c3c010 ffff88007cfaf3b0
> [  891.683519]  ffff88007bde5dc4 00000000ffffffff ffff88007bde5dc8 ffff880069c3fb98
> [  891.685654] Call Trace:
> [  891.686350]  [<ffffffff814d1aee>] schedule+0x3e/0x90
> [  891.687645]  [<ffffffff814d1d0e>] schedule_preempt_disabled+0xe/0x10
> [  891.689289]  [<ffffffff814d2c42>] __mutex_lock_slowpath+0x92/0x100
> [  891.690898]  [<ffffffff81190c16>] ? unlazy_walk+0xe6/0x150
> [  891.692333]  [<ffffffff814d2cd3>] mutex_lock+0x23/0x40
> [  891.693671]  [<ffffffff8119145d>] lookup_slow+0x3d/0xc0
> [  891.695036]  [<ffffffff811946c5>] link_path_walk+0x375/0x910
> [  891.696523]  [<ffffffff81194d28>] path_init+0xc8/0x460
> [  891.697864]  [<ffffffff811970c2>] path_openat+0x72/0x680
> [  891.699280]  [<ffffffff81177f72>] ? fallback_alloc+0x192/0x200
> [  891.700852]  [<ffffffff811771d8>] ? kmem_getpages+0x58/0x110
> [  891.702334]  [<ffffffff8119771a>] do_filp_open+0x4a/0xa0
> [  891.703769]  [<ffffffff811a382d>] ? __alloc_fd+0xcd/0x140
> [  891.705200]  [<ffffffff81183d45>] do_sys_open+0x145/0x240
> [  891.706650]  [<ffffffff81183e7e>] SyS_open+0x1e/0x20
> [  891.707976]  [<ffffffff814d4d32>] system_call_fastpath+0x12/0x17
> (...snipped...)
> [  899.777423] init: page allocation failure: order:0, mode:0x2015a
> [  899.777424] CPU: 2 PID: 1 Comm: init Tainted: G            E   4.0.0-rc4+ #13
> [  899.777425] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> [  899.777426]  0000000000000000 ffff88007d07ba98 ffffffff814d0ee5 0000000000000001
> [  899.777426]  000000000002015a ffff88007d07bb28 ffffffff8112f2ba ffff88007fffdb28
> [  899.777427]  ffff88007d07bab8 0000000000000020 000000000002015a 0000000000000000
> [  899.777428] Call Trace:
> [  899.777430]  [<ffffffff814d0ee5>] dump_stack+0x48/0x5b
> [  899.777431]  [<ffffffff8112f2ba>] warn_alloc_failed+0xea/0x130
> [  899.777432]  [<ffffffff81130699>] __alloc_pages_nodemask+0x669/0x9a0
> [  899.777434]  [<ffffffff81170d87>] alloc_pages_current+0xa7/0x170
> [  899.777435]  [<ffffffff81126d07>] __page_cache_alloc+0xb7/0xd0
> [  899.777436]  [<ffffffff811287a5>] filemap_fault+0x1b5/0x440
> [  899.777437]  [<ffffffff811502ff>] __do_fault+0x3f/0xc0
> [  899.777438]  [<ffffffff811518e1>] handle_mm_fault+0x5e1/0x13b0
> [  899.777441]  [<ffffffff8108098a>] ? set_next_entity+0x2a/0x60
> [  899.777442]  [<ffffffff810463ef>] __do_page_fault+0x18f/0x430
> [  899.777443]  [<ffffffff8104676c>] do_page_fault+0xc/0x10
> [  899.777445]  [<ffffffff814d67a2>] page_fault+0x22/0x30
> (...snipped...)
> [ 1013.096701] XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
> ----------
> 
> We have mutex_lock() which prevented effectively GFP_NOFAIL allocation
> at xfs_buf_allocate_memory() from making forward progress when the OOM
> victim is blocked at mutex_lock(). As long as there is GFP_NOFAIL users,
> we need some heuristic mechanism for detecting stalls.

One of those is to give GFP_NOFAIL user an access to reserves after it
is not able to make any progress after several OOM attempts. If the
caller is using GFP_NOFAIL appropriately then we should be slightly
better off. XFS people refused to replace opencoded GFP_NOFAIL because
they have plans to implement failure strategies so they didn't consider
the change worth it.
 
> While your patch seems to shorten the duration of !__GFP_FS allocations,
> I can't feel that the I/O layer is making forward progress because the
> system is stalling as if forever retrying !__GFP_FS allocations than
> return I/O error to the caller.

Yes and this is unfixable from the MM layer IMO.

> Maybe somewhere in the I/O layer is
> stalling due to use of the same watermark threshold for GFP_NOIO /
> GFP_NOFS / GFP_KERNEL allocations, though I didn't check for details...

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
  2015-03-18  9:10                         ` Vlastimil Babka
@ 2015-03-18 12:36                           ` Tetsuo Handa
  -1 siblings, 0 replies; 63+ messages in thread
From: Tetsuo Handa @ 2015-03-18 12:36 UTC (permalink / raw)
  To: vbabka, mhocko, hannes
  Cc: akpm, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

Vlastimil Babka wrote:
> I'll add that I think if we do improve the reclaim etc, and make 
> allocations failures rarer, then the whole testing effort will have much 
> lower chance of finding the places where allocation failures are not 
> handled properly. Also Michal says that catching those depend on running 
> all "their loads which we never dreamed of". In that case, if our goal 
> is to fix all broken allocation sites with some quantifiable 
> probability, I'm afraid we might be really better off with some form of 
> fault injection, which will trigger the failures with the probability we 
> set, and not depend on corner case low memory conditions manifesting
> just at the time the workload is at one of the broken allocation sites.
> 

I think we can use SystemTap based fault injection which allows only once
injection per each backtrace without putting the system under OOM condition,
which I demonstrated at https://lkml.org/lkml/2014/12/25/64 .

Since SystemTap can generate backtraces without garbage lines,
we can uniquely identify and inject only once per each backtrace,
making it possible to test every memory allocation callers.

Steps for installation and testing are described below.

---------- installation start ----------
wget https://sourceware.org/systemtap/ftp/releases/systemtap-2.7.tar.gz
echo 'e0c3c36955323ae59be07a26a9563474  systemtap-2.7.tar.gz' | md5sum --check -
tar -zxf systemtap-2.7.tar.gz
cd systemtap-2.7
./configure --prefix=$HOME/systemtap.tmp
make -s
make -s install
---------- installation end ----------

---------- preparation (optional) start ----------
Start kdump service and set /proc/sys/kernel/panic_on_oops to 1
as root user so that we can obtain vmcore upon kernel oops.
---------- preparation (optional) end ----------

---------- testing start ----------
Run

$HOME/systemtap.tmp/bin/staprun fault_injection.ko

and operate as you like, and see whether your system can survive or not.
---------- testing end ----------

The fault_injection.ko is generated by commands shown below.
Scripts shown below checks only sleepable allocations. If you
replace %{ __GFP_WAIT %} with 0, you can check atomic allocations.

---------- For testing __kmalloc() failure ----------
$HOME/systemtap.tmp/bin/stap -p4 -m fault_injection -g -DSTP_NO_OVERLOAD -e '
global traces_bt[65536];
probe begin { printf("Probe start!\n"); }
probe kernel.function("__kmalloc") {
  if (($flags & %{ __GFP_NOFAIL | __GFP_WAIT %} ) == %{ __GFP_WAIT %} && execname() != "stapio") {
    bt = backtrace();
    if (traces_bt[bt]++ == 0) {
      printf("%s (%u) size:%u gfp:0x%x\n", execname(), tid(), $size, $flags);
      print_stack(bt);
      printf("\n\n");
      $size = 1 << 30;
    }
  }
}
probe end { delete traces_bt; }'
---------- For testing __kmalloc() failure ----------

Like an example shown below demonstrate, we will be able to selectively
test specific subsystems by setting per a task_struct marker.

---------- For testing __alloc_pages_nodemask() failure except page fault ----------
$HOME/systemtap.tmp/bin/stap -p4 -m fault_injection -g -DSTP_NO_OVERLOAD -e '
global traces_bt[65536];
global in_page_fault%;
probe begin { printf("Probe start!\n"); }
probe kernel.function("__alloc_pages_nodemask") {
  if (($gfp_mask & %{ __GFP_NOFAIL | __GFP_WAIT %} ) == %{ __GFP_WAIT %} &&
      in_page_fault[tid()] == 0 && execname() != "stapio") {
    bt = backtrace();
    if (traces_bt[bt]++ == 0) {
      printf("%s (%u) order:%u gfp:0x%x\n", execname(), tid(), $order, $gfp_mask);
      print_stack(bt);
      printf("\n\n");
      $order = 1 << 30;
      $gfp_mask = $gfp_mask | %{ __GFP_NORETRY %};
    }
  }
}
probe kernel.function("handle_mm_fault") {
  in_page_fault[tid()]++;
}
probe kernel.function("handle_mm_fault").return {
  in_page_fault[tid()]--;
}
probe end { delete traces_bt; delete in_page_fault; }'
---------- For testing __alloc_pages_nodemask() failure except page fault ----------

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
@ 2015-03-18 12:36                           ` Tetsuo Handa
  0 siblings, 0 replies; 63+ messages in thread
From: Tetsuo Handa @ 2015-03-18 12:36 UTC (permalink / raw)
  To: vbabka, mhocko, hannes
  Cc: akpm, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

Vlastimil Babka wrote:
> I'll add that I think if we do improve the reclaim etc, and make 
> allocations failures rarer, then the whole testing effort will have much 
> lower chance of finding the places where allocation failures are not 
> handled properly. Also Michal says that catching those depend on running 
> all "their loads which we never dreamed of". In that case, if our goal 
> is to fix all broken allocation sites with some quantifiable 
> probability, I'm afraid we might be really better off with some form of 
> fault injection, which will trigger the failures with the probability we 
> set, and not depend on corner case low memory conditions manifesting
> just at the time the workload is at one of the broken allocation sites.
> 

I think we can use SystemTap based fault injection which allows only once
injection per each backtrace without putting the system under OOM condition,
which I demonstrated at https://lkml.org/lkml/2014/12/25/64 .

Since SystemTap can generate backtraces without garbage lines,
we can uniquely identify and inject only once per each backtrace,
making it possible to test every memory allocation callers.

Steps for installation and testing are described below.

---------- installation start ----------
wget https://sourceware.org/systemtap/ftp/releases/systemtap-2.7.tar.gz
echo 'e0c3c36955323ae59be07a26a9563474  systemtap-2.7.tar.gz' | md5sum --check -
tar -zxf systemtap-2.7.tar.gz
cd systemtap-2.7
./configure --prefix=$HOME/systemtap.tmp
make -s
make -s install
---------- installation end ----------

---------- preparation (optional) start ----------
Start kdump service and set /proc/sys/kernel/panic_on_oops to 1
as root user so that we can obtain vmcore upon kernel oops.
---------- preparation (optional) end ----------

---------- testing start ----------
Run

$HOME/systemtap.tmp/bin/staprun fault_injection.ko

and operate as you like, and see whether your system can survive or not.
---------- testing end ----------

The fault_injection.ko is generated by commands shown below.
Scripts shown below checks only sleepable allocations. If you
replace %{ __GFP_WAIT %} with 0, you can check atomic allocations.

---------- For testing __kmalloc() failure ----------
$HOME/systemtap.tmp/bin/stap -p4 -m fault_injection -g -DSTP_NO_OVERLOAD -e '
global traces_bt[65536];
probe begin { printf("Probe start!\n"); }
probe kernel.function("__kmalloc") {
  if (($flags & %{ __GFP_NOFAIL | __GFP_WAIT %} ) == %{ __GFP_WAIT %} && execname() != "stapio") {
    bt = backtrace();
    if (traces_bt[bt]++ == 0) {
      printf("%s (%u) size:%u gfp:0x%x\n", execname(), tid(), $size, $flags);
      print_stack(bt);
      printf("\n\n");
      $size = 1 << 30;
    }
  }
}
probe end { delete traces_bt; }'
---------- For testing __kmalloc() failure ----------

Like an example shown below demonstrate, we will be able to selectively
test specific subsystems by setting per a task_struct marker.

---------- For testing __alloc_pages_nodemask() failure except page fault ----------
$HOME/systemtap.tmp/bin/stap -p4 -m fault_injection -g -DSTP_NO_OVERLOAD -e '
global traces_bt[65536];
global in_page_fault%;
probe begin { printf("Probe start!\n"); }
probe kernel.function("__alloc_pages_nodemask") {
  if (($gfp_mask & %{ __GFP_NOFAIL | __GFP_WAIT %} ) == %{ __GFP_WAIT %} &&
      in_page_fault[tid()] == 0 && execname() != "stapio") {
    bt = backtrace();
    if (traces_bt[bt]++ == 0) {
      printf("%s (%u) order:%u gfp:0x%x\n", execname(), tid(), $order, $gfp_mask);
      print_stack(bt);
      printf("\n\n");
      $order = 1 << 30;
      $gfp_mask = $gfp_mask | %{ __GFP_NORETRY %};
    }
  }
}
probe kernel.function("handle_mm_fault") {
  in_page_fault[tid()]++;
}
probe kernel.function("handle_mm_fault").return {
  in_page_fault[tid()]--;
}
probe end { delete traces_bt; delete in_page_fault; }'
---------- For testing __alloc_pages_nodemask() failure except page fault ----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
  2015-03-18 12:23                   ` Michal Hocko
@ 2015-03-19 11:03                     ` Tetsuo Handa
  -1 siblings, 0 replies; 63+ messages in thread
From: Tetsuo Handa @ 2015-03-19 11:03 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, hannes, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

Michal Hocko wrote:
> > So, your patch introduces a trigger to involve OOM killer for !__GFP_FS
> > allocation. I myself think that we should trigger OOM killer for !__GFP_FS
> > allocation in order to make forward progress in case the OOM victim is blocked.
> > What is the reason we did not involve OOM killer for !__GFP_FS allocation?
> 
> Because the reclaim context for these allocations is very restricted. We
> might have a lot of cache which needs to be written down before it will
> be reclaimed. If we triggered OOM from this path we would see a lot of
> pre-mature OOM killers triggered.

I see. I was worrying that the reason is related to possible deadlocks.

Not giving up waiting for cache which _needs to be_ written down before
it will be reclaimed (sysctl_nr_alloc_retry == ULONG_MAX) is causing
system lockups we are seeing, isn't it?

Giving up waiting for cache which _needs to be_ written down before
it will be reclaimed (sysctl_nr_alloc_retry == 1) is also causing
a lot of pre-mature page allocation failures I'm seeing, isn't it?

    /*
     * If we fail to make progress by freeing individual
     * pages, but the allocation wants us to keep going,
     * start OOM killing tasks.
     */
    if (!did_some_progress) {
            page = __alloc_pages_may_oom(gfp_mask, order, ac,
                                            &did_some_progress);
            if (page)
                    goto got_pg;
            if (!did_some_progress)
                    goto nopage;

            nr_retries++;
    }
    /* Wait for some write requests to complete then retry */
    wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
    goto retry;

If we can somehow tell that there is no more cache which _can be_
written down before it will be reclaimed, we don't need to use
sysctl_nr_alloc_retry, and we can trigger OOM killer, right?



Today's stress testing found another problem your patch did not care.
If page fault caused SIGBUS signal when the first OOM victim cannot be
terminated due to mutex_lock() dependency, the process which triggered
the page fault will likely be killed. If the process is the global init,
kernel panic is triggered like shown below.

(From http://I-love.SAKURA.ne.jp/tmp/serial-20150319-1.txt.xz )
----------
[ 1277.833918] a.out           D ffff880066503bc8     0  8374   5141 0x00000080
[ 1277.835930]  ffff880066503bc8 ffff88007d1180d0 ffff8800664fd070 ffff880066503bc8
[ 1277.838052]  ffffffff8122a0eb ffff88007b8edc20 ffff880066500010 ffff88007fc93740
[ 1277.840102]  7fffffffffffffff ffff880066503d20 0000000000000002 ffff880066503be8
[ 1277.842128] Call Trace:
[ 1277.842766]  [<ffffffff8122a0eb>] ? blk_peek_request+0x8b/0x2a0
[ 1277.844222]  [<ffffffff814d1aee>] schedule+0x3e/0x90
[ 1277.845459]  [<ffffffff814d3dfd>] schedule_timeout+0x12d/0x1a0
[ 1277.846885]  [<ffffffff810b5b16>] ? ktime_get+0x46/0xb0
[ 1277.848189]  [<ffffffff814d0faa>] io_schedule_timeout+0xaa/0x130
[ 1277.849702]  [<ffffffff8108c610>] ? prepare_to_wait+0x60/0x90
[ 1277.851173]  [<ffffffff814d1d90>] ? bit_wait_io_timeout+0x80/0x80
[ 1277.852661]  [<ffffffff814d1dc6>] bit_wait_io+0x36/0x50
[ 1277.853946]  [<ffffffff814d2125>] __wait_on_bit+0x65/0x90
[ 1277.855005] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[ 1277.855083] ata4: EH complete
[ 1277.855174] sd 4:0:0:0: [sdb] tag#27 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1277.855175] sd 4:0:0:0: [sdb] tag#21 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1277.855177] sd 4:0:0:0: [sdb] tag#18 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1277.855178] sd 4:0:0:0: [sdb] tag#21 CDB: Read(10) 28 00 05 33 29 37 00 00 08 00
[ 1277.855179] sd 4:0:0:0: [sdb] tag#27 CDB: Read(10) 28 00 05 33 53 6f 00 00 08 00
[ 1277.855179] sd 4:0:0:0: [sdb] tag#18 CDB: Write(10) 2a 00 05 34 be 67 00 00 08 00
[ 1277.855183] blk_update_request: I/O error, dev sdb, sector 87249775
[ 1277.855256] sd 4:0:0:0: [sdb] tag#22 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1277.855257] sd 4:0:0:0: [sdb] tag#28 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1277.855258] sd 4:0:0:0: [sdb] tag#22 CDB: Read(10) 28 00 05 33 4e 2f 00 00 08 00
[ 1277.855259] sd 4:0:0:0: [sdb] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1277.855260] blk_update_request: I/O error, dev sdb, sector 87248431
[ 1277.855260] sd 4:0:0:0: [sdb] tag#28 CDB: Write(10) 2a 00 05 33 5a 27 00 00 18 00
[ 1277.855261] sd 4:0:0:0: [sdb] tag#19 CDB: Read(10) 28 00 05 34 ba a7 00 00 08 00
[ 1277.855261] blk_update_request: I/O error, dev sdb, sector 87251495
[ 1277.855262] blk_update_request: I/O error, dev sdb, sector 87341735
[ 1277.855319] sd 4:0:0:0: [sdb] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1277.855320] sd 4:0:0:0: [sdb] tag#29 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1277.855320] sd 4:0:0:0: [sdb] tag#20 CDB: Write(10) 2a 00 05 33 4d 37 00 00 08 00
[ 1277.855321] sd 4:0:0:0: [sdb] tag#24 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1277.855322] blk_update_request: I/O error, dev sdb, sector 87248183
[ 1277.855322] sd 4:0:0:0: [sdb] tag#29 CDB: Write(10) 2a 00 05 34 c1 2f 00 00 10 00
[ 1277.855323] sd 4:0:0:0: [sdb] tag#24 CDB: Read(10) 28 00 05 33 52 9f 00 00 08 00
[ 1277.855323] blk_update_request: I/O error, dev sdb, sector 87343407
[ 1277.855324] blk_update_request: I/O error, dev sdb, sector 87249567
[ 1277.855373] sd 4:0:0:0: [sdb] tag#30 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1277.855374] blk_update_request: I/O error, dev sdb, sector 87343167
[ 1277.855374] sd 4:0:0:0: [sdb] tag#30 CDB: Read(10) 28 00 05 33 23 af 00 00 08 00
[ 1277.855375] blk_update_request: I/O error, dev sdb, sector 87237551
[ 1277.855376] blk_update_request: I/O error, dev sdb, sector 87250175
[ 1277.855728] Buffer I/O error on dev sdb1, logical block 1069738, lost async page write
[ 1277.855736] Buffer I/O error on dev sdb1, logical block 10917863, lost async page write
[ 1277.855739] Buffer I/O error on dev sdb1, logical block 10917864, lost async page write
[ 1277.855741] Buffer I/O error on dev sdb1, logical block 10917885, lost async page write
[ 1277.855744] Buffer I/O error on dev sdb1, logical block 11933189, lost async page write
[ 1277.855749] Buffer I/O error on dev sdb1, logical block 10917840, lost async page write
[ 1277.855768] Buffer I/O error on dev sdb1, logical block 10917829, lost async page write
[ 1277.856003] Buffer I/O error on dev sdb1, logical block 10906429, lost async page write
[ 1277.856008] Buffer I/O error on dev sdb1, logical block 10906430, lost async page write
[ 1277.856011] Buffer I/O error on dev sdb1, logical block 10906431, lost async page write
[ 1277.856847] XFS (sdb1): metadata I/O error: block 0x50080d0 ("xlog_iodone") error 5 numblks 64
[ 1277.856850] XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 1180 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa00f31a9
[ 1277.857054] XFS (sdb1): Log I/O Error Detected.  Shutting down filesystem
[ 1277.857055] XFS (sdb1): Please umount the filesystem and rectify the problem(s)
[ 1277.858225] Core dump to |/usr/libexec/abrt-hook-ccpp 7 0 1 0 0 1426725983 e pipe failed
[ 1277.858320] Core dump to |/usr/libexec/abrt-hook-ccpp 7 16777216 4995 0 0 1426725983 e pipe failed
[ 1277.858385] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000007

[ 1277.858387] CPU: 1 PID: 1 Comm: init Tainted: G            E   4.0.0-rc4+ #15
[ 1277.858388] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 1277.858390]  ffff880037baf740 ffff88007d07bc38 ffffffff814d0ee5 000000000000fffe
[ 1277.858391]  ffffffff81701f18 ffff88007d07bcb8 ffffffff814d0c6c ffffffff00000010
[ 1277.858392]  ffff88007d07bcc8 ffff88007d07bc68 0000000000000008 ffff88007d07bcb8
[ 1277.858392] Call Trace:
[ 1277.858398]  [<ffffffff814d0ee5>] dump_stack+0x48/0x5b
[ 1277.858400]  [<ffffffff814d0c6c>] panic+0xbb/0x1fa
[ 1277.858403]  [<ffffffff81055871>] do_exit+0xb51/0xb90
[ 1277.858404]  [<ffffffff81055901>] do_group_exit+0x51/0xc0
[ 1277.858406]  [<ffffffff81061dd2>] get_signal+0x222/0x590
[ 1277.858408]  [<ffffffff81002496>] do_signal+0x36/0x710
[ 1277.858411]  [<ffffffff810461d0>] ? mm_fault_error+0xd0/0x160
[ 1277.858413]  [<ffffffff8104661b>] ? __do_page_fault+0x3bb/0x430
[ 1277.858414]  [<ffffffff81002bb8>] do_notify_resume+0x48/0x60
[ 1277.858416]  [<ffffffff814d59a7>] retint_signal+0x41/0x7a
----------

(From http://I-love.SAKURA.ne.jp/tmp/serial-20150319-2.txt.xz )
----------
[ 2822.642453] scsi_io_completion: 21 callbacks suppressed
[ 2822.644049] sd 4:0:0:0: [sdb] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 2822.646453] sd 4:0:0:0: [sdb] tag#10 CDB: Read(10) 28 00 05 32 28 ff 00 00 08 00
[ 2822.648630] blk_update_request: 21 callbacks suppressed
[ 2822.648631] blk_update_request: I/O error, dev sdb, sector 87173375
[ 2822.648663] sd 4:0:0:0: [sdb] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 2822.648665] sd 4:0:0:0: [sdb] tag#11 CDB: Write(10) 2a 00 05 32 25 6f 00 00 08 00
[ 2822.648665] blk_update_request: I/O error, dev sdb, sector 87172463
[ 2822.648676] sd 4:0:0:0: [sdb] tag#12 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 2822.648677] sd 4:0:0:0: [sdb] tag#12 CDB: Read(10) 28 00 05 32 1e 8f 00 00 08 00
[ 2822.648678] blk_update_request: I/O error, dev sdb, sector 87170703
[ 2822.648700] sd 4:0:0:0: [sdb] tag#13 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 2822.648701] sd 4:0:0:0: [sdb] tag#13 CDB: Write(10) 2a 00 05 32 35 17 00 00 08 00
[ 2822.648701] blk_update_request: I/O error, dev sdb, sector 87176471
[ 2822.648711] sd 4:0:0:0: [sdb] tag#14 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 2822.648712] sd 4:0:0:0: [sdb] tag#14 CDB: Write(10) 2a 00 05 32 8c 77 00 00 08 00
[ 2822.648713] blk_update_request: I/O error, dev sdb, sector 87198839
[ 2822.648722] sd 4:0:0:0: [sdb] tag#15 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 2822.648723] sd 4:0:0:0: [sdb] tag#15 CDB: Read(10) 28 00 05 0b fb 8f 00 00 08 00
[ 2822.648723] blk_update_request: I/O error, dev sdb, sector 84671375
[ 2822.648742] sd 4:0:0:0: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 2822.648743] sd 4:0:0:0: [sdb] tag#16 CDB: Read(10) 28 00 05 33 3e f7 00 00 08 00
[ 2822.648744] blk_update_request: I/O error, dev sdb, sector 87244535
[ 2822.648753] sd 4:0:0:0: [sdb] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 2822.648754] sd 4:0:0:0: [sdb] tag#17 CDB: Write(10) 2a 00 05 d1 eb 77 00 00 08 00
[ 2822.648755] blk_update_request: I/O error, dev sdb, sector 97643383
[ 2822.648759] sd 4:0:0:0: [sdb] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 2822.648760] sd 4:0:0:0: [sdb] tag#19 CDB: Read(10) 28 00 05 32 f9 7f 00 00 08 00
[ 2822.648760] blk_update_request: I/O error, dev sdb, sector 87226751
[ 2822.648778] sd 4:0:0:0: [sdb] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 2822.648780] sd 4:0:0:0: [sdb] tag#18 CDB: Read(10) 28 00 05 32 e3 37 00 00 08 00
[ 2822.648780] blk_update_request: I/O error, dev sdb, sector 87221047
[ 2822.649830] buffer_io_error: 122 callbacks suppressed
[ 2822.649832] Buffer I/O error on dev sdb1, logical block 10896550, lost async page write
[ 2822.649850] Buffer I/O error on dev sdb1, logical block 10897051, lost async page write
[ 2822.649864] Buffer I/O error on dev sdb1, logical block 10899847, lost async page write
[ 2822.649878] Buffer I/O error on dev sdb1, logical block 12205415, lost async page write
[ 2822.649893] Buffer I/O error on dev sdb1, logical block 10903400, lost async page write
[ 2822.649900] Buffer I/O error on dev sdb1, logical block 10905034, lost async page write
[ 2822.649902] Buffer I/O error on dev sdb1, logical block 10905077, lost async page write
[ 2822.649908] Buffer I/O error on dev sdb1, logical block 10900244, lost async page write
[ 2822.649910] Buffer I/O error on dev sdb1, logical block 10901263, lost async page write
[ 2822.649915] Buffer I/O error on dev sdb1, logical block 10899976, lost async page write
[ 2822.649920] XFS (sdb1): metadata I/O error: block 0x50046c8 ("xlog_iodone") error 5 numblks 64
[ 2822.649924] XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 1180 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa00f31a9
[ 2822.650440] XFS (sdb1): Log I/O Error Detected.  Shutting down filesystem
[ 2822.650440] XFS (sdb1): Please umount the filesystem and rectify the problem(s)
[ 2822.650444] XFS (sdb1): metadata I/O error: block 0x5004701 ("xlog_iodone") error 5 numblks 64
[ 2822.650445] XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 1180 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa00f31a9
[ 2822.650446] XFS (sdb1): metadata I/O error: block 0x5004741 ("xlog_iodone") error 5 numblks 64
[ 2822.650447] XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 1180 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa00f31a9
[ 2822.650819] XFS (sdb1): xfs_log_force: error -5 returned.
[ 2822.676108] Core dump to |/usr/libexec/abrt-hook-ccpp 7 0 2233 0 0 1426728845 e pipe failed
[ 2822.676268] Core dump to |/usr/libexec/abrt-hook-ccpp 7 0 1847 0 0 1426728845 e pipe failed
[ 2822.761872] XFS (sdb1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
[ 2822.814996] XFS (sdb1): xfs_log_force: error -5 returned.
[ 2823.210912] audit: *NO* daemon at audit_pid=1847
[ 2823.212289] audit: audit_lost=1 audit_rate_limit=0 audit_backlog_limit=320
[ 2823.214238] audit: auditd disappeared
[ 2823.215419] audit: type=1701 audit(1426728846.212:69): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=2196 comm="master" exe="/usr/libexec/postfix/master" sig=7
[ 2823.219662] audit: type=1701 audit(1426728846.214:70): auid=4294967295 uid=89 gid=89 ses=4294967295 pid=9984 comm="pickup" exe="/usr/libexec/postfix/pickup" sig=7
[ 2823.228854] audit: type=1701 audit(1426728846.229:71): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=1880 comm="rsyslogd" exe="/sbin/rsyslogd" sig=7
[ 2823.232849] Core dump to |/usr/libexec/abrt-hook-ccpp 7 0 1877 0 0 1426728846 e pipe failed
[ 2823.240671] audit: type=1701 audit(1426728846.241:72): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=2265 comm="smbd" exe="/usr/sbin/smbd" sig=7
[ 2823.244547] Core dump to |/usr/libexec/abrt-hook-ccpp 7 16777216 2265 0 0 1426728846 e pipe failed
[ 2823.247697] audit: type=1701 audit(1426728846.248:73): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=2242 comm="smbd" exe="/usr/sbin/smbd" sig=7
[ 2823.252653] Core dump to |/usr/libexec/abrt-hook-ccpp 7 16777216 2242 0 0 1426728846 e pipe failed
[ 2823.263635] audit: type=1701 audit(1426728846.264:74): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=1801 comm="dhclient" exe="/sbin/dhclient" sig=7
[ 2823.267442] Core dump to |/usr/libexec/abrt-hook-ccpp 7 0 1801 0 0 1426728846 e pipe failed
[ 2848.437629] audit: type=1701 audit(1426728871.443:75): auid=0 uid=0 gid=0 ses=5 pid=10052 comm="bash" exe="/bin/bash" sig=7
[ 2848.444223] Core dump to |/usr/libexec/abrt-hook-ccpp 7 0 10052 0 0 1426728871 e pipe failed
[ 2848.449033] audit: type=1701 audit(1426728871.454:76): auid=0 uid=0 gid=0 ses=5 pid=9958 comm="login" exe="/bin/login" sig=7
[ 2848.455734] Core dump to |/usr/libexec/abrt-hook-ccpp 7 0 9958 0 0 1426728871 e pipe failed
[ 2848.460683] audit: type=1701 audit(1426728871.466:77): auid=4294967295 uid=81 gid=81 ses=4294967295 pid=2048 comm="dbus-daemon" exe="/bin/dbus-daemon" sig=7
[ 2848.464454] audit: type=1701 audit(1426728871.470:78): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=1 comm="init" exe="/sbin/init" sig=7
[ 2848.464577] audit: type=1701 audit(1426728871.470:79): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=9986 comm="console-kit-dae" exe="/usr/sbin/console-kit-daemon" sig=7
[ 2848.464578] audit: type=1701 audit(1426728871.470:80): auid=4294967295 uid=70 gid=70 ses=4294967295 pid=2060 comm="avahi-daemon" exe="/usr/sbin/avahi-daemon" sig=7
[ 2848.465160] audit: type=1701 audit(1426728871.470:81): auid=4294967295 uid=70 gid=70 ses=4294967295 pid=2061 comm="avahi-daemon" exe="/usr/sbin/avahi-daemon" sig=7
[ 2848.476649] Core dump to |/usr/libexec/abrt-hook-ccpp 7 0 9986 0 0 1426728871 e pipe failed
[ 2848.481923] Core dump to |/usr/libexec/abrt-hook-ccpp 7 0 1 0 0 1426728871 e pipe failed
[ 2848.484090] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000007
[ 2848.484090] 
[ 2848.486554] CPU: 2 PID: 1 Comm: init Tainted: G            E   4.0.0-rc4+ #15
[ 2848.488377] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 2848.491120]  ffff880037be4ac0 ffff88007d07bc38 ffffffff814d0ee5 000000000000fffe
[ 2848.493549]  ffffffff81701f18 ffff88007d07bcb8 ffffffff814d0c6c ffffffff00000010
[ 2848.495723]  ffff88007d07bcc8 ffff88007d07bc68 0000000000000008 ffff88007d07bcb8
[ 2848.498043] Call Trace:
[ 2848.498738]  [<ffffffff814d0ee5>] dump_stack+0x48/0x5b
[ 2848.500114]  [<ffffffff814d0c6c>] panic+0xbb/0x1fa
[ 2848.501394]  [<ffffffff81055871>] do_exit+0xb51/0xb90
[ 2848.502710]  [<ffffffff81055901>] do_group_exit+0x51/0xc0
[ 2848.504128]  [<ffffffff81061dd2>] get_signal+0x222/0x590
[ 2848.505509]  [<ffffffff81002496>] do_signal+0x36/0x710
[ 2848.506848]  [<ffffffff810461d0>] ? mm_fault_error+0xd0/0x160
[ 2848.508421]  [<ffffffff8104661b>] ? __do_page_fault+0x3bb/0x430
[ 2848.509966]  [<ffffffff81002bb8>] do_notify_resume+0x48/0x60
[ 2848.511435]  [<ffffffff814d59a7>] retint_signal+0x41/0x7a
----------

Innocent (possibly critical) process can be killed unexpectedly than
return -ENOMEM to the system calls or return NULL to e.g. kmalloc() users.
This is much worse than choosing the second OOM victim upon timeout.

Why not change each caller to use either __GFP_NOFAIL or __GFP_NORETRY
than introduce global sysctl_nr_alloc_retry which will unconditionally
allow small allocations to fail?



Also found yet another problem. kernel worker thread seems to be stalling at
bdi_writeback_workfn() => xfs_vm_writepage() => xfs_buf_allocate_memory() =>
alloc_pages_current() => shrink_inactive_list() => congestion_wait() forever
while kswapd0 seems to be stalling at shrink_inactive_list() =>
xfs_vm_writepage() => xlog_grant_head_wait() forever.

(From http://I-love.SAKURA.ne.jp/tmp/serial-20150319-3.txt.xz )
----------
[ 1392.529392] sysrq: SysRq : Show Blocked State
[ 1392.532232]   task                        PC stack   pid father
[ 1392.535229] kswapd0         D ffff88007c3a3728     0    40      2 0x00000000
[ 1392.538916]  ffff88007c3a3728 ffff88007c5514b0 ffff88007cddac00 0000000000000008
[ 1392.543034]  0000000000000286 ffff88007ac35a42 ffff88007c3a0010 ffff88007c11fdc0
[ 1392.547234]  ffff880037af61c0 00000000000009cc 00000000000147a0 ffff88007c3a3748
[ 1392.551350] Call Trace:
[ 1392.552673]  [<ffffffff814d1aee>] schedule+0x3e/0x90
[ 1392.555285]  [<ffffffffa00f4377>] xlog_grant_head_wait+0xb7/0x1c0 [xfs]
[ 1392.558611]  [<ffffffffa00f4546>] xlog_grant_head_check+0xc6/0xe0 [xfs]
[ 1392.561938]  [<ffffffffa00f4642>] xfs_log_reserve+0xe2/0x220 [xfs]
[ 1392.565083]  [<ffffffffa00efc85>] xfs_trans_reserve+0x1e5/0x220 [xfs]
[ 1392.568312]  [<ffffffffa00efe9a>] ? _xfs_trans_alloc+0x3a/0xa0 [xfs]
[ 1392.570965]  [<ffffffffa00ca76a>] xfs_setfilesize_trans_alloc+0x4a/0xb0 [xfs]
[ 1392.572670]  [<ffffffffa00ccb15>] xfs_vm_writepage+0x4a5/0x5a0 [xfs]
[ 1392.574173]  [<ffffffff81139aac>] shrink_page_list+0x43c/0x9d0
[ 1392.575566]  [<ffffffff8113a6c5>] shrink_inactive_list+0x275/0x500
[ 1392.577084]  [<ffffffff812565c0>] ? radix_tree_gang_lookup_tag+0x90/0xd0
[ 1392.578657]  [<ffffffff8113b2e1>] shrink_lruvec+0x641/0x730
[ 1392.580093]  [<ffffffff8108098a>] ? set_next_entity+0x2a/0x60
[ 1392.581516]  [<ffffffff810aeaac>] ? lock_timer_base+0x3c/0x70
[ 1392.582915]  [<ffffffff810aed93>] ? try_to_del_timer_sync+0x53/0x70
[ 1392.584408]  [<ffffffff8113b8c5>] shrink_zone+0x75/0x1b0
[ 1392.585669]  [<ffffffff8113c19e>] kswapd+0x4de/0x9a0
[ 1392.586967]  [<ffffffff8113bcc0>] ? zone_reclaim+0x2c0/0x2c0
[ 1392.588363]  [<ffffffff8113bcc0>] ? zone_reclaim+0x2c0/0x2c0
[ 1392.589738]  [<ffffffff8106f0ae>] kthread+0xce/0xf0
[ 1392.590967]  [<ffffffff8106efe0>] ? kthread_freezable_should_stop+0x70/0x70
[ 1392.592629]  [<ffffffff814d4c88>] ret_from_fork+0x58/0x90
[ 1392.593909]  [<ffffffff8106efe0>] ? kthread_freezable_should_stop+0x70/0x70
[ 1392.595582] kworker/u16:29  D ffff88007b28e8e8     0   440      2 0x00000000
[ 1392.597428] Workqueue: writeback bdi_writeback_workfn (flush-8:0)
[ 1392.598998]  ffff88007b28e8e8 ffff88007c9288c0 ffff88007b27aa80 ffff88007b28e8e8
[ 1392.600983]  ffffffff810aed93 ffff88007fc93740 ffff88007b28c010 ffff88007d1b8000
[ 1392.603045]  000000010010ac98 ffff88007d1b8000 000000010010ac34 ffff88007b28e908
[ 1392.605052] Call Trace:
[ 1392.605698]  [<ffffffff810aed93>] ? try_to_del_timer_sync+0x53/0x70
[ 1392.607218]  [<ffffffff814d1aee>] schedule+0x3e/0x90
[ 1392.608421]  [<ffffffff814d3dc9>] schedule_timeout+0xf9/0x1a0
[ 1392.609785]  [<ffffffff810aefd0>] ? add_timer_on+0xd0/0xd0
[ 1392.611093]  [<ffffffff814d0faa>] io_schedule_timeout+0xaa/0x130
[ 1392.612599]  [<ffffffff811447af>] congestion_wait+0x7f/0x100
[ 1392.614043]  [<ffffffff8108c170>] ? woken_wake_function+0x20/0x20
[ 1392.615530]  [<ffffffff8113a8f4>] shrink_inactive_list+0x4a4/0x500
[ 1392.617051]  [<ffffffff8113b2e1>] shrink_lruvec+0x641/0x730
[ 1392.618446]  [<ffffffff8114d850>] ? list_lru_count_one+0x20/0x30
[ 1392.619926]  [<ffffffff8113b8c5>] shrink_zone+0x75/0x1b0
[ 1392.621223]  [<ffffffff8113c7f3>] do_try_to_free_pages+0x193/0x340
[ 1392.622680]  [<ffffffff8113caf7>] try_to_free_pages+0xb7/0x140
[ 1392.624074]  [<ffffffff811305bf>] __alloc_pages_nodemask+0x58f/0x9a0
[ 1392.625567]  [<ffffffff81170d87>] alloc_pages_current+0xa7/0x170
[ 1392.626992]  [<ffffffffa00d2a05>] xfs_buf_allocate_memory+0x1a5/0x290 [xfs]
[ 1392.628650]  [<ffffffffa00d3fe0>] xfs_buf_get_map+0x130/0x180 [xfs]
[ 1392.630281]  [<ffffffffa00d4060>] xfs_buf_read_map+0x30/0x100 [xfs]
[ 1392.631815]  [<ffffffffa00fffb9>] xfs_trans_read_buf_map+0xd9/0x300 [xfs]
[ 1392.633666]  [<ffffffffa00aa029>] xfs_btree_read_buf_block+0x79/0xc0 [xfs]
[ 1392.635398]  [<ffffffffa00aa274>] xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
[ 1392.637112]  [<ffffffffa00aac8f>] xfs_btree_lookup+0xcf/0x4b0 [xfs]
[ 1392.638594]  [<ffffffff81179573>] ? kmem_cache_alloc+0x163/0x1d0
[ 1392.640033]  [<ffffffffa0092e39>] xfs_alloc_lookup_eq+0x19/0x20 [xfs]
[ 1392.641556]  [<ffffffffa0093155>] xfs_alloc_fixup_trees+0x2a5/0x340 [xfs]
[ 1392.643156]  [<ffffffffa0094b8d>] xfs_alloc_ag_vextent_near+0x9ad/0xb60 [xfs]
[ 1392.644850]  [<ffffffffa0095c1d>] ? xfs_alloc_fix_freelist+0x3dd/0x470 [xfs]
[ 1392.646615]  [<ffffffffa00950a5>] xfs_alloc_ag_vextent+0xd5/0x100 [xfs]
[ 1392.648262]  [<ffffffffa0095f64>] xfs_alloc_vextent+0x2b4/0x600 [xfs]
[ 1392.649855]  [<ffffffffa00a4c48>] xfs_bmap_btalloc+0x388/0x750 [xfs]
[ 1392.651412]  [<ffffffffa00a86e8>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
[ 1392.652917]  [<ffffffffa00a5034>] xfs_bmap_alloc+0x24/0x40 [xfs]
[ 1392.654343]  [<ffffffffa00a7742>] xfs_bmapi_write+0x5a2/0xa20 [xfs]
[ 1392.655846]  [<ffffffffa009e8d0>] ? xfs_bmap_last_offset+0x50/0xc0 [xfs]
[ 1392.657431]  [<ffffffffa00df6fe>] xfs_iomap_write_allocate+0x14e/0x380 [xfs]
[ 1392.659110]  [<ffffffffa00cbb79>] xfs_map_blocks+0x139/0x220 [xfs]
[ 1392.660576]  [<ffffffffa00cc7f6>] xfs_vm_writepage+0x186/0x5a0 [xfs]
[ 1392.662193]  [<ffffffff81130b47>] __writepage+0x17/0x40
[ 1392.663635]  [<ffffffff81131d57>] write_cache_pages+0x247/0x510
[ 1392.665121]  [<ffffffff81130b30>] ? set_page_dirty+0x60/0x60
[ 1392.666503]  [<ffffffff81132071>] generic_writepages+0x51/0x80
[ 1392.667966]  [<ffffffffa00cba03>] xfs_vm_writepages+0x53/0x70 [xfs]
[ 1392.669445]  [<ffffffff811320c0>] do_writepages+0x20/0x40
[ 1392.670730]  [<ffffffff811b1569>] __writeback_single_inode+0x49/0x2e0
[ 1392.672267]  [<ffffffff8108c80f>] ? wake_up_bit+0x2f/0x40
[ 1392.673681]  [<ffffffff811b1c3a>] writeback_sb_inodes+0x28a/0x4e0
[ 1392.675167]  [<ffffffff811b1f2e>] __writeback_inodes_wb+0x9e/0xd0
[ 1392.676661]  [<ffffffff811b215b>] wb_writeback+0x1fb/0x2c0
[ 1392.678000]  [<ffffffff811b22a1>] wb_do_writeback+0x81/0x1f0
[ 1392.679492]  [<ffffffff81086f3b>] ? pick_next_task_fair+0x40b/0x550
[ 1392.681076]  [<ffffffff811b2480>] bdi_writeback_workfn+0x70/0x200
[ 1392.682564]  [<ffffffff81076961>] ? dequeue_task+0x61/0x90
[ 1392.683902]  [<ffffffff8106976a>] process_one_work+0x13a/0x420
[ 1392.685287]  [<ffffffff81069b73>] worker_thread+0x123/0x4f0
[ 1392.686610]  [<ffffffff81069a50>] ? process_one_work+0x420/0x420
[ 1392.688045]  [<ffffffff81069a50>] ? process_one_work+0x420/0x420
[ 1392.689466]  [<ffffffff8106f0ae>] kthread+0xce/0xf0
[ 1392.690630]  [<ffffffff8106efe0>] ? kthread_freezable_should_stop+0x70/0x70
[ 1392.692329]  [<ffffffff814d4c88>] ret_from_fork+0x58/0x90
[ 1392.693617]  [<ffffffff8106efe0>] ? kthread_freezable_should_stop+0x70/0x70
(...snipped...)
[ 1714.888104] sysrq: SysRq : Show Blocked State
[ 1714.890786]   task                        PC stack   pid father
[ 1714.894315] kswapd0         D ffff88007c3a3728     0    40      2 0x00000000
[ 1714.898427]  ffff88007c3a3728 ffff88007c5514b0 ffff88007cddac00 0000000000000008
[ 1714.902908]  0000000000000286 ffff88007ac35a42 ffff88007c3a0010 ffff88007c11fdc0
[ 1714.907032]  ffff880037af61c0 00000000000009cc 00000000000147a0 ffff88007c3a3748
[ 1714.909212] Call Trace:
[ 1714.909909]  [<ffffffff814d1aee>] schedule+0x3e/0x90
[ 1714.911283]  [<ffffffffa00f4377>] xlog_grant_head_wait+0xb7/0x1c0 [xfs]
[ 1714.913067]  [<ffffffffa00f4546>] xlog_grant_head_check+0xc6/0xe0 [xfs]
[ 1714.914850]  [<ffffffffa00f4642>] xfs_log_reserve+0xe2/0x220 [xfs]
[ 1714.916496]  [<ffffffffa00efc85>] xfs_trans_reserve+0x1e5/0x220 [xfs]
[ 1714.918216]  [<ffffffffa00efe9a>] ? _xfs_trans_alloc+0x3a/0xa0 [xfs]
[ 1714.919904]  [<ffffffffa00ca76a>] xfs_setfilesize_trans_alloc+0x4a/0xb0 [xfs]
[ 1714.921796]  [<ffffffffa00ccb15>] xfs_vm_writepage+0x4a5/0x5a0 [xfs]
[ 1714.923484]  [<ffffffff81139aac>] shrink_page_list+0x43c/0x9d0
[ 1714.925037]  [<ffffffff8113a6c5>] shrink_inactive_list+0x275/0x500
[ 1714.926706]  [<ffffffff812565c0>] ? radix_tree_gang_lookup_tag+0x90/0xd0
[ 1714.928511]  [<ffffffff8113b2e1>] shrink_lruvec+0x641/0x730
[ 1714.930028]  [<ffffffff8108098a>] ? set_next_entity+0x2a/0x60
[ 1714.931562]  [<ffffffff810aeaac>] ? lock_timer_base+0x3c/0x70
[ 1714.933089]  [<ffffffff810aed93>] ? try_to_del_timer_sync+0x53/0x70
[ 1714.934781]  [<ffffffff8113b8c5>] shrink_zone+0x75/0x1b0
[ 1714.936197]  [<ffffffff8113c19e>] kswapd+0x4de/0x9a0
[ 1714.937549]  [<ffffffff8113bcc0>] ? zone_reclaim+0x2c0/0x2c0
[ 1714.939067]  [<ffffffff8113bcc0>] ? zone_reclaim+0x2c0/0x2c0
[ 1714.940593]  [<ffffffff8106f0ae>] kthread+0xce/0xf0
[ 1714.941908]  [<ffffffff8106efe0>] ? kthread_freezable_should_stop+0x70/0x70
[ 1714.943773]  [<ffffffff814d4c88>] ret_from_fork+0x58/0x90
[ 1714.945216]  [<ffffffff8106efe0>] ? kthread_freezable_should_stop+0x70/0x70
[ 1714.947069] kworker/u16:29  D ffff88007b28e8e8     0   440      2 0x00000000
[ 1714.949024] Workqueue: writeback bdi_writeback_workfn (flush-8:0)
[ 1714.950742]  ffff88007b28e8e8 ffff88007c9288c0 ffff88007b27aa80 ffff88007b28e8e8
[ 1714.953192]  ffffffff810aed93 0000000000000002 ffff88007b28c010 ffff88007d1b8000
[ 1714.955407]  0000000100159847 ffff88007d1b8000 00000001001597e3 ffff88007b28e908
[ 1714.957577] Call Trace:
[ 1714.958269]  [<ffffffff810aed93>] ? try_to_del_timer_sync+0x53/0x70
[ 1714.959931]  [<ffffffff814d1aee>] schedule+0x3e/0x90
[ 1714.961291]  [<ffffffff814d3dc9>] schedule_timeout+0xf9/0x1a0
[ 1714.962838]  [<ffffffff810aefd0>] ? add_timer_on+0xd0/0xd0
[ 1714.964305]  [<ffffffff814d0faa>] io_schedule_timeout+0xaa/0x130
[ 1714.965903]  [<ffffffff811447af>] congestion_wait+0x7f/0x100
[ 1714.967906]  [<ffffffff8108c170>] ? woken_wake_function+0x20/0x20
[ 1714.969604]  [<ffffffff8113a8f4>] shrink_inactive_list+0x4a4/0x500
[ 1714.971251]  [<ffffffff8113b2e1>] shrink_lruvec+0x641/0x730
[ 1714.972743]  [<ffffffff8114d850>] ? list_lru_count_one+0x20/0x30
[ 1714.974338]  [<ffffffff8113b8c5>] shrink_zone+0x75/0x1b0
[ 1714.975801]  [<ffffffff8113c7f3>] do_try_to_free_pages+0x193/0x340
[ 1714.977471]  [<ffffffff8113caf7>] try_to_free_pages+0xb7/0x140
[ 1714.979048]  [<ffffffff811305bf>] __alloc_pages_nodemask+0x58f/0x9a0
[ 1714.980734]  [<ffffffff81170d87>] alloc_pages_current+0xa7/0x170
[ 1714.982347]  [<ffffffffa00d2a05>] xfs_buf_allocate_memory+0x1a5/0x290 [xfs]
[ 1714.984239]  [<ffffffffa00d3fe0>] xfs_buf_get_map+0x130/0x180 [xfs]
[ 1714.985925]  [<ffffffffa00d4060>] xfs_buf_read_map+0x30/0x100 [xfs]
[ 1714.987611]  [<ffffffffa00fffb9>] xfs_trans_read_buf_map+0xd9/0x300 [xfs]
[ 1714.989421]  [<ffffffffa00aa029>] xfs_btree_read_buf_block+0x79/0xc0 [xfs]
[ 1714.991257]  [<ffffffffa00aa274>] xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
[ 1714.993342]  [<ffffffffa00aac8f>] xfs_btree_lookup+0xcf/0x4b0 [xfs]
[ 1714.995038]  [<ffffffff81179573>] ? kmem_cache_alloc+0x163/0x1d0
[ 1714.996655]  [<ffffffffa0092e39>] xfs_alloc_lookup_eq+0x19/0x20 [xfs]
[ 1714.998429]  [<ffffffffa0093155>] xfs_alloc_fixup_trees+0x2a5/0x340 [xfs]
[ 1715.000242]  [<ffffffffa0094b8d>] xfs_alloc_ag_vextent_near+0x9ad/0xb60 [xfs]
[ 1715.002137]  [<ffffffffa0095c1d>] ? xfs_alloc_fix_freelist+0x3dd/0x470 [xfs]
[ 1715.004003]  [<ffffffffa00950a5>] xfs_alloc_ag_vextent+0xd5/0x100 [xfs]
[ 1715.005761]  [<ffffffffa0095f64>] xfs_alloc_vextent+0x2b4/0x600 [xfs]
[ 1715.007482]  [<ffffffffa00a4c48>] xfs_bmap_btalloc+0x388/0x750 [xfs]
[ 1715.009177]  [<ffffffffa00a86e8>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
[ 1715.010896]  [<ffffffffa00a5034>] xfs_bmap_alloc+0x24/0x40 [xfs]
[ 1715.012551]  [<ffffffffa00a7742>] xfs_bmapi_write+0x5a2/0xa20 [xfs]
[ 1715.014248]  [<ffffffffa009e8d0>] ? xfs_bmap_last_offset+0x50/0xc0 [xfs]
[ 1715.016071]  [<ffffffffa00df6fe>] xfs_iomap_write_allocate+0x14e/0x380 [xfs]
[ 1715.017965]  [<ffffffffa00cbb79>] xfs_map_blocks+0x139/0x220 [xfs]
[ 1715.019631]  [<ffffffffa00cc7f6>] xfs_vm_writepage+0x186/0x5a0 [xfs]
[ 1715.021345]  [<ffffffff81130b47>] __writepage+0x17/0x40
[ 1715.022762]  [<ffffffff81131d57>] write_cache_pages+0x247/0x510
[ 1715.024370]  [<ffffffff81130b30>] ? set_page_dirty+0x60/0x60
[ 1715.025885]  [<ffffffff81132071>] generic_writepages+0x51/0x80
[ 1715.027614]  [<ffffffffa00cba03>] xfs_vm_writepages+0x53/0x70 [xfs]
[ 1715.029179]  [<ffffffff811320c0>] do_writepages+0x20/0x40
[ 1715.030559]  [<ffffffff811b1569>] __writeback_single_inode+0x49/0x2e0
[ 1715.032154]  [<ffffffff8108c80f>] ? wake_up_bit+0x2f/0x40
[ 1715.033498]  [<ffffffff811b1c3a>] writeback_sb_inodes+0x28a/0x4e0
[ 1715.035025]  [<ffffffff811b1f2e>] __writeback_inodes_wb+0x9e/0xd0
[ 1715.036551]  [<ffffffff811b215b>] wb_writeback+0x1fb/0x2c0
[ 1715.037916]  [<ffffffff811b22a1>] wb_do_writeback+0x81/0x1f0
[ 1715.039317]  [<ffffffff81086f3b>] ? pick_next_task_fair+0x40b/0x550
[ 1715.040897]  [<ffffffff811b2480>] bdi_writeback_workfn+0x70/0x200
[ 1715.042415]  [<ffffffff81076961>] ? dequeue_task+0x61/0x90
[ 1715.043798]  [<ffffffff8106976a>] process_one_work+0x13a/0x420
[ 1715.045260]  [<ffffffff81069b73>] worker_thread+0x123/0x4f0
[ 1715.046665]  [<ffffffff81069a50>] ? process_one_work+0x420/0x420
[ 1715.048154]  [<ffffffff81069a50>] ? process_one_work+0x420/0x420
[ 1715.049659]  [<ffffffff8106f0ae>] kthread+0xce/0xf0
[ 1715.050880]  [<ffffffff8106efe0>] ? kthread_freezable_should_stop+0x70/0x70
[ 1715.052608]  [<ffffffff814d4c88>] ret_from_fork+0x58/0x90
[ 1715.053958]  [<ffffffff8106efe0>] ? kthread_freezable_should_stop+0x70/0x70
----------

I do want hints like http://www.spinics.net/lists/linux-mm/msg81409.html for
guessing whether forward progress is made or not. I'm fine with enabling
such hints with CONFIG_DEBUG_something.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
@ 2015-03-19 11:03                     ` Tetsuo Handa
  0 siblings, 0 replies; 63+ messages in thread
From: Tetsuo Handa @ 2015-03-19 11:03 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, hannes, david, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

Michal Hocko wrote:
> > So, your patch introduces a trigger to involve OOM killer for !__GFP_FS
> > allocation. I myself think that we should trigger OOM killer for !__GFP_FS
> > allocation in order to make forward progress in case the OOM victim is blocked.
> > What is the reason we did not involve OOM killer for !__GFP_FS allocation?
> 
> Because the reclaim context for these allocations is very restricted. We
> might have a lot of cache which needs to be written down before it will
> be reclaimed. If we triggered OOM from this path we would see a lot of
> pre-mature OOM killers triggered.

I see. I was worrying that the reason is related to possible deadlocks.

Not giving up waiting for cache which _needs to be_ written down before
it will be reclaimed (sysctl_nr_alloc_retry == ULONG_MAX) is causing
system lockups we are seeing, isn't it?

Giving up waiting for cache which _needs to be_ written down before
it will be reclaimed (sysctl_nr_alloc_retry == 1) is also causing
a lot of pre-mature page allocation failures I'm seeing, isn't it?

    /*
     * If we fail to make progress by freeing individual
     * pages, but the allocation wants us to keep going,
     * start OOM killing tasks.
     */
    if (!did_some_progress) {
            page = __alloc_pages_may_oom(gfp_mask, order, ac,
                                            &did_some_progress);
            if (page)
                    goto got_pg;
            if (!did_some_progress)
                    goto nopage;

            nr_retries++;
    }
    /* Wait for some write requests to complete then retry */
    wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
    goto retry;

If we can somehow tell that there is no more cache which _can be_
written down before it will be reclaimed, we don't need to use
sysctl_nr_alloc_retry, and we can trigger OOM killer, right?



Today's stress testing found another problem your patch did not care.
If page fault caused SIGBUS signal when the first OOM victim cannot be
terminated due to mutex_lock() dependency, the process which triggered
the page fault will likely be killed. If the process is the global init,
kernel panic is triggered like shown below.

(From http://I-love.SAKURA.ne.jp/tmp/serial-20150319-1.txt.xz )
----------
[ 1277.833918] a.out           D ffff880066503bc8     0  8374   5141 0x00000080
[ 1277.835930]  ffff880066503bc8 ffff88007d1180d0 ffff8800664fd070 ffff880066503bc8
[ 1277.838052]  ffffffff8122a0eb ffff88007b8edc20 ffff880066500010 ffff88007fc93740
[ 1277.840102]  7fffffffffffffff ffff880066503d20 0000000000000002 ffff880066503be8
[ 1277.842128] Call Trace:
[ 1277.842766]  [<ffffffff8122a0eb>] ? blk_peek_request+0x8b/0x2a0
[ 1277.844222]  [<ffffffff814d1aee>] schedule+0x3e/0x90
[ 1277.845459]  [<ffffffff814d3dfd>] schedule_timeout+0x12d/0x1a0
[ 1277.846885]  [<ffffffff810b5b16>] ? ktime_get+0x46/0xb0
[ 1277.848189]  [<ffffffff814d0faa>] io_schedule_timeout+0xaa/0x130
[ 1277.849702]  [<ffffffff8108c610>] ? prepare_to_wait+0x60/0x90
[ 1277.851173]  [<ffffffff814d1d90>] ? bit_wait_io_timeout+0x80/0x80
[ 1277.852661]  [<ffffffff814d1dc6>] bit_wait_io+0x36/0x50
[ 1277.853946]  [<ffffffff814d2125>] __wait_on_bit+0x65/0x90
[ 1277.855005] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[ 1277.855083] ata4: EH complete
[ 1277.855174] sd 4:0:0:0: [sdb] tag#27 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1277.855175] sd 4:0:0:0: [sdb] tag#21 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1277.855177] sd 4:0:0:0: [sdb] tag#18 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1277.855178] sd 4:0:0:0: [sdb] tag#21 CDB: Read(10) 28 00 05 33 29 37 00 00 08 00
[ 1277.855179] sd 4:0:0:0: [sdb] tag#27 CDB: Read(10) 28 00 05 33 53 6f 00 00 08 00
[ 1277.855179] sd 4:0:0:0: [sdb] tag#18 CDB: Write(10) 2a 00 05 34 be 67 00 00 08 00
[ 1277.855183] blk_update_request: I/O error, dev sdb, sector 87249775
[ 1277.855256] sd 4:0:0:0: [sdb] tag#22 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1277.855257] sd 4:0:0:0: [sdb] tag#28 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1277.855258] sd 4:0:0:0: [sdb] tag#22 CDB: Read(10) 28 00 05 33 4e 2f 00 00 08 00
[ 1277.855259] sd 4:0:0:0: [sdb] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1277.855260] blk_update_request: I/O error, dev sdb, sector 87248431
[ 1277.855260] sd 4:0:0:0: [sdb] tag#28 CDB: Write(10) 2a 00 05 33 5a 27 00 00 18 00
[ 1277.855261] sd 4:0:0:0: [sdb] tag#19 CDB: Read(10) 28 00 05 34 ba a7 00 00 08 00
[ 1277.855261] blk_update_request: I/O error, dev sdb, sector 87251495
[ 1277.855262] blk_update_request: I/O error, dev sdb, sector 87341735
[ 1277.855319] sd 4:0:0:0: [sdb] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1277.855320] sd 4:0:0:0: [sdb] tag#29 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1277.855320] sd 4:0:0:0: [sdb] tag#20 CDB: Write(10) 2a 00 05 33 4d 37 00 00 08 00
[ 1277.855321] sd 4:0:0:0: [sdb] tag#24 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1277.855322] blk_update_request: I/O error, dev sdb, sector 87248183
[ 1277.855322] sd 4:0:0:0: [sdb] tag#29 CDB: Write(10) 2a 00 05 34 c1 2f 00 00 10 00
[ 1277.855323] sd 4:0:0:0: [sdb] tag#24 CDB: Read(10) 28 00 05 33 52 9f 00 00 08 00
[ 1277.855323] blk_update_request: I/O error, dev sdb, sector 87343407
[ 1277.855324] blk_update_request: I/O error, dev sdb, sector 87249567
[ 1277.855373] sd 4:0:0:0: [sdb] tag#30 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 1277.855374] blk_update_request: I/O error, dev sdb, sector 87343167
[ 1277.855374] sd 4:0:0:0: [sdb] tag#30 CDB: Read(10) 28 00 05 33 23 af 00 00 08 00
[ 1277.855375] blk_update_request: I/O error, dev sdb, sector 87237551
[ 1277.855376] blk_update_request: I/O error, dev sdb, sector 87250175
[ 1277.855728] Buffer I/O error on dev sdb1, logical block 1069738, lost async page write
[ 1277.855736] Buffer I/O error on dev sdb1, logical block 10917863, lost async page write
[ 1277.855739] Buffer I/O error on dev sdb1, logical block 10917864, lost async page write
[ 1277.855741] Buffer I/O error on dev sdb1, logical block 10917885, lost async page write
[ 1277.855744] Buffer I/O error on dev sdb1, logical block 11933189, lost async page write
[ 1277.855749] Buffer I/O error on dev sdb1, logical block 10917840, lost async page write
[ 1277.855768] Buffer I/O error on dev sdb1, logical block 10917829, lost async page write
[ 1277.856003] Buffer I/O error on dev sdb1, logical block 10906429, lost async page write
[ 1277.856008] Buffer I/O error on dev sdb1, logical block 10906430, lost async page write
[ 1277.856011] Buffer I/O error on dev sdb1, logical block 10906431, lost async page write
[ 1277.856847] XFS (sdb1): metadata I/O error: block 0x50080d0 ("xlog_iodone") error 5 numblks 64
[ 1277.856850] XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 1180 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa00f31a9
[ 1277.857054] XFS (sdb1): Log I/O Error Detected.  Shutting down filesystem
[ 1277.857055] XFS (sdb1): Please umount the filesystem and rectify the problem(s)
[ 1277.858225] Core dump to |/usr/libexec/abrt-hook-ccpp 7 0 1 0 0 1426725983 e pipe failed
[ 1277.858320] Core dump to |/usr/libexec/abrt-hook-ccpp 7 16777216 4995 0 0 1426725983 e pipe failed
[ 1277.858385] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000007

[ 1277.858387] CPU: 1 PID: 1 Comm: init Tainted: G            E   4.0.0-rc4+ #15
[ 1277.858388] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 1277.858390]  ffff880037baf740 ffff88007d07bc38 ffffffff814d0ee5 000000000000fffe
[ 1277.858391]  ffffffff81701f18 ffff88007d07bcb8 ffffffff814d0c6c ffffffff00000010
[ 1277.858392]  ffff88007d07bcc8 ffff88007d07bc68 0000000000000008 ffff88007d07bcb8
[ 1277.858392] Call Trace:
[ 1277.858398]  [<ffffffff814d0ee5>] dump_stack+0x48/0x5b
[ 1277.858400]  [<ffffffff814d0c6c>] panic+0xbb/0x1fa
[ 1277.858403]  [<ffffffff81055871>] do_exit+0xb51/0xb90
[ 1277.858404]  [<ffffffff81055901>] do_group_exit+0x51/0xc0
[ 1277.858406]  [<ffffffff81061dd2>] get_signal+0x222/0x590
[ 1277.858408]  [<ffffffff81002496>] do_signal+0x36/0x710
[ 1277.858411]  [<ffffffff810461d0>] ? mm_fault_error+0xd0/0x160
[ 1277.858413]  [<ffffffff8104661b>] ? __do_page_fault+0x3bb/0x430
[ 1277.858414]  [<ffffffff81002bb8>] do_notify_resume+0x48/0x60
[ 1277.858416]  [<ffffffff814d59a7>] retint_signal+0x41/0x7a
----------

(From http://I-love.SAKURA.ne.jp/tmp/serial-20150319-2.txt.xz )
----------
[ 2822.642453] scsi_io_completion: 21 callbacks suppressed
[ 2822.644049] sd 4:0:0:0: [sdb] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 2822.646453] sd 4:0:0:0: [sdb] tag#10 CDB: Read(10) 28 00 05 32 28 ff 00 00 08 00
[ 2822.648630] blk_update_request: 21 callbacks suppressed
[ 2822.648631] blk_update_request: I/O error, dev sdb, sector 87173375
[ 2822.648663] sd 4:0:0:0: [sdb] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 2822.648665] sd 4:0:0:0: [sdb] tag#11 CDB: Write(10) 2a 00 05 32 25 6f 00 00 08 00
[ 2822.648665] blk_update_request: I/O error, dev sdb, sector 87172463
[ 2822.648676] sd 4:0:0:0: [sdb] tag#12 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 2822.648677] sd 4:0:0:0: [sdb] tag#12 CDB: Read(10) 28 00 05 32 1e 8f 00 00 08 00
[ 2822.648678] blk_update_request: I/O error, dev sdb, sector 87170703
[ 2822.648700] sd 4:0:0:0: [sdb] tag#13 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 2822.648701] sd 4:0:0:0: [sdb] tag#13 CDB: Write(10) 2a 00 05 32 35 17 00 00 08 00
[ 2822.648701] blk_update_request: I/O error, dev sdb, sector 87176471
[ 2822.648711] sd 4:0:0:0: [sdb] tag#14 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 2822.648712] sd 4:0:0:0: [sdb] tag#14 CDB: Write(10) 2a 00 05 32 8c 77 00 00 08 00
[ 2822.648713] blk_update_request: I/O error, dev sdb, sector 87198839
[ 2822.648722] sd 4:0:0:0: [sdb] tag#15 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 2822.648723] sd 4:0:0:0: [sdb] tag#15 CDB: Read(10) 28 00 05 0b fb 8f 00 00 08 00
[ 2822.648723] blk_update_request: I/O error, dev sdb, sector 84671375
[ 2822.648742] sd 4:0:0:0: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 2822.648743] sd 4:0:0:0: [sdb] tag#16 CDB: Read(10) 28 00 05 33 3e f7 00 00 08 00
[ 2822.648744] blk_update_request: I/O error, dev sdb, sector 87244535
[ 2822.648753] sd 4:0:0:0: [sdb] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 2822.648754] sd 4:0:0:0: [sdb] tag#17 CDB: Write(10) 2a 00 05 d1 eb 77 00 00 08 00
[ 2822.648755] blk_update_request: I/O error, dev sdb, sector 97643383
[ 2822.648759] sd 4:0:0:0: [sdb] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 2822.648760] sd 4:0:0:0: [sdb] tag#19 CDB: Read(10) 28 00 05 32 f9 7f 00 00 08 00
[ 2822.648760] blk_update_request: I/O error, dev sdb, sector 87226751
[ 2822.648778] sd 4:0:0:0: [sdb] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 2822.648780] sd 4:0:0:0: [sdb] tag#18 CDB: Read(10) 28 00 05 32 e3 37 00 00 08 00
[ 2822.648780] blk_update_request: I/O error, dev sdb, sector 87221047
[ 2822.649830] buffer_io_error: 122 callbacks suppressed
[ 2822.649832] Buffer I/O error on dev sdb1, logical block 10896550, lost async page write
[ 2822.649850] Buffer I/O error on dev sdb1, logical block 10897051, lost async page write
[ 2822.649864] Buffer I/O error on dev sdb1, logical block 10899847, lost async page write
[ 2822.649878] Buffer I/O error on dev sdb1, logical block 12205415, lost async page write
[ 2822.649893] Buffer I/O error on dev sdb1, logical block 10903400, lost async page write
[ 2822.649900] Buffer I/O error on dev sdb1, logical block 10905034, lost async page write
[ 2822.649902] Buffer I/O error on dev sdb1, logical block 10905077, lost async page write
[ 2822.649908] Buffer I/O error on dev sdb1, logical block 10900244, lost async page write
[ 2822.649910] Buffer I/O error on dev sdb1, logical block 10901263, lost async page write
[ 2822.649915] Buffer I/O error on dev sdb1, logical block 10899976, lost async page write
[ 2822.649920] XFS (sdb1): metadata I/O error: block 0x50046c8 ("xlog_iodone") error 5 numblks 64
[ 2822.649924] XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 1180 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa00f31a9
[ 2822.650440] XFS (sdb1): Log I/O Error Detected.  Shutting down filesystem
[ 2822.650440] XFS (sdb1): Please umount the filesystem and rectify the problem(s)
[ 2822.650444] XFS (sdb1): metadata I/O error: block 0x5004701 ("xlog_iodone") error 5 numblks 64
[ 2822.650445] XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 1180 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa00f31a9
[ 2822.650446] XFS (sdb1): metadata I/O error: block 0x5004741 ("xlog_iodone") error 5 numblks 64
[ 2822.650447] XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 1180 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa00f31a9
[ 2822.650819] XFS (sdb1): xfs_log_force: error -5 returned.
[ 2822.676108] Core dump to |/usr/libexec/abrt-hook-ccpp 7 0 2233 0 0 1426728845 e pipe failed
[ 2822.676268] Core dump to |/usr/libexec/abrt-hook-ccpp 7 0 1847 0 0 1426728845 e pipe failed
[ 2822.761872] XFS (sdb1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
[ 2822.814996] XFS (sdb1): xfs_log_force: error -5 returned.
[ 2823.210912] audit: *NO* daemon at audit_pid=1847
[ 2823.212289] audit: audit_lost=1 audit_rate_limit=0 audit_backlog_limit=320
[ 2823.214238] audit: auditd disappeared
[ 2823.215419] audit: type=1701 audit(1426728846.212:69): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=2196 comm="master" exe="/usr/libexec/postfix/master" sig=7
[ 2823.219662] audit: type=1701 audit(1426728846.214:70): auid=4294967295 uid=89 gid=89 ses=4294967295 pid=9984 comm="pickup" exe="/usr/libexec/postfix/pickup" sig=7
[ 2823.228854] audit: type=1701 audit(1426728846.229:71): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=1880 comm="rsyslogd" exe="/sbin/rsyslogd" sig=7
[ 2823.232849] Core dump to |/usr/libexec/abrt-hook-ccpp 7 0 1877 0 0 1426728846 e pipe failed
[ 2823.240671] audit: type=1701 audit(1426728846.241:72): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=2265 comm="smbd" exe="/usr/sbin/smbd" sig=7
[ 2823.244547] Core dump to |/usr/libexec/abrt-hook-ccpp 7 16777216 2265 0 0 1426728846 e pipe failed
[ 2823.247697] audit: type=1701 audit(1426728846.248:73): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=2242 comm="smbd" exe="/usr/sbin/smbd" sig=7
[ 2823.252653] Core dump to |/usr/libexec/abrt-hook-ccpp 7 16777216 2242 0 0 1426728846 e pipe failed
[ 2823.263635] audit: type=1701 audit(1426728846.264:74): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=1801 comm="dhclient" exe="/sbin/dhclient" sig=7
[ 2823.267442] Core dump to |/usr/libexec/abrt-hook-ccpp 7 0 1801 0 0 1426728846 e pipe failed
[ 2848.437629] audit: type=1701 audit(1426728871.443:75): auid=0 uid=0 gid=0 ses=5 pid=10052 comm="bash" exe="/bin/bash" sig=7
[ 2848.444223] Core dump to |/usr/libexec/abrt-hook-ccpp 7 0 10052 0 0 1426728871 e pipe failed
[ 2848.449033] audit: type=1701 audit(1426728871.454:76): auid=0 uid=0 gid=0 ses=5 pid=9958 comm="login" exe="/bin/login" sig=7
[ 2848.455734] Core dump to |/usr/libexec/abrt-hook-ccpp 7 0 9958 0 0 1426728871 e pipe failed
[ 2848.460683] audit: type=1701 audit(1426728871.466:77): auid=4294967295 uid=81 gid=81 ses=4294967295 pid=2048 comm="dbus-daemon" exe="/bin/dbus-daemon" sig=7
[ 2848.464454] audit: type=1701 audit(1426728871.470:78): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=1 comm="init" exe="/sbin/init" sig=7
[ 2848.464577] audit: type=1701 audit(1426728871.470:79): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=9986 comm="console-kit-dae" exe="/usr/sbin/console-kit-daemon" sig=7
[ 2848.464578] audit: type=1701 audit(1426728871.470:80): auid=4294967295 uid=70 gid=70 ses=4294967295 pid=2060 comm="avahi-daemon" exe="/usr/sbin/avahi-daemon" sig=7
[ 2848.465160] audit: type=1701 audit(1426728871.470:81): auid=4294967295 uid=70 gid=70 ses=4294967295 pid=2061 comm="avahi-daemon" exe="/usr/sbin/avahi-daemon" sig=7
[ 2848.476649] Core dump to |/usr/libexec/abrt-hook-ccpp 7 0 9986 0 0 1426728871 e pipe failed
[ 2848.481923] Core dump to |/usr/libexec/abrt-hook-ccpp 7 0 1 0 0 1426728871 e pipe failed
[ 2848.484090] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000007
[ 2848.484090] 
[ 2848.486554] CPU: 2 PID: 1 Comm: init Tainted: G            E   4.0.0-rc4+ #15
[ 2848.488377] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 2848.491120]  ffff880037be4ac0 ffff88007d07bc38 ffffffff814d0ee5 000000000000fffe
[ 2848.493549]  ffffffff81701f18 ffff88007d07bcb8 ffffffff814d0c6c ffffffff00000010
[ 2848.495723]  ffff88007d07bcc8 ffff88007d07bc68 0000000000000008 ffff88007d07bcb8
[ 2848.498043] Call Trace:
[ 2848.498738]  [<ffffffff814d0ee5>] dump_stack+0x48/0x5b
[ 2848.500114]  [<ffffffff814d0c6c>] panic+0xbb/0x1fa
[ 2848.501394]  [<ffffffff81055871>] do_exit+0xb51/0xb90
[ 2848.502710]  [<ffffffff81055901>] do_group_exit+0x51/0xc0
[ 2848.504128]  [<ffffffff81061dd2>] get_signal+0x222/0x590
[ 2848.505509]  [<ffffffff81002496>] do_signal+0x36/0x710
[ 2848.506848]  [<ffffffff810461d0>] ? mm_fault_error+0xd0/0x160
[ 2848.508421]  [<ffffffff8104661b>] ? __do_page_fault+0x3bb/0x430
[ 2848.509966]  [<ffffffff81002bb8>] do_notify_resume+0x48/0x60
[ 2848.511435]  [<ffffffff814d59a7>] retint_signal+0x41/0x7a
----------

Innocent (possibly critical) process can be killed unexpectedly than
return -ENOMEM to the system calls or return NULL to e.g. kmalloc() users.
This is much worse than choosing the second OOM victim upon timeout.

Why not change each caller to use either __GFP_NOFAIL or __GFP_NORETRY
than introduce global sysctl_nr_alloc_retry which will unconditionally
allow small allocations to fail?



Also found yet another problem. kernel worker thread seems to be stalling at
bdi_writeback_workfn() => xfs_vm_writepage() => xfs_buf_allocate_memory() =>
alloc_pages_current() => shrink_inactive_list() => congestion_wait() forever
while kswapd0 seems to be stalling at shrink_inactive_list() =>
xfs_vm_writepage() => xlog_grant_head_wait() forever.

(From http://I-love.SAKURA.ne.jp/tmp/serial-20150319-3.txt.xz )
----------
[ 1392.529392] sysrq: SysRq : Show Blocked State
[ 1392.532232]   task                        PC stack   pid father
[ 1392.535229] kswapd0         D ffff88007c3a3728     0    40      2 0x00000000
[ 1392.538916]  ffff88007c3a3728 ffff88007c5514b0 ffff88007cddac00 0000000000000008
[ 1392.543034]  0000000000000286 ffff88007ac35a42 ffff88007c3a0010 ffff88007c11fdc0
[ 1392.547234]  ffff880037af61c0 00000000000009cc 00000000000147a0 ffff88007c3a3748
[ 1392.551350] Call Trace:
[ 1392.552673]  [<ffffffff814d1aee>] schedule+0x3e/0x90
[ 1392.555285]  [<ffffffffa00f4377>] xlog_grant_head_wait+0xb7/0x1c0 [xfs]
[ 1392.558611]  [<ffffffffa00f4546>] xlog_grant_head_check+0xc6/0xe0 [xfs]
[ 1392.561938]  [<ffffffffa00f4642>] xfs_log_reserve+0xe2/0x220 [xfs]
[ 1392.565083]  [<ffffffffa00efc85>] xfs_trans_reserve+0x1e5/0x220 [xfs]
[ 1392.568312]  [<ffffffffa00efe9a>] ? _xfs_trans_alloc+0x3a/0xa0 [xfs]
[ 1392.570965]  [<ffffffffa00ca76a>] xfs_setfilesize_trans_alloc+0x4a/0xb0 [xfs]
[ 1392.572670]  [<ffffffffa00ccb15>] xfs_vm_writepage+0x4a5/0x5a0 [xfs]
[ 1392.574173]  [<ffffffff81139aac>] shrink_page_list+0x43c/0x9d0
[ 1392.575566]  [<ffffffff8113a6c5>] shrink_inactive_list+0x275/0x500
[ 1392.577084]  [<ffffffff812565c0>] ? radix_tree_gang_lookup_tag+0x90/0xd0
[ 1392.578657]  [<ffffffff8113b2e1>] shrink_lruvec+0x641/0x730
[ 1392.580093]  [<ffffffff8108098a>] ? set_next_entity+0x2a/0x60
[ 1392.581516]  [<ffffffff810aeaac>] ? lock_timer_base+0x3c/0x70
[ 1392.582915]  [<ffffffff810aed93>] ? try_to_del_timer_sync+0x53/0x70
[ 1392.584408]  [<ffffffff8113b8c5>] shrink_zone+0x75/0x1b0
[ 1392.585669]  [<ffffffff8113c19e>] kswapd+0x4de/0x9a0
[ 1392.586967]  [<ffffffff8113bcc0>] ? zone_reclaim+0x2c0/0x2c0
[ 1392.588363]  [<ffffffff8113bcc0>] ? zone_reclaim+0x2c0/0x2c0
[ 1392.589738]  [<ffffffff8106f0ae>] kthread+0xce/0xf0
[ 1392.590967]  [<ffffffff8106efe0>] ? kthread_freezable_should_stop+0x70/0x70
[ 1392.592629]  [<ffffffff814d4c88>] ret_from_fork+0x58/0x90
[ 1392.593909]  [<ffffffff8106efe0>] ? kthread_freezable_should_stop+0x70/0x70
[ 1392.595582] kworker/u16:29  D ffff88007b28e8e8     0   440      2 0x00000000
[ 1392.597428] Workqueue: writeback bdi_writeback_workfn (flush-8:0)
[ 1392.598998]  ffff88007b28e8e8 ffff88007c9288c0 ffff88007b27aa80 ffff88007b28e8e8
[ 1392.600983]  ffffffff810aed93 ffff88007fc93740 ffff88007b28c010 ffff88007d1b8000
[ 1392.603045]  000000010010ac98 ffff88007d1b8000 000000010010ac34 ffff88007b28e908
[ 1392.605052] Call Trace:
[ 1392.605698]  [<ffffffff810aed93>] ? try_to_del_timer_sync+0x53/0x70
[ 1392.607218]  [<ffffffff814d1aee>] schedule+0x3e/0x90
[ 1392.608421]  [<ffffffff814d3dc9>] schedule_timeout+0xf9/0x1a0
[ 1392.609785]  [<ffffffff810aefd0>] ? add_timer_on+0xd0/0xd0
[ 1392.611093]  [<ffffffff814d0faa>] io_schedule_timeout+0xaa/0x130
[ 1392.612599]  [<ffffffff811447af>] congestion_wait+0x7f/0x100
[ 1392.614043]  [<ffffffff8108c170>] ? woken_wake_function+0x20/0x20
[ 1392.615530]  [<ffffffff8113a8f4>] shrink_inactive_list+0x4a4/0x500
[ 1392.617051]  [<ffffffff8113b2e1>] shrink_lruvec+0x641/0x730
[ 1392.618446]  [<ffffffff8114d850>] ? list_lru_count_one+0x20/0x30
[ 1392.619926]  [<ffffffff8113b8c5>] shrink_zone+0x75/0x1b0
[ 1392.621223]  [<ffffffff8113c7f3>] do_try_to_free_pages+0x193/0x340
[ 1392.622680]  [<ffffffff8113caf7>] try_to_free_pages+0xb7/0x140
[ 1392.624074]  [<ffffffff811305bf>] __alloc_pages_nodemask+0x58f/0x9a0
[ 1392.625567]  [<ffffffff81170d87>] alloc_pages_current+0xa7/0x170
[ 1392.626992]  [<ffffffffa00d2a05>] xfs_buf_allocate_memory+0x1a5/0x290 [xfs]
[ 1392.628650]  [<ffffffffa00d3fe0>] xfs_buf_get_map+0x130/0x180 [xfs]
[ 1392.630281]  [<ffffffffa00d4060>] xfs_buf_read_map+0x30/0x100 [xfs]
[ 1392.631815]  [<ffffffffa00fffb9>] xfs_trans_read_buf_map+0xd9/0x300 [xfs]
[ 1392.633666]  [<ffffffffa00aa029>] xfs_btree_read_buf_block+0x79/0xc0 [xfs]
[ 1392.635398]  [<ffffffffa00aa274>] xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
[ 1392.637112]  [<ffffffffa00aac8f>] xfs_btree_lookup+0xcf/0x4b0 [xfs]
[ 1392.638594]  [<ffffffff81179573>] ? kmem_cache_alloc+0x163/0x1d0
[ 1392.640033]  [<ffffffffa0092e39>] xfs_alloc_lookup_eq+0x19/0x20 [xfs]
[ 1392.641556]  [<ffffffffa0093155>] xfs_alloc_fixup_trees+0x2a5/0x340 [xfs]
[ 1392.643156]  [<ffffffffa0094b8d>] xfs_alloc_ag_vextent_near+0x9ad/0xb60 [xfs]
[ 1392.644850]  [<ffffffffa0095c1d>] ? xfs_alloc_fix_freelist+0x3dd/0x470 [xfs]
[ 1392.646615]  [<ffffffffa00950a5>] xfs_alloc_ag_vextent+0xd5/0x100 [xfs]
[ 1392.648262]  [<ffffffffa0095f64>] xfs_alloc_vextent+0x2b4/0x600 [xfs]
[ 1392.649855]  [<ffffffffa00a4c48>] xfs_bmap_btalloc+0x388/0x750 [xfs]
[ 1392.651412]  [<ffffffffa00a86e8>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
[ 1392.652917]  [<ffffffffa00a5034>] xfs_bmap_alloc+0x24/0x40 [xfs]
[ 1392.654343]  [<ffffffffa00a7742>] xfs_bmapi_write+0x5a2/0xa20 [xfs]
[ 1392.655846]  [<ffffffffa009e8d0>] ? xfs_bmap_last_offset+0x50/0xc0 [xfs]
[ 1392.657431]  [<ffffffffa00df6fe>] xfs_iomap_write_allocate+0x14e/0x380 [xfs]
[ 1392.659110]  [<ffffffffa00cbb79>] xfs_map_blocks+0x139/0x220 [xfs]
[ 1392.660576]  [<ffffffffa00cc7f6>] xfs_vm_writepage+0x186/0x5a0 [xfs]
[ 1392.662193]  [<ffffffff81130b47>] __writepage+0x17/0x40
[ 1392.663635]  [<ffffffff81131d57>] write_cache_pages+0x247/0x510
[ 1392.665121]  [<ffffffff81130b30>] ? set_page_dirty+0x60/0x60
[ 1392.666503]  [<ffffffff81132071>] generic_writepages+0x51/0x80
[ 1392.667966]  [<ffffffffa00cba03>] xfs_vm_writepages+0x53/0x70 [xfs]
[ 1392.669445]  [<ffffffff811320c0>] do_writepages+0x20/0x40
[ 1392.670730]  [<ffffffff811b1569>] __writeback_single_inode+0x49/0x2e0
[ 1392.672267]  [<ffffffff8108c80f>] ? wake_up_bit+0x2f/0x40
[ 1392.673681]  [<ffffffff811b1c3a>] writeback_sb_inodes+0x28a/0x4e0
[ 1392.675167]  [<ffffffff811b1f2e>] __writeback_inodes_wb+0x9e/0xd0
[ 1392.676661]  [<ffffffff811b215b>] wb_writeback+0x1fb/0x2c0
[ 1392.678000]  [<ffffffff811b22a1>] wb_do_writeback+0x81/0x1f0
[ 1392.679492]  [<ffffffff81086f3b>] ? pick_next_task_fair+0x40b/0x550
[ 1392.681076]  [<ffffffff811b2480>] bdi_writeback_workfn+0x70/0x200
[ 1392.682564]  [<ffffffff81076961>] ? dequeue_task+0x61/0x90
[ 1392.683902]  [<ffffffff8106976a>] process_one_work+0x13a/0x420
[ 1392.685287]  [<ffffffff81069b73>] worker_thread+0x123/0x4f0
[ 1392.686610]  [<ffffffff81069a50>] ? process_one_work+0x420/0x420
[ 1392.688045]  [<ffffffff81069a50>] ? process_one_work+0x420/0x420
[ 1392.689466]  [<ffffffff8106f0ae>] kthread+0xce/0xf0
[ 1392.690630]  [<ffffffff8106efe0>] ? kthread_freezable_should_stop+0x70/0x70
[ 1392.692329]  [<ffffffff814d4c88>] ret_from_fork+0x58/0x90
[ 1392.693617]  [<ffffffff8106efe0>] ? kthread_freezable_should_stop+0x70/0x70
(...snipped...)
[ 1714.888104] sysrq: SysRq : Show Blocked State
[ 1714.890786]   task                        PC stack   pid father
[ 1714.894315] kswapd0         D ffff88007c3a3728     0    40      2 0x00000000
[ 1714.898427]  ffff88007c3a3728 ffff88007c5514b0 ffff88007cddac00 0000000000000008
[ 1714.902908]  0000000000000286 ffff88007ac35a42 ffff88007c3a0010 ffff88007c11fdc0
[ 1714.907032]  ffff880037af61c0 00000000000009cc 00000000000147a0 ffff88007c3a3748
[ 1714.909212] Call Trace:
[ 1714.909909]  [<ffffffff814d1aee>] schedule+0x3e/0x90
[ 1714.911283]  [<ffffffffa00f4377>] xlog_grant_head_wait+0xb7/0x1c0 [xfs]
[ 1714.913067]  [<ffffffffa00f4546>] xlog_grant_head_check+0xc6/0xe0 [xfs]
[ 1714.914850]  [<ffffffffa00f4642>] xfs_log_reserve+0xe2/0x220 [xfs]
[ 1714.916496]  [<ffffffffa00efc85>] xfs_trans_reserve+0x1e5/0x220 [xfs]
[ 1714.918216]  [<ffffffffa00efe9a>] ? _xfs_trans_alloc+0x3a/0xa0 [xfs]
[ 1714.919904]  [<ffffffffa00ca76a>] xfs_setfilesize_trans_alloc+0x4a/0xb0 [xfs]
[ 1714.921796]  [<ffffffffa00ccb15>] xfs_vm_writepage+0x4a5/0x5a0 [xfs]
[ 1714.923484]  [<ffffffff81139aac>] shrink_page_list+0x43c/0x9d0
[ 1714.925037]  [<ffffffff8113a6c5>] shrink_inactive_list+0x275/0x500
[ 1714.926706]  [<ffffffff812565c0>] ? radix_tree_gang_lookup_tag+0x90/0xd0
[ 1714.928511]  [<ffffffff8113b2e1>] shrink_lruvec+0x641/0x730
[ 1714.930028]  [<ffffffff8108098a>] ? set_next_entity+0x2a/0x60
[ 1714.931562]  [<ffffffff810aeaac>] ? lock_timer_base+0x3c/0x70
[ 1714.933089]  [<ffffffff810aed93>] ? try_to_del_timer_sync+0x53/0x70
[ 1714.934781]  [<ffffffff8113b8c5>] shrink_zone+0x75/0x1b0
[ 1714.936197]  [<ffffffff8113c19e>] kswapd+0x4de/0x9a0
[ 1714.937549]  [<ffffffff8113bcc0>] ? zone_reclaim+0x2c0/0x2c0
[ 1714.939067]  [<ffffffff8113bcc0>] ? zone_reclaim+0x2c0/0x2c0
[ 1714.940593]  [<ffffffff8106f0ae>] kthread+0xce/0xf0
[ 1714.941908]  [<ffffffff8106efe0>] ? kthread_freezable_should_stop+0x70/0x70
[ 1714.943773]  [<ffffffff814d4c88>] ret_from_fork+0x58/0x90
[ 1714.945216]  [<ffffffff8106efe0>] ? kthread_freezable_should_stop+0x70/0x70
[ 1714.947069] kworker/u16:29  D ffff88007b28e8e8     0   440      2 0x00000000
[ 1714.949024] Workqueue: writeback bdi_writeback_workfn (flush-8:0)
[ 1714.950742]  ffff88007b28e8e8 ffff88007c9288c0 ffff88007b27aa80 ffff88007b28e8e8
[ 1714.953192]  ffffffff810aed93 0000000000000002 ffff88007b28c010 ffff88007d1b8000
[ 1714.955407]  0000000100159847 ffff88007d1b8000 00000001001597e3 ffff88007b28e908
[ 1714.957577] Call Trace:
[ 1714.958269]  [<ffffffff810aed93>] ? try_to_del_timer_sync+0x53/0x70
[ 1714.959931]  [<ffffffff814d1aee>] schedule+0x3e/0x90
[ 1714.961291]  [<ffffffff814d3dc9>] schedule_timeout+0xf9/0x1a0
[ 1714.962838]  [<ffffffff810aefd0>] ? add_timer_on+0xd0/0xd0
[ 1714.964305]  [<ffffffff814d0faa>] io_schedule_timeout+0xaa/0x130
[ 1714.965903]  [<ffffffff811447af>] congestion_wait+0x7f/0x100
[ 1714.967906]  [<ffffffff8108c170>] ? woken_wake_function+0x20/0x20
[ 1714.969604]  [<ffffffff8113a8f4>] shrink_inactive_list+0x4a4/0x500
[ 1714.971251]  [<ffffffff8113b2e1>] shrink_lruvec+0x641/0x730
[ 1714.972743]  [<ffffffff8114d850>] ? list_lru_count_one+0x20/0x30
[ 1714.974338]  [<ffffffff8113b8c5>] shrink_zone+0x75/0x1b0
[ 1714.975801]  [<ffffffff8113c7f3>] do_try_to_free_pages+0x193/0x340
[ 1714.977471]  [<ffffffff8113caf7>] try_to_free_pages+0xb7/0x140
[ 1714.979048]  [<ffffffff811305bf>] __alloc_pages_nodemask+0x58f/0x9a0
[ 1714.980734]  [<ffffffff81170d87>] alloc_pages_current+0xa7/0x170
[ 1714.982347]  [<ffffffffa00d2a05>] xfs_buf_allocate_memory+0x1a5/0x290 [xfs]
[ 1714.984239]  [<ffffffffa00d3fe0>] xfs_buf_get_map+0x130/0x180 [xfs]
[ 1714.985925]  [<ffffffffa00d4060>] xfs_buf_read_map+0x30/0x100 [xfs]
[ 1714.987611]  [<ffffffffa00fffb9>] xfs_trans_read_buf_map+0xd9/0x300 [xfs]
[ 1714.989421]  [<ffffffffa00aa029>] xfs_btree_read_buf_block+0x79/0xc0 [xfs]
[ 1714.991257]  [<ffffffffa00aa274>] xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
[ 1714.993342]  [<ffffffffa00aac8f>] xfs_btree_lookup+0xcf/0x4b0 [xfs]
[ 1714.995038]  [<ffffffff81179573>] ? kmem_cache_alloc+0x163/0x1d0
[ 1714.996655]  [<ffffffffa0092e39>] xfs_alloc_lookup_eq+0x19/0x20 [xfs]
[ 1714.998429]  [<ffffffffa0093155>] xfs_alloc_fixup_trees+0x2a5/0x340 [xfs]
[ 1715.000242]  [<ffffffffa0094b8d>] xfs_alloc_ag_vextent_near+0x9ad/0xb60 [xfs]
[ 1715.002137]  [<ffffffffa0095c1d>] ? xfs_alloc_fix_freelist+0x3dd/0x470 [xfs]
[ 1715.004003]  [<ffffffffa00950a5>] xfs_alloc_ag_vextent+0xd5/0x100 [xfs]
[ 1715.005761]  [<ffffffffa0095f64>] xfs_alloc_vextent+0x2b4/0x600 [xfs]
[ 1715.007482]  [<ffffffffa00a4c48>] xfs_bmap_btalloc+0x388/0x750 [xfs]
[ 1715.009177]  [<ffffffffa00a86e8>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
[ 1715.010896]  [<ffffffffa00a5034>] xfs_bmap_alloc+0x24/0x40 [xfs]
[ 1715.012551]  [<ffffffffa00a7742>] xfs_bmapi_write+0x5a2/0xa20 [xfs]
[ 1715.014248]  [<ffffffffa009e8d0>] ? xfs_bmap_last_offset+0x50/0xc0 [xfs]
[ 1715.016071]  [<ffffffffa00df6fe>] xfs_iomap_write_allocate+0x14e/0x380 [xfs]
[ 1715.017965]  [<ffffffffa00cbb79>] xfs_map_blocks+0x139/0x220 [xfs]
[ 1715.019631]  [<ffffffffa00cc7f6>] xfs_vm_writepage+0x186/0x5a0 [xfs]
[ 1715.021345]  [<ffffffff81130b47>] __writepage+0x17/0x40
[ 1715.022762]  [<ffffffff81131d57>] write_cache_pages+0x247/0x510
[ 1715.024370]  [<ffffffff81130b30>] ? set_page_dirty+0x60/0x60
[ 1715.025885]  [<ffffffff81132071>] generic_writepages+0x51/0x80
[ 1715.027614]  [<ffffffffa00cba03>] xfs_vm_writepages+0x53/0x70 [xfs]
[ 1715.029179]  [<ffffffff811320c0>] do_writepages+0x20/0x40
[ 1715.030559]  [<ffffffff811b1569>] __writeback_single_inode+0x49/0x2e0
[ 1715.032154]  [<ffffffff8108c80f>] ? wake_up_bit+0x2f/0x40
[ 1715.033498]  [<ffffffff811b1c3a>] writeback_sb_inodes+0x28a/0x4e0
[ 1715.035025]  [<ffffffff811b1f2e>] __writeback_inodes_wb+0x9e/0xd0
[ 1715.036551]  [<ffffffff811b215b>] wb_writeback+0x1fb/0x2c0
[ 1715.037916]  [<ffffffff811b22a1>] wb_do_writeback+0x81/0x1f0
[ 1715.039317]  [<ffffffff81086f3b>] ? pick_next_task_fair+0x40b/0x550
[ 1715.040897]  [<ffffffff811b2480>] bdi_writeback_workfn+0x70/0x200
[ 1715.042415]  [<ffffffff81076961>] ? dequeue_task+0x61/0x90
[ 1715.043798]  [<ffffffff8106976a>] process_one_work+0x13a/0x420
[ 1715.045260]  [<ffffffff81069b73>] worker_thread+0x123/0x4f0
[ 1715.046665]  [<ffffffff81069a50>] ? process_one_work+0x420/0x420
[ 1715.048154]  [<ffffffff81069a50>] ? process_one_work+0x420/0x420
[ 1715.049659]  [<ffffffff8106f0ae>] kthread+0xce/0xf0
[ 1715.050880]  [<ffffffff8106efe0>] ? kthread_freezable_should_stop+0x70/0x70
[ 1715.052608]  [<ffffffff814d4c88>] ret_from_fork+0x58/0x90
[ 1715.053958]  [<ffffffff8106efe0>] ? kthread_freezable_should_stop+0x70/0x70
----------

I do want hints like http://www.spinics.net/lists/linux-mm/msg81409.html for
guessing whether forward progress is made or not. I'm fine with enabling
such hints with CONFIG_DEBUG_something.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/2] Move away from non-failing small allocations
  2015-03-17 14:06       ` Tetsuo Handa
@ 2015-04-02 11:53         ` Tetsuo Handa
  -1 siblings, 0 replies; 63+ messages in thread
From: Tetsuo Handa @ 2015-04-02 11:53 UTC (permalink / raw)
  To: mhocko, david
  Cc: akpm, hannes, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > We are seeing issues with the fs code now because the test cases which
> > led to the current discussion exercise FS code. The code which does
> > lock(); kmalloc(GFP_KERNEL) is not reduced there though. I am pretty sure
> > we can find other subsystems if we try hard enough.
> 
> I'm expecting for patches which avoids deadlock by lock(); kmalloc(GFP_KERNEL).
> 
> > > static inline void enter_fs_code(struct super_block *sb)
> > > {
> > > 	if (sb->my_small_allocations_can_fail)
> > > 		current->small_allocations_can_fail++;
> > > }
> > > 
> > > that way (or something similar) we can select the behaviour on a per-fs
> > > basis and the rest of the kernel remains unaffected.  Other subsystems
> > > can opt in as well.
> > 
> > This is basically leading to GFP_MAYFAIL which is completely backwards
> > (the hard requirement should be an exception not a default rule).
> > I really do not want to end up with stuffing random may_fail annotations
> > all over the kernel.
> > 
> 
> I wish that GFP_NOFS / GFP_NOIO regions are annotated with
> 
>   static inline void enter_fs_code(void)
>   {
>   #ifdef CONFIG_DEBUG_GFP_FLAGS
>   	current->in_fs_code++;
>   #endif
>   }
> 
>   static inline void leave_fs_code(void)
>   {
>   #ifdef CONFIG_DEBUG_GFP_FLAGS
>   	current->in_fs_code--;
>   #endif
>   }
> 
>   static inline void enter_io_code(void)
>   {
>   #ifdef CONFIG_DEBUG_GFP_FLAGS
>   	current->in_io_code++;
>   #endif
>   }
> 
>   static inline void leave_io_code(void)
>   {
>   #ifdef CONFIG_DEBUG_GFP_FLAGS
>   	current->in_io_code--;
>   #endif
>   }
> 
> so that inappropriate GFP_KERNEL usage inside GFP_NOFS region are catchable
> by doing
> 
>   struct page *
>   __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>                           struct zonelist *zonelist, nodemask_t *nodemask)
>   {
>   	struct zoneref *preferred_zoneref;
>   	struct page *page = NULL;
>   	unsigned int cpuset_mems_cookie;
>   	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
>   	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
>   	struct alloc_context ac = {
>   		.high_zoneidx = gfp_zone(gfp_mask),
>   		.nodemask = nodemask,
>   		.migratetype = gfpflags_to_migratetype(gfp_mask),
>   	};
>   	
>   	gfp_mask &= gfp_allowed_mask;
>  +#ifdef CONFIG_DEBUG_GFP_FLAGS
>  +	WARN_ON(current->in_fs_code & (gfp_mask & __GFP_FS));
>  +	WARN_ON(current->in_io_code & (gfp_mask & __GFP_IO));
>  +#endif
>   
>   	lockdep_trace_alloc(gfp_mask);
>   
> 
> . It is difficult for non-fs developers to determine whether they need to use
> GFP_NOFS than GFP_KERNEL in their code. An example is seen at
> http://marc.info/?l=linux-security-module&m=138556479607024&w=2 .

I didn't receive responses about idea of annotating GFP_NOFS/GFP_NOIO
sections. Therefore, I wrote a patch shown below.

Dave and Michal, can we assume that a thread which initiated a
GFP_NOFS/GFP_NOIO section always terminates that section (as with
rcu_read_lock()/rcu_read_unlock()) ? If we can't assume it,
mask_gfp_flags() in below patch cannot work.
----------------------------------------
>From 0b18f21c9c9aef2353d355afc83b2a2193bbced7 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Thu, 2 Apr 2015 16:47:47 +0900
Subject: [PATCH] Add macros for annotating GFP_NOFS/GFP_NOIO allocations.

While there are rules for when to use GFP_NOFS or GFP_NOIO, there are
locations where GFP_KERNEL is by error used. Such incorrect usage may
cause deadlock under memory pressure. But it is difficult for non-fs
developers to determine whether they need to use GFP_NOFS than GFP_KERNEL
in their code. Also, it is difficult to hit the deadlock caused by such
incorrect usage. Therefore, this patch introduces 4 macros for annotating
the duration where GFP_NOFS or GFP_NOIO needs to be used.

  enter_fs_code() is for annotating that allocations with __GFP_FS are
  not permitted until corresponding leave_fs_code() is called.

  leave_fs_code() is for annotating that allocations with __GFP_FS are
  permitted from now on.

  enter_io_code() is for annotating that allocations with __GFP_IO are
  not permitted until corresponding leave_io_code() is called.

  leave_io_code() is for annotating that allocations with __GFP_IO are
  permitted from now on.

These macros are no-op unless CONFIG_DEBUG_GFP_FLAGS is defined.

Note that this patch does not insert these macros into GFP_NOFS/GFP_NOIO
users. Inserting these macros with appropriate comments are up to
GFP_NOFS/GFP_NOIO users who know the dependency well.

This patch also introduces 1 function which makes use of these macros
when CONFIG_DEBUG_GFP_FLAGS is defined.

  mask_gfp_flags() is for checking and printing warnings when __GFP_FS is
  used between enter_fs_code() and leave_fs_code() or __GFP_IO is used
  between enter_io_code() and leave_io_code().

  If GFP_KERNEL allocation is requested between enter_fs_code() and
  leave_fs_code(), a warning message

    GFP_FS allocation ($gfp_mask) is unsafe for this context

  is reported with a backtrace.

  If GFP_KERNEL or GFP_NOFS allocation is requested between enter_io_code()
  and leave_io_code(), a warning message

    GFP_IO allocation ($gfp_mask) is unsafe for this context

  is reported with a backtrace.

Note that currently mask_gfp_flags() cannot work unless gfp_allowed_mask
is involved. That is, mask_gfp_flags() cannot work if memory allocation
can be handled by e.g. returning partial memory from already allocated
pages. Handling such cases is deferred to future patches.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/gfp.h   | 17 +++++++++++++++++
 include/linux/sched.h |  4 ++++
 mm/Kconfig.debug      |  7 +++++++
 mm/page_alloc.c       | 17 ++++++++++++++++-
 mm/slab.c             |  4 ++--
 mm/slob.c             |  4 ++--
 mm/slub.c             |  6 +++---
 7 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 51bd1e7..97654ff 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -421,4 +421,21 @@ extern void init_cma_reserved_pageblock(struct page *page);
 
 #endif
 
+#ifdef CONFIG_DEBUG_GFP_FLAGS
+#define enter_fs_code() { current->in_fs_code++; }
+#define leave_fs_code() { current->in_fs_code--; }
+#define enter_io_code() { current->in_io_code++; }
+#define leave_io_code() { current->in_io_code--; }
+extern gfp_t mask_gfp_flags(gfp_t gfp_mask);
+#else
+#define enter_fs_code() do { } while (0)
+#define leave_fs_code() do { } while (0)
+#define enter_io_code() do { } while (0)
+#define leave_io_code() do { } while (0)
+static inline gfp_t mask_gfp_flags(gfp_t gfp_mask)
+{
+	return gfp_mask & gfp_allowed_mask;
+}
+#endif
+
 #endif /* __LINUX_GFP_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a419b65..ebaf817 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1710,6 +1710,10 @@ struct task_struct {
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	unsigned long	task_state_change;
 #endif
+#ifdef CONFIG_DEBUG_GFP_FLAGS
+	unsigned char in_fs_code;
+	unsigned char in_io_code;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 957d3da..0f9ad6f 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -28,3 +28,10 @@ config DEBUG_PAGEALLOC
 
 config PAGE_POISONING
 	bool
+
+config DEBUG_GFP_FLAGS
+	bool "Check GFP flags passed to memory allocators"
+	depends on DEBUG_KERNEL
+	---help---
+	  Track and warn about incorrect use of __GFP_FS or __GFP_IO
+	  memory allocations.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 40e2942..239b068 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2794,6 +2794,21 @@ got_pg:
 	return page;
 }
 
+#ifdef CONFIG_DEBUG_GFP_FLAGS
+gfp_t mask_gfp_flags(gfp_t gfp_mask)
+{
+	if (gfp_mask & __GFP_WAIT) {
+		WARN((gfp_mask & __GFP_FS) && current->in_fs_code,
+		     "GFP_FS allocation (0x%x) is unsafe for this context\n",
+		     gfp_mask);
+		WARN((gfp_mask & __GFP_IO) && current->in_io_code,
+		     "GFP_IO allocation (0x%x) is unsafe for this context\n",
+		     gfp_mask);
+	}
+	return gfp_mask & gfp_allowed_mask;
+}
+#endif
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -2812,7 +2827,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		.migratetype = gfpflags_to_migratetype(gfp_mask),
 	};
 
-	gfp_mask &= gfp_allowed_mask;
+	gfp_mask = mask_gfp_flags(gfp_mask);
 
 	lockdep_trace_alloc(gfp_mask);
 
diff --git a/mm/slab.c b/mm/slab.c
index c4b89ea..8a73788 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3136,7 +3136,7 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	void *ptr;
 	int slab_node = numa_mem_id();
 
-	flags &= gfp_allowed_mask;
+	flags = mask_gfp_flags(flags);
 
 	lockdep_trace_alloc(flags);
 
@@ -3224,7 +3224,7 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 	unsigned long save_flags;
 	void *objp;
 
-	flags &= gfp_allowed_mask;
+	flags = mask_gfp_flags(flags);
 
 	lockdep_trace_alloc(flags);
 
diff --git a/mm/slob.c b/mm/slob.c
index 94a7fed..c731eb4 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -430,7 +430,7 @@ __do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller)
 	int align = max_t(size_t, ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN);
 	void *ret;
 
-	gfp &= gfp_allowed_mask;
+	gfp = mask_gfp_flags(gfp);
 
 	lockdep_trace_alloc(gfp);
 
@@ -536,7 +536,7 @@ void *slob_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
 {
 	void *b;
 
-	flags &= gfp_allowed_mask;
+	flags = mask_gfp_flags(flags);
 
 	lockdep_trace_alloc(flags);
 
diff --git a/mm/slub.c b/mm/slub.c
index 82c4737..60fe0a9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1263,7 +1263,7 @@ static inline void kfree_hook(const void *x)
 static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 						     gfp_t flags)
 {
-	flags &= gfp_allowed_mask;
+	flags = mask_gfp_flags(flags);
 	lockdep_trace_alloc(flags);
 	might_sleep_if(flags & __GFP_WAIT);
 
@@ -1276,7 +1276,7 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 static inline void slab_post_alloc_hook(struct kmem_cache *s,
 					gfp_t flags, void *object)
 {
-	flags &= gfp_allowed_mask;
+	flags = mask_gfp_flags(flags);
 	kmemcheck_slab_alloc(s, flags, object, slab_ksize(s));
 	kmemleak_alloc_recursive(object, s->object_size, 1, s->flags, flags);
 	memcg_kmem_put_cache(s);
@@ -1339,7 +1339,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	struct kmem_cache_order_objects oo = s->oo;
 	gfp_t alloc_gfp;
 
-	flags &= gfp_allowed_mask;
+	flags = mask_gfp_flags(flags);
 
 	if (flags & __GFP_WAIT)
 		local_irq_enable();
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/2] Move away from non-failing small allocations
@ 2015-04-02 11:53         ` Tetsuo Handa
  0 siblings, 0 replies; 63+ messages in thread
From: Tetsuo Handa @ 2015-04-02 11:53 UTC (permalink / raw)
  To: mhocko, david
  Cc: akpm, hannes, mgorman, riel, fengguang.wu, linux-mm, linux-kernel

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > We are seeing issues with the fs code now because the test cases which
> > led to the current discussion exercise FS code. The code which does
> > lock(); kmalloc(GFP_KERNEL) is not reduced there though. I am pretty sure
> > we can find other subsystems if we try hard enough.
> 
> I'm expecting for patches which avoids deadlock by lock(); kmalloc(GFP_KERNEL).
> 
> > > static inline void enter_fs_code(struct super_block *sb)
> > > {
> > > 	if (sb->my_small_allocations_can_fail)
> > > 		current->small_allocations_can_fail++;
> > > }
> > > 
> > > that way (or something similar) we can select the behaviour on a per-fs
> > > basis and the rest of the kernel remains unaffected.  Other subsystems
> > > can opt in as well.
> > 
> > This is basically leading to GFP_MAYFAIL which is completely backwards
> > (the hard requirement should be an exception not a default rule).
> > I really do not want to end up with stuffing random may_fail annotations
> > all over the kernel.
> > 
> 
> I wish that GFP_NOFS / GFP_NOIO regions are annotated with
> 
>   static inline void enter_fs_code(void)
>   {
>   #ifdef CONFIG_DEBUG_GFP_FLAGS
>   	current->in_fs_code++;
>   #endif
>   }
> 
>   static inline void leave_fs_code(void)
>   {
>   #ifdef CONFIG_DEBUG_GFP_FLAGS
>   	current->in_fs_code--;
>   #endif
>   }
> 
>   static inline void enter_io_code(void)
>   {
>   #ifdef CONFIG_DEBUG_GFP_FLAGS
>   	current->in_io_code++;
>   #endif
>   }
> 
>   static inline void leave_io_code(void)
>   {
>   #ifdef CONFIG_DEBUG_GFP_FLAGS
>   	current->in_io_code--;
>   #endif
>   }
> 
> so that inappropriate GFP_KERNEL usage inside GFP_NOFS region are catchable
> by doing
> 
>   struct page *
>   __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>                           struct zonelist *zonelist, nodemask_t *nodemask)
>   {
>   	struct zoneref *preferred_zoneref;
>   	struct page *page = NULL;
>   	unsigned int cpuset_mems_cookie;
>   	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
>   	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
>   	struct alloc_context ac = {
>   		.high_zoneidx = gfp_zone(gfp_mask),
>   		.nodemask = nodemask,
>   		.migratetype = gfpflags_to_migratetype(gfp_mask),
>   	};
>   	
>   	gfp_mask &= gfp_allowed_mask;
>  +#ifdef CONFIG_DEBUG_GFP_FLAGS
>  +	WARN_ON(current->in_fs_code & (gfp_mask & __GFP_FS));
>  +	WARN_ON(current->in_io_code & (gfp_mask & __GFP_IO));
>  +#endif
>   
>   	lockdep_trace_alloc(gfp_mask);
>   
> 
> . It is difficult for non-fs developers to determine whether they need to use
> GFP_NOFS than GFP_KERNEL in their code. An example is seen at
> http://marc.info/?l=linux-security-module&m=138556479607024&w=2 .

I didn't receive responses about idea of annotating GFP_NOFS/GFP_NOIO
sections. Therefore, I wrote a patch shown below.

Dave and Michal, can we assume that a thread which initiated a
GFP_NOFS/GFP_NOIO section always terminates that section (as with
rcu_read_lock()/rcu_read_unlock()) ? If we can't assume it,
mask_gfp_flags() in below patch cannot work.
----------------------------------------
>From 0b18f21c9c9aef2353d355afc83b2a2193bbced7 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Thu, 2 Apr 2015 16:47:47 +0900
Subject: [PATCH] Add macros for annotating GFP_NOFS/GFP_NOIO allocations.

While there are rules for when to use GFP_NOFS or GFP_NOIO, there are
locations where GFP_KERNEL is by error used. Such incorrect usage may
cause deadlock under memory pressure. But it is difficult for non-fs
developers to determine whether they need to use GFP_NOFS than GFP_KERNEL
in their code. Also, it is difficult to hit the deadlock caused by such
incorrect usage. Therefore, this patch introduces 4 macros for annotating
the duration where GFP_NOFS or GFP_NOIO needs to be used.

  enter_fs_code() is for annotating that allocations with __GFP_FS are
  not permitted until corresponding leave_fs_code() is called.

  leave_fs_code() is for annotating that allocations with __GFP_FS are
  permitted from now on.

  enter_io_code() is for annotating that allocations with __GFP_IO are
  not permitted until corresponding leave_io_code() is called.

  leave_io_code() is for annotating that allocations with __GFP_IO are
  permitted from now on.

These macros are no-op unless CONFIG_DEBUG_GFP_FLAGS is defined.

Note that this patch does not insert these macros into GFP_NOFS/GFP_NOIO
users. Inserting these macros with appropriate comments are up to
GFP_NOFS/GFP_NOIO users who know the dependency well.

This patch also introduces 1 function which makes use of these macros
when CONFIG_DEBUG_GFP_FLAGS is defined.

  mask_gfp_flags() is for checking and printing warnings when __GFP_FS is
  used between enter_fs_code() and leave_fs_code() or __GFP_IO is used
  between enter_io_code() and leave_io_code().

  If GFP_KERNEL allocation is requested between enter_fs_code() and
  leave_fs_code(), a warning message

    GFP_FS allocation ($gfp_mask) is unsafe for this context

  is reported with a backtrace.

  If GFP_KERNEL or GFP_NOFS allocation is requested between enter_io_code()
  and leave_io_code(), a warning message

    GFP_IO allocation ($gfp_mask) is unsafe for this context

  is reported with a backtrace.

Note that currently mask_gfp_flags() cannot work unless gfp_allowed_mask
is involved. That is, mask_gfp_flags() cannot work if memory allocation
can be handled by e.g. returning partial memory from already allocated
pages. Handling such cases is deferred to future patches.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/gfp.h   | 17 +++++++++++++++++
 include/linux/sched.h |  4 ++++
 mm/Kconfig.debug      |  7 +++++++
 mm/page_alloc.c       | 17 ++++++++++++++++-
 mm/slab.c             |  4 ++--
 mm/slob.c             |  4 ++--
 mm/slub.c             |  6 +++---
 7 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 51bd1e7..97654ff 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -421,4 +421,21 @@ extern void init_cma_reserved_pageblock(struct page *page);
 
 #endif
 
+#ifdef CONFIG_DEBUG_GFP_FLAGS
+#define enter_fs_code() { current->in_fs_code++; }
+#define leave_fs_code() { current->in_fs_code--; }
+#define enter_io_code() { current->in_io_code++; }
+#define leave_io_code() { current->in_io_code--; }
+extern gfp_t mask_gfp_flags(gfp_t gfp_mask);
+#else
+#define enter_fs_code() do { } while (0)
+#define leave_fs_code() do { } while (0)
+#define enter_io_code() do { } while (0)
+#define leave_io_code() do { } while (0)
+static inline gfp_t mask_gfp_flags(gfp_t gfp_mask)
+{
+	return gfp_mask & gfp_allowed_mask;
+}
+#endif
+
 #endif /* __LINUX_GFP_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a419b65..ebaf817 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1710,6 +1710,10 @@ struct task_struct {
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	unsigned long	task_state_change;
 #endif
+#ifdef CONFIG_DEBUG_GFP_FLAGS
+	unsigned char in_fs_code;
+	unsigned char in_io_code;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 957d3da..0f9ad6f 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -28,3 +28,10 @@ config DEBUG_PAGEALLOC
 
 config PAGE_POISONING
 	bool
+
+config DEBUG_GFP_FLAGS
+	bool "Check GFP flags passed to memory allocators"
+	depends on DEBUG_KERNEL
+	---help---
+	  Track and warn about incorrect use of __GFP_FS or __GFP_IO
+	  memory allocations.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 40e2942..239b068 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2794,6 +2794,21 @@ got_pg:
 	return page;
 }
 
+#ifdef CONFIG_DEBUG_GFP_FLAGS
+gfp_t mask_gfp_flags(gfp_t gfp_mask)
+{
+	if (gfp_mask & __GFP_WAIT) {
+		WARN((gfp_mask & __GFP_FS) && current->in_fs_code,
+		     "GFP_FS allocation (0x%x) is unsafe for this context\n",
+		     gfp_mask);
+		WARN((gfp_mask & __GFP_IO) && current->in_io_code,
+		     "GFP_IO allocation (0x%x) is unsafe for this context\n",
+		     gfp_mask);
+	}
+	return gfp_mask & gfp_allowed_mask;
+}
+#endif
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -2812,7 +2827,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		.migratetype = gfpflags_to_migratetype(gfp_mask),
 	};
 
-	gfp_mask &= gfp_allowed_mask;
+	gfp_mask = mask_gfp_flags(gfp_mask);
 
 	lockdep_trace_alloc(gfp_mask);
 
diff --git a/mm/slab.c b/mm/slab.c
index c4b89ea..8a73788 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3136,7 +3136,7 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	void *ptr;
 	int slab_node = numa_mem_id();
 
-	flags &= gfp_allowed_mask;
+	flags = mask_gfp_flags(flags);
 
 	lockdep_trace_alloc(flags);
 
@@ -3224,7 +3224,7 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 	unsigned long save_flags;
 	void *objp;
 
-	flags &= gfp_allowed_mask;
+	flags = mask_gfp_flags(flags);
 
 	lockdep_trace_alloc(flags);
 
diff --git a/mm/slob.c b/mm/slob.c
index 94a7fed..c731eb4 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -430,7 +430,7 @@ __do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller)
 	int align = max_t(size_t, ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN);
 	void *ret;
 
-	gfp &= gfp_allowed_mask;
+	gfp = mask_gfp_flags(gfp);
 
 	lockdep_trace_alloc(gfp);
 
@@ -536,7 +536,7 @@ void *slob_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
 {
 	void *b;
 
-	flags &= gfp_allowed_mask;
+	flags = mask_gfp_flags(flags);
 
 	lockdep_trace_alloc(flags);
 
diff --git a/mm/slub.c b/mm/slub.c
index 82c4737..60fe0a9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1263,7 +1263,7 @@ static inline void kfree_hook(const void *x)
 static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 						     gfp_t flags)
 {
-	flags &= gfp_allowed_mask;
+	flags = mask_gfp_flags(flags);
 	lockdep_trace_alloc(flags);
 	might_sleep_if(flags & __GFP_WAIT);
 
@@ -1276,7 +1276,7 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 static inline void slab_post_alloc_hook(struct kmem_cache *s,
 					gfp_t flags, void *object)
 {
-	flags &= gfp_allowed_mask;
+	flags = mask_gfp_flags(flags);
 	kmemcheck_slab_alloc(s, flags, object, slab_ksize(s));
 	kmemleak_alloc_recursive(object, s->object_size, 1, s->flags, flags);
 	memcg_kmem_put_cache(s);
@@ -1339,7 +1339,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	struct kmem_cache_order_objects oo = s->oo;
 	gfp_t alloc_gfp;
 
-	flags &= gfp_allowed_mask;
+	flags = mask_gfp_flags(flags);
 
 	if (flags & __GFP_WAIT)
 		local_irq_enable();
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2015-04-02 11:53 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-11 20:54 [PATCH 0/2] Move away from non-failing small allocations Michal Hocko
2015-03-11 20:54 ` Michal Hocko
2015-03-11 20:54 ` Michal Hocko
2015-03-11 20:54 ` [PATCH 1/2] mm: Allow small allocations to fail Michal Hocko
2015-03-11 20:54   ` Michal Hocko
2015-03-11 20:54   ` Michal Hocko
2015-03-12 12:54   ` Tetsuo Handa
2015-03-12 12:54     ` Tetsuo Handa
2015-03-12 13:12     ` Michal Hocko
2015-03-12 13:12       ` Michal Hocko
2015-03-15  5:43   ` Tetsuo Handa
2015-03-15  5:43     ` Tetsuo Handa
2015-03-15 12:13     ` Michal Hocko
2015-03-15 12:13       ` Michal Hocko
2015-03-15 13:06       ` Tetsuo Handa
2015-03-15 13:06         ` Tetsuo Handa
2015-03-16  7:46         ` [PATCH 1/2 v2] " Michal Hocko
2015-03-16  7:46           ` Michal Hocko
2015-03-16 21:11           ` Johannes Weiner
2015-03-16 21:11             ` Johannes Weiner
2015-03-17 10:25             ` Michal Hocko
2015-03-17 10:25               ` Michal Hocko
2015-03-17 13:29               ` Johannes Weiner
2015-03-17 13:29                 ` Johannes Weiner
2015-03-17 14:17                 ` Michal Hocko
2015-03-17 14:17                   ` Michal Hocko
2015-03-17 17:26                   ` Johannes Weiner
2015-03-17 17:26                     ` Johannes Weiner
2015-03-17 19:41                     ` Michal Hocko
2015-03-17 19:41                       ` Michal Hocko
2015-03-18  9:10                       ` Vlastimil Babka
2015-03-18  9:10                         ` Vlastimil Babka
2015-03-18 12:04                         ` Michal Hocko
2015-03-18 12:04                           ` Michal Hocko
2015-03-18 12:36                         ` Tetsuo Handa
2015-03-18 12:36                           ` Tetsuo Handa
2015-03-18 11:35                       ` Tetsuo Handa
2015-03-18 11:35                         ` Tetsuo Handa
2015-03-17 11:13           ` Tetsuo Handa
2015-03-17 11:13             ` Tetsuo Handa
2015-03-17 13:15             ` Michal Hocko
2015-03-17 13:15               ` Michal Hocko
2015-03-18 11:33               ` Tetsuo Handa
2015-03-18 11:33                 ` Tetsuo Handa
2015-03-18 12:23                 ` Michal Hocko
2015-03-18 12:23                   ` Michal Hocko
2015-03-19 11:03                   ` Tetsuo Handa
2015-03-19 11:03                     ` Tetsuo Handa
2015-03-11 20:54 ` [PATCH 2/2] mmotm: Enable small allocation " Michal Hocko
2015-03-11 20:54   ` Michal Hocko
2015-03-11 20:54   ` Michal Hocko
2015-03-11 22:36 ` [PATCH 0/2] Move away from non-failing small allocations Sasha Levin
2015-03-11 22:36   ` Sasha Levin
2015-03-11 22:36   ` Sasha Levin
2015-03-16 22:38 ` Andrew Morton
2015-03-16 22:38   ` Andrew Morton
2015-03-16 22:38   ` Andrew Morton
2015-03-17  9:07   ` Michal Hocko
2015-03-17  9:07     ` Michal Hocko
2015-03-17 14:06     ` Tetsuo Handa
2015-03-17 14:06       ` Tetsuo Handa
2015-04-02 11:53       ` Tetsuo Handa
2015-04-02 11:53         ` Tetsuo Handa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.