linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/1] oom support for reclaiming of hugepages
@ 2017-07-27 18:02 Liam R. Howlett
  2017-07-27 18:02 ` [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events Liam R. Howlett
  0 siblings, 1 reply; 17+ messages in thread
From: Liam R. Howlett @ 2017-07-27 18:02 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mhocko, n-horiguchi, mike.kravetz, aneesh.kumar, khandual,
	punit.agrawal, arnd, gerald.schaefer, aarcange, oleg,
	penguin-kernel, mingo, kirill.shutemov, vdavydov.dev, willy

I'm looking for comments on how to avoid the failure scenario where a correctly
configured system fails to boot after taking corrective action when a memory
module goes bad.  Right now if there is a memory event that causes a system
reboot and the UEFI to remove the memory from the memory pool, it may result in
Linux not having enough memory to boot due to the huge page reserve.

The patch in its current state will reclaim hugepages if they are free
regardless of on boot or not - which may not be desirable, or maybe it is?
I've looked through how select_bad_process works and do not see a clean way to
hook in to this function when the victim is not a task.

I also could not find a good place to add the CONFIG_HUGETLB_PAGE_OOM.
Obviously that would need to go somewhere sane.

Liam R. Howlett (1):
  mm/hugetlb mm/oom_kill:  Add support for reclaiming hugepages on OOM
    events.

 include/linux/hugetlb.h |  1 +
 mm/hugetlb.c            | 35 +++++++++++++++++++++++++++++++++++
 mm/oom_kill.c           |  8 ++++++++
 3 files changed, 44 insertions(+)

-- 
2.13.0.90.g1eb437020

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 1/1] mm/hugetlb mm/oom_kill:  Add support for reclaiming hugepages on OOM events.
  2017-07-27 18:02 [RFC PATCH 0/1] oom support for reclaiming of hugepages Liam R. Howlett
@ 2017-07-27 18:02 ` Liam R. Howlett
  2017-07-28  6:46   ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Liam R. Howlett @ 2017-07-27 18:02 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mhocko, n-horiguchi, mike.kravetz, aneesh.kumar, khandual,
	punit.agrawal, arnd, gerald.schaefer, aarcange, oleg,
	penguin-kernel, mingo, kirill.shutemov, vdavydov.dev, willy

When a system runs out of memory it may be desirable to reclaim
unreserved hugepages.  This situation arises when a correctly configured
system has a memory failure and takes corrective action of rebooting and
removing the memory from the memory pool results in a system failing to
boot.  With this change, the out of memory handler is able to reclaim
any pages that are free and not reserved.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 include/linux/hugetlb.h |  1 +
 mm/hugetlb.c            | 35 +++++++++++++++++++++++++++++++++++
 mm/oom_kill.c           |  8 ++++++++
 3 files changed, 44 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8d9fe131a240..20e5729b9e9a 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -470,6 +470,7 @@ static inline pgoff_t basepage_index(struct page *page)
 }
 
 extern int dissolve_free_huge_page(struct page *page);
+extern unsigned long decrease_free_hugepages(nodemask_t *nodes);
 extern int dissolve_free_huge_pages(unsigned long start_pfn,
 				    unsigned long end_pfn);
 static inline bool hugepage_migration_supported(struct hstate *h)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bc48ee783dd9..00a0e08b96c5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1454,6 +1454,41 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 }
 
 /*
+ * Decrement free hugepages.  Used by oom kill to avoid killing a task if
+ * there is free huge pages that can be used instead.
+ * Returns the number of bytes reclaimed from hugepages
+ */
+#define CONFIG_HUGETLB_PAGE_OOM
+unsigned long decrease_free_hugepages(nodemask_t *nodes)
+{
+#ifdef CONFIG_HUGETLB_PAGE_OOM
+	struct hstate *h;
+	unsigned long ret = 0;
+
+	spin_lock(&hugetlb_lock);
+	for_each_hstate(h) {
+		if (h->free_huge_pages > h->resv_huge_pages) {
+			char buf[32];
+
+			memfmt(buf, huge_page_size(h));
+			ret = free_pool_huge_page(h, nodes ?
+						  nodes : &node_online_map, 0);
+			pr_warn("HugeTLB: Reclaiming %lu hugepage(s) of page size %s\n",
+				ret, buf);
+			ret *= huge_page_size(h);
+			goto found;
+		}
+	}
+
+found:
+	spin_unlock(&hugetlb_lock);
+	return ret;
+#else
+	return 0;
+#endif /* CONFIG_HUGETLB_PAGE_OOM */
+}
+
+/*
  * Dissolve a given free hugepage into free buddy pages. This function does
  * nothing for in-use (including surplus) hugepages. Returns -EBUSY if the
  * number of free hugepages would be reduced below the number of reserved
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f030c1c..0a42f6d7d253 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -40,6 +40,7 @@
 #include <linux/ratelimit.h>
 #include <linux/kthread.h>
 #include <linux/init.h>
+#include <linux/hugetlb.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -1044,6 +1045,13 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
+	/* Reclaim a free, unreserved hugepage. */
+	freed = decrease_free_hugepages(oc->nodemask);
+	if (freed != 0) {
+		pr_err("Out of memory: Reclaimed %lu from HugeTLB\n", freed);
+		return true;
+	}
+
 	select_bad_process(oc);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
-- 
2.13.0.90.g1eb437020

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill:  Add support for reclaiming hugepages on OOM events.
  2017-07-27 18:02 ` [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events Liam R. Howlett
@ 2017-07-28  6:46   ` Michal Hocko
  2017-07-28 11:33     ` Kirill A. Shutemov
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2017-07-28  6:46 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, akpm, n-horiguchi, mike.kravetz, aneesh.kumar,
	khandual, punit.agrawal, arnd, gerald.schaefer, aarcange, oleg,
	penguin-kernel, mingo, kirill.shutemov, vdavydov.dev, willy

On Thu 27-07-17 14:02:36, Liam R. Howlett wrote:
> When a system runs out of memory it may be desirable to reclaim
> unreserved hugepages.  This situation arises when a correctly configured
> system has a memory failure and takes corrective action of rebooting and
> removing the memory from the memory pool results in a system failing to
> boot.  With this change, the out of memory handler is able to reclaim
> any pages that are free and not reserved.

I am sorry but I have to Nack this. You are breaking the basic contract
of hugetlb user API. Administrator configures the pool to suit a
workload. It is a deliberate and privileged action. We allow to
overcommit that pool should there be a immediate need for more hugetlb
pages and we do remove those when they are freed. If we don't then this
should be fixed.
Other than that hugetlb pages are not reclaimable by design and users
do rely on that. Otherwise they could consider using THP instead.

If somebody configures the initial pool too high it is a configuration
bug. Just think about it, we do not want to reset lowmem reserves
configured by admin just because we are hitting the oom killer and yes
insanely large lowmem reserves might lead to early OOM as well.

Nacked-by: Michal Hocko <mhocko@suse.com>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill:  Add support for reclaiming hugepages on OOM events.
  2017-07-28  6:46   ` Michal Hocko
@ 2017-07-28 11:33     ` Kirill A. Shutemov
  2017-07-28 12:23       ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Kirill A. Shutemov @ 2017-07-28 11:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Liam R. Howlett, linux-mm, akpm, n-horiguchi, mike.kravetz,
	aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer,
	aarcange, oleg, penguin-kernel, mingo, kirill.shutemov,
	vdavydov.dev, willy

On Fri, Jul 28, 2017 at 08:46:02AM +0200, Michal Hocko wrote:
> On Thu 27-07-17 14:02:36, Liam R. Howlett wrote:
> > When a system runs out of memory it may be desirable to reclaim
> > unreserved hugepages.  This situation arises when a correctly configured
> > system has a memory failure and takes corrective action of rebooting and
> > removing the memory from the memory pool results in a system failing to
> > boot.  With this change, the out of memory handler is able to reclaim
> > any pages that are free and not reserved.
> 
> I am sorry but I have to Nack this. You are breaking the basic contract
> of hugetlb user API. Administrator configures the pool to suit a
> workload. It is a deliberate and privileged action. We allow to
> overcommit that pool should there be a immediate need for more hugetlb
> pages and we do remove those when they are freed. If we don't then this
> should be fixed.
> Other than that hugetlb pages are not reclaimable by design and users
> do rely on that. Otherwise they could consider using THP instead.
> 
> If somebody configures the initial pool too high it is a configuration
> bug. Just think about it, we do not want to reset lowmem reserves
> configured by admin just because we are hitting the oom killer and yes
> insanely large lowmem reserves might lead to early OOM as well.
> 
> Nacked-by: Michal Hocko <mhocko@suse.com>

Hm. I'm not sure it's fully justified. To me, reclaiming hugetlb is
something to be considered as last resort after all other measures have
been tried.

I think we can allow hugetlb reclaim just to keep system alive, taint
kernel and indicate that "reboot needed".

The situation is somewhat similar to BUG() vs. WARN().

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill:  Add support for reclaiming hugepages on OOM events.
  2017-07-28 11:33     ` Kirill A. Shutemov
@ 2017-07-28 12:23       ` Michal Hocko
  2017-07-28 12:44         ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2017-07-28 12:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Liam R. Howlett, linux-mm, akpm, n-horiguchi, mike.kravetz,
	aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer,
	aarcange, oleg, penguin-kernel, mingo, kirill.shutemov,
	vdavydov.dev, willy

On Fri 28-07-17 14:33:47, Kirill A. Shutemov wrote:
> On Fri, Jul 28, 2017 at 08:46:02AM +0200, Michal Hocko wrote:
> > On Thu 27-07-17 14:02:36, Liam R. Howlett wrote:
> > > When a system runs out of memory it may be desirable to reclaim
> > > unreserved hugepages.  This situation arises when a correctly configured
> > > system has a memory failure and takes corrective action of rebooting and
> > > removing the memory from the memory pool results in a system failing to
> > > boot.  With this change, the out of memory handler is able to reclaim
> > > any pages that are free and not reserved.
> > 
> > I am sorry but I have to Nack this. You are breaking the basic contract
> > of hugetlb user API. Administrator configures the pool to suit a
> > workload. It is a deliberate and privileged action. We allow to
> > overcommit that pool should there be a immediate need for more hugetlb
> > pages and we do remove those when they are freed. If we don't then this
> > should be fixed.
> > Other than that hugetlb pages are not reclaimable by design and users
> > do rely on that. Otherwise they could consider using THP instead.
> > 
> > If somebody configures the initial pool too high it is a configuration
> > bug. Just think about it, we do not want to reset lowmem reserves
> > configured by admin just because we are hitting the oom killer and yes
> > insanely large lowmem reserves might lead to early OOM as well.
> > 
> > Nacked-by: Michal Hocko <mhocko@suse.com>
> 
> Hm. I'm not sure it's fully justified. To me, reclaiming hugetlb is
> something to be considered as last resort after all other measures have
> been tried.

System can recover from the OOM killer in most cases and there is no
real reason to break contracts which administrator established. On the
other hand you cannot assume correct operation of the SW which depends
on hugetlb pages in general. Such a SW might get unexpected crashes/data
corruptions and what not.

> I think we can allow hugetlb reclaim just to keep system alive, taint
> kernel and indicate that "reboot needed".

Let me repeat, this is an admin only configuration and we cannot pretend
to fix those decisions. If there is a configuration issue it should be
fixed and the OOM killer along with the oom report explaining the
situation is the best way forward IMO.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill:  Add support for reclaiming hugepages on OOM events.
  2017-07-28 12:23       ` Michal Hocko
@ 2017-07-28 12:44         ` Michal Hocko
  2017-07-29  1:56           ` Liam R. Howlett
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2017-07-28 12:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Liam R. Howlett, linux-mm, akpm, n-horiguchi, mike.kravetz,
	aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer,
	aarcange, oleg, penguin-kernel, mingo, kirill.shutemov,
	vdavydov.dev, willy

On Fri 28-07-17 14:23:50, Michal Hocko wrote:
> On Fri 28-07-17 14:33:47, Kirill A. Shutemov wrote:
> > On Fri, Jul 28, 2017 at 08:46:02AM +0200, Michal Hocko wrote:
> > > On Thu 27-07-17 14:02:36, Liam R. Howlett wrote:
> > > > When a system runs out of memory it may be desirable to reclaim
> > > > unreserved hugepages.  This situation arises when a correctly configured
> > > > system has a memory failure and takes corrective action of rebooting and
> > > > removing the memory from the memory pool results in a system failing to
> > > > boot.  With this change, the out of memory handler is able to reclaim
> > > > any pages that are free and not reserved.
> > > 
> > > I am sorry but I have to Nack this. You are breaking the basic contract
> > > of hugetlb user API. Administrator configures the pool to suit a
> > > workload. It is a deliberate and privileged action. We allow to
> > > overcommit that pool should there be a immediate need for more hugetlb
> > > pages and we do remove those when they are freed. If we don't then this
> > > should be fixed.
> > > Other than that hugetlb pages are not reclaimable by design and users
> > > do rely on that. Otherwise they could consider using THP instead.
> > > 
> > > If somebody configures the initial pool too high it is a configuration
> > > bug. Just think about it, we do not want to reset lowmem reserves
> > > configured by admin just because we are hitting the oom killer and yes
> > > insanely large lowmem reserves might lead to early OOM as well.
> > > 
> > > Nacked-by: Michal Hocko <mhocko@suse.com>
> > 
> > Hm. I'm not sure it's fully justified. To me, reclaiming hugetlb is
> > something to be considered as last resort after all other measures have
> > been tried.
> 
> System can recover from the OOM killer in most cases and there is no
> real reason to break contracts which administrator established. On the
> other hand you cannot assume correct operation of the SW which depends
> on hugetlb pages in general. Such a SW might get unexpected crashes/data
> corruptions and what not.

And to be clear. The memory hotpug currently does the similar thing via
dissolve_free_huge_pages and I believe that is wrong as well although
one could argue that the memory offline is an admin action as well so
reducing hugetlb pages is a reasonable thing to do. This would be for a
separate discussion though.

But OOM can happen for entirely different reasons and hugetlb might be
configured properly while this change would simply break that setup.
This is simply nogo.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill:  Add support for reclaiming hugepages on OOM events.
  2017-07-28 12:44         ` Michal Hocko
@ 2017-07-29  1:56           ` Liam R. Howlett
  2017-07-31  9:10             ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Liam R. Howlett @ 2017-07-29  1:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kirill A. Shutemov, linux-mm, akpm, n-horiguchi, mike.kravetz,
	aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer,
	aarcange, oleg, penguin-kernel, mingo, kirill.shutemov,
	vdavydov.dev, willy

* Michal Hocko <mhocko@kernel.org> [170728 08:44]:
> On Fri 28-07-17 14:23:50, Michal Hocko wrote:
> > On Fri 28-07-17 14:33:47, Kirill A. Shutemov wrote:
> > > On Fri, Jul 28, 2017 at 08:46:02AM +0200, Michal Hocko wrote:
> > > > On Thu 27-07-17 14:02:36, Liam R. Howlett wrote:
> > > > > When a system runs out of memory it may be desirable to reclaim
> > > > > unreserved hugepages.  This situation arises when a correctly configured
> > > > > system has a memory failure and takes corrective action of rebooting and
> > > > > removing the memory from the memory pool results in a system failing to
> > > > > boot.  With this change, the out of memory handler is able to reclaim
> > > > > any pages that are free and not reserved.
> > > > 
> > > > I am sorry but I have to Nack this. You are breaking the basic contract
> > > > of hugetlb user API. Administrator configures the pool to suit a
> > > > workload. It is a deliberate and privileged action. We allow to
> > > > overcommit that pool should there be a immediate need for more hugetlb
> > > > pages and we do remove those when they are freed. If we don't then this
> > > > should be fixed.

This is certainly a work in progress and I appreciate you taking the
time to point out the issues.  I didn't mean to suggest merging this it
is today.

> > > > Other than that hugetlb pages are not reclaimable by design and users
> > > > do rely on that. Otherwise they could consider using THP instead.
> > > > 
> > > > If somebody configures the initial pool too high it is a configuration
> > > > bug. Just think about it, we do not want to reset lowmem reserves
> > > > configured by admin just because we are hitting the oom killer and yes
> > > > insanely large lowmem reserves might lead to early OOM as well.

The case I raise is a correctly configured system which has a memory
module failure.  Modern systems will reboot and remove the memory from
the memory pool.  Linux will start to load and run out of memory.  I get
that this code has the side effect of doing what you're saying.  Do you
see this as a worth while feature and if so, do you know of a better way
for me to trigger the behaviour?

> > > > 
> > > > Nacked-by: Michal Hocko <mhocko@suse.com>
> > > 
> > > Hm. I'm not sure it's fully justified. To me, reclaiming hugetlb is
> > > something to be considered as last resort after all other measures have
> > > been tried.
> > 
> > System can recover from the OOM killer in most cases and there is no
> > real reason to break contracts which administrator established. On the
> > other hand you cannot assume correct operation of the SW which depends
> > on hugetlb pages in general. Such a SW might get unexpected crashes/data
> > corruptions and what not.

My question about allowing the reclaim to happen all the time was like
Kirill said, if there's memory that's not being used then why panic (or
kill a task)?  I see that Michal has thought this through though.  My
intent was to add this as a config option, but it sounds like that's
also a bad plan.

> 
> And to be clear. The memory hotpug currently does the similar thing via
> dissolve_free_huge_pages and I believe that is wrong as well although
> one could argue that the memory offline is an admin action as well so
> reducing hugetlb pages is a reasonable thing to do. This would be for a
> separate discussion though.
> 
> But OOM can happen for entirely different reasons and hugetlb might be
> configured properly while this change would simply break that setup.
> This is simply nogo.
> 

Yes, this patch is certainly not the final version for that specific
reason.  I didn't see a good way to plug in to the OOM and was looking
for suggestions.  Sorry if that was not clear.

The root problem I'm trying to solve isn't a misconfiguration but to
cover off the case of the system recovering from a failure while Linux
will not.

Here are a few other ideas that may or may not be better (or sane):

Would perhaps specifying a percentage of memory instead of a specific
number be a better approach than reclaiming?  That would still leave
those who use hard values vulnerable but at least provide an alternative
that was safer.  It's also a pretty brutal interface for someone to use.

We could figure out there's a bad memory module and enable this on boot
only?  I am unclear on how to do either of those, but in combination it
would allow the issue to be detected and avoid failures.  I have looked
in to detecting when we're booting and I've not had much luck there.  I
believe dmidecode can pick up disabled modules so that part should be
plausible?  Would enabling this code when a disabled module exists and
during boot be acceptable?

Disable all hugepages passed when there's a disabled memory module and
throw a WARN?

Is there any other options?

Thank you both for your comments and time,
Liam

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill:  Add support for reclaiming hugepages on OOM events.
  2017-07-29  1:56           ` Liam R. Howlett
@ 2017-07-31  9:10             ` Michal Hocko
  2017-07-31 13:56               ` Liam R. Howlett
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2017-07-31  9:10 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: Kirill A. Shutemov, linux-mm, akpm, n-horiguchi, mike.kravetz,
	aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer,
	aarcange, oleg, penguin-kernel, mingo, kirill.shutemov,
	vdavydov.dev, willy

On Fri 28-07-17 21:56:38, Liam R. Howlett wrote:
> * Michal Hocko <mhocko@kernel.org> [170728 08:44]:
> > On Fri 28-07-17 14:23:50, Michal Hocko wrote:
> > > > > Other than that hugetlb pages are not reclaimable by design and users
> > > > > do rely on that. Otherwise they could consider using THP instead.
> > > > > 
> > > > > If somebody configures the initial pool too high it is a configuration
> > > > > bug. Just think about it, we do not want to reset lowmem reserves
> > > > > configured by admin just because we are hitting the oom killer and yes
> > > > > insanely large lowmem reserves might lead to early OOM as well.
> 
> The case I raise is a correctly configured system which has a memory
> module failure.

So you are concerned about MCEs due to failing memory modules? If yes
why do you care about hugetlb in particular?

> Modern systems will reboot and remove the memory from
> the memory pool.  Linux will start to load and run out of memory.  I get
> that this code has the side effect of doing what you're saying.  Do you
> see this as a worth while feature and if so, do you know of a better way
> for me to trigger the behaviour?

I do not understand your question. Could you elaborate more please? Are
you talking about system going into OOM because of too many MCEs?

> > > > > Nacked-by: Michal Hocko <mhocko@suse.com>
> > > > 
> > > > Hm. I'm not sure it's fully justified. To me, reclaiming hugetlb is
> > > > something to be considered as last resort after all other measures have
> > > > been tried.
> > > 
> > > System can recover from the OOM killer in most cases and there is no
> > > real reason to break contracts which administrator established. On the
> > > other hand you cannot assume correct operation of the SW which depends
> > > on hugetlb pages in general. Such a SW might get unexpected crashes/data
> > > corruptions and what not.
> 
> My question about allowing the reclaim to happen all the time was like
> Kirill said, if there's memory that's not being used then why panic (or
> kill a task)?  I see that Michal has thought this through though.  My
> intent was to add this as a config option, but it sounds like that's
> also a bad plan.

You cannot reclaim something that the administrator has asked for to be
available. Sure we can reclaim the excess if there is any but that is
not what your patch does
 
> > And to be clear. The memory hotpug currently does the similar thing via
> > dissolve_free_huge_pages and I believe that is wrong as well although
> > one could argue that the memory offline is an admin action as well so
> > reducing hugetlb pages is a reasonable thing to do. This would be for a
> > separate discussion though.
> > 
> > But OOM can happen for entirely different reasons and hugetlb might be
> > configured properly while this change would simply break that setup.
> > This is simply nogo.
> > 
> 
> Yes, this patch is certainly not the final version for that specific
> reason.  I didn't see a good way to plug in to the OOM and was looking
> for suggestions.  Sorry if that was not clear.
> 
> The root problem I'm trying to solve isn't a misconfiguration but to
> cover off the case of the system recovering from a failure while Linux
> will not.

Please be more specific what you mean by the "failure". It is hard to
comment on further things without a clear definition what is the problem
you are trying to address.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill:  Add support for reclaiming hugepages on OOM events.
  2017-07-31  9:10             ` Michal Hocko
@ 2017-07-31 13:56               ` Liam R. Howlett
  2017-07-31 14:08                 ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Liam R. Howlett @ 2017-07-31 13:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kirill A. Shutemov, linux-mm, akpm, n-horiguchi, mike.kravetz,
	aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer,
	aarcange, oleg, penguin-kernel, mingo, kirill.shutemov,
	vdavydov.dev, willy

* Michal Hocko <mhocko@kernel.org> [170731 05:10]:
> On Fri 28-07-17 21:56:38, Liam R. Howlett wrote:
> > * Michal Hocko <mhocko@kernel.org> [170728 08:44]:
> > > On Fri 28-07-17 14:23:50, Michal Hocko wrote:
> > > > > > Other than that hugetlb pages are not reclaimable by design and users
> > > > > > do rely on that. Otherwise they could consider using THP instead.
> > > > > > 
> > > > > > If somebody configures the initial pool too high it is a configuration
> > > > > > bug. Just think about it, we do not want to reset lowmem reserves
> > > > > > configured by admin just because we are hitting the oom killer and yes
> > > > > > insanely large lowmem reserves might lead to early OOM as well.
> > 
> > The case I raise is a correctly configured system which has a memory
> > module failure.
> 
> So you are concerned about MCEs due to failing memory modules? If yes
> why do you care about hugetlb in particular?

No,  I am concerned about a failed memory module.  The system will
detect certain failures, mark the memory as bad and automatically
reboot.  Up on rebooting, that module will not be used.

My focus on hugetlb is that it can stop the automatic recovery of the
system.  Are there other reservations that should also be considered?

> 
> > Modern systems will reboot and remove the memory from
> > the memory pool.  Linux will start to load and run out of memory.  I get
> > that this code has the side effect of doing what you're saying.  Do you
> > see this as a worth while feature and if so, do you know of a better way
> > for me to trigger the behaviour?
> 
> I do not understand your question. Could you elaborate more please? Are
> you talking about system going into OOM because of too many MCEs?

No,  I'm talking about failed memory for whatever reason.  The system
reboots by a hardware means (I believe the memory controller) and
removes the memory on that failed module from the pool.  Now you
effectively have a system with less memory than before which invalidates
your configuration.  Is it worth while to have Linux successfully boot
when the system attempts to recover from a failure?

> 
> > > > > > Nacked-by: Michal Hocko <mhocko@suse.com>
> > > > > 
> > > > > Hm. I'm not sure it's fully justified. To me, reclaiming hugetlb is
> > > > > something to be considered as last resort after all other measures have
> > > > > been tried.
> > > > 
> > > > System can recover from the OOM killer in most cases and there is no
> > > > real reason to break contracts which administrator established. On the
> > > > other hand you cannot assume correct operation of the SW which depends
> > > > on hugetlb pages in general. Such a SW might get unexpected crashes/data
> > > > corruptions and what not.
> > 
> > My question about allowing the reclaim to happen all the time was like
> > Kirill said, if there's memory that's not being used then why panic (or
> > kill a task)?  I see that Michal has thought this through though.  My
> > intent was to add this as a config option, but it sounds like that's
> > also a bad plan.
> 
> You cannot reclaim something that the administrator has asked for to be
> available. Sure we can reclaim the excess if there is any but that is
> not what your patch does

I'm looking at the free_huge_pages vs the resv_huge_pages.  I thought
the resv_huge_pages were the free pages that are already requested, so
if there were more free than reserved then they would be excess?

>  
> > > And to be clear. The memory hotpug currently does the similar thing via
> > > dissolve_free_huge_pages and I believe that is wrong as well although
> > > one could argue that the memory offline is an admin action as well so
> > > reducing hugetlb pages is a reasonable thing to do. This would be for a
> > > separate discussion though.
> > > 
> > > But OOM can happen for entirely different reasons and hugetlb might be
> > > configured properly while this change would simply break that setup.
> > > This is simply nogo.
> > > 
> > 
> > Yes, this patch is certainly not the final version for that specific
> > reason.  I didn't see a good way to plug in to the OOM and was looking
> > for suggestions.  Sorry if that was not clear.
> > 
> > The root problem I'm trying to solve isn't a misconfiguration but to
> > cover off the case of the system recovering from a failure while Linux
> > will not.
> 
> Please be more specific what you mean by the "failure". It is hard to
> comment on further things without a clear definition what is the problem
> you are trying to address.

The failure is when you have a memory module which is detected as bad
for any reason by the hardware which causes the memory as being marked
bad and removed from the memory pool on an automatic reboot.  Sorry for
not being clear here, I thought my comments would have been clear from
what I stated in RFC PATCH 0/1.

I hope this helps clarify things.

Thanks,
Liam

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill:  Add support for reclaiming hugepages on OOM events.
  2017-07-31 13:56               ` Liam R. Howlett
@ 2017-07-31 14:08                 ` Michal Hocko
  2017-07-31 14:37                   ` Matthew Wilcox
  2017-08-01  1:11                   ` Liam R. Howlett
  0 siblings, 2 replies; 17+ messages in thread
From: Michal Hocko @ 2017-07-31 14:08 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: Kirill A. Shutemov, linux-mm, akpm, n-horiguchi, mike.kravetz,
	aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer,
	aarcange, oleg, penguin-kernel, mingo, kirill.shutemov,
	vdavydov.dev, willy

On Mon 31-07-17 09:56:48, Liam R. Howlett wrote:
> * Michal Hocko <mhocko@kernel.org> [170731 05:10]:
> > On Fri 28-07-17 21:56:38, Liam R. Howlett wrote:
> > > * Michal Hocko <mhocko@kernel.org> [170728 08:44]:
> > > > On Fri 28-07-17 14:23:50, Michal Hocko wrote:
> > > > > > > Other than that hugetlb pages are not reclaimable by design and users
> > > > > > > do rely on that. Otherwise they could consider using THP instead.
> > > > > > > 
> > > > > > > If somebody configures the initial pool too high it is a configuration
> > > > > > > bug. Just think about it, we do not want to reset lowmem reserves
> > > > > > > configured by admin just because we are hitting the oom killer and yes
> > > > > > > insanely large lowmem reserves might lead to early OOM as well.
> > > 
> > > The case I raise is a correctly configured system which has a memory
> > > module failure.
> > 
> > So you are concerned about MCEs due to failing memory modules? If yes
> > why do you care about hugetlb in particular?
> 
> No,  I am concerned about a failed memory module.  The system will
> detect certain failures, mark the memory as bad and automatically
> reboot.  Up on rebooting, that module will not be used.

How do you detect/configure this? We do have HWPoison infrastructure

> My focus on hugetlb is that it can stop the automatic recovery of the
> system.

How?

> Are there other reservations that should also be considered?

What about any other memory reservations by memmap= kernel command line?
 
> > > Modern systems will reboot and remove the memory from
> > > the memory pool.  Linux will start to load and run out of memory.  I get
> > > that this code has the side effect of doing what you're saying.  Do you
> > > see this as a worth while feature and if so, do you know of a better way
> > > for me to trigger the behaviour?
> > 
> > I do not understand your question. Could you elaborate more please? Are
> > you talking about system going into OOM because of too many MCEs?
> 
> No,  I'm talking about failed memory for whatever reason.  The system
> reboots by a hardware means (I believe the memory controller) and
> removes the memory on that failed module from the pool.  Now you
> effectively have a system with less memory than before which invalidates
> your configuration.  Is it worth while to have Linux successfully boot
> when the system attempts to recover from a failure?

Cetainly yes but if you boot with much less memory and you want to use
hugetlb pages then you have to reconsider and maybe even reconfigure
your workload to reflect new conditions. So I am not really sure this
can be fully automated.

> > > > > > > Nacked-by: Michal Hocko <mhocko@suse.com>
> > > > > > 
> > > > > > Hm. I'm not sure it's fully justified. To me, reclaiming hugetlb is
> > > > > > something to be considered as last resort after all other measures have
> > > > > > been tried.
> > > > > 
> > > > > System can recover from the OOM killer in most cases and there is no
> > > > > real reason to break contracts which administrator established. On the
> > > > > other hand you cannot assume correct operation of the SW which depends
> > > > > on hugetlb pages in general. Such a SW might get unexpected crashes/data
> > > > > corruptions and what not.
> > > 
> > > My question about allowing the reclaim to happen all the time was like
> > > Kirill said, if there's memory that's not being used then why panic (or
> > > kill a task)?  I see that Michal has thought this through though.  My
> > > intent was to add this as a config option, but it sounds like that's
> > > also a bad plan.
> > 
> > You cannot reclaim something that the administrator has asked for to be
> > available. Sure we can reclaim the excess if there is any but that is
> > not what your patch does
> 
> I'm looking at the free_huge_pages vs the resv_huge_pages.  I thought
> the resv_huge_pages were the free pages that are already requested, so
> if there were more free than reserved then they would be excess?

The terminology is little be confusing here. Hugetlb memory we have
committed into is reserved (e.g. by mmap) and we surely can have free
pages on top of resv_huge_pages but that is not an excess yet. We can
have surplus pages which would be an excess over what admin configured
initially. See Documentation/vm/{hugetlbpage.txt,hugetlbfs_reserv.txt}
for more information.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill:  Add support for reclaiming hugepages on OOM events.
  2017-07-31 14:08                 ` Michal Hocko
@ 2017-07-31 14:37                   ` Matthew Wilcox
  2017-07-31 14:49                     ` Michal Hocko
  2017-08-01  1:11                   ` Liam R. Howlett
  1 sibling, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2017-07-31 14:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Liam R. Howlett, Kirill A. Shutemov, linux-mm, akpm, n-horiguchi,
	mike.kravetz, aneesh.kumar, khandual, punit.agrawal, arnd,
	gerald.schaefer, aarcange, oleg, penguin-kernel, mingo,
	kirill.shutemov, vdavydov.dev

On Mon, Jul 31, 2017 at 04:08:10PM +0200, Michal Hocko wrote:
> On Mon 31-07-17 09:56:48, Liam R. Howlett wrote:
> > * Michal Hocko <mhocko@kernel.org> [170731 05:10]:
> > > On Fri 28-07-17 21:56:38, Liam R. Howlett wrote:
> > > > The case I raise is a correctly configured system which has a memory
> > > > module failure.
> > > 
> > > So you are concerned about MCEs due to failing memory modules? If yes
> > > why do you care about hugetlb in particular?
> > 
> > No,  I am concerned about a failed memory module.  The system will
> > detect certain failures, mark the memory as bad and automatically
> > reboot.  Up on rebooting, that module will not be used.
> 
> How do you detect/configure this? We do have HWPoison infrastructure
> 
> > My focus on hugetlb is that it can stop the automatic recovery of the
> > system.
> 
> How?

Let me try to explain the situation as I understand it.

The customer has purchased a 128TB machine in order to run a database.
They reserve 124TB of memory for use by the database cache.  Everything
works great.  Then a 4TB memory module goes bad.  The machine reboots
itself in order to return to operation, now having only 124TB of memory
and having 124TB of memory reserved.  It OOMs during boot.  The current
output from our OOM machinery doesn't point the sysadmin at the kernel
command line parameter as now being the problem.  So they file a priority
1 problem ticket ...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill:  Add support for reclaiming hugepages on OOM events.
  2017-07-31 14:37                   ` Matthew Wilcox
@ 2017-07-31 14:49                     ` Michal Hocko
  2017-08-01  1:25                       ` Liam R. Howlett
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2017-07-31 14:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Liam R. Howlett, Kirill A. Shutemov, linux-mm, akpm, n-horiguchi,
	mike.kravetz, aneesh.kumar, khandual, punit.agrawal, arnd,
	gerald.schaefer, aarcange, oleg, penguin-kernel, mingo,
	kirill.shutemov, vdavydov.dev

On Mon 31-07-17 07:37:35, Matthew Wilcox wrote:
> On Mon, Jul 31, 2017 at 04:08:10PM +0200, Michal Hocko wrote:
> > On Mon 31-07-17 09:56:48, Liam R. Howlett wrote:
[...]
> > > My focus on hugetlb is that it can stop the automatic recovery of the
> > > system.
> > 
> > How?
> 
> Let me try to explain the situation as I understand it.
> 
> The customer has purchased a 128TB machine in order to run a database.
> They reserve 124TB of memory for use by the database cache.  Everything
> works great.  Then a 4TB memory module goes bad.  The machine reboots
> itself in order to return to operation, now having only 124TB of memory
> and having 124TB of memory reserved.  It OOMs during boot.  The current
> output from our OOM machinery doesn't point the sysadmin at the kernel
> command line parameter as now being the problem.  So they file a priority
> 1 problem ticket ...

Well, I would argue that the oom report is quite clear that the hugetlb
memory has consumed the large part if not whole usable memory and that
should give a clue...

Nevertheless, I can see some merit here, but I am arguing that there
is simply no good way to handle this without admin involvement
unless we want to risk other and much more subtle breakage where the
application really expects it can consume the preallocated hugetlb pool
completely. And I would even argue that the later is more probable than
unintended memory failure reboot cycle. If somebody can tune hugetlb
pool dynamically I would recommend doing so from an init script.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill:  Add support for reclaiming hugepages on OOM events.
  2017-07-31 14:08                 ` Michal Hocko
  2017-07-31 14:37                   ` Matthew Wilcox
@ 2017-08-01  1:11                   ` Liam R. Howlett
  2017-08-01  8:29                     ` Michal Hocko
  1 sibling, 1 reply; 17+ messages in thread
From: Liam R. Howlett @ 2017-08-01  1:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kirill A. Shutemov, linux-mm, akpm, n-horiguchi, mike.kravetz,
	aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer,
	aarcange, oleg, penguin-kernel, mingo, kirill.shutemov,
	vdavydov.dev, willy

* Michal Hocko <mhocko@kernel.org> [170731 10:08]:
> On Mon 31-07-17 09:56:48, Liam R. Howlett wrote:
> > * Michal Hocko <mhocko@kernel.org> [170731 05:10]:
> > > On Fri 28-07-17 21:56:38, Liam R. Howlett wrote:
> > > > * Michal Hocko <mhocko@kernel.org> [170728 08:44]:
> > > > > On Fri 28-07-17 14:23:50, Michal Hocko wrote:
> > > > > > > > Other than that hugetlb pages are not reclaimable by design and users
> > > > > > > > do rely on that. Otherwise they could consider using THP instead.
> > > > > > > > 
> > > > > > > > If somebody configures the initial pool too high it is a configuration
> > > > > > > > bug. Just think about it, we do not want to reset lowmem reserves
> > > > > > > > configured by admin just because we are hitting the oom killer and yes
> > > > > > > > insanely large lowmem reserves might lead to early OOM as well.
> > > > 
> > > > The case I raise is a correctly configured system which has a memory
> > > > module failure.
> > > 
> > > So you are concerned about MCEs due to failing memory modules? If yes
> > > why do you care about hugetlb in particular?
> > 
> > No,  I am concerned about a failed memory module.  The system will
> > detect certain failures, mark the memory as bad and automatically
> > reboot.  Up on rebooting, that module will not be used.
> 
> How do you detect/configure this? We do have HWPoison infrastructure

I don't right now but I felt I was at a stage where I would like to RFC
to try and have this go smoother.  I've not researched this but off
hand: dmidecode is able to detect that there is a memory module
disabled.  This alone would not indicate a failure, but if one was to
see a disabled DIMM and an invalid configuration it might be worth
pointing out on boot?

> 
> > My focus on hugetlb is that it can stop the automatic recovery of the
> > system.
> 
> How?

Clarified in the thread fork - Thanks Matthew!

> 
> > Are there other reservations that should also be considered?
> 
> What about any other memory reservations by memmap= kernel command line?

I've not seen any other reservation so large that a single failure
causes a failed boot due to OOM, but that doesn't mean they should be
ignored.

>  
> > > > Modern systems will reboot and remove the memory from
> > > > the memory pool.  Linux will start to load and run out of memory.  I get
> > > > that this code has the side effect of doing what you're saying.  Do you
> > > > see this as a worth while feature and if so, do you know of a better way
> > > > for me to trigger the behaviour?
> > > 
> > > I do not understand your question. Could you elaborate more please? Are
> > > you talking about system going into OOM because of too many MCEs?
> > 
> > No,  I'm talking about failed memory for whatever reason.  The system
> > reboots by a hardware means (I believe the memory controller) and
> > removes the memory on that failed module from the pool.  Now you
> > effectively have a system with less memory than before which invalidates
> > your configuration.  Is it worth while to have Linux successfully boot
> > when the system attempts to recover from a failure?
> 
> Cetainly yes but if you boot with much less memory and you want to use
> hugetlb pages then you have to reconsider and maybe even reconfigure
> your workload to reflect new conditions. So I am not really sure this
> can be fully automated.
> 

I agree.  A reconfiguration or repair is required to have optimum
performance.  Would you agree that having functioning system better than
a reboot loop or hang on a panic?  It's also easier to reconfigure a
system that's booting.

> > > > > > > > Nacked-by: Michal Hocko <mhocko@suse.com>
> > > > > > > 
> > > > > > > Hm. I'm not sure it's fully justified. To me, reclaiming hugetlb is
> > > > > > > something to be considered as last resort after all other measures have
> > > > > > > been tried.
> > > > > > 
> > > > > > System can recover from the OOM killer in most cases and there is no
> > > > > > real reason to break contracts which administrator established. On the
> > > > > > other hand you cannot assume correct operation of the SW which depends
> > > > > > on hugetlb pages in general. Such a SW might get unexpected crashes/data
> > > > > > corruptions and what not.
> > > > 
> > > > My question about allowing the reclaim to happen all the time was like
> > > > Kirill said, if there's memory that's not being used then why panic (or
> > > > kill a task)?  I see that Michal has thought this through though.  My
> > > > intent was to add this as a config option, but it sounds like that's
> > > > also a bad plan.
> > > 
> > > You cannot reclaim something that the administrator has asked for to be
> > > available. Sure we can reclaim the excess if there is any but that is
> > > not what your patch does
> > 
> > I'm looking at the free_huge_pages vs the resv_huge_pages.  I thought
> > the resv_huge_pages were the free pages that are already requested, so
> > if there were more free than reserved then they would be excess?
> 
> The terminology is little be confusing here. Hugetlb memory we have
> committed into is reserved (e.g. by mmap) and we surely can have free
> pages on top of resv_huge_pages but that is not an excess yet. We can
> have surplus pages which would be an excess over what admin configured
> initially. See Documentation/vm/{hugetlbpage.txt,hugetlbfs_reserv.txt}
> for more information.

Thank you.  I will revisit this error if the patch is considered useful
at the end of the RFC conversation.

Cheers,
Liam

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill:  Add support for reclaiming hugepages on OOM events.
  2017-07-31 14:49                     ` Michal Hocko
@ 2017-08-01  1:25                       ` Liam R. Howlett
  2017-08-01  8:28                         ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Liam R. Howlett @ 2017-08-01  1:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, Kirill A. Shutemov, linux-mm, akpm, n-horiguchi,
	mike.kravetz, aneesh.kumar, khandual, punit.agrawal, arnd,
	gerald.schaefer, aarcange, oleg, penguin-kernel, mingo,
	kirill.shutemov, vdavydov.dev

* Michal Hocko <mhocko@kernel.org> [170731 10:49]:
> On Mon 31-07-17 07:37:35, Matthew Wilcox wrote:
> > On Mon, Jul 31, 2017 at 04:08:10PM +0200, Michal Hocko wrote:
> > > On Mon 31-07-17 09:56:48, Liam R. Howlett wrote:
> [...]
> > > > My focus on hugetlb is that it can stop the automatic recovery of the
> > > > system.
> > > 
> > > How?
> > 
> > Let me try to explain the situation as I understand it.
> > 
> > The customer has purchased a 128TB machine in order to run a database.
> > They reserve 124TB of memory for use by the database cache.  Everything
> > works great.  Then a 4TB memory module goes bad.  The machine reboots
> > itself in order to return to operation, now having only 124TB of memory
> > and having 124TB of memory reserved.  It OOMs during boot.  The current
> > output from our OOM machinery doesn't point the sysadmin at the kernel
> > command line parameter as now being the problem.  So they file a priority
> > 1 problem ticket ...
> 
> Well, I would argue that the oom report is quite clear that the hugetlb
> memory has consumed the large part if not whole usable memory and that
> should give a clue...

Can you please show me where it's clear?  Are you referring to these
messages?

Node 0 hugepages_total=15999 hugepages_free=15999 hugepages_surp=0
hugepages_size=8192kB
Node 1 hugepages_total=16157 hugepages_free=16157 hugepages_surp=0
hugepages_size=8192kB

I'm not trying to be obtuse, I'm just not sure what message in which you
are referring.

> 
> Nevertheless, I can see some merit here, but I am arguing that there
> is simply no good way to handle this without admin involvement
> unless we want to risk other and much more subtle breakage where the
> application really expects it can consume the preallocated hugetlb pool
> completely. And I would even argue that the later is more probable than
> unintended memory failure reboot cycle.  If somebody can tune hugetlb
> pool dynamically I would recommend doing so from an init script.

I agree that an admin involvement is necessary for a full recovery but
I'm trying to make the best of a bad situation.

Why can't it consume the preallocated hugetlb pool completely? I'm just
trying to make the pool a little smaller.  I thought that when the
application fails to allocate a hugetlb it would receive a failure and
need to cope with the allocation failure?

Thanks,
Liam

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill:  Add support for reclaiming hugepages on OOM events.
  2017-08-01  1:25                       ` Liam R. Howlett
@ 2017-08-01  8:28                         ` Michal Hocko
  0 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2017-08-01  8:28 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: Matthew Wilcox, Kirill A. Shutemov, linux-mm, akpm, n-horiguchi,
	mike.kravetz, aneesh.kumar, khandual, punit.agrawal, arnd,
	gerald.schaefer, aarcange, oleg, penguin-kernel, mingo,
	kirill.shutemov, vdavydov.dev

On Mon 31-07-17 21:25:42, Liam R. Howlett wrote:
> * Michal Hocko <mhocko@kernel.org> [170731 10:49]:
> > On Mon 31-07-17 07:37:35, Matthew Wilcox wrote:
> > > On Mon, Jul 31, 2017 at 04:08:10PM +0200, Michal Hocko wrote:
> > > > On Mon 31-07-17 09:56:48, Liam R. Howlett wrote:
> > [...]
> > > > > My focus on hugetlb is that it can stop the automatic recovery of the
> > > > > system.
> > > > 
> > > > How?
> > > 
> > > Let me try to explain the situation as I understand it.
> > > 
> > > The customer has purchased a 128TB machine in order to run a database.
> > > They reserve 124TB of memory for use by the database cache.  Everything
> > > works great.  Then a 4TB memory module goes bad.  The machine reboots
> > > itself in order to return to operation, now having only 124TB of memory
> > > and having 124TB of memory reserved.  It OOMs during boot.  The current
> > > output from our OOM machinery doesn't point the sysadmin at the kernel
> > > command line parameter as now being the problem.  So they file a priority
> > > 1 problem ticket ...
> > 
> > Well, I would argue that the oom report is quite clear that the hugetlb
> > memory has consumed the large part if not whole usable memory and that
> > should give a clue...
> 
> Can you please show me where it's clear?  Are you referring to these
> messages?
> 
> Node 0 hugepages_total=15999 hugepages_free=15999 hugepages_surp=0
> hugepages_size=8192kB
> Node 1 hugepages_total=16157 hugepages_free=16157 hugepages_surp=0
> hugepages_size=8192kB
> 
> I'm not trying to be obtuse, I'm just not sure what message in which you
> are referring.

Yes the above is the part of the oom report I had in mind.

> > Nevertheless, I can see some merit here, but I am arguing that there
> > is simply no good way to handle this without admin involvement
> > unless we want to risk other and much more subtle breakage where the
> > application really expects it can consume the preallocated hugetlb pool
> > completely. And I would even argue that the later is more probable than
> > unintended memory failure reboot cycle.  If somebody can tune hugetlb
> > pool dynamically I would recommend doing so from an init script.
> 
> I agree that an admin involvement is necessary for a full recovery but
> I'm trying to make the best of a bad situation.

Is this situation so common to risk breaking an existing userspace which
rely on having all hugetlb pages availble once they are configured?

> Why can't it consume the preallocated hugetlb pool completely?

Because it is not so unrealistic to assume that some userspace might
_rely_ on having the pool available at any time with the capacity
configured during the initialization. The hugetlb API basically
guarantee that once the pool is preallocated it will never get reclaimed
unless administrator intervene.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill:  Add support for reclaiming hugepages on OOM events.
  2017-08-01  1:11                   ` Liam R. Howlett
@ 2017-08-01  8:29                     ` Michal Hocko
  2017-08-01 14:41                       ` Liam R. Howlett
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2017-08-01  8:29 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: Kirill A. Shutemov, linux-mm, akpm, n-horiguchi, mike.kravetz,
	aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer,
	aarcange, oleg, penguin-kernel, mingo, kirill.shutemov,
	vdavydov.dev, willy

On Mon 31-07-17 21:11:25, Liam R. Howlett wrote:
> * Michal Hocko <mhocko@kernel.org> [170731 10:08]:
> > On Mon 31-07-17 09:56:48, Liam R. Howlett wrote:
[...]
> > > No,  I'm talking about failed memory for whatever reason.  The system
> > > reboots by a hardware means (I believe the memory controller) and
> > > removes the memory on that failed module from the pool.  Now you
> > > effectively have a system with less memory than before which invalidates
> > > your configuration.  Is it worth while to have Linux successfully boot
> > > when the system attempts to recover from a failure?
> > 
> > Cetainly yes but if you boot with much less memory and you want to use
> > hugetlb pages then you have to reconsider and maybe even reconfigure
> > your workload to reflect new conditions. So I am not really sure this
> > can be fully automated.
> > 
> 
> I agree.  A reconfiguration or repair is required to have optimum
> performance.  Would you agree that having functioning system better than
> a reboot loop or hang on a panic?  It's also easier to reconfigure a
> system that's booting.

Absolutely. The thing is that I am not even sure that the hugetlb
problem is real. Using hugetlb reservation from the boot command line
parameter is easily fixable (just update the boot comand line from the
boot loader). From my experience the init time hugetlb initialization
is usually trying to be portable and as such configures a certain
percentage of the available memory for hugetlb (some of them even on per
NUMA node basis). Even if somebody uses hard coded values then this is
something that is fixable during recovery.

That being said I am not sure you are focusing on a real problem while
the solution you are proposing might break an existing userspace. Please
try to play with your memory recovery feature some more with real
hugetlb usecases (Oracle DB is a heavy user AFAIR) and see what the real
life problems might happen and we can revisit potential solutions with
more data in hands.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill:  Add support for reclaiming hugepages on OOM events.
  2017-08-01  8:29                     ` Michal Hocko
@ 2017-08-01 14:41                       ` Liam R. Howlett
  0 siblings, 0 replies; 17+ messages in thread
From: Liam R. Howlett @ 2017-08-01 14:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kirill A. Shutemov, linux-mm, akpm, n-horiguchi, mike.kravetz,
	aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer,
	aarcange, oleg, penguin-kernel, mingo, kirill.shutemov,
	vdavydov.dev, willy

* Michal Hocko <mhocko@kernel.org> [170801 04:30]:
> On Mon 31-07-17 21:11:25, Liam R. Howlett wrote:
> > * Michal Hocko <mhocko@kernel.org> [170731 10:08]:
> > > On Mon 31-07-17 09:56:48, Liam R. Howlett wrote:
> [...]
> > > > No,  I'm talking about failed memory for whatever reason.  The system
> > > > reboots by a hardware means (I believe the memory controller) and
> > > > removes the memory on that failed module from the pool.  Now you
> > > > effectively have a system with less memory than before which invalidates
> > > > your configuration.  Is it worth while to have Linux successfully boot
> > > > when the system attempts to recover from a failure?
> > > 
> > > Cetainly yes but if you boot with much less memory and you want to use
> > > hugetlb pages then you have to reconsider and maybe even reconfigure
> > > your workload to reflect new conditions. So I am not really sure this
> > > can be fully automated.
> > > 
> > 
> > I agree.  A reconfiguration or repair is required to have optimum
> > performance.  Would you agree that having functioning system better than
> > a reboot loop or hang on a panic?  It's also easier to reconfigure a
> > system that's booting.
> 
> Absolutely. The thing is that I am not even sure that the hugetlb
> problem is real. Using hugetlb reservation from the boot command line
> parameter is easily fixable (just update the boot comand line from the
> boot loader). From my experience the init time hugetlb initialization
> is usually trying to be portable and as such configures a certain
> percentage of the available memory for hugetlb (some of them even on per
> NUMA node basis). Even if somebody uses hard coded values then this is
> something that is fixable during recovery.

This was my thought when I was first assigned the bug for my last patch
for adding the log message of the hugetlb allocation failure but during
our discussion I was assigned two more near-identical bugs.  From what I
can tell the people following a setup guide do not know how to edit the
grub command line easily once in a boot loop or don't have a decent
enough console setup to do so.  Worse yet, all three of the bugs were
filed as kernel bugs because people didn't even realise it was a setup
issue.  I think the sysctl way of setting the hugetlb is the safest.
But since we provide a kernel command line way of setting the hugetlb,
it seems reasonable to make the user error as transparent as possible.
This RFC was an extension of looking at how people arrive at an OOM
error on boot when using hugetlb.

> 
> That being said I am not sure you are focusing on a real problem while
> the solution you are proposing might break an existing userspace. Please
> try to play with your memory recovery feature some more with real
> hugetlb usecases (Oracle DB is a heavy user AFAIR) and see what the real
> life problems might happen and we can revisit potential solutions with
> more data in hands.


Okay, thank you.  I will re-examine the issue and see about a different
approach.  I appreciate the time you have taken to look at my RFC.

Thanks,
Liam

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-08-01 14:42 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-27 18:02 [RFC PATCH 0/1] oom support for reclaiming of hugepages Liam R. Howlett
2017-07-27 18:02 ` [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events Liam R. Howlett
2017-07-28  6:46   ` Michal Hocko
2017-07-28 11:33     ` Kirill A. Shutemov
2017-07-28 12:23       ` Michal Hocko
2017-07-28 12:44         ` Michal Hocko
2017-07-29  1:56           ` Liam R. Howlett
2017-07-31  9:10             ` Michal Hocko
2017-07-31 13:56               ` Liam R. Howlett
2017-07-31 14:08                 ` Michal Hocko
2017-07-31 14:37                   ` Matthew Wilcox
2017-07-31 14:49                     ` Michal Hocko
2017-08-01  1:25                       ` Liam R. Howlett
2017-08-01  8:28                         ` Michal Hocko
2017-08-01  1:11                   ` Liam R. Howlett
2017-08-01  8:29                     ` Michal Hocko
2017-08-01 14:41                       ` Liam R. Howlett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).