* [RFC PATCH 0/1] oom support for reclaiming of hugepages @ 2017-07-27 18:02 Liam R. Howlett 2017-07-27 18:02 ` [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events Liam R. Howlett 0 siblings, 1 reply; 17+ messages in thread From: Liam R. Howlett @ 2017-07-27 18:02 UTC (permalink / raw) To: linux-mm Cc: akpm, mhocko, n-horiguchi, mike.kravetz, aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer, aarcange, oleg, penguin-kernel, mingo, kirill.shutemov, vdavydov.dev, willy I'm looking for comments on how to avoid the failure scenario where a correctly configured system fails to boot after taking corrective action when a memory module goes bad. Right now if there is a memory event that causes a system reboot and the UEFI to remove the memory from the memory pool, it may result in Linux not having enough memory to boot due to the huge page reserve. The patch in its current state will reclaim hugepages if they are free regardless of on boot or not - which may not be desirable, or maybe it is? I've looked through how select_bad_process works and do not see a clean way to hook in to this function when the victim is not a task. I also could not find a good place to add the CONFIG_HUGETLB_PAGE_OOM. Obviously that would need to go somewhere sane. Liam R. Howlett (1): mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events. include/linux/hugetlb.h | 1 + mm/hugetlb.c | 35 +++++++++++++++++++++++++++++++++++ mm/oom_kill.c | 8 ++++++++ 3 files changed, 44 insertions(+) -- 2.13.0.90.g1eb437020 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events. 2017-07-27 18:02 [RFC PATCH 0/1] oom support for reclaiming of hugepages Liam R. Howlett @ 2017-07-27 18:02 ` Liam R. Howlett 2017-07-28 6:46 ` Michal Hocko 0 siblings, 1 reply; 17+ messages in thread From: Liam R. Howlett @ 2017-07-27 18:02 UTC (permalink / raw) To: linux-mm Cc: akpm, mhocko, n-horiguchi, mike.kravetz, aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer, aarcange, oleg, penguin-kernel, mingo, kirill.shutemov, vdavydov.dev, willy When a system runs out of memory it may be desirable to reclaim unreserved hugepages. This situation arises when a correctly configured system has a memory failure and takes corrective action of rebooting and removing the memory from the memory pool results in a system failing to boot. With this change, the out of memory handler is able to reclaim any pages that are free and not reserved. Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> --- include/linux/hugetlb.h | 1 + mm/hugetlb.c | 35 +++++++++++++++++++++++++++++++++++ mm/oom_kill.c | 8 ++++++++ 3 files changed, 44 insertions(+) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 8d9fe131a240..20e5729b9e9a 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -470,6 +470,7 @@ static inline pgoff_t basepage_index(struct page *page) } extern int dissolve_free_huge_page(struct page *page); +extern unsigned long decrease_free_hugepages(nodemask_t *nodes); extern int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn); static inline bool hugepage_migration_supported(struct hstate *h) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bc48ee783dd9..00a0e08b96c5 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1454,6 +1454,41 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed, } /* + * Decrement free hugepages. Used by oom kill to avoid killing a task if + * there is free huge pages that can be used instead. + * Returns the number of bytes reclaimed from hugepages + */ +#define CONFIG_HUGETLB_PAGE_OOM +unsigned long decrease_free_hugepages(nodemask_t *nodes) +{ +#ifdef CONFIG_HUGETLB_PAGE_OOM + struct hstate *h; + unsigned long ret = 0; + + spin_lock(&hugetlb_lock); + for_each_hstate(h) { + if (h->free_huge_pages > h->resv_huge_pages) { + char buf[32]; + + memfmt(buf, huge_page_size(h)); + ret = free_pool_huge_page(h, nodes ? + nodes : &node_online_map, 0); + pr_warn("HugeTLB: Reclaiming %lu hugepage(s) of page size %s\n", + ret, buf); + ret *= huge_page_size(h); + goto found; + } + } + +found: + spin_unlock(&hugetlb_lock); + return ret; +#else + return 0; +#endif /* CONFIG_HUGETLB_PAGE_OOM */ +} + +/* * Dissolve a given free hugepage into free buddy pages. This function does * nothing for in-use (including surplus) hugepages. Returns -EBUSY if the * number of free hugepages would be reduced below the number of reserved diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 9e8b4f030c1c..0a42f6d7d253 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -40,6 +40,7 @@ #include <linux/ratelimit.h> #include <linux/kthread.h> #include <linux/init.h> +#include <linux/hugetlb.h> #include <asm/tlb.h> #include "internal.h" @@ -1044,6 +1045,13 @@ bool out_of_memory(struct oom_control *oc) return true; } + /* Reclaim a free, unreserved hugepage. */ + freed = decrease_free_hugepages(oc->nodemask); + if (freed != 0) { + pr_err("Out of memory: Reclaimed %lu from HugeTLB\n", freed); + return true; + } + select_bad_process(oc); /* Found nothing?!?! Either we hang forever, or we panic. */ if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { -- 2.13.0.90.g1eb437020 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events. 2017-07-27 18:02 ` [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events Liam R. Howlett @ 2017-07-28 6:46 ` Michal Hocko 2017-07-28 11:33 ` Kirill A. Shutemov 0 siblings, 1 reply; 17+ messages in thread From: Michal Hocko @ 2017-07-28 6:46 UTC (permalink / raw) To: Liam R. Howlett Cc: linux-mm, akpm, n-horiguchi, mike.kravetz, aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer, aarcange, oleg, penguin-kernel, mingo, kirill.shutemov, vdavydov.dev, willy On Thu 27-07-17 14:02:36, Liam R. Howlett wrote: > When a system runs out of memory it may be desirable to reclaim > unreserved hugepages. This situation arises when a correctly configured > system has a memory failure and takes corrective action of rebooting and > removing the memory from the memory pool results in a system failing to > boot. With this change, the out of memory handler is able to reclaim > any pages that are free and not reserved. I am sorry but I have to Nack this. You are breaking the basic contract of hugetlb user API. Administrator configures the pool to suit a workload. It is a deliberate and privileged action. We allow to overcommit that pool should there be a immediate need for more hugetlb pages and we do remove those when they are freed. If we don't then this should be fixed. Other than that hugetlb pages are not reclaimable by design and users do rely on that. Otherwise they could consider using THP instead. If somebody configures the initial pool too high it is a configuration bug. Just think about it, we do not want to reset lowmem reserves configured by admin just because we are hitting the oom killer and yes insanely large lowmem reserves might lead to early OOM as well. Nacked-by: Michal Hocko <mhocko@suse.com> -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events. 2017-07-28 6:46 ` Michal Hocko @ 2017-07-28 11:33 ` Kirill A. Shutemov 2017-07-28 12:23 ` Michal Hocko 0 siblings, 1 reply; 17+ messages in thread From: Kirill A. Shutemov @ 2017-07-28 11:33 UTC (permalink / raw) To: Michal Hocko Cc: Liam R. Howlett, linux-mm, akpm, n-horiguchi, mike.kravetz, aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer, aarcange, oleg, penguin-kernel, mingo, kirill.shutemov, vdavydov.dev, willy On Fri, Jul 28, 2017 at 08:46:02AM +0200, Michal Hocko wrote: > On Thu 27-07-17 14:02:36, Liam R. Howlett wrote: > > When a system runs out of memory it may be desirable to reclaim > > unreserved hugepages. This situation arises when a correctly configured > > system has a memory failure and takes corrective action of rebooting and > > removing the memory from the memory pool results in a system failing to > > boot. With this change, the out of memory handler is able to reclaim > > any pages that are free and not reserved. > > I am sorry but I have to Nack this. You are breaking the basic contract > of hugetlb user API. Administrator configures the pool to suit a > workload. It is a deliberate and privileged action. We allow to > overcommit that pool should there be a immediate need for more hugetlb > pages and we do remove those when they are freed. If we don't then this > should be fixed. > Other than that hugetlb pages are not reclaimable by design and users > do rely on that. Otherwise they could consider using THP instead. > > If somebody configures the initial pool too high it is a configuration > bug. Just think about it, we do not want to reset lowmem reserves > configured by admin just because we are hitting the oom killer and yes > insanely large lowmem reserves might lead to early OOM as well. > > Nacked-by: Michal Hocko <mhocko@suse.com> Hm. I'm not sure it's fully justified. To me, reclaiming hugetlb is something to be considered as last resort after all other measures have been tried. I think we can allow hugetlb reclaim just to keep system alive, taint kernel and indicate that "reboot needed". The situation is somewhat similar to BUG() vs. WARN(). -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events. 2017-07-28 11:33 ` Kirill A. Shutemov @ 2017-07-28 12:23 ` Michal Hocko 2017-07-28 12:44 ` Michal Hocko 0 siblings, 1 reply; 17+ messages in thread From: Michal Hocko @ 2017-07-28 12:23 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Liam R. Howlett, linux-mm, akpm, n-horiguchi, mike.kravetz, aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer, aarcange, oleg, penguin-kernel, mingo, kirill.shutemov, vdavydov.dev, willy On Fri 28-07-17 14:33:47, Kirill A. Shutemov wrote: > On Fri, Jul 28, 2017 at 08:46:02AM +0200, Michal Hocko wrote: > > On Thu 27-07-17 14:02:36, Liam R. Howlett wrote: > > > When a system runs out of memory it may be desirable to reclaim > > > unreserved hugepages. This situation arises when a correctly configured > > > system has a memory failure and takes corrective action of rebooting and > > > removing the memory from the memory pool results in a system failing to > > > boot. With this change, the out of memory handler is able to reclaim > > > any pages that are free and not reserved. > > > > I am sorry but I have to Nack this. You are breaking the basic contract > > of hugetlb user API. Administrator configures the pool to suit a > > workload. It is a deliberate and privileged action. We allow to > > overcommit that pool should there be a immediate need for more hugetlb > > pages and we do remove those when they are freed. If we don't then this > > should be fixed. > > Other than that hugetlb pages are not reclaimable by design and users > > do rely on that. Otherwise they could consider using THP instead. > > > > If somebody configures the initial pool too high it is a configuration > > bug. Just think about it, we do not want to reset lowmem reserves > > configured by admin just because we are hitting the oom killer and yes > > insanely large lowmem reserves might lead to early OOM as well. > > > > Nacked-by: Michal Hocko <mhocko@suse.com> > > Hm. I'm not sure it's fully justified. To me, reclaiming hugetlb is > something to be considered as last resort after all other measures have > been tried. System can recover from the OOM killer in most cases and there is no real reason to break contracts which administrator established. On the other hand you cannot assume correct operation of the SW which depends on hugetlb pages in general. Such a SW might get unexpected crashes/data corruptions and what not. > I think we can allow hugetlb reclaim just to keep system alive, taint > kernel and indicate that "reboot needed". Let me repeat, this is an admin only configuration and we cannot pretend to fix those decisions. If there is a configuration issue it should be fixed and the OOM killer along with the oom report explaining the situation is the best way forward IMO. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events. 2017-07-28 12:23 ` Michal Hocko @ 2017-07-28 12:44 ` Michal Hocko 2017-07-29 1:56 ` Liam R. Howlett 0 siblings, 1 reply; 17+ messages in thread From: Michal Hocko @ 2017-07-28 12:44 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Liam R. Howlett, linux-mm, akpm, n-horiguchi, mike.kravetz, aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer, aarcange, oleg, penguin-kernel, mingo, kirill.shutemov, vdavydov.dev, willy On Fri 28-07-17 14:23:50, Michal Hocko wrote: > On Fri 28-07-17 14:33:47, Kirill A. Shutemov wrote: > > On Fri, Jul 28, 2017 at 08:46:02AM +0200, Michal Hocko wrote: > > > On Thu 27-07-17 14:02:36, Liam R. Howlett wrote: > > > > When a system runs out of memory it may be desirable to reclaim > > > > unreserved hugepages. This situation arises when a correctly configured > > > > system has a memory failure and takes corrective action of rebooting and > > > > removing the memory from the memory pool results in a system failing to > > > > boot. With this change, the out of memory handler is able to reclaim > > > > any pages that are free and not reserved. > > > > > > I am sorry but I have to Nack this. You are breaking the basic contract > > > of hugetlb user API. Administrator configures the pool to suit a > > > workload. It is a deliberate and privileged action. We allow to > > > overcommit that pool should there be a immediate need for more hugetlb > > > pages and we do remove those when they are freed. If we don't then this > > > should be fixed. > > > Other than that hugetlb pages are not reclaimable by design and users > > > do rely on that. Otherwise they could consider using THP instead. > > > > > > If somebody configures the initial pool too high it is a configuration > > > bug. Just think about it, we do not want to reset lowmem reserves > > > configured by admin just because we are hitting the oom killer and yes > > > insanely large lowmem reserves might lead to early OOM as well. > > > > > > Nacked-by: Michal Hocko <mhocko@suse.com> > > > > Hm. I'm not sure it's fully justified. To me, reclaiming hugetlb is > > something to be considered as last resort after all other measures have > > been tried. > > System can recover from the OOM killer in most cases and there is no > real reason to break contracts which administrator established. On the > other hand you cannot assume correct operation of the SW which depends > on hugetlb pages in general. Such a SW might get unexpected crashes/data > corruptions and what not. And to be clear. The memory hotpug currently does the similar thing via dissolve_free_huge_pages and I believe that is wrong as well although one could argue that the memory offline is an admin action as well so reducing hugetlb pages is a reasonable thing to do. This would be for a separate discussion though. But OOM can happen for entirely different reasons and hugetlb might be configured properly while this change would simply break that setup. This is simply nogo. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events. 2017-07-28 12:44 ` Michal Hocko @ 2017-07-29 1:56 ` Liam R. Howlett 2017-07-31 9:10 ` Michal Hocko 0 siblings, 1 reply; 17+ messages in thread From: Liam R. Howlett @ 2017-07-29 1:56 UTC (permalink / raw) To: Michal Hocko Cc: Kirill A. Shutemov, linux-mm, akpm, n-horiguchi, mike.kravetz, aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer, aarcange, oleg, penguin-kernel, mingo, kirill.shutemov, vdavydov.dev, willy * Michal Hocko <mhocko@kernel.org> [170728 08:44]: > On Fri 28-07-17 14:23:50, Michal Hocko wrote: > > On Fri 28-07-17 14:33:47, Kirill A. Shutemov wrote: > > > On Fri, Jul 28, 2017 at 08:46:02AM +0200, Michal Hocko wrote: > > > > On Thu 27-07-17 14:02:36, Liam R. Howlett wrote: > > > > > When a system runs out of memory it may be desirable to reclaim > > > > > unreserved hugepages. This situation arises when a correctly configured > > > > > system has a memory failure and takes corrective action of rebooting and > > > > > removing the memory from the memory pool results in a system failing to > > > > > boot. With this change, the out of memory handler is able to reclaim > > > > > any pages that are free and not reserved. > > > > > > > > I am sorry but I have to Nack this. You are breaking the basic contract > > > > of hugetlb user API. Administrator configures the pool to suit a > > > > workload. It is a deliberate and privileged action. We allow to > > > > overcommit that pool should there be a immediate need for more hugetlb > > > > pages and we do remove those when they are freed. If we don't then this > > > > should be fixed. This is certainly a work in progress and I appreciate you taking the time to point out the issues. I didn't mean to suggest merging this it is today. > > > > Other than that hugetlb pages are not reclaimable by design and users > > > > do rely on that. Otherwise they could consider using THP instead. > > > > > > > > If somebody configures the initial pool too high it is a configuration > > > > bug. Just think about it, we do not want to reset lowmem reserves > > > > configured by admin just because we are hitting the oom killer and yes > > > > insanely large lowmem reserves might lead to early OOM as well. The case I raise is a correctly configured system which has a memory module failure. Modern systems will reboot and remove the memory from the memory pool. Linux will start to load and run out of memory. I get that this code has the side effect of doing what you're saying. Do you see this as a worth while feature and if so, do you know of a better way for me to trigger the behaviour? > > > > > > > > Nacked-by: Michal Hocko <mhocko@suse.com> > > > > > > Hm. I'm not sure it's fully justified. To me, reclaiming hugetlb is > > > something to be considered as last resort after all other measures have > > > been tried. > > > > System can recover from the OOM killer in most cases and there is no > > real reason to break contracts which administrator established. On the > > other hand you cannot assume correct operation of the SW which depends > > on hugetlb pages in general. Such a SW might get unexpected crashes/data > > corruptions and what not. My question about allowing the reclaim to happen all the time was like Kirill said, if there's memory that's not being used then why panic (or kill a task)? I see that Michal has thought this through though. My intent was to add this as a config option, but it sounds like that's also a bad plan. > > And to be clear. The memory hotpug currently does the similar thing via > dissolve_free_huge_pages and I believe that is wrong as well although > one could argue that the memory offline is an admin action as well so > reducing hugetlb pages is a reasonable thing to do. This would be for a > separate discussion though. > > But OOM can happen for entirely different reasons and hugetlb might be > configured properly while this change would simply break that setup. > This is simply nogo. > Yes, this patch is certainly not the final version for that specific reason. I didn't see a good way to plug in to the OOM and was looking for suggestions. Sorry if that was not clear. The root problem I'm trying to solve isn't a misconfiguration but to cover off the case of the system recovering from a failure while Linux will not. Here are a few other ideas that may or may not be better (or sane): Would perhaps specifying a percentage of memory instead of a specific number be a better approach than reclaiming? That would still leave those who use hard values vulnerable but at least provide an alternative that was safer. It's also a pretty brutal interface for someone to use. We could figure out there's a bad memory module and enable this on boot only? I am unclear on how to do either of those, but in combination it would allow the issue to be detected and avoid failures. I have looked in to detecting when we're booting and I've not had much luck there. I believe dmidecode can pick up disabled modules so that part should be plausible? Would enabling this code when a disabled module exists and during boot be acceptable? Disable all hugepages passed when there's a disabled memory module and throw a WARN? Is there any other options? Thank you both for your comments and time, Liam -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events. 2017-07-29 1:56 ` Liam R. Howlett @ 2017-07-31 9:10 ` Michal Hocko 2017-07-31 13:56 ` Liam R. Howlett 0 siblings, 1 reply; 17+ messages in thread From: Michal Hocko @ 2017-07-31 9:10 UTC (permalink / raw) To: Liam R. Howlett Cc: Kirill A. Shutemov, linux-mm, akpm, n-horiguchi, mike.kravetz, aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer, aarcange, oleg, penguin-kernel, mingo, kirill.shutemov, vdavydov.dev, willy On Fri 28-07-17 21:56:38, Liam R. Howlett wrote: > * Michal Hocko <mhocko@kernel.org> [170728 08:44]: > > On Fri 28-07-17 14:23:50, Michal Hocko wrote: > > > > > Other than that hugetlb pages are not reclaimable by design and users > > > > > do rely on that. Otherwise they could consider using THP instead. > > > > > > > > > > If somebody configures the initial pool too high it is a configuration > > > > > bug. Just think about it, we do not want to reset lowmem reserves > > > > > configured by admin just because we are hitting the oom killer and yes > > > > > insanely large lowmem reserves might lead to early OOM as well. > > The case I raise is a correctly configured system which has a memory > module failure. So you are concerned about MCEs due to failing memory modules? If yes why do you care about hugetlb in particular? > Modern systems will reboot and remove the memory from > the memory pool. Linux will start to load and run out of memory. I get > that this code has the side effect of doing what you're saying. Do you > see this as a worth while feature and if so, do you know of a better way > for me to trigger the behaviour? I do not understand your question. Could you elaborate more please? Are you talking about system going into OOM because of too many MCEs? > > > > > Nacked-by: Michal Hocko <mhocko@suse.com> > > > > > > > > Hm. I'm not sure it's fully justified. To me, reclaiming hugetlb is > > > > something to be considered as last resort after all other measures have > > > > been tried. > > > > > > System can recover from the OOM killer in most cases and there is no > > > real reason to break contracts which administrator established. On the > > > other hand you cannot assume correct operation of the SW which depends > > > on hugetlb pages in general. Such a SW might get unexpected crashes/data > > > corruptions and what not. > > My question about allowing the reclaim to happen all the time was like > Kirill said, if there's memory that's not being used then why panic (or > kill a task)? I see that Michal has thought this through though. My > intent was to add this as a config option, but it sounds like that's > also a bad plan. You cannot reclaim something that the administrator has asked for to be available. Sure we can reclaim the excess if there is any but that is not what your patch does > > And to be clear. The memory hotpug currently does the similar thing via > > dissolve_free_huge_pages and I believe that is wrong as well although > > one could argue that the memory offline is an admin action as well so > > reducing hugetlb pages is a reasonable thing to do. This would be for a > > separate discussion though. > > > > But OOM can happen for entirely different reasons and hugetlb might be > > configured properly while this change would simply break that setup. > > This is simply nogo. > > > > Yes, this patch is certainly not the final version for that specific > reason. I didn't see a good way to plug in to the OOM and was looking > for suggestions. Sorry if that was not clear. > > The root problem I'm trying to solve isn't a misconfiguration but to > cover off the case of the system recovering from a failure while Linux > will not. Please be more specific what you mean by the "failure". It is hard to comment on further things without a clear definition what is the problem you are trying to address. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events. 2017-07-31 9:10 ` Michal Hocko @ 2017-07-31 13:56 ` Liam R. Howlett 2017-07-31 14:08 ` Michal Hocko 0 siblings, 1 reply; 17+ messages in thread From: Liam R. Howlett @ 2017-07-31 13:56 UTC (permalink / raw) To: Michal Hocko Cc: Kirill A. Shutemov, linux-mm, akpm, n-horiguchi, mike.kravetz, aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer, aarcange, oleg, penguin-kernel, mingo, kirill.shutemov, vdavydov.dev, willy * Michal Hocko <mhocko@kernel.org> [170731 05:10]: > On Fri 28-07-17 21:56:38, Liam R. Howlett wrote: > > * Michal Hocko <mhocko@kernel.org> [170728 08:44]: > > > On Fri 28-07-17 14:23:50, Michal Hocko wrote: > > > > > > Other than that hugetlb pages are not reclaimable by design and users > > > > > > do rely on that. Otherwise they could consider using THP instead. > > > > > > > > > > > > If somebody configures the initial pool too high it is a configuration > > > > > > bug. Just think about it, we do not want to reset lowmem reserves > > > > > > configured by admin just because we are hitting the oom killer and yes > > > > > > insanely large lowmem reserves might lead to early OOM as well. > > > > The case I raise is a correctly configured system which has a memory > > module failure. > > So you are concerned about MCEs due to failing memory modules? If yes > why do you care about hugetlb in particular? No, I am concerned about a failed memory module. The system will detect certain failures, mark the memory as bad and automatically reboot. Up on rebooting, that module will not be used. My focus on hugetlb is that it can stop the automatic recovery of the system. Are there other reservations that should also be considered? > > > Modern systems will reboot and remove the memory from > > the memory pool. Linux will start to load and run out of memory. I get > > that this code has the side effect of doing what you're saying. Do you > > see this as a worth while feature and if so, do you know of a better way > > for me to trigger the behaviour? > > I do not understand your question. Could you elaborate more please? Are > you talking about system going into OOM because of too many MCEs? No, I'm talking about failed memory for whatever reason. The system reboots by a hardware means (I believe the memory controller) and removes the memory on that failed module from the pool. Now you effectively have a system with less memory than before which invalidates your configuration. Is it worth while to have Linux successfully boot when the system attempts to recover from a failure? > > > > > > > Nacked-by: Michal Hocko <mhocko@suse.com> > > > > > > > > > > Hm. I'm not sure it's fully justified. To me, reclaiming hugetlb is > > > > > something to be considered as last resort after all other measures have > > > > > been tried. > > > > > > > > System can recover from the OOM killer in most cases and there is no > > > > real reason to break contracts which administrator established. On the > > > > other hand you cannot assume correct operation of the SW which depends > > > > on hugetlb pages in general. Such a SW might get unexpected crashes/data > > > > corruptions and what not. > > > > My question about allowing the reclaim to happen all the time was like > > Kirill said, if there's memory that's not being used then why panic (or > > kill a task)? I see that Michal has thought this through though. My > > intent was to add this as a config option, but it sounds like that's > > also a bad plan. > > You cannot reclaim something that the administrator has asked for to be > available. Sure we can reclaim the excess if there is any but that is > not what your patch does I'm looking at the free_huge_pages vs the resv_huge_pages. I thought the resv_huge_pages were the free pages that are already requested, so if there were more free than reserved then they would be excess? > > > > And to be clear. The memory hotpug currently does the similar thing via > > > dissolve_free_huge_pages and I believe that is wrong as well although > > > one could argue that the memory offline is an admin action as well so > > > reducing hugetlb pages is a reasonable thing to do. This would be for a > > > separate discussion though. > > > > > > But OOM can happen for entirely different reasons and hugetlb might be > > > configured properly while this change would simply break that setup. > > > This is simply nogo. > > > > > > > Yes, this patch is certainly not the final version for that specific > > reason. I didn't see a good way to plug in to the OOM and was looking > > for suggestions. Sorry if that was not clear. > > > > The root problem I'm trying to solve isn't a misconfiguration but to > > cover off the case of the system recovering from a failure while Linux > > will not. > > Please be more specific what you mean by the "failure". It is hard to > comment on further things without a clear definition what is the problem > you are trying to address. The failure is when you have a memory module which is detected as bad for any reason by the hardware which causes the memory as being marked bad and removed from the memory pool on an automatic reboot. Sorry for not being clear here, I thought my comments would have been clear from what I stated in RFC PATCH 0/1. I hope this helps clarify things. Thanks, Liam -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events. 2017-07-31 13:56 ` Liam R. Howlett @ 2017-07-31 14:08 ` Michal Hocko 2017-07-31 14:37 ` Matthew Wilcox 2017-08-01 1:11 ` Liam R. Howlett 0 siblings, 2 replies; 17+ messages in thread From: Michal Hocko @ 2017-07-31 14:08 UTC (permalink / raw) To: Liam R. Howlett Cc: Kirill A. Shutemov, linux-mm, akpm, n-horiguchi, mike.kravetz, aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer, aarcange, oleg, penguin-kernel, mingo, kirill.shutemov, vdavydov.dev, willy On Mon 31-07-17 09:56:48, Liam R. Howlett wrote: > * Michal Hocko <mhocko@kernel.org> [170731 05:10]: > > On Fri 28-07-17 21:56:38, Liam R. Howlett wrote: > > > * Michal Hocko <mhocko@kernel.org> [170728 08:44]: > > > > On Fri 28-07-17 14:23:50, Michal Hocko wrote: > > > > > > > Other than that hugetlb pages are not reclaimable by design and users > > > > > > > do rely on that. Otherwise they could consider using THP instead. > > > > > > > > > > > > > > If somebody configures the initial pool too high it is a configuration > > > > > > > bug. Just think about it, we do not want to reset lowmem reserves > > > > > > > configured by admin just because we are hitting the oom killer and yes > > > > > > > insanely large lowmem reserves might lead to early OOM as well. > > > > > > The case I raise is a correctly configured system which has a memory > > > module failure. > > > > So you are concerned about MCEs due to failing memory modules? If yes > > why do you care about hugetlb in particular? > > No, I am concerned about a failed memory module. The system will > detect certain failures, mark the memory as bad and automatically > reboot. Up on rebooting, that module will not be used. How do you detect/configure this? We do have HWPoison infrastructure > My focus on hugetlb is that it can stop the automatic recovery of the > system. How? > Are there other reservations that should also be considered? What about any other memory reservations by memmap= kernel command line? > > > Modern systems will reboot and remove the memory from > > > the memory pool. Linux will start to load and run out of memory. I get > > > that this code has the side effect of doing what you're saying. Do you > > > see this as a worth while feature and if so, do you know of a better way > > > for me to trigger the behaviour? > > > > I do not understand your question. Could you elaborate more please? Are > > you talking about system going into OOM because of too many MCEs? > > No, I'm talking about failed memory for whatever reason. The system > reboots by a hardware means (I believe the memory controller) and > removes the memory on that failed module from the pool. Now you > effectively have a system with less memory than before which invalidates > your configuration. Is it worth while to have Linux successfully boot > when the system attempts to recover from a failure? Cetainly yes but if you boot with much less memory and you want to use hugetlb pages then you have to reconsider and maybe even reconfigure your workload to reflect new conditions. So I am not really sure this can be fully automated. > > > > > > > Nacked-by: Michal Hocko <mhocko@suse.com> > > > > > > > > > > > > Hm. I'm not sure it's fully justified. To me, reclaiming hugetlb is > > > > > > something to be considered as last resort after all other measures have > > > > > > been tried. > > > > > > > > > > System can recover from the OOM killer in most cases and there is no > > > > > real reason to break contracts which administrator established. On the > > > > > other hand you cannot assume correct operation of the SW which depends > > > > > on hugetlb pages in general. Such a SW might get unexpected crashes/data > > > > > corruptions and what not. > > > > > > My question about allowing the reclaim to happen all the time was like > > > Kirill said, if there's memory that's not being used then why panic (or > > > kill a task)? I see that Michal has thought this through though. My > > > intent was to add this as a config option, but it sounds like that's > > > also a bad plan. > > > > You cannot reclaim something that the administrator has asked for to be > > available. Sure we can reclaim the excess if there is any but that is > > not what your patch does > > I'm looking at the free_huge_pages vs the resv_huge_pages. I thought > the resv_huge_pages were the free pages that are already requested, so > if there were more free than reserved then they would be excess? The terminology is little be confusing here. Hugetlb memory we have committed into is reserved (e.g. by mmap) and we surely can have free pages on top of resv_huge_pages but that is not an excess yet. We can have surplus pages which would be an excess over what admin configured initially. See Documentation/vm/{hugetlbpage.txt,hugetlbfs_reserv.txt} for more information. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events. 2017-07-31 14:08 ` Michal Hocko @ 2017-07-31 14:37 ` Matthew Wilcox 2017-07-31 14:49 ` Michal Hocko 2017-08-01 1:11 ` Liam R. Howlett 1 sibling, 1 reply; 17+ messages in thread From: Matthew Wilcox @ 2017-07-31 14:37 UTC (permalink / raw) To: Michal Hocko Cc: Liam R. Howlett, Kirill A. Shutemov, linux-mm, akpm, n-horiguchi, mike.kravetz, aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer, aarcange, oleg, penguin-kernel, mingo, kirill.shutemov, vdavydov.dev On Mon, Jul 31, 2017 at 04:08:10PM +0200, Michal Hocko wrote: > On Mon 31-07-17 09:56:48, Liam R. Howlett wrote: > > * Michal Hocko <mhocko@kernel.org> [170731 05:10]: > > > On Fri 28-07-17 21:56:38, Liam R. Howlett wrote: > > > > The case I raise is a correctly configured system which has a memory > > > > module failure. > > > > > > So you are concerned about MCEs due to failing memory modules? If yes > > > why do you care about hugetlb in particular? > > > > No, I am concerned about a failed memory module. The system will > > detect certain failures, mark the memory as bad and automatically > > reboot. Up on rebooting, that module will not be used. > > How do you detect/configure this? We do have HWPoison infrastructure > > > My focus on hugetlb is that it can stop the automatic recovery of the > > system. > > How? Let me try to explain the situation as I understand it. The customer has purchased a 128TB machine in order to run a database. They reserve 124TB of memory for use by the database cache. Everything works great. Then a 4TB memory module goes bad. The machine reboots itself in order to return to operation, now having only 124TB of memory and having 124TB of memory reserved. It OOMs during boot. The current output from our OOM machinery doesn't point the sysadmin at the kernel command line parameter as now being the problem. So they file a priority 1 problem ticket ... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events. 2017-07-31 14:37 ` Matthew Wilcox @ 2017-07-31 14:49 ` Michal Hocko 2017-08-01 1:25 ` Liam R. Howlett 0 siblings, 1 reply; 17+ messages in thread From: Michal Hocko @ 2017-07-31 14:49 UTC (permalink / raw) To: Matthew Wilcox Cc: Liam R. Howlett, Kirill A. Shutemov, linux-mm, akpm, n-horiguchi, mike.kravetz, aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer, aarcange, oleg, penguin-kernel, mingo, kirill.shutemov, vdavydov.dev On Mon 31-07-17 07:37:35, Matthew Wilcox wrote: > On Mon, Jul 31, 2017 at 04:08:10PM +0200, Michal Hocko wrote: > > On Mon 31-07-17 09:56:48, Liam R. Howlett wrote: [...] > > > My focus on hugetlb is that it can stop the automatic recovery of the > > > system. > > > > How? > > Let me try to explain the situation as I understand it. > > The customer has purchased a 128TB machine in order to run a database. > They reserve 124TB of memory for use by the database cache. Everything > works great. Then a 4TB memory module goes bad. The machine reboots > itself in order to return to operation, now having only 124TB of memory > and having 124TB of memory reserved. It OOMs during boot. The current > output from our OOM machinery doesn't point the sysadmin at the kernel > command line parameter as now being the problem. So they file a priority > 1 problem ticket ... Well, I would argue that the oom report is quite clear that the hugetlb memory has consumed the large part if not whole usable memory and that should give a clue... Nevertheless, I can see some merit here, but I am arguing that there is simply no good way to handle this without admin involvement unless we want to risk other and much more subtle breakage where the application really expects it can consume the preallocated hugetlb pool completely. And I would even argue that the later is more probable than unintended memory failure reboot cycle. If somebody can tune hugetlb pool dynamically I would recommend doing so from an init script. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events. 2017-07-31 14:49 ` Michal Hocko @ 2017-08-01 1:25 ` Liam R. Howlett 2017-08-01 8:28 ` Michal Hocko 0 siblings, 1 reply; 17+ messages in thread From: Liam R. Howlett @ 2017-08-01 1:25 UTC (permalink / raw) To: Michal Hocko Cc: Matthew Wilcox, Kirill A. Shutemov, linux-mm, akpm, n-horiguchi, mike.kravetz, aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer, aarcange, oleg, penguin-kernel, mingo, kirill.shutemov, vdavydov.dev * Michal Hocko <mhocko@kernel.org> [170731 10:49]: > On Mon 31-07-17 07:37:35, Matthew Wilcox wrote: > > On Mon, Jul 31, 2017 at 04:08:10PM +0200, Michal Hocko wrote: > > > On Mon 31-07-17 09:56:48, Liam R. Howlett wrote: > [...] > > > > My focus on hugetlb is that it can stop the automatic recovery of the > > > > system. > > > > > > How? > > > > Let me try to explain the situation as I understand it. > > > > The customer has purchased a 128TB machine in order to run a database. > > They reserve 124TB of memory for use by the database cache. Everything > > works great. Then a 4TB memory module goes bad. The machine reboots > > itself in order to return to operation, now having only 124TB of memory > > and having 124TB of memory reserved. It OOMs during boot. The current > > output from our OOM machinery doesn't point the sysadmin at the kernel > > command line parameter as now being the problem. So they file a priority > > 1 problem ticket ... > > Well, I would argue that the oom report is quite clear that the hugetlb > memory has consumed the large part if not whole usable memory and that > should give a clue... Can you please show me where it's clear? Are you referring to these messages? Node 0 hugepages_total=15999 hugepages_free=15999 hugepages_surp=0 hugepages_size=8192kB Node 1 hugepages_total=16157 hugepages_free=16157 hugepages_surp=0 hugepages_size=8192kB I'm not trying to be obtuse, I'm just not sure what message in which you are referring. > > Nevertheless, I can see some merit here, but I am arguing that there > is simply no good way to handle this without admin involvement > unless we want to risk other and much more subtle breakage where the > application really expects it can consume the preallocated hugetlb pool > completely. And I would even argue that the later is more probable than > unintended memory failure reboot cycle. If somebody can tune hugetlb > pool dynamically I would recommend doing so from an init script. I agree that an admin involvement is necessary for a full recovery but I'm trying to make the best of a bad situation. Why can't it consume the preallocated hugetlb pool completely? I'm just trying to make the pool a little smaller. I thought that when the application fails to allocate a hugetlb it would receive a failure and need to cope with the allocation failure? Thanks, Liam -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events. 2017-08-01 1:25 ` Liam R. Howlett @ 2017-08-01 8:28 ` Michal Hocko 0 siblings, 0 replies; 17+ messages in thread From: Michal Hocko @ 2017-08-01 8:28 UTC (permalink / raw) To: Liam R. Howlett Cc: Matthew Wilcox, Kirill A. Shutemov, linux-mm, akpm, n-horiguchi, mike.kravetz, aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer, aarcange, oleg, penguin-kernel, mingo, kirill.shutemov, vdavydov.dev On Mon 31-07-17 21:25:42, Liam R. Howlett wrote: > * Michal Hocko <mhocko@kernel.org> [170731 10:49]: > > On Mon 31-07-17 07:37:35, Matthew Wilcox wrote: > > > On Mon, Jul 31, 2017 at 04:08:10PM +0200, Michal Hocko wrote: > > > > On Mon 31-07-17 09:56:48, Liam R. Howlett wrote: > > [...] > > > > > My focus on hugetlb is that it can stop the automatic recovery of the > > > > > system. > > > > > > > > How? > > > > > > Let me try to explain the situation as I understand it. > > > > > > The customer has purchased a 128TB machine in order to run a database. > > > They reserve 124TB of memory for use by the database cache. Everything > > > works great. Then a 4TB memory module goes bad. The machine reboots > > > itself in order to return to operation, now having only 124TB of memory > > > and having 124TB of memory reserved. It OOMs during boot. The current > > > output from our OOM machinery doesn't point the sysadmin at the kernel > > > command line parameter as now being the problem. So they file a priority > > > 1 problem ticket ... > > > > Well, I would argue that the oom report is quite clear that the hugetlb > > memory has consumed the large part if not whole usable memory and that > > should give a clue... > > Can you please show me where it's clear? Are you referring to these > messages? > > Node 0 hugepages_total=15999 hugepages_free=15999 hugepages_surp=0 > hugepages_size=8192kB > Node 1 hugepages_total=16157 hugepages_free=16157 hugepages_surp=0 > hugepages_size=8192kB > > I'm not trying to be obtuse, I'm just not sure what message in which you > are referring. Yes the above is the part of the oom report I had in mind. > > Nevertheless, I can see some merit here, but I am arguing that there > > is simply no good way to handle this without admin involvement > > unless we want to risk other and much more subtle breakage where the > > application really expects it can consume the preallocated hugetlb pool > > completely. And I would even argue that the later is more probable than > > unintended memory failure reboot cycle. If somebody can tune hugetlb > > pool dynamically I would recommend doing so from an init script. > > I agree that an admin involvement is necessary for a full recovery but > I'm trying to make the best of a bad situation. Is this situation so common to risk breaking an existing userspace which rely on having all hugetlb pages availble once they are configured? > Why can't it consume the preallocated hugetlb pool completely? Because it is not so unrealistic to assume that some userspace might _rely_ on having the pool available at any time with the capacity configured during the initialization. The hugetlb API basically guarantee that once the pool is preallocated it will never get reclaimed unless administrator intervene. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events. 2017-07-31 14:08 ` Michal Hocko 2017-07-31 14:37 ` Matthew Wilcox @ 2017-08-01 1:11 ` Liam R. Howlett 2017-08-01 8:29 ` Michal Hocko 1 sibling, 1 reply; 17+ messages in thread From: Liam R. Howlett @ 2017-08-01 1:11 UTC (permalink / raw) To: Michal Hocko Cc: Kirill A. Shutemov, linux-mm, akpm, n-horiguchi, mike.kravetz, aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer, aarcange, oleg, penguin-kernel, mingo, kirill.shutemov, vdavydov.dev, willy * Michal Hocko <mhocko@kernel.org> [170731 10:08]: > On Mon 31-07-17 09:56:48, Liam R. Howlett wrote: > > * Michal Hocko <mhocko@kernel.org> [170731 05:10]: > > > On Fri 28-07-17 21:56:38, Liam R. Howlett wrote: > > > > * Michal Hocko <mhocko@kernel.org> [170728 08:44]: > > > > > On Fri 28-07-17 14:23:50, Michal Hocko wrote: > > > > > > > > Other than that hugetlb pages are not reclaimable by design and users > > > > > > > > do rely on that. Otherwise they could consider using THP instead. > > > > > > > > > > > > > > > > If somebody configures the initial pool too high it is a configuration > > > > > > > > bug. Just think about it, we do not want to reset lowmem reserves > > > > > > > > configured by admin just because we are hitting the oom killer and yes > > > > > > > > insanely large lowmem reserves might lead to early OOM as well. > > > > > > > > The case I raise is a correctly configured system which has a memory > > > > module failure. > > > > > > So you are concerned about MCEs due to failing memory modules? If yes > > > why do you care about hugetlb in particular? > > > > No, I am concerned about a failed memory module. The system will > > detect certain failures, mark the memory as bad and automatically > > reboot. Up on rebooting, that module will not be used. > > How do you detect/configure this? We do have HWPoison infrastructure I don't right now but I felt I was at a stage where I would like to RFC to try and have this go smoother. I've not researched this but off hand: dmidecode is able to detect that there is a memory module disabled. This alone would not indicate a failure, but if one was to see a disabled DIMM and an invalid configuration it might be worth pointing out on boot? > > > My focus on hugetlb is that it can stop the automatic recovery of the > > system. > > How? Clarified in the thread fork - Thanks Matthew! > > > Are there other reservations that should also be considered? > > What about any other memory reservations by memmap= kernel command line? I've not seen any other reservation so large that a single failure causes a failed boot due to OOM, but that doesn't mean they should be ignored. > > > > > Modern systems will reboot and remove the memory from > > > > the memory pool. Linux will start to load and run out of memory. I get > > > > that this code has the side effect of doing what you're saying. Do you > > > > see this as a worth while feature and if so, do you know of a better way > > > > for me to trigger the behaviour? > > > > > > I do not understand your question. Could you elaborate more please? Are > > > you talking about system going into OOM because of too many MCEs? > > > > No, I'm talking about failed memory for whatever reason. The system > > reboots by a hardware means (I believe the memory controller) and > > removes the memory on that failed module from the pool. Now you > > effectively have a system with less memory than before which invalidates > > your configuration. Is it worth while to have Linux successfully boot > > when the system attempts to recover from a failure? > > Cetainly yes but if you boot with much less memory and you want to use > hugetlb pages then you have to reconsider and maybe even reconfigure > your workload to reflect new conditions. So I am not really sure this > can be fully automated. > I agree. A reconfiguration or repair is required to have optimum performance. Would you agree that having functioning system better than a reboot loop or hang on a panic? It's also easier to reconfigure a system that's booting. > > > > > > > > Nacked-by: Michal Hocko <mhocko@suse.com> > > > > > > > > > > > > > > Hm. I'm not sure it's fully justified. To me, reclaiming hugetlb is > > > > > > > something to be considered as last resort after all other measures have > > > > > > > been tried. > > > > > > > > > > > > System can recover from the OOM killer in most cases and there is no > > > > > > real reason to break contracts which administrator established. On the > > > > > > other hand you cannot assume correct operation of the SW which depends > > > > > > on hugetlb pages in general. Such a SW might get unexpected crashes/data > > > > > > corruptions and what not. > > > > > > > > My question about allowing the reclaim to happen all the time was like > > > > Kirill said, if there's memory that's not being used then why panic (or > > > > kill a task)? I see that Michal has thought this through though. My > > > > intent was to add this as a config option, but it sounds like that's > > > > also a bad plan. > > > > > > You cannot reclaim something that the administrator has asked for to be > > > available. Sure we can reclaim the excess if there is any but that is > > > not what your patch does > > > > I'm looking at the free_huge_pages vs the resv_huge_pages. I thought > > the resv_huge_pages were the free pages that are already requested, so > > if there were more free than reserved then they would be excess? > > The terminology is little be confusing here. Hugetlb memory we have > committed into is reserved (e.g. by mmap) and we surely can have free > pages on top of resv_huge_pages but that is not an excess yet. We can > have surplus pages which would be an excess over what admin configured > initially. See Documentation/vm/{hugetlbpage.txt,hugetlbfs_reserv.txt} > for more information. Thank you. I will revisit this error if the patch is considered useful at the end of the RFC conversation. Cheers, Liam -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events. 2017-08-01 1:11 ` Liam R. Howlett @ 2017-08-01 8:29 ` Michal Hocko 2017-08-01 14:41 ` Liam R. Howlett 0 siblings, 1 reply; 17+ messages in thread From: Michal Hocko @ 2017-08-01 8:29 UTC (permalink / raw) To: Liam R. Howlett Cc: Kirill A. Shutemov, linux-mm, akpm, n-horiguchi, mike.kravetz, aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer, aarcange, oleg, penguin-kernel, mingo, kirill.shutemov, vdavydov.dev, willy On Mon 31-07-17 21:11:25, Liam R. Howlett wrote: > * Michal Hocko <mhocko@kernel.org> [170731 10:08]: > > On Mon 31-07-17 09:56:48, Liam R. Howlett wrote: [...] > > > No, I'm talking about failed memory for whatever reason. The system > > > reboots by a hardware means (I believe the memory controller) and > > > removes the memory on that failed module from the pool. Now you > > > effectively have a system with less memory than before which invalidates > > > your configuration. Is it worth while to have Linux successfully boot > > > when the system attempts to recover from a failure? > > > > Cetainly yes but if you boot with much less memory and you want to use > > hugetlb pages then you have to reconsider and maybe even reconfigure > > your workload to reflect new conditions. So I am not really sure this > > can be fully automated. > > > > I agree. A reconfiguration or repair is required to have optimum > performance. Would you agree that having functioning system better than > a reboot loop or hang on a panic? It's also easier to reconfigure a > system that's booting. Absolutely. The thing is that I am not even sure that the hugetlb problem is real. Using hugetlb reservation from the boot command line parameter is easily fixable (just update the boot comand line from the boot loader). From my experience the init time hugetlb initialization is usually trying to be portable and as such configures a certain percentage of the available memory for hugetlb (some of them even on per NUMA node basis). Even if somebody uses hard coded values then this is something that is fixable during recovery. That being said I am not sure you are focusing on a real problem while the solution you are proposing might break an existing userspace. Please try to play with your memory recovery feature some more with real hugetlb usecases (Oracle DB is a heavy user AFAIR) and see what the real life problems might happen and we can revisit potential solutions with more data in hands. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events. 2017-08-01 8:29 ` Michal Hocko @ 2017-08-01 14:41 ` Liam R. Howlett 0 siblings, 0 replies; 17+ messages in thread From: Liam R. Howlett @ 2017-08-01 14:41 UTC (permalink / raw) To: Michal Hocko Cc: Kirill A. Shutemov, linux-mm, akpm, n-horiguchi, mike.kravetz, aneesh.kumar, khandual, punit.agrawal, arnd, gerald.schaefer, aarcange, oleg, penguin-kernel, mingo, kirill.shutemov, vdavydov.dev, willy * Michal Hocko <mhocko@kernel.org> [170801 04:30]: > On Mon 31-07-17 21:11:25, Liam R. Howlett wrote: > > * Michal Hocko <mhocko@kernel.org> [170731 10:08]: > > > On Mon 31-07-17 09:56:48, Liam R. Howlett wrote: > [...] > > > > No, I'm talking about failed memory for whatever reason. The system > > > > reboots by a hardware means (I believe the memory controller) and > > > > removes the memory on that failed module from the pool. Now you > > > > effectively have a system with less memory than before which invalidates > > > > your configuration. Is it worth while to have Linux successfully boot > > > > when the system attempts to recover from a failure? > > > > > > Cetainly yes but if you boot with much less memory and you want to use > > > hugetlb pages then you have to reconsider and maybe even reconfigure > > > your workload to reflect new conditions. So I am not really sure this > > > can be fully automated. > > > > > > > I agree. A reconfiguration or repair is required to have optimum > > performance. Would you agree that having functioning system better than > > a reboot loop or hang on a panic? It's also easier to reconfigure a > > system that's booting. > > Absolutely. The thing is that I am not even sure that the hugetlb > problem is real. Using hugetlb reservation from the boot command line > parameter is easily fixable (just update the boot comand line from the > boot loader). From my experience the init time hugetlb initialization > is usually trying to be portable and as such configures a certain > percentage of the available memory for hugetlb (some of them even on per > NUMA node basis). Even if somebody uses hard coded values then this is > something that is fixable during recovery. This was my thought when I was first assigned the bug for my last patch for adding the log message of the hugetlb allocation failure but during our discussion I was assigned two more near-identical bugs. From what I can tell the people following a setup guide do not know how to edit the grub command line easily once in a boot loop or don't have a decent enough console setup to do so. Worse yet, all three of the bugs were filed as kernel bugs because people didn't even realise it was a setup issue. I think the sysctl way of setting the hugetlb is the safest. But since we provide a kernel command line way of setting the hugetlb, it seems reasonable to make the user error as transparent as possible. This RFC was an extension of looking at how people arrive at an OOM error on boot when using hugetlb. > > That being said I am not sure you are focusing on a real problem while > the solution you are proposing might break an existing userspace. Please > try to play with your memory recovery feature some more with real > hugetlb usecases (Oracle DB is a heavy user AFAIR) and see what the real > life problems might happen and we can revisit potential solutions with > more data in hands. Okay, thank you. I will re-examine the issue and see about a different approach. I appreciate the time you have taken to look at my RFC. Thanks, Liam -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2017-08-01 14:42 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-07-27 18:02 [RFC PATCH 0/1] oom support for reclaiming of hugepages Liam R. Howlett 2017-07-27 18:02 ` [RFC PATCH 1/1] mm/hugetlb mm/oom_kill: Add support for reclaiming hugepages on OOM events Liam R. Howlett 2017-07-28 6:46 ` Michal Hocko 2017-07-28 11:33 ` Kirill A. Shutemov 2017-07-28 12:23 ` Michal Hocko 2017-07-28 12:44 ` Michal Hocko 2017-07-29 1:56 ` Liam R. Howlett 2017-07-31 9:10 ` Michal Hocko 2017-07-31 13:56 ` Liam R. Howlett 2017-07-31 14:08 ` Michal Hocko 2017-07-31 14:37 ` Matthew Wilcox 2017-07-31 14:49 ` Michal Hocko 2017-08-01 1:25 ` Liam R. Howlett 2017-08-01 8:28 ` Michal Hocko 2017-08-01 1:11 ` Liam R. Howlett 2017-08-01 8:29 ` Michal Hocko 2017-08-01 14:41 ` Liam R. Howlett
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.