All of lore.kernel.org
 help / color / mirror / Atom feed
* [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged
@ 2020-09-10 20:47 Vijay Balakrishna
  2020-09-10 22:01 ` Kirill A. Shutemov
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Vijay Balakrishna @ 2020-09-10 20:47 UTC (permalink / raw)
  To: Andrew Morton, Kirill A. Shutemov, Oleg Nesterov, Song Liu,
	Andrea Arcangeli, Pavel Tatashin, Vijay Balakrishna, Allen Pais
  Cc: linux-kernel, linux-mm

When memory is hotplug added or removed the min_free_kbytes must be
recalculated based on what is expected by khugepaged.  Currently
after hotplug, min_free_kbytes will be set to a lower default and higher
default set when THP enabled is lost. This leaves the system with small
min_free_kbytes which isn't suitable for systems especially with network
intensive loads.  Typical failure symptoms include HW WATCHDOG reset,
soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
kills.

Fixes: f000565adb77 ("thp: set recommended min free kbytes")

Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
Cc: stable@vger.kernel.org
---
 include/linux/khugepaged.h |  5 +++++
 mm/khugepaged.c            | 13 +++++++++++--
 mm/memory_hotplug.c        |  3 +++
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index bc45ea1efbf7..c941b7377321 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -15,6 +15,7 @@ extern int __khugepaged_enter(struct mm_struct *mm);
 extern void __khugepaged_exit(struct mm_struct *mm);
 extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
 				      unsigned long vm_flags);
+extern void khugepaged_min_free_kbytes_update(void);
 #ifdef CONFIG_SHMEM
 extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
 #else
@@ -85,6 +86,10 @@ static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
 					   unsigned long addr)
 {
 }
+
+static inline void khugepaged_min_free_kbytes_update(void)
+{
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_KHUGEPAGED_H */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index cfa0dba5fd3b..4f7107476a6f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -56,6 +56,9 @@ enum scan_result {
 #define CREATE_TRACE_POINTS
 #include <trace/events/huge_memory.h>
 
+static struct task_struct *khugepaged_thread __read_mostly;
+static DEFINE_MUTEX(khugepaged_mutex);
+
 /* default scan 8*512 pte (or vmas) every 30 second */
 static unsigned int khugepaged_pages_to_scan __read_mostly;
 static unsigned int khugepaged_pages_collapsed;
@@ -2292,8 +2295,6 @@ static void set_recommended_min_free_kbytes(void)
 
 int start_stop_khugepaged(void)
 {
-	static struct task_struct *khugepaged_thread __read_mostly;
-	static DEFINE_MUTEX(khugepaged_mutex);
 	int err = 0;
 
 	mutex_lock(&khugepaged_mutex);
@@ -2320,3 +2321,11 @@ int start_stop_khugepaged(void)
 	mutex_unlock(&khugepaged_mutex);
 	return err;
 }
+
+void khugepaged_min_free_kbytes_update(void)
+{
+	mutex_lock(&khugepaged_mutex);
+	if (khugepaged_enabled() && khugepaged_thread)
+		set_recommended_min_free_kbytes();
+	mutex_unlock(&khugepaged_mutex);
+}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e9d5ab5d3ca0..3e19272c1fad 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -36,6 +36,7 @@
 #include <linux/memblock.h>
 #include <linux/compaction.h>
 #include <linux/rmap.h>
+#include <linux/khugepaged.h>
 
 #include <asm/tlbflush.h>
 
@@ -857,6 +858,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 	zone_pcp_update(zone);
 
 	init_per_zone_wmark_min();
+	khugepaged_min_free_kbytes_update();
 
 	kswapd_run(nid);
 	kcompactd_run(nid);
@@ -1600,6 +1602,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
 
 	init_per_zone_wmark_min();
+	khugepaged_min_free_kbytes_update();
 
 	if (!populated_zone(zone)) {
 		zone_pcp_reset(zone);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged
  2020-09-10 20:47 [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged Vijay Balakrishna
@ 2020-09-10 22:01 ` Kirill A. Shutemov
  2020-09-10 22:28     ` Pavel Tatashin
  2020-09-15  5:04   ` Vijay Balakrishna
  2020-09-14 14:33 ` Michal Hocko
  2020-09-15 18:22   ` Pavel Tatashin
  2 siblings, 2 replies; 19+ messages in thread
From: Kirill A. Shutemov @ 2020-09-10 22:01 UTC (permalink / raw)
  To: Vijay Balakrishna
  Cc: Andrew Morton, Oleg Nesterov, Song Liu, Andrea Arcangeli,
	Pavel Tatashin, Allen Pais, linux-kernel, linux-mm

On Thu, Sep 10, 2020 at 01:47:39PM -0700, Vijay Balakrishna wrote:
> When memory is hotplug added or removed the min_free_kbytes must be
> recalculated based on what is expected by khugepaged.  Currently
> after hotplug, min_free_kbytes will be set to a lower default and higher
> default set when THP enabled is lost. This leaves the system with small
> min_free_kbytes which isn't suitable for systems especially with network
> intensive loads.  Typical failure symptoms include HW WATCHDOG reset,
> soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
> kills.
> 
> Fixes: f000565adb77 ("thp: set recommended min free kbytes")
> 
> Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
> Cc: stable@vger.kernel.org

NAK. It would override min_free_kbytes set by user.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged
  2020-09-10 22:01 ` Kirill A. Shutemov
@ 2020-09-10 22:28     ` Pavel Tatashin
  2020-09-15  5:04   ` Vijay Balakrishna
  1 sibling, 0 replies; 19+ messages in thread
From: Pavel Tatashin @ 2020-09-10 22:28 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Vijay Balakrishna, Andrew Morton, Oleg Nesterov, Song Liu,
	Andrea Arcangeli, Allen Pais, LKML, linux-mm

Hi Kirill,

On Thu, Sep 10, 2020 at 6:01 PM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Thu, Sep 10, 2020 at 01:47:39PM -0700, Vijay Balakrishna wrote:
> > When memory is hotplug added or removed the min_free_kbytes must be
> > recalculated based on what is expected by khugepaged.  Currently
> > after hotplug, min_free_kbytes will be set to a lower default and higher
> > default set when THP enabled is lost. This leaves the system with small
> > min_free_kbytes which isn't suitable for systems especially with network
> > intensive loads.  Typical failure symptoms include HW WATCHDOG reset,
> > soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
> > kills.
> >
> > Fixes: f000565adb77 ("thp: set recommended min free kbytes")
> >
> > Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
> > Cc: stable@vger.kernel.org
>
> NAK. It would override min_free_kbytes set by user.

Hi Kirill,

Thank you for looking into this. How is this different from when
khugepaged modifies it?

echo always >/sys/kernel/mm/transparent_hugepage/enabled

Which results in:

start_stop_khugepaged
  set_recommended_min_free_kbytes

Which will also adjust min_free_kbytes according to hugepaged requirement?

This bug that Vijay described is another hot-plug related issue that
we have found on our system where we perform memory hot add and hot
remove on every reboot.

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged
@ 2020-09-10 22:28     ` Pavel Tatashin
  0 siblings, 0 replies; 19+ messages in thread
From: Pavel Tatashin @ 2020-09-10 22:28 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Vijay Balakrishna, Andrew Morton, Oleg Nesterov, Song Liu,
	Andrea Arcangeli, Allen Pais, LKML, linux-mm

Hi Kirill,

On Thu, Sep 10, 2020 at 6:01 PM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Thu, Sep 10, 2020 at 01:47:39PM -0700, Vijay Balakrishna wrote:
> > When memory is hotplug added or removed the min_free_kbytes must be
> > recalculated based on what is expected by khugepaged.  Currently
> > after hotplug, min_free_kbytes will be set to a lower default and higher
> > default set when THP enabled is lost. This leaves the system with small
> > min_free_kbytes which isn't suitable for systems especially with network
> > intensive loads.  Typical failure symptoms include HW WATCHDOG reset,
> > soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
> > kills.
> >
> > Fixes: f000565adb77 ("thp: set recommended min free kbytes")
> >
> > Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
> > Cc: stable@vger.kernel.org
>
> NAK. It would override min_free_kbytes set by user.

Hi Kirill,

Thank you for looking into this. How is this different from when
khugepaged modifies it?

echo always >/sys/kernel/mm/transparent_hugepage/enabled

Which results in:

start_stop_khugepaged
  set_recommended_min_free_kbytes

Which will also adjust min_free_kbytes according to hugepaged requirement?

This bug that Vijay described is another hot-plug related issue that
we have found on our system where we perform memory hot add and hot
remove on every reboot.

Thank you,
Pasha


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged
  2020-09-10 22:28     ` Pavel Tatashin
  (?)
@ 2020-09-10 22:56     ` Vijay Balakrishna
  -1 siblings, 0 replies; 19+ messages in thread
From: Vijay Balakrishna @ 2020-09-10 22:56 UTC (permalink / raw)
  To: Pavel Tatashin, Kirill A. Shutemov
  Cc: Andrew Morton, Oleg Nesterov, Song Liu, Andrea Arcangeli,
	Allen Pais, LKML, linux-mm



On 9/10/2020 3:28 PM, Pavel Tatashin wrote:
> Hi Kirill,
> 
> On Thu, Sep 10, 2020 at 6:01 PM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
>>
>> On Thu, Sep 10, 2020 at 01:47:39PM -0700, Vijay Balakrishna wrote:
>>> When memory is hotplug added or removed the min_free_kbytes must be
>>> recalculated based on what is expected by khugepaged.  Currently
>>> after hotplug, min_free_kbytes will be set to a lower default and higher
>>> default set when THP enabled is lost. This leaves the system with small
>>> min_free_kbytes which isn't suitable for systems especially with network
>>> intensive loads.  Typical failure symptoms include HW WATCHDOG reset,
>>> soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
>>> kills.
>>>
>>> Fixes: f000565adb77 ("thp: set recommended min free kbytes")
>>>
>>> Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
>>> Cc: stable@vger.kernel.org
>>
>> NAK. It would override min_free_kbytes set by user.
> 
> Hi Kirill,
> 
> Thank you for looking into this. How is this different from when
> khugepaged modifies it?
> 
> echo always >/sys/kernel/mm/transparent_hugepage/enabled
> 
> Which results in:
> 
> start_stop_khugepaged
>    set_recommended_min_free_kbytes
> 
> Which will also adjust min_free_kbytes according to hugepaged requirement?
> 
> This bug that Vijay described is another hot-plug related issue that
> we have found on our system where we perform memory hot add and hot
> remove on every reboot.
> 
> Thank you,
> Pasha
> 

Thanks Kirill for taking a look and spotting it.

IIUC, it is an existing issue, we should fix it while we are at it. 
Otherwise with my propsed patch introduces a regression for hotplug 
memory consumers with THP enabled and min_free_kbytes set by user higher 
than calculated by THP.

Vijay

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged
  2020-09-10 20:47 [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged Vijay Balakrishna
  2020-09-10 22:01 ` Kirill A. Shutemov
@ 2020-09-14 14:33 ` Michal Hocko
  2020-09-14 16:57   ` Vijay Balakrishna
  2020-09-15 18:22   ` Pavel Tatashin
  2 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2020-09-14 14:33 UTC (permalink / raw)
  To: Vijay Balakrishna
  Cc: Andrew Morton, Kirill A. Shutemov, Oleg Nesterov, Song Liu,
	Andrea Arcangeli, Pavel Tatashin, Allen Pais, linux-kernel,
	linux-mm

On Thu 10-09-20 13:47:39, Vijay Balakrishna wrote:
> When memory is hotplug added or removed the min_free_kbytes must be
> recalculated based on what is expected by khugepaged.  Currently
> after hotplug, min_free_kbytes will be set to a lower default and higher
> default set when THP enabled is lost. This leaves the system with small
> min_free_kbytes which isn't suitable for systems especially with network
> intensive loads.  Typical failure symptoms include HW WATCHDOG reset,
> soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
> kills.

Care to explain some more please? The whole point of increasing
min_free_kbytes for THP is to get a larger free memory with a hope that
huge pages will be more likely to appear. While this might help for
other users that need a high order pages it is definitely not the
primary reason behind it. Could you provide an example with some more
data?

I do see how the inconsistency between boot and hotplug watermarks
setting is not ideal but I do worry about interaction with the user
specific values as a potential problem. set_recommended_min_free_kbytes
happens early enough that user space cannot really interfere but the
hotplug happens at any time.
 
> Fixes: f000565adb77 ("thp: set recommended min free kbytes")
> 
> Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
> Cc: stable@vger.kernel.org
> ---
>  include/linux/khugepaged.h |  5 +++++
>  mm/khugepaged.c            | 13 +++++++++++--
>  mm/memory_hotplug.c        |  3 +++
>  3 files changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index bc45ea1efbf7..c941b7377321 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -15,6 +15,7 @@ extern int __khugepaged_enter(struct mm_struct *mm);
>  extern void __khugepaged_exit(struct mm_struct *mm);
>  extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
>  				      unsigned long vm_flags);
> +extern void khugepaged_min_free_kbytes_update(void);
>  #ifdef CONFIG_SHMEM
>  extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
>  #else
> @@ -85,6 +86,10 @@ static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
>  					   unsigned long addr)
>  {
>  }
> +
> +static inline void khugepaged_min_free_kbytes_update(void)
> +{
> +}
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  #endif /* _LINUX_KHUGEPAGED_H */
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index cfa0dba5fd3b..4f7107476a6f 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -56,6 +56,9 @@ enum scan_result {
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/huge_memory.h>
>  
> +static struct task_struct *khugepaged_thread __read_mostly;
> +static DEFINE_MUTEX(khugepaged_mutex);
> +
>  /* default scan 8*512 pte (or vmas) every 30 second */
>  static unsigned int khugepaged_pages_to_scan __read_mostly;
>  static unsigned int khugepaged_pages_collapsed;
> @@ -2292,8 +2295,6 @@ static void set_recommended_min_free_kbytes(void)
>  
>  int start_stop_khugepaged(void)
>  {
> -	static struct task_struct *khugepaged_thread __read_mostly;
> -	static DEFINE_MUTEX(khugepaged_mutex);
>  	int err = 0;
>  
>  	mutex_lock(&khugepaged_mutex);
> @@ -2320,3 +2321,11 @@ int start_stop_khugepaged(void)
>  	mutex_unlock(&khugepaged_mutex);
>  	return err;
>  }
> +
> +void khugepaged_min_free_kbytes_update(void)
> +{
> +	mutex_lock(&khugepaged_mutex);
> +	if (khugepaged_enabled() && khugepaged_thread)
> +		set_recommended_min_free_kbytes();
> +	mutex_unlock(&khugepaged_mutex);
> +}
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e9d5ab5d3ca0..3e19272c1fad 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -36,6 +36,7 @@
>  #include <linux/memblock.h>
>  #include <linux/compaction.h>
>  #include <linux/rmap.h>
> +#include <linux/khugepaged.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -857,6 +858,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
>  	zone_pcp_update(zone);
>  
>  	init_per_zone_wmark_min();
> +	khugepaged_min_free_kbytes_update();
>  
>  	kswapd_run(nid);
>  	kcompactd_run(nid);
> @@ -1600,6 +1602,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  	pgdat_resize_unlock(zone->zone_pgdat, &flags);
>  
>  	init_per_zone_wmark_min();
> +	khugepaged_min_free_kbytes_update();
>  
>  	if (!populated_zone(zone)) {
>  		zone_pcp_reset(zone);
> -- 
> 2.28.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged
  2020-09-14 14:33 ` Michal Hocko
@ 2020-09-14 16:57   ` Vijay Balakrishna
  2020-09-15  8:18     ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Vijay Balakrishna @ 2020-09-14 16:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kirill A. Shutemov, Oleg Nesterov, Song Liu,
	Andrea Arcangeli, Pavel Tatashin, Allen Pais, linux-kernel,
	linux-mm, Vijay Balakrishna



On 9/14/2020 7:33 AM, Michal Hocko wrote:
> On Thu 10-09-20 13:47:39, Vijay Balakrishna wrote:
>> When memory is hotplug added or removed the min_free_kbytes must be
>> recalculated based on what is expected by khugepaged.  Currently
>> after hotplug, min_free_kbytes will be set to a lower default and higher
>> default set when THP enabled is lost. This leaves the system with small
>> min_free_kbytes which isn't suitable for systems especially with network
>> intensive loads.  Typical failure symptoms include HW WATCHDOG reset,
>> soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
>> kills.
> 
> Care to explain some more please? The whole point of increasing
> min_free_kbytes for THP is to get a larger free memory with a hope that
> huge pages will be more likely to appear. While this might help for
> other users that need a high order pages it is definitely not the
> primary reason behind it. Could you provide an example with some more
> data?

Thanks Michal.  I haven't looked into THP as part of my investigation, 
so I cannot comment.

In our use case we are hotplug removing ~2GB of 8GB total (on our SoC) 
during normal reboot/shutdown.  This memory is hotplug hot-added as 
movable type via systemd late service during start-of-day.

In our stress test first we ran into HW WATCHDOG recovery, on enabling 
kernel watchdog we started seeing soft lockup hung task notices, failure 
symptons varied, where stack trace of hung tasks sometimes trying to 
allocate GFP_ATOMIC memory, looping in do_notify_resume, NETDEVICE 
WATCHDOG timeouts, OOM process kills etc.,  During investigation we 
reran stress test without hotplug use case.  Surprisingly this run 
didn't encounter the said problems.  This led to comparing what is 
different between the two runs, while looking at various globals, 
studying hotplug code I uncovered the issue of failing to restore 
min_free_kbytes.  In particular on our 8GB SoC min_free_kbytes went down 
to 8703 from 22528 after hotplug add.

I'm going send a new patch to fix the issue Kirill A. Shutemov raised:
NAK. It would override min_free_kbytes set by user.

Vijay


> 
> I do see how the inconsistency between boot and hotplug watermarks
> setting is not ideal but I do worry about interaction with the user
> specific values as a potential problem. set_recommended_min_free_kbytes
> happens early enough that user space cannot really interfere but the
> hotplug happens at any time.
>   
>> Fixes: f000565adb77 ("thp: set recommended min free kbytes")
>>
>> Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
>> Cc: stable@vger.kernel.org
>> ---
>>   include/linux/khugepaged.h |  5 +++++
>>   mm/khugepaged.c            | 13 +++++++++++--
>>   mm/memory_hotplug.c        |  3 +++
>>   3 files changed, 19 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
>> index bc45ea1efbf7..c941b7377321 100644
>> --- a/include/linux/khugepaged.h
>> +++ b/include/linux/khugepaged.h
>> @@ -15,6 +15,7 @@ extern int __khugepaged_enter(struct mm_struct *mm);
>>   extern void __khugepaged_exit(struct mm_struct *mm);
>>   extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
>>   				      unsigned long vm_flags);
>> +extern void khugepaged_min_free_kbytes_update(void);
>>   #ifdef CONFIG_SHMEM
>>   extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
>>   #else
>> @@ -85,6 +86,10 @@ static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
>>   					   unsigned long addr)
>>   {
>>   }
>> +
>> +static inline void khugepaged_min_free_kbytes_update(void)
>> +{
>> +}
>>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>   
>>   #endif /* _LINUX_KHUGEPAGED_H */
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index cfa0dba5fd3b..4f7107476a6f 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -56,6 +56,9 @@ enum scan_result {
>>   #define CREATE_TRACE_POINTS
>>   #include <trace/events/huge_memory.h>
>>   
>> +static struct task_struct *khugepaged_thread __read_mostly;
>> +static DEFINE_MUTEX(khugepaged_mutex);
>> +
>>   /* default scan 8*512 pte (or vmas) every 30 second */
>>   static unsigned int khugepaged_pages_to_scan __read_mostly;
>>   static unsigned int khugepaged_pages_collapsed;
>> @@ -2292,8 +2295,6 @@ static void set_recommended_min_free_kbytes(void)
>>   
>>   int start_stop_khugepaged(void)
>>   {
>> -	static struct task_struct *khugepaged_thread __read_mostly;
>> -	static DEFINE_MUTEX(khugepaged_mutex);
>>   	int err = 0;
>>   
>>   	mutex_lock(&khugepaged_mutex);
>> @@ -2320,3 +2321,11 @@ int start_stop_khugepaged(void)
>>   	mutex_unlock(&khugepaged_mutex);
>>   	return err;
>>   }
>> +
>> +void khugepaged_min_free_kbytes_update(void)
>> +{
>> +	mutex_lock(&khugepaged_mutex);
>> +	if (khugepaged_enabled() && khugepaged_thread)
>> +		set_recommended_min_free_kbytes();
>> +	mutex_unlock(&khugepaged_mutex);
>> +}
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index e9d5ab5d3ca0..3e19272c1fad 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -36,6 +36,7 @@
>>   #include <linux/memblock.h>
>>   #include <linux/compaction.h>
>>   #include <linux/rmap.h>
>> +#include <linux/khugepaged.h>
>>   
>>   #include <asm/tlbflush.h>
>>   
>> @@ -857,6 +858,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
>>   	zone_pcp_update(zone);
>>   
>>   	init_per_zone_wmark_min();
>> +	khugepaged_min_free_kbytes_update();
>>   
>>   	kswapd_run(nid);
>>   	kcompactd_run(nid);
>> @@ -1600,6 +1602,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
>>   	pgdat_resize_unlock(zone->zone_pgdat, &flags);
>>   
>>   	init_per_zone_wmark_min();
>> +	khugepaged_min_free_kbytes_update();
>>   
>>   	if (!populated_zone(zone)) {
>>   		zone_pcp_reset(zone);
>> -- 
>> 2.28.0
>>
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged
  2020-09-10 22:01 ` Kirill A. Shutemov
  2020-09-10 22:28     ` Pavel Tatashin
@ 2020-09-15  5:04   ` Vijay Balakrishna
  1 sibling, 0 replies; 19+ messages in thread
From: Vijay Balakrishna @ 2020-09-15  5:04 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Oleg Nesterov, Song Liu, Andrea Arcangeli,
	Pavel Tatashin, Allen Pais, linux-kernel, linux-mm



On 9/10/2020 3:01 PM, Kirill A. Shutemov wrote:
> On Thu, Sep 10, 2020 at 01:47:39PM -0700, Vijay Balakrishna wrote:
>> When memory is hotplug added or removed the min_free_kbytes must be
>> recalculated based on what is expected by khugepaged.  Currently
>> after hotplug, min_free_kbytes will be set to a lower default and higher
>> default set when THP enabled is lost. This leaves the system with small
>> min_free_kbytes which isn't suitable for systems especially with network
>> intensive loads.  Typical failure symptoms include HW WATCHDOG reset,
>> soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
>> kills.
>>
>> Fixes: f000565adb77 ("thp: set recommended min free kbytes")
>>
>> Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
>> Cc: stable@vger.kernel.org
> 
> NAK. It would override min_free_kbytes set by user.

Hi Kirill,

To fix the issue you raised I just submitted
https://lore.kernel.org/lkml/1600145748-26518-1-git-send-email-vijayb@linux.microsoft.com/

Thanks,
Vijay

> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged
  2020-09-14 16:57   ` Vijay Balakrishna
@ 2020-09-15  8:18     ` Michal Hocko
  2020-09-15 15:48       ` Vijay Balakrishna
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2020-09-15  8:18 UTC (permalink / raw)
  To: Vijay Balakrishna
  Cc: Andrew Morton, Kirill A. Shutemov, Oleg Nesterov, Song Liu,
	Andrea Arcangeli, Pavel Tatashin, Allen Pais, linux-kernel,
	linux-mm

On Mon 14-09-20 09:57:02, Vijay Balakrishna wrote:
> 
> 
> On 9/14/2020 7:33 AM, Michal Hocko wrote:
> > On Thu 10-09-20 13:47:39, Vijay Balakrishna wrote:
> > > When memory is hotplug added or removed the min_free_kbytes must be
> > > recalculated based on what is expected by khugepaged.  Currently
> > > after hotplug, min_free_kbytes will be set to a lower default and higher
> > > default set when THP enabled is lost. This leaves the system with small
> > > min_free_kbytes which isn't suitable for systems especially with network
> > > intensive loads.  Typical failure symptoms include HW WATCHDOG reset,
> > > soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
> > > kills.
> > 
> > Care to explain some more please? The whole point of increasing
> > min_free_kbytes for THP is to get a larger free memory with a hope that
> > huge pages will be more likely to appear. While this might help for
> > other users that need a high order pages it is definitely not the
> > primary reason behind it. Could you provide an example with some more
> > data?
> 
> Thanks Michal.  I haven't looked into THP as part of my investigation, so I
> cannot comment.
> 
> In our use case we are hotplug removing ~2GB of 8GB total (on our SoC)
> during normal reboot/shutdown.  This memory is hotplug hot-added as movable
> type via systemd late service during start-of-day.
> 
> In our stress test first we ran into HW WATCHDOG recovery, on enabling
> kernel watchdog we started seeing soft lockup hung task notices, failure
> symptons varied, where stack trace of hung tasks sometimes trying to
> allocate GFP_ATOMIC memory, looping in do_notify_resume, NETDEVICE WATCHDOG
> timeouts, OOM process kills etc.,  During investigation we reran stress test
> without hotplug use case.  Surprisingly this run didn't encounter the said
> problems.  This led to comparing what is different between the two runs,
> while looking at various globals, studying hotplug code I uncovered the
> issue of failing to restore min_free_kbytes.  In particular on our 8GB SoC
> min_free_kbytes went down to 8703 from 22528 after hotplug add.

Did you try to increase min_free_kbytes manually after hot remove? Btw.
I would consider oom killer invocation due to min_free_kbytes really
weird behavior. If anything the higher value would cause more memory
reclaim and potentially oom rather than smaller one.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged
  2020-09-15  8:18     ` Michal Hocko
@ 2020-09-15 15:48       ` Vijay Balakrishna
  2020-09-16  6:53         ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Vijay Balakrishna @ 2020-09-15 15:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kirill A. Shutemov, Oleg Nesterov, Song Liu,
	Andrea Arcangeli, Pavel Tatashin, Allen Pais, linux-kernel,
	linux-mm



On 9/15/2020 1:18 AM, Michal Hocko wrote:
> On Mon 14-09-20 09:57:02, Vijay Balakrishna wrote:
>>
>>
>> On 9/14/2020 7:33 AM, Michal Hocko wrote:
>>> On Thu 10-09-20 13:47:39, Vijay Balakrishna wrote:
>>>> When memory is hotplug added or removed the min_free_kbytes must be
>>>> recalculated based on what is expected by khugepaged.  Currently
>>>> after hotplug, min_free_kbytes will be set to a lower default and higher
>>>> default set when THP enabled is lost. This leaves the system with small
>>>> min_free_kbytes which isn't suitable for systems especially with network
>>>> intensive loads.  Typical failure symptoms include HW WATCHDOG reset,
>>>> soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
>>>> kills.
>>>
>>> Care to explain some more please? The whole point of increasing
>>> min_free_kbytes for THP is to get a larger free memory with a hope that
>>> huge pages will be more likely to appear. While this might help for
>>> other users that need a high order pages it is definitely not the
>>> primary reason behind it. Could you provide an example with some more
>>> data?
>>
>> Thanks Michal.  I haven't looked into THP as part of my investigation, so I
>> cannot comment.
>>
>> In our use case we are hotplug removing ~2GB of 8GB total (on our SoC)
>> during normal reboot/shutdown.  This memory is hotplug hot-added as movable
>> type via systemd late service during start-of-day.
>>
>> In our stress test first we ran into HW WATCHDOG recovery, on enabling
>> kernel watchdog we started seeing soft lockup hung task notices, failure
>> symptons varied, where stack trace of hung tasks sometimes trying to
>> allocate GFP_ATOMIC memory, looping in do_notify_resume, NETDEVICE WATCHDOG
>> timeouts, OOM process kills etc.,  During investigation we reran stress test
>> without hotplug use case.  Surprisingly this run didn't encounter the said
>> problems.  This led to comparing what is different between the two runs,
>> while looking at various globals, studying hotplug code I uncovered the
>> issue of failing to restore min_free_kbytes.  In particular on our 8GB SoC
>> min_free_kbytes went down to 8703 from 22528 after hotplug add.
> 
> Did you try to increase min_free_kbytes manually after hot remove? Btw.

No, in our use case memory hot remove done during shutdown.

> I would consider oom killer invocation due to min_free_kbytes really
> weird behavior. If anything the higher value would cause more memory
> reclaim and potentially oom rather than smaller one.

Yes, we wondered about it too.  One panic stack trace (after many OOM kills)

[330321.174240] Out of memory and no killable processes...
[330321.179658] Kernel panic - not syncing: System is deadlocked on memory
[330321.186489] CPU: 4 PID: 1 Comm: systemd Kdump: loaded Tainted: G 
       O      5.4.51-xxx #1
[330321.196900] Hardware name: Overlake (DT)
[330321.201038] Call trace:
[330321.203660]  dump_backtrace+0x0/0x1d0
[330321.207533]  show_stack+0x20/0x2c
[330321.211048]  dump_stack+0xe8/0x150
[330321.214656]  panic+0x18c/0x3b4
[330321.217901]  out_of_memory+0x4c0/0x6e4
[330321.221863]  __alloc_pages_nodemask+0xbdc/0x1c90
[330321.226722]  alloc_pages_current+0x21c/0x2b0
[330321.231220]  alloc_slab_page+0x1e0/0x7d8
[330321.235361]  new_slab+0x2e8/0x2f8
[330321.238874]  ___slab_alloc+0x45c/0x59c
[330321.242835]  kmem_cache_alloc+0x2d4/0x360
[330321.247065]  getname_flags+0x6c/0x2a8
[330321.250938]  user_path_at_empty+0x3c/0x68
[330321.255168]  do_readlinkat+0x7c/0x17c
[330321.259039]  __arm64_sys_readlinkat+0x5c/0x70
[330321.263627]  el0_svc_handler+0x1b8/0x32c
[330321.267767]  el0_svc+0x10/0x14
[330321.271026] SMP: stopping secondary CPUs
[330321.275382] Starting crashdump kernel...
[330321.279526] Bye!

Then while searching I came across documented warning below.  In above 
instance panic after OOM kills happened after 3+ days of stress run (a 
mixure of ttcp, cpuloadgen and fio).

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-red_hat_enterprise_linux-performance_tuning_guide-configuration_tools-configuring_system_memory_capacity

Warning

Extreme values can damage your system. Setting min_free_kbytes to an 
extremely low value prevents the system from reclaiming memory, which 
can result in system hangs and OOM-killing processes. However, setting 
min_free_kbytes too high (for example, to 5–10% of total system memory) 
causes the system to enter an out-of-memory state immediately, resulting 
in the system spending too much time reclaiming memory.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged
  2020-09-10 20:47 [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged Vijay Balakrishna
@ 2020-09-15 18:22   ` Pavel Tatashin
  2020-09-14 14:33 ` Michal Hocko
  2020-09-15 18:22   ` Pavel Tatashin
  2 siblings, 0 replies; 19+ messages in thread
From: Pavel Tatashin @ 2020-09-15 18:22 UTC (permalink / raw)
  To: Vijay Balakrishna
  Cc: Andrew Morton, Kirill A. Shutemov, Oleg Nesterov, Song Liu,
	Andrea Arcangeli, Allen Pais, LKML, linux-mm

On Thu, Sep 10, 2020 at 4:47 PM Vijay Balakrishna
<vijayb@linux.microsoft.com> wrote:
>
> When memory is hotplug added or removed the min_free_kbytes must be
> recalculated based on what is expected by khugepaged.  Currently
> after hotplug, min_free_kbytes will be set to a lower default and higher
> default set when THP enabled is lost. This leaves the system with small
> min_free_kbytes which isn't suitable for systems especially with network
> intensive loads.  Typical failure symptoms include HW WATCHDOG reset,
> soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
> kills.
>
> Fixes: f000565adb77 ("thp: set recommended min free kbytes")
>
> Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
> Cc: stable@vger.kernel.org
> ---
>  include/linux/khugepaged.h |  5 +++++
>  mm/khugepaged.c            | 13 +++++++++++--
>  mm/memory_hotplug.c        |  3 +++
>  3 files changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index bc45ea1efbf7..c941b7377321 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -15,6 +15,7 @@ extern int __khugepaged_enter(struct mm_struct *mm);
>  extern void __khugepaged_exit(struct mm_struct *mm);
>  extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
>                                       unsigned long vm_flags);
> +extern void khugepaged_min_free_kbytes_update(void);
>  #ifdef CONFIG_SHMEM
>  extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
>  #else
> @@ -85,6 +86,10 @@ static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
>                                            unsigned long addr)
>  {
>  }
> +
> +static inline void khugepaged_min_free_kbytes_update(void)
> +{
> +}
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
>  #endif /* _LINUX_KHUGEPAGED_H */
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index cfa0dba5fd3b..4f7107476a6f 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -56,6 +56,9 @@ enum scan_result {
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/huge_memory.h>
>
> +static struct task_struct *khugepaged_thread __read_mostly;
> +static DEFINE_MUTEX(khugepaged_mutex);
> +
>  /* default scan 8*512 pte (or vmas) every 30 second */
>  static unsigned int khugepaged_pages_to_scan __read_mostly;
>  static unsigned int khugepaged_pages_collapsed;
> @@ -2292,8 +2295,6 @@ static void set_recommended_min_free_kbytes(void)
>
>  int start_stop_khugepaged(void)
>  {
> -       static struct task_struct *khugepaged_thread __read_mostly;
> -       static DEFINE_MUTEX(khugepaged_mutex);
>         int err = 0;
>
>         mutex_lock(&khugepaged_mutex);
> @@ -2320,3 +2321,11 @@ int start_stop_khugepaged(void)
>         mutex_unlock(&khugepaged_mutex);
>         return err;
>  }
> +
> +void khugepaged_min_free_kbytes_update(void)
> +{
> +       mutex_lock(&khugepaged_mutex);
> +       if (khugepaged_enabled() && khugepaged_thread)
> +               set_recommended_min_free_kbytes();
> +       mutex_unlock(&khugepaged_mutex);
> +}
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e9d5ab5d3ca0..3e19272c1fad 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -36,6 +36,7 @@
>  #include <linux/memblock.h>
>  #include <linux/compaction.h>
>  #include <linux/rmap.h>
> +#include <linux/khugepaged.h>
>
>  #include <asm/tlbflush.h>
>
> @@ -857,6 +858,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
>         zone_pcp_update(zone);
>
>         init_per_zone_wmark_min();
> +       khugepaged_min_free_kbytes_update();
>
>         kswapd_run(nid);
>         kcompactd_run(nid);
> @@ -1600,6 +1602,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
>         pgdat_resize_unlock(zone->zone_pgdat, &flags);
>
>         init_per_zone_wmark_min();
> +       khugepaged_min_free_kbytes_update();

The bug makes sense that min_free_kbytes should be consistent after
reboot or after hot-add, hence
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged
@ 2020-09-15 18:22   ` Pavel Tatashin
  0 siblings, 0 replies; 19+ messages in thread
From: Pavel Tatashin @ 2020-09-15 18:22 UTC (permalink / raw)
  To: Vijay Balakrishna
  Cc: Andrew Morton, Kirill A. Shutemov, Oleg Nesterov, Song Liu,
	Andrea Arcangeli, Allen Pais, LKML, linux-mm

On Thu, Sep 10, 2020 at 4:47 PM Vijay Balakrishna
<vijayb@linux.microsoft.com> wrote:
>
> When memory is hotplug added or removed the min_free_kbytes must be
> recalculated based on what is expected by khugepaged.  Currently
> after hotplug, min_free_kbytes will be set to a lower default and higher
> default set when THP enabled is lost. This leaves the system with small
> min_free_kbytes which isn't suitable for systems especially with network
> intensive loads.  Typical failure symptoms include HW WATCHDOG reset,
> soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
> kills.
>
> Fixes: f000565adb77 ("thp: set recommended min free kbytes")
>
> Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
> Cc: stable@vger.kernel.org
> ---
>  include/linux/khugepaged.h |  5 +++++
>  mm/khugepaged.c            | 13 +++++++++++--
>  mm/memory_hotplug.c        |  3 +++
>  3 files changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index bc45ea1efbf7..c941b7377321 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -15,6 +15,7 @@ extern int __khugepaged_enter(struct mm_struct *mm);
>  extern void __khugepaged_exit(struct mm_struct *mm);
>  extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
>                                       unsigned long vm_flags);
> +extern void khugepaged_min_free_kbytes_update(void);
>  #ifdef CONFIG_SHMEM
>  extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
>  #else
> @@ -85,6 +86,10 @@ static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
>                                            unsigned long addr)
>  {
>  }
> +
> +static inline void khugepaged_min_free_kbytes_update(void)
> +{
> +}
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
>  #endif /* _LINUX_KHUGEPAGED_H */
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index cfa0dba5fd3b..4f7107476a6f 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -56,6 +56,9 @@ enum scan_result {
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/huge_memory.h>
>
> +static struct task_struct *khugepaged_thread __read_mostly;
> +static DEFINE_MUTEX(khugepaged_mutex);
> +
>  /* default scan 8*512 pte (or vmas) every 30 second */
>  static unsigned int khugepaged_pages_to_scan __read_mostly;
>  static unsigned int khugepaged_pages_collapsed;
> @@ -2292,8 +2295,6 @@ static void set_recommended_min_free_kbytes(void)
>
>  int start_stop_khugepaged(void)
>  {
> -       static struct task_struct *khugepaged_thread __read_mostly;
> -       static DEFINE_MUTEX(khugepaged_mutex);
>         int err = 0;
>
>         mutex_lock(&khugepaged_mutex);
> @@ -2320,3 +2321,11 @@ int start_stop_khugepaged(void)
>         mutex_unlock(&khugepaged_mutex);
>         return err;
>  }
> +
> +void khugepaged_min_free_kbytes_update(void)
> +{
> +       mutex_lock(&khugepaged_mutex);
> +       if (khugepaged_enabled() && khugepaged_thread)
> +               set_recommended_min_free_kbytes();
> +       mutex_unlock(&khugepaged_mutex);
> +}
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e9d5ab5d3ca0..3e19272c1fad 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -36,6 +36,7 @@
>  #include <linux/memblock.h>
>  #include <linux/compaction.h>
>  #include <linux/rmap.h>
> +#include <linux/khugepaged.h>
>
>  #include <asm/tlbflush.h>
>
> @@ -857,6 +858,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
>         zone_pcp_update(zone);
>
>         init_per_zone_wmark_min();
> +       khugepaged_min_free_kbytes_update();
>
>         kswapd_run(nid);
>         kcompactd_run(nid);
> @@ -1600,6 +1602,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
>         pgdat_resize_unlock(zone->zone_pgdat, &flags);
>
>         init_per_zone_wmark_min();
> +       khugepaged_min_free_kbytes_update();

The bug makes sense that min_free_kbytes should be consistent after
reboot or after hot-add, hence
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged
  2020-09-15 15:48       ` Vijay Balakrishna
@ 2020-09-16  6:53         ` Michal Hocko
  2020-09-16 18:28           ` Vijay Balakrishna
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2020-09-16  6:53 UTC (permalink / raw)
  To: Vijay Balakrishna
  Cc: Andrew Morton, Kirill A. Shutemov, Oleg Nesterov, Song Liu,
	Andrea Arcangeli, Pavel Tatashin, Allen Pais, linux-kernel,
	linux-mm

On Tue 15-09-20 08:48:08, Vijay Balakrishna wrote:
> 
> 
> On 9/15/2020 1:18 AM, Michal Hocko wrote:
> > On Mon 14-09-20 09:57:02, Vijay Balakrishna wrote:
> > > 
> > > 
> > > On 9/14/2020 7:33 AM, Michal Hocko wrote:
> > > > On Thu 10-09-20 13:47:39, Vijay Balakrishna wrote:
> > > > > When memory is hotplug added or removed the min_free_kbytes must be
> > > > > recalculated based on what is expected by khugepaged.  Currently
> > > > > after hotplug, min_free_kbytes will be set to a lower default and higher
> > > > > default set when THP enabled is lost. This leaves the system with small
> > > > > min_free_kbytes which isn't suitable for systems especially with network
> > > > > intensive loads.  Typical failure symptoms include HW WATCHDOG reset,
> > > > > soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
> > > > > kills.
> > > > 
> > > > Care to explain some more please? The whole point of increasing
> > > > min_free_kbytes for THP is to get a larger free memory with a hope that
> > > > huge pages will be more likely to appear. While this might help for
> > > > other users that need a high order pages it is definitely not the
> > > > primary reason behind it. Could you provide an example with some more
> > > > data?
> > > 
> > > Thanks Michal.  I haven't looked into THP as part of my investigation, so I
> > > cannot comment.
> > > 
> > > In our use case we are hotplug removing ~2GB of 8GB total (on our SoC)
> > > during normal reboot/shutdown.  This memory is hotplug hot-added as movable
> > > type via systemd late service during start-of-day.
> > > 
> > > In our stress test first we ran into HW WATCHDOG recovery, on enabling
> > > kernel watchdog we started seeing soft lockup hung task notices, failure
> > > symptons varied, where stack trace of hung tasks sometimes trying to
> > > allocate GFP_ATOMIC memory, looping in do_notify_resume, NETDEVICE WATCHDOG
> > > timeouts, OOM process kills etc.,  During investigation we reran stress test
> > > without hotplug use case.  Surprisingly this run didn't encounter the said
> > > problems.  This led to comparing what is different between the two runs,
> > > while looking at various globals, studying hotplug code I uncovered the
> > > issue of failing to restore min_free_kbytes.  In particular on our 8GB SoC
> > > min_free_kbytes went down to 8703 from 22528 after hotplug add.
> > 
> > Did you try to increase min_free_kbytes manually after hot remove? Btw.
> 
> No, in our use case memory hot remove done during shutdown.

I do not follow. If you are hotremoving during shutdown then how come
the value of min_free_kbytes matter at all?

> > I would consider oom killer invocation due to min_free_kbytes really
> > weird behavior. If anything the higher value would cause more memory
> > reclaim and potentially oom rather than smaller one.
> 
> Yes, we wondered about it too.  One panic stack trace (after many OOM kills)
> 
> [330321.174240] Out of memory and no killable processes...
> [330321.179658] Kernel panic - not syncing: System is deadlocked on memory
> [330321.186489] CPU: 4 PID: 1 Comm: systemd Kdump: loaded Tainted: G       O
> 5.4.51-xxx #1
> [330321.196900] Hardware name: Overlake (DT)
> [330321.201038] Call trace:
> [330321.203660]  dump_backtrace+0x0/0x1d0
> [330321.207533]  show_stack+0x20/0x2c
> [330321.211048]  dump_stack+0xe8/0x150
> [330321.214656]  panic+0x18c/0x3b4
> [330321.217901]  out_of_memory+0x4c0/0x6e4
> [330321.221863]  __alloc_pages_nodemask+0xbdc/0x1c90
> [330321.226722]  alloc_pages_current+0x21c/0x2b0
> [330321.231220]  alloc_slab_page+0x1e0/0x7d8
> [330321.235361]  new_slab+0x2e8/0x2f8
> [330321.238874]  ___slab_alloc+0x45c/0x59c
> [330321.242835]  kmem_cache_alloc+0x2d4/0x360
> [330321.247065]  getname_flags+0x6c/0x2a8
> [330321.250938]  user_path_at_empty+0x3c/0x68
> [330321.255168]  do_readlinkat+0x7c/0x17c
> [330321.259039]  __arm64_sys_readlinkat+0x5c/0x70
> [330321.263627]  el0_svc_handler+0x1b8/0x32c
> [330321.267767]  el0_svc+0x10/0x14
> [330321.271026] SMP: stopping secondary CPUs
> [330321.275382] Starting crashdump kernel...
> [330321.279526] Bye!

Do you have the full oom splat? The fact that previous oom killer
invocations haven't helped and all the eligible tasks have been killed
and you still hit the oom would suggest there is a lot of memory
allocated without a direct relation to tasks. I fail to see how
min_free_kbytes would be related.

> Then while searching I came across documented warning below.  In above
> instance panic after OOM kills happened after 3+ days of stress run (a
> mixure of ttcp, cpuloadgen and fio).
> 
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-red_hat_enterprise_linux-performance_tuning_guide-configuration_tools-configuring_system_memory_capacity
> 
> Warning
> 
> Extreme values can damage your system. Setting min_free_kbytes to an
> extremely low value prevents the system from reclaiming memory, which can
> result in system hangs and OOM-killing processes. However, setting
> min_free_kbytes too high (for example, to 5–10% of total system memory)
> causes the system to enter an out-of-memory state immediately, resulting in
> the system spending too much time reclaiming memory.

The auto tuned value should never reach such a low value to cause
problems.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged
  2020-09-16  6:53         ` Michal Hocko
@ 2020-09-16 18:28           ` Vijay Balakrishna
  2020-09-17 12:12             ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Vijay Balakrishna @ 2020-09-16 18:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kirill A. Shutemov, Oleg Nesterov, Song Liu,
	Andrea Arcangeli, Pavel Tatashin, Allen Pais, linux-kernel,
	linux-mm



On 9/15/2020 11:53 PM, Michal Hocko wrote:
> On Tue 15-09-20 08:48:08, Vijay Balakrishna wrote:
>>
>>
>> On 9/15/2020 1:18 AM, Michal Hocko wrote:
>>> On Mon 14-09-20 09:57:02, Vijay Balakrishna wrote:
>>>>
>>>>
>>>> On 9/14/2020 7:33 AM, Michal Hocko wrote:
>>>>> On Thu 10-09-20 13:47:39, Vijay Balakrishna wrote:
>>>>>> When memory is hotplug added or removed the min_free_kbytes must be
>>>>>> recalculated based on what is expected by khugepaged.  Currently
>>>>>> after hotplug, min_free_kbytes will be set to a lower default and higher
>>>>>> default set when THP enabled is lost. This leaves the system with small
>>>>>> min_free_kbytes which isn't suitable for systems especially with network
>>>>>> intensive loads.  Typical failure symptoms include HW WATCHDOG reset,
>>>>>> soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
>>>>>> kills.
>>>>>
>>>>> Care to explain some more please? The whole point of increasing
>>>>> min_free_kbytes for THP is to get a larger free memory with a hope that
>>>>> huge pages will be more likely to appear. While this might help for
>>>>> other users that need a high order pages it is definitely not the
>>>>> primary reason behind it. Could you provide an example with some more
>>>>> data?
>>>>
>>>> Thanks Michal.  I haven't looked into THP as part of my investigation, so I
>>>> cannot comment.
>>>>
>>>> In our use case we are hotplug removing ~2GB of 8GB total (on our SoC)
>>>> during normal reboot/shutdown.  This memory is hotplug hot-added as movable
>>>> type via systemd late service during start-of-day.
>>>>
>>>> In our stress test first we ran into HW WATCHDOG recovery, on enabling
>>>> kernel watchdog we started seeing soft lockup hung task notices, failure
>>>> symptons varied, where stack trace of hung tasks sometimes trying to
>>>> allocate GFP_ATOMIC memory, looping in do_notify_resume, NETDEVICE WATCHDOG
>>>> timeouts, OOM process kills etc.,  During investigation we reran stress test
>>>> without hotplug use case.  Surprisingly this run didn't encounter the said
>>>> problems.  This led to comparing what is different between the two runs,
>>>> while looking at various globals, studying hotplug code I uncovered the
>>>> issue of failing to restore min_free_kbytes.  In particular on our 8GB SoC
>>>> min_free_kbytes went down to 8703 from 22528 after hotplug add.
>>>
>>> Did you try to increase min_free_kbytes manually after hot remove? Btw.
>>
>> No, in our use case memory hot remove done during shutdown.
> 
> I do not follow. If you are hotremoving during shutdown then how come
> the value of min_free_kbytes matter at all?

We are hot adding (which is hot removed memory during shutdown) during 
boot, the removed memory treated as persistent.

> 
>>> I would consider oom killer invocation due to min_free_kbytes really
>>> weird behavior. If anything the higher value would cause more memory
>>> reclaim and potentially oom rather than smaller one.
>>
>> Yes, we wondered about it too.  One panic stack trace (after many OOM kills)
>>
>> [330321.174240] Out of memory and no killable processes...
>> [330321.179658] Kernel panic - not syncing: System is deadlocked on memory
>> [330321.186489] CPU: 4 PID: 1 Comm: systemd Kdump: loaded Tainted: G       O
>> 5.4.51-xxx #1
>> [330321.196900] Hardware name: Overlake (DT)
>> [330321.201038] Call trace:
>> [330321.203660]  dump_backtrace+0x0/0x1d0
>> [330321.207533]  show_stack+0x20/0x2c
>> [330321.211048]  dump_stack+0xe8/0x150
>> [330321.214656]  panic+0x18c/0x3b4
>> [330321.217901]  out_of_memory+0x4c0/0x6e4
>> [330321.221863]  __alloc_pages_nodemask+0xbdc/0x1c90
>> [330321.226722]  alloc_pages_current+0x21c/0x2b0
>> [330321.231220]  alloc_slab_page+0x1e0/0x7d8
>> [330321.235361]  new_slab+0x2e8/0x2f8
>> [330321.238874]  ___slab_alloc+0x45c/0x59c
>> [330321.242835]  kmem_cache_alloc+0x2d4/0x360
>> [330321.247065]  getname_flags+0x6c/0x2a8
>> [330321.250938]  user_path_at_empty+0x3c/0x68
>> [330321.255168]  do_readlinkat+0x7c/0x17c
>> [330321.259039]  __arm64_sys_readlinkat+0x5c/0x70
>> [330321.263627]  el0_svc_handler+0x1b8/0x32c
>> [330321.267767]  el0_svc+0x10/0x14
>> [330321.271026] SMP: stopping secondary CPUs
>> [330321.275382] Starting crashdump kernel...
>> [330321.279526] Bye!
> 
> Do you have the full oom splat? The fact that previous oom killer
> invocations haven't helped and all the eligible tasks have been killed
> and you still hit the oom would suggest there is a lot of memory
> allocated without a direct relation to tasks. I fail to see how
> min_free_kbytes would be related.

OOM splat below.  I see we had kmem leak detection turned on here.  We 
haven't run stress with kmem leak detection since uncovereing low 
min_free_kbytes.  During investigation we wanted to make sure there is 
no kmem leaks, we didn't find significant leaks detected.

[330319.234959] 
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/dbus-broker.service,task=dbus-broker,pid=541,uid=999
[330319.251380] Out of memory: Killed process 541 (dbus-broker) 
total-vm:4400kB, anon-rss:892kB, file-rss:380kB, shmem-rss:0kB, UID:999 
pgtables:44kB oom_score_adj:-900
[330319.267587] oom_reaper: reaped process 541 (dbus-broker), now 
anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[330319.766059] systemd invoked oom-killer: 
gfp_mask=0x40cc0(GFP_KERNEL|__GFP_COMP), order=1, oom_score_adj=0
[330319.776060] CPU: 4 PID: 1 Comm: systemd Kdump: loaded Tainted: G 
        O      5.4.51-xxx #1
[330319.790612] Call trace:
[330319.793240]  dump_backtrace+0x0/0x1d0
[330319.797112]  show_stack+0x20/0x2c
[330319.800628]  dump_stack+0xe8/0x150
[330319.804234]  dump_header+0x80/0x494
[330319.807925]  out_of_memory+0x480/0x6e4
[330319.811887]  __alloc_pages_nodemask+0xbdc/0x1c90
[330319.816745]  alloc_pages_current+0x21c/0x2b0
[330319.821244]  alloc_slab_page+0x1e0/0x7d8
[330319.825383]  new_slab+0x2e8/0x2f8
[330319.828895]  ___slab_alloc+0x45c/0x59c
[330319.832854]  kmem_cache_alloc+0x2d4/0x360
[330319.837086]  getname_flags+0x6c/0x2a8
[330319.840958]  user_path_at_empty+0x3c/0x68
[330319.845188]  do_readlinkat+0x7c/0x17c
[330319.849059]  __arm64_sys_readlinkat+0x5c/0x70
[330319.853648]  el0_svc_handler+0x1b8/0x32c
[330319.857788]  el0_svc+0x10/0x14
[330319.861064] Mem-Info:
[330319.863519] active_anon:60744 inactive_anon:109226 isolated_anon:0
                  active_file:6418 inactive_file:3869 isolated_file:2
                  unevictable:0 dirty:8 writeback:1 unstable:0
                  slab_reclaimable:34660 slab_unreclaimable:795718
                  mapped:1256 shmem:165765 pagetables:689 bounce:0
                  free:340962 free_pcp:4672 free_cma:0
[330319.898873] Node 0 active_anon:242976kB inactive_anon:436904kB 
active_file:25672kB inactive_file:15476kB unevictable:0kB 
isolated(anon):0kB isolated(file):8kB mapped:5024kB dirty:32kB 
writeback:4kB shmem:663060kB shmem_thp: 0kB shmem_pmdmapped: 0kB 
anon_thp: 73728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[330319.928124] Node 0 Normal free:12652kB min:14344kB low:19092kB 
high:23840kB active_anon:55340kB inactive_anon:60276kB active_file:60kB 
inactive_file:128kB unevictable:0kB writepending:4kB present:6220656kB 
managed:4750196kB mlocked:0kB kernel_stack:9568kB pagetables:2756kB 
bounce:0kB free_pcp:10056kB local_pcp:1376kB free_cma:0kB
[330319.958376] lowmem_reserve[]: 0 15360 15360
[330319.962814] Node 0 Movable free:1351196kB min:2544kB low:4508kB 
high:6472kB active_anon:188352kB inactive_anon:376856kB 
active_file:26120kB inactive_file:15308kB unevictable:0kB 
writepending:32kB present:1966080kB managed:1966080kB mlocked:0kB 
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:8632kB 
local_pcp:1336kB free_cma:0kB
[330319.993157] lowmem_reserve[]: 0 0 0
[330319.996879] Node 0 Normal: 3138*4kB (UME) 38*8kB (UM) 0*16kB 0*32kB 
0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 12856kB
[330320.009592] Node 0 Movable: 16382*4kB (M) 2980*8kB (M) 311*16kB (M) 
77*32kB (M) 28*64kB (M) 5*128kB (M) 6*256kB (M) 1*512kB (M) 1*1024kB (M) 
120*2048kB (M) 245*4096kB (M) = 1351592kB
[330320.026541] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=1048576kB
[330320.035631] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=32768kB
[330320.044543] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=2048kB
[330320.053363] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=64kB
[330320.062004] 176215 total pagecache pages
[330320.066165] 2046684 pages RAM
[330320.069339] 0 pages HighMem/MovableOnly
[330320.073410] 367615 pages reserved
[330320.076943] 0 pages hwpoisoned
[330320.080190] Unreclaimable slab info:
[330320.083991] Name                      Used          Total
[330320.090244] bio-3                    560KB        672KB
[330320.095747] bio-2                    757KB        832KB
[330320.101254] nf-frags                   7KB         15KB
[330320.106763] fib6_nodes                16KB         16KB
[330320.112270] ip6_dst_cache             70KB         70KB
[330320.117779] RAWv6                    154KB        154KB
[330320.123296] UDPv6                    246KB        246KB
[330320.128806] TCPv6                    146KB        146KB
[330320.134315] nf_conntrack              84KB         94KB
[330320.139823] io                        66KB         80KB
[330320.145332] sd_ext_cdb                 3KB          3KB
[330320.150838] virtio_scsi_cmd           16KB         16KB
[330320.156345] sgpool-128                29KB         29KB
[330320.161852] sgpool-64                 31KB         31KB
[330320.167359] sgpool-32                 31KB         31KB
[330320.172867] sgpool-16                 15KB         15KB
[330320.178374] sgpool-8                   7KB          7KB
[330320.183880] mqueue_inode_cache         31KB         31KB
[330320.189479] jbd2_inode                35KB         43KB
[330320.194989] ext4_system_zone          21KB         55KB
[330320.200497] ext4_bio_post_read_ctx         15KB         15KB
[330320.206453] kioctx                   255KB        255KB
[330320.211960] aio_kiocb                 54KB         77KB
[330320.217477] dio                      196KB        323KB
[330320.222985] bio-1                      7KB          7KB
[330320.228499] UNIX                     308KB        369KB
[330320.234009] tcp_bind_bucket           20KB         20KB
[330320.239518] ip_fib_trie               16KB         16KB
[330320.245024] ip_fib_alias              15KB         15KB
[330320.250537] ip_dst_cache              64KB         64KB
[330320.256047] RAW                      158KB        158KB
[330320.261553] UDP                      247KB        247KB
[330320.267060] tw_sock_TCP               15KB         15KB
[330320.272566] request_sock_TCP          30KB         30KB
[330320.278073] TCP                      278KB        278KB
[330320.283580] hugetlbfs_inode_cache         62KB         62KB
[330320.289446] eventpoll_pwq             31KB         31KB
[330320.294953] eventpoll_epi             31KB         31KB
[330320.300460] inotify_inode_mark         31KB         31KB
[330320.306061] request_queue            467KB        499KB
[330320.311568] blkdev_ioc                61KB         61KB
[330320.317087] bio-0                    259KB        600KB
[330320.322603] biovec-max              3060KB       3718KB
[330320.328112] biovec-64                252KB        315KB
[330320.333619] biovec-16                 62KB         78KB
[330320.339127] khugepaged_mm_slot         31KB         35KB
[330320.344723] user_namespace           122KB        122KB
[330320.350230] uid_cache                 64KB         64KB
[330320.355738] iommu_iova              1076KB       1076KB
[330320.361245] dmaengine-unmap-2          4KB          4KB
[330320.366754] skbuff_fclone_cache        128KB        160KB
[330320.374394] skbuff_head_cache      79402KB     106265KB
[330320.379908] file_lock_cache           62KB         92KB
[330320.385416] file_lock_ctx             42KB         47KB
[330320.390924] fsnotify_mark_connector         46KB         51KB
[330320.396969] net_namespace             64KB         64KB
[330320.402477] task_delay_info           66KB         78KB
[330320.407984] taskstats                 63KB         63KB
[330320.413491] proc_dir_entry           279KB        289KB
[330320.418998] pde_opener                31KB         31KB
[330320.424507] seq_file                  63KB         94KB
[330320.430014] sigqueue                  55KB         55KB
[330320.435525] shmem_inode_cache       1086KB       1221KB
[330320.441036] kernfs_iattrs_cache        782KB        826KB
[330320.446746] kernfs_node_cache       7943KB       8221KB
[330320.452306] mnt_cache               1579KB       2756KB
[330320.457813] filp                     265KB        265KB
[330320.463321] names_cache            21543KB      21543KB
[330320.468833] hashtab_node             115KB        131KB
[330320.474341] ebitmap_node             626KB        641KB
[330320.479849] avtab_node              1047KB       1063KB
[330320.485361] avc_node                 118KB        118KB
[330320.490868] iint_cache                23KB         23KB
[330320.496376] lsm_inode_cache         8578KB       8578KB
[330320.501902] lsm_file_cache           147KB        552KB
[330320.507416] key_jar                  157KB        157KB
[330320.512927] nsproxy                   31KB         31KB
[330320.518436] vm_area_struct            89KB        127KB
[330320.523944] mm_struct                252KB        252KB
[330320.529451] fs_cache                  64KB         64KB
[330320.534959] files_cache              255KB        255KB
[330320.540468] signal_cache             507KB        569KB
[330320.545979] sighand_cache            633KB        841KB
[330320.551488] task_struct             1721KB       1940KB
[330320.556997] cred_jar                 119KB        136KB
[330320.562504] anon_vma_chain            35KB         47KB
[330320.568011] anon_vma                  73KB         95KB
[330320.573520] pid                      101KB        120KB
[330320.579029] numa_policy                3KB          3KB
[330320.584536] trace_event_file         262KB        262KB
[330320.590041] ftrace_event_field        184KB        184KB
[330320.595638] pool_workqueue           128KB        128KB
[330320.601146] task_group                64KB         64KB
[330320.606652] vmap_area                 77KB         78KB
[330320.612159] page->ptl                517KB        517KB
[330320.617665] kmemleak_scan_area         47KB         47KB
[330320.623262] kmemleak_object      2449340KB    2449340KB
[330320.628773] kmalloc-8k              4848KB       4928KB
[330320.634423] kmalloc-4k             48944KB      61856KB
[330320.639946] kmalloc-2k             11768KB      12480KB
[330320.645453] kmalloc-1k             10752KB      10752KB
[330320.651049] kmalloc-512            87024KB      94124KB
[330320.656561] kmalloc-256             2433KB       2528KB
[330320.662359] kmalloc-128            24071KB      29104KB
[330320.667869] kmem_cache_node          867KB        896KB
[330320.673377] kmem_cache              2162KB       2171KB
[330320.678881] Tasks state (memory values in pages):
[330320.683848] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes 
swapents oom_score_adj name
[330320.692867] [    238]     0   238     3446     1394    61440 
0         -1000 systemd-udevd
[330320.702174] [  20024] 62412 20008   313895        0   176128 
0             0 test_ntttcp
[330320.711281] [  20891] 64250 20885    67810        0    49152 
0             0 test_cpuloadgen
[330320.720743] [  20704] 62689 20704   157371       23    90112 
0             0 fio
[330320.729119] [  20708] 62689 20708   157372        5    90112 
0             0 fio
[330320.737496] [  20709] 62689 20709   157373        6    90112 
0             0 fio
[330320.745874] [  20710] 62689 20710   157374        6    90112 
0             0 fio
[330320.754267] [   4698]     0  4698     2231        0    57344 
0             0 umount
[330320.762919] [   4699]     0  4699     2231        0    57344 
0             0 umount
[330320.771565] [   4700]     0  4700     2231        0    53248 
0             0 umount
[330320.780212] [   4701]     0  4701     2231        0    57344 
0             0 umount
[330320.788858] [   4705]     0  4705     2231        0    53248 
0             0 umount
[330320.797505] [   4706]     0  4706     2231        0    57344 
0             0 umount
[330320.806152] [   4707]     0  4707     2231        0    61440 
0             0 umount
[330320.814798] [   4709]     0  4709     2265        0    57344 
0             0 veritysetup
[330320.823891] [   4715]     0  4715     2231        0    61440 
0             0 umount
[330320.832537] [   4716]     0  4716     2231        0    57344 
0             0 umount
[330320.841185] [   4717]     0  4717     2231        0    57344 
0             0 umount
[330320.849832] [   4721]     0  4721     2231        0    57344 
0             0 umount
[330320.858478] [   4722]     0  4722     2231        0    57344 
0             0 umount
[330320.867124] [   4723]     0  4723     2231        0    57344 
0             0 umount
[330320.875770] [   4728]     0  4728     2231        0    61440 
0             0 umount
[330320.884423] [   4729]     0  4729     2231        0    57344 
0             0 umount
[330320.893075] [   4730]     0  4730     2231        0    57344 
0             0 umount
[330320.901722] [   4731]     0  4731     2231        0    57344 
0             0 umount
[330320.910369] [   4732]     0  4732     2231        0    61440 
0             0 umount
[330320.919016] [   4733]     0  4733     2231        0    57344 
0             0 umount
[330320.927662] [   4735]     0  4735     2231        0    61440 
0             0 umount
[330320.936307] [   4736]     0  4736     2231        0    61440 
0             0 umount
[330320.944953] [   4737]     0  4737     2231        0    57344 
0             0 umount
[330320.953599] [   4738]     0  4738     2231        0    61440 
0             0 umount
[330320.962245] [   4739]     0  4739     2231        0    57344 
0             0 umount
[330320.970891] [   4740]     0  4740     2231        0    53248 
0             0 umount
[330320.979536] [   4744]     0  4744     2231        0    61440 
0             0 umount
[330320.988187] [   4746]     0  4746     2231        0    57344 
0             0 umount
[330320.996832] [   4747]     0  4747     2231        0    61440 
0             0 umount
[330321.005479] [   4757]     0  4757     2265        0    53248 
0             0 veritysetup
[330321.014573] [   4758]     0  4758     2231        0    57344 
0             0 umount
[330321.023225] [   4759]     0  4759     2231        0    57344 
0             0 umount
[330321.031872] [   4760]     0  4760     2231        0    61440 
0             0 umount
[330321.040519] [   5922]     0  5922     3012        0    61440 
0             0 systemd-user-ru
[330321.049972] [   6557]     0  6557     2231        0    61440 
0             0 umount
[330321.058618] [   6558]     0  6558     2231        0    61440 
0             0 umount
[330321.067264] [   6563]     0  6563     2231        0    57344 
0             0 umount
[330321.075910] [   6567]     0  6567     2231        0    57344 
0             0 umount
[330321.084556] [   6569]     0  6569     2231        0    53248 
0             0 umount
[330321.093194] [   6570]     0  6570     2231        0    65536 
0             0 umount
[330321.101840] [   6575]     0  6575     2231        0    57344 
0             0 umount
[330321.110485] [   6578]     0  6578     2231        0    61440 
0             0 umount
[330321.119132] [   6579]     0  6579     2231        0    57344 
0             0 umount
[330321.127778] [   6580]     0  6580     2231        0    61440 
0             0 umount
[330321.136425] [   7215]     0  7215     5087        0    69632 
0             0 systemd-journal
[330321.145879] [   8410]     0  8410     5087        0    65536 
0             0 systemd-journal
[330321.155336] [   9603]     0  9603     5087        0    73728 
0             0 systemd-journal
[330321.164790] [  10366]     0 10366     3012        0    61440 
0             0 systemd-user-ru
[330321.174240] Out of memory and no killable processes...
[330321.179658] Kernel panic - not syncing: System is deadlocked on memory
[330321.186489] CPU: 4 PID: 1 Comm: systemd Kdump: loaded Tainted: G 
        O      5.4.51-xxx #1
[330321.201038] Call trace:
[330321.203660]  dump_backtrace+0x0/0x1d0
[330321.207533]  show_stack+0x20/0x2c
[330321.211048]  dump_stack+0xe8/0x150
[330321.214656]  panic+0x18c/0x3b4
[330321.217901]  out_of_memory+0x4c0/0x6e4
[330321.221863]  __alloc_pages_nodemask+0xbdc/0x1c90
[330321.226722]  alloc_pages_current+0x21c/0x2b0
[330321.231220]  alloc_slab_page+0x1e0/0x7d8
[330321.235361]  new_slab+0x2e8/0x2f8
[330321.238874]  ___slab_alloc+0x45c/0x59c
[330321.242835]  kmem_cache_alloc+0x2d4/0x360
[330321.247065]  getname_flags+0x6c/0x2a8
[330321.250938]  user_path_at_empty+0x3c/0x68
[330321.255168]  do_readlinkat+0x7c/0x17c
[330321.259039]  __arm64_sys_readlinkat+0x5c/0x70
[330321.263627]  el0_svc_handler+0x1b8/0x32c
[330321.267767]  el0_svc+0x10/0x14
[330321.271026] SMP: stopping secondary CPUs
[330321.275382] Starting crashdump kernel...
[330321.279526] Bye!

> 
>> Then while searching I came across documented warning below.  In above
>> instance panic after OOM kills happened after 3+ days of stress run (a
>> mixure of ttcp, cpuloadgen and fio).
>>
>> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-red_hat_enterprise_linux-performance_tuning_guide-configuration_tools-configuring_system_memory_capacity
>>
>> Warning
>>
>> Extreme values can damage your system. Setting min_free_kbytes to an
>> extremely low value prevents the system from reclaiming memory, which can
>> result in system hangs and OOM-killing processes. However, setting
>> min_free_kbytes too high (for example, to 5–10% of total system memory)
>> causes the system to enter an out-of-memory state immediately, resulting in
>> the system spending too much time reclaiming memory.
> 
> The auto tuned value should never reach such a low value to cause
> problems.

The auto tuned value is incorrect post hotplug memory operation, in our 
use case memoy hot add occurs very early during boot.

Thanks,
Vijay

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged
  2020-09-16 18:28           ` Vijay Balakrishna
@ 2020-09-17 12:12             ` Michal Hocko
  2020-09-17 18:03               ` Vijay Balakrishna
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2020-09-17 12:12 UTC (permalink / raw)
  To: Vijay Balakrishna
  Cc: Andrew Morton, Kirill A. Shutemov, Oleg Nesterov, Song Liu,
	Andrea Arcangeli, Pavel Tatashin, Allen Pais, linux-kernel,
	linux-mm

On Wed 16-09-20 11:28:40, Vijay Balakrishna wrote:
[...]
> OOM splat below.  I see we had kmem leak detection turned on here.  We
> haven't run stress with kmem leak detection since uncovereing low
> min_free_kbytes.  During investigation we wanted to make sure there is no
> kmem leaks, we didn't find significant leaks detected.
> 
> [330319.766059] systemd invoked oom-killer:
> gfp_mask=0x40cc0(GFP_KERNEL|__GFP_COMP), order=1, oom_score_adj=0

[...]
> [330319.861064] Mem-Info:
> [330319.863519] active_anon:60744 inactive_anon:109226 isolated_anon:0
>                  active_file:6418 inactive_file:3869 isolated_file:2
>                  unevictable:0 dirty:8 writeback:1 unstable:0
>                  slab_reclaimable:34660 slab_unreclaimable:795718
>                  mapped:1256 shmem:165765 pagetables:689 bounce:0
>                  free:340962 free_pcp:4672 free_cma:0

The memory consumption is predominantely in slab (unreclaimable). Only
~8% of the memory is on LRUs (anonymous + file). Slab (both reclaimable
and unreclaimable) is ~40%. So there is still a lot of memory
unaccounted (direct users of the page allocator). This would partially
explain why the oom killer is not able to make progress and eventually
panics because it is the kernel which is blowing the memory consumption.

There is still ~1G free memory but the problem is that this is a
GFP_KERNEL request which is not allowed to consume Movable memory.
Zone normal is depleted and therefore it cannot satisfy this request
even when there are some order-1 pages available.

> [330319.928124] Node 0 Normal free:12652kB min:14344kB low:19092kB=20
> high:23840kB active_anon:55340kB inactive_anon:60276kB active_file:60kB
> inactive_file:128kB unevictable:0kB writepending:4kB present:6220656kB
> managed:4750196kB mlocked:0kB kernel_stack:9568kB pagetables:2756kB
> bounce:0kB free_pcp:10056kB local_pcp:1376kB free_cma:0kB
[...]
> [330319.996879] Node 0 Normal: 3138*4kB (UME) 38*8kB (UM) 0*16kB 0*32kB
> 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 12856kB

I do not see the state of swap in the oom splat so I assume you have
swap disabled. If that is the case then the memory reclaim cannot really
do much for this request. There is almost no page cache to reclaim.

That being said I do not see how a increased min_free_kbytes could help
for this particular OOM situation. If there is really any relation it is
more of a unintended side effect.

[...]
> > > Extreme values can damage your system. Setting min_free_kbytes to an
> > > extremely low value prevents the system from reclaiming memory, which can
> > > result in system hangs and OOM-killing processes. However, setting
> > > min_free_kbytes too high (for example, to 5–10% of total system memory)
> > > causes the system to enter an out-of-memory state immediately, resulting in
> > > the system spending too much time reclaiming memory.
> > 
> > The auto tuned value should never reach such a low value to cause
> > problems.
> 
> The auto tuned value is incorrect post hotplug memory operation, in our use
> case memoy hot add occurs very early during boot.
 
Define incorrect. What are the actual values? Have you tried to increase
the value manually after the hotplug?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged
  2020-09-17 12:12             ` Michal Hocko
@ 2020-09-17 18:03               ` Vijay Balakrishna
  2020-09-18  5:45                 ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Vijay Balakrishna @ 2020-09-17 18:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kirill A. Shutemov, Oleg Nesterov, Song Liu,
	Andrea Arcangeli, Pavel Tatashin, Allen Pais, linux-kernel,
	linux-mm



On 9/17/2020 5:12 AM, Michal Hocko wrote:
> On Wed 16-09-20 11:28:40, Vijay Balakrishna wrote:
> [...]
>> OOM splat below.  I see we had kmem leak detection turned on here.  We
>> haven't run stress with kmem leak detection since uncovereing low
>> min_free_kbytes.  During investigation we wanted to make sure there is no
>> kmem leaks, we didn't find significant leaks detected.
>>
>> [330319.766059] systemd invoked oom-killer:
>> gfp_mask=0x40cc0(GFP_KERNEL|__GFP_COMP), order=1, oom_score_adj=0
> 
> [...]
>> [330319.861064] Mem-Info:
>> [330319.863519] active_anon:60744 inactive_anon:109226 isolated_anon:0
>>                   active_file:6418 inactive_file:3869 isolated_file:2
>>                   unevictable:0 dirty:8 writeback:1 unstable:0
>>                   slab_reclaimable:34660 slab_unreclaimable:795718
>>                   mapped:1256 shmem:165765 pagetables:689 bounce:0
>>                   free:340962 free_pcp:4672 free_cma:0
> 
> The memory consumption is predominantely in slab (unreclaimable). Only
> ~8% of the memory is on LRUs (anonymous + file). Slab (both reclaimable
> and unreclaimable) is ~40%. So there is still a lot of memory
> unaccounted (direct users of the page allocator). This would partially
> explain why the oom killer is not able to make progress and eventually
> panics because it is the kernel which is blowing the memory consumption.
> 
> There is still ~1G free memory but the problem is that this is a
> GFP_KERNEL request which is not allowed to consume Movable memory.
> Zone normal is depleted and therefore it cannot satisfy this request
> even when there are some order-1 pages available.
> 
>> [330319.928124] Node 0 Normal free:12652kB min:14344kB low:19092kB=20
>> high:23840kB active_anon:55340kB inactive_anon:60276kB active_file:60kB
>> inactive_file:128kB unevictable:0kB writepending:4kB present:6220656kB
>> managed:4750196kB mlocked:0kB kernel_stack:9568kB pagetables:2756kB
>> bounce:0kB free_pcp:10056kB local_pcp:1376kB free_cma:0kB
> [...]
>> [330319.996879] Node 0 Normal: 3138*4kB (UME) 38*8kB (UM) 0*16kB 0*32kB
>> 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 12856kB
> 
> I do not see the state of swap in the oom splat so I assume you have
> swap disabled. If that is the case then the memory reclaim cannot really
> do much for this request. There is almost no page cache to reclaim.

No swap configured in our system.
> 
> That being said I do not see how a increased min_free_kbytes could help
> for this particular OOM situation. If there is really any relation it is
> more of a unintended side effect.

I haven't had a chance to rerun stress with kmem leak detection to know 
if we still see OOM kills after min_free_kbytes restore.
> 
> [...]
>>>> Extreme values can damage your system. Setting min_free_kbytes to an
>>>> extremely low value prevents the system from reclaiming memory, which can
>>>> result in system hangs and OOM-killing processes. However, setting
>>>> min_free_kbytes too high (for example, to 5–10% of total system memory)
>>>> causes the system to enter an out-of-memory state immediately, resulting in
>>>> the system spending too much time reclaiming memory.
>>>
>>> The auto tuned value should never reach such a low value to cause
>>> problems.
>>
>> The auto tuned value is incorrect post hotplug memory operation, in our use
>> case memoy hot add occurs very early during boot.
>   
> Define incorrect. What are the actual values? Have you tried to increase
> the value manually after the hotplug?

In our case SoC with 8GB memory, system tuned min_free_kbytes
- first to 22528
- we perform memory hot add very early in boot
- now min_free_kbytes is 8703

Before looking at code, first I manually restored min_free_kbytes soon 
after boot, reran stress and didn't notice symptoms I mentioned in 
change log.

Thanks,
Vijay

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged
  2020-09-17 18:03               ` Vijay Balakrishna
@ 2020-09-18  5:45                 ` Michal Hocko
  2020-09-18 15:32                   ` Vijay Balakrishna
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2020-09-18  5:45 UTC (permalink / raw)
  To: Vijay Balakrishna
  Cc: Andrew Morton, Kirill A. Shutemov, Oleg Nesterov, Song Liu,
	Andrea Arcangeli, Pavel Tatashin, Allen Pais, linux-kernel,
	linux-mm

On Thu 17-09-20 11:03:56, Vijay Balakrishna wrote:
[...]
> > > The auto tuned value is incorrect post hotplug memory operation, in our use
> > > case memoy hot add occurs very early during boot.
> > Define incorrect. What are the actual values? Have you tried to increase
> > the value manually after the hotplug?
> 
> In our case SoC with 8GB memory, system tuned min_free_kbytes
> - first to 22528
> - we perform memory hot add very early in boot

What was the original and after-the-hotplug size of memory and layout?
I suspect that all the hotplugged memory is in Movable zone, right?

> - now min_free_kbytes is 8703
> 
> Before looking at code, first I manually restored min_free_kbytes soon after
> boot, reran stress and didn't notice symptoms I mentioned in change log.

This is really surprising and I strongly suspect that an earlier reclaim
just changed the timing enough so that workload has spread the memory
prpessure over a longer time and that might have been enough to recycle
some of the unreclaimable memory due to its natural life time. But this
is a pure speculation. Much more data would be needed to analyze this.

In any case your stress test is oveprovisioning your Normal zone and
increased min_free_kbytes just papers over the sizing problem.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged
  2020-09-18  5:45                 ` Michal Hocko
@ 2020-09-18 15:32                   ` Vijay Balakrishna
  2020-09-21  7:00                     ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Vijay Balakrishna @ 2020-09-18 15:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kirill A. Shutemov, Oleg Nesterov, Song Liu,
	Andrea Arcangeli, Pavel Tatashin, Allen Pais, linux-kernel,
	linux-mm



On 9/17/2020 10:45 PM, Michal Hocko wrote:
> On Thu 17-09-20 11:03:56, Vijay Balakrishna wrote:
> [...]
>>>> The auto tuned value is incorrect post hotplug memory operation, in our use
>>>> case memoy hot add occurs very early during boot.
>>> Define incorrect. What are the actual values? Have you tried to increase
>>> the value manually after the hotplug?
>>
>> In our case SoC with 8GB memory, system tuned min_free_kbytes
>> - first to 22528
>> - we perform memory hot add very early in boot
> 
> What was the original and after-the-hotplug size of memory and layout?
> I suspect that all the hotplugged memory is in Movable zone, right?

Yes, added ~1.92GB as Movable type, booting with 6GB at start.

> 
>> - now min_free_kbytes is 8703
>>
>> Before looking at code, first I manually restored min_free_kbytes soon after
>> boot, reran stress and didn't notice symptoms I mentioned in change log.
> 
> This is really surprising and I strongly suspect that an earlier reclaim
> just changed the timing enough so that workload has spread the memory
> prpessure over a longer time and that might have been enough to recycle
> some of the unreclaimable memory due to its natural life time. But this
> is a pure speculation. Much more data would be needed to analyze this.
> 
> In any case your stress test is oveprovisioning your Normal zone and
> increased min_free_kbytes just papers over the sizing problem.
> 

It is a synthetic workload, likely not sized I need to check.  I feel 
having higher min_free_kbytes made GFP_ATOMIC allocations not to fail. 
I have seen NETDEV WATCHDOG timeout with stacktrace trying to allocate 
memory, looping in net rx receive path.

Thanks,
Vijay

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged
  2020-09-18 15:32                   ` Vijay Balakrishna
@ 2020-09-21  7:00                     ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2020-09-21  7:00 UTC (permalink / raw)
  To: Vijay Balakrishna
  Cc: Andrew Morton, Kirill A. Shutemov, Oleg Nesterov, Song Liu,
	Andrea Arcangeli, Pavel Tatashin, Allen Pais, linux-kernel,
	linux-mm

On Fri 18-09-20 08:32:13, Vijay Balakrishna wrote:
> 
> 
> On 9/17/2020 10:45 PM, Michal Hocko wrote:
> > On Thu 17-09-20 11:03:56, Vijay Balakrishna wrote:
> > [...]
> > > > > The auto tuned value is incorrect post hotplug memory operation, in our use
> > > > > case memoy hot add occurs very early during boot.
> > > > Define incorrect. What are the actual values? Have you tried to increase
> > > > the value manually after the hotplug?
> > > 
> > > In our case SoC with 8GB memory, system tuned min_free_kbytes
> > > - first to 22528
> > > - we perform memory hot add very early in boot
> > 
> > What was the original and after-the-hotplug size of memory and layout?
> > I suspect that all the hotplugged memory is in Movable zone, right?
> 
> Yes, added ~1.92GB as Movable type, booting with 6GB at start.
> 
> > 
> > > - now min_free_kbytes is 8703
> > > 
> > > Before looking at code, first I manually restored min_free_kbytes soon after
> > > boot, reran stress and didn't notice symptoms I mentioned in change log.
> > 
> > This is really surprising and I strongly suspect that an earlier reclaim
> > just changed the timing enough so that workload has spread the memory
> > prpessure over a longer time and that might have been enough to recycle
> > some of the unreclaimable memory due to its natural life time. But this
> > is a pure speculation. Much more data would be needed to analyze this.
> > 
> > In any case your stress test is oveprovisioning your Normal zone and
> > increased min_free_kbytes just papers over the sizing problem.
> > 
> 
> It is a synthetic workload, likely not sized I need to check.  I feel having
> higher min_free_kbytes made GFP_ATOMIC allocations not to fail.

Yes a higher min_free_kbytes will help GFP_ATOMIC. But only to some
degree. But nobody should depend on an atomic allocation for
correctness. It is just way too easy to fail under a higher memory
pressure.

> I have seen
> NETDEV WATCHDOG timeout with stacktrace trying to allocate memory, looping
> in net rx receive path.

You should talk to net folks.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-09-21  7:00 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-10 20:47 [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged Vijay Balakrishna
2020-09-10 22:01 ` Kirill A. Shutemov
2020-09-10 22:28   ` Pavel Tatashin
2020-09-10 22:28     ` Pavel Tatashin
2020-09-10 22:56     ` Vijay Balakrishna
2020-09-15  5:04   ` Vijay Balakrishna
2020-09-14 14:33 ` Michal Hocko
2020-09-14 16:57   ` Vijay Balakrishna
2020-09-15  8:18     ` Michal Hocko
2020-09-15 15:48       ` Vijay Balakrishna
2020-09-16  6:53         ` Michal Hocko
2020-09-16 18:28           ` Vijay Balakrishna
2020-09-17 12:12             ` Michal Hocko
2020-09-17 18:03               ` Vijay Balakrishna
2020-09-18  5:45                 ` Michal Hocko
2020-09-18 15:32                   ` Vijay Balakrishna
2020-09-21  7:00                     ` Michal Hocko
2020-09-15 18:22 ` Pavel Tatashin
2020-09-15 18:22   ` Pavel Tatashin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.