linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* mm: pages are not freed from lru_add_pvecs after process termination
@ 2016-04-27 17:01 Odzioba, Lukasz
  2016-04-27 17:11 ` Dave Hansen
  0 siblings, 1 reply; 25+ messages in thread
From: Odzioba, Lukasz @ 2016-04-27 17:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Shutemov, Kirill, Hansen, Dave, Anaczkowski, Lukasz

Hi,
I encounter a problem which I'd like to discuss here (tested on 3.10 and 4.5).
While running some workloads we noticed that in case of "improper" application
exit (like SIGTERM) quite a bit (a few GBs) of memory is not being reclaimed
after process termination.

Executing  echo 1 > /proc/sys/vm/compact_memory makes the memory available again.

This memory is not reclaimed so OOM will kill process trying to allocate memory
which technically should be available. 
Such behavior is present only when THP are [always] enabled.
Disabling it makes the issue not visible to the naked eye.

An important information is that it is visible mostly due to large amount of CPUs
in the system (>200) and amount of missing memory varies with the number of CPUs.

This memory seems to not be accounted anywhere, but I was able to found it on
per cpu lru_add_pvec lists thanks to Dave Hansen's suggestion.

Knowing that I am able to reproduce this problem with much simpler code:
//compile with: gcc repro.c -o repro -fopenmp
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include "omp.h"
int main() {
#pragma omp parallel
{
        size_t size = 55*1000*1000; // tweaked for 288cpus, "leaks" ~3.5GB
        unsigned long nodemask = 1;
        void *p = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS , -1, 0);
        if(p)
                memset(p, 0, size);
        
        //munmap(p, size); // uncomment to make the problem go away
}
        return 0;
}


Exemplary execution:
$ numactl -H | grep "node 1" | grep MB
node 1 size: 16122 MB
node 1 free: 16026 MB
$ ./repro
$ numactl -H | grep "node 1" | grep MB
node 1 size: 16122 MB
node 1 free: 13527 MB

After a couple of minutes on idle system some of this memory is reclaimed, but never all
unless I run tasks on every CPU:
node 1 size: 16122 MB
node 1 free: 14823 MB

Pieces of the puzzle:
A) after process termination memory is not getting freed nor accounted as free
B) memory cannot be allocated by other processes (unless it is allocated by all CPUs)

I am not sure whether it is expected behavior or a side effect of something else not
going as it should. Temporarily I added lru_add_drain_all() to try_to_free_pages()
which sort of hammers B case, but A is still present.

I am not familiar with this code, but I feel like draining lru_add work should be split
into smaller pieces and done by kswapd to fix A and drain only as much pages as
needed in try_to_free_pages to fix B.

Any comments/ideas/patches for a proper fix are welcome.

Thanks,
Lukas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: mm: pages are not freed from lru_add_pvecs after process termination
  2016-04-27 17:01 mm: pages are not freed from lru_add_pvecs after process termination Odzioba, Lukasz
@ 2016-04-27 17:11 ` Dave Hansen
  2016-04-28 14:37   ` Michal Hocko
  2016-05-02 14:39   ` Vlastimil Babka
  0 siblings, 2 replies; 25+ messages in thread
From: Dave Hansen @ 2016-04-27 17:11 UTC (permalink / raw)
  To: Odzioba, Lukasz, linux-kernel, linux-mm
  Cc: Shutemov, Kirill, Anaczkowski, Lukasz

On 04/27/2016 10:01 AM, Odzioba, Lukasz wrote:
> Pieces of the puzzle:
> A) after process termination memory is not getting freed nor accounted as free

I don't think this part is necessarily a bug.  As long as we have stats
*somewhere*, and we really do "reclaim" them, I don't think we need to
call these pages "free".

> I am not sure whether it is expected behavior or a side effect of something else not
> going as it should. Temporarily I added lru_add_drain_all() to try_to_free_pages()
> which sort of hammers B case, but A is still present.

It's not expected behavior.  It's an unanticipated side effect of large
numbers of cpu threads, large pages on the LRU, and (relatively) small
zones.

> I am not familiar with this code, but I feel like draining lru_add work should be split
> into smaller pieces and done by kswapd to fix A and drain only as much pages as
> needed in try_to_free_pages to fix B.
> 
> Any comments/ideas/patches for a proper fix are welcome.

Here are my suggestions.  I've passed these along multiple times, but I
guess I'll repeat them again for good measure.

> 1. We need some statistics on the number and total *SIZES* of all pages
>    in the lru pagevecs.  It's too opaque now.
> 2. We need to make darn sure we drain the lru pagevecs before failing
>    any kind of allocation.
> 3. We need some way to drain the lru pagevecs directly.  Maybe the buddy
>    pcp lists too.
> 4. We need to make sure that a zone_reclaim_mode=0 system still drains
>    too.
> 5. The VM stats and their updates are now related to how often
>    drain_zone_pages() gets run.  That might be interacting here too.

6. Perhaps don't use the LRU pagevecs for large pages.  It limits the
   severity of the problem.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: mm: pages are not freed from lru_add_pvecs after process termination
  2016-04-27 17:11 ` Dave Hansen
@ 2016-04-28 14:37   ` Michal Hocko
  2016-05-02 13:00     ` Michal Hocko
  2016-05-02 14:39   ` Vlastimil Babka
  1 sibling, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2016-04-28 14:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Odzioba, Lukasz, linux-kernel, linux-mm, Shutemov, Kirill,
	Anaczkowski, Lukasz

On Wed 27-04-16 10:11:04, Dave Hansen wrote:
> On 04/27/2016 10:01 AM, Odzioba, Lukasz wrote:
[...]
> > 1. We need some statistics on the number and total *SIZES* of all pages
> >    in the lru pagevecs.  It's too opaque now.
> > 2. We need to make darn sure we drain the lru pagevecs before failing
> >    any kind of allocation.

lru_add_drain_all is unfortunatelly too costly (especially on large
machines). You are right that failing an allocation with a lot of cached
pages is less than suboptimal though. So maybe we can do it from the
slow path after the first round of direct reclaim failed to allocate
anything. Something like the following:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5dd65d9fb76a..0743c58c2e9d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3559,6 +3559,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum compact_result compact_result;
 	int compaction_retries = 0;
 	int no_progress_loops = 0;
+	bool drained_lru = false;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3667,6 +3668,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (page)
 		goto got_pg;
 
+	if (!drained_lru) {
+		drained_lru = true;
+		lru_add_drain_all();
+	}
+
 	/* Do not loop if specifically requested */
 	if (gfp_mask & __GFP_NORETRY)
 		goto noretry;

The downside would be that we really depend on the WQ to make any
progress here. If we are really out of memory then we are screwed so
we would need a flush_work_timeout() or something else that would
guarantee maximum timeout. That something else might be to stop using WQ
and move the flushing into the IRQ context. Not for free too but at
least not dependant on having some memory to make a progress.

> > 3. We need some way to drain the lru pagevecs directly.  Maybe the buddy
> >    pcp lists too.
> > 4. We need to make sure that a zone_reclaim_mode=0 system still drains
> >    too.
> > 5. The VM stats and their updates are now related to how often
> >    drain_zone_pages() gets run.  That might be interacting here too.
> 
> 6. Perhaps don't use the LRU pagevecs for large pages.  It limits the
>    severity of the problem.

7. Hook into vmstat and flush from there? This would drain them
periodically but it would also introduce an undeterministic interference
as well.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: mm: pages are not freed from lru_add_pvecs after process termination
  2016-04-28 14:37   ` Michal Hocko
@ 2016-05-02 13:00     ` Michal Hocko
  2016-05-04 19:41       ` Odzioba, Lukasz
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2016-05-02 13:00 UTC (permalink / raw)
  To: Odzioba, Lukasz
  Cc: Dave Hansen, linux-kernel, linux-mm, Shutemov, Kirill,
	Anaczkowski, Lukasz

On Thu 28-04-16 16:37:10, Michal Hocko wrote:
[...]
> 7. Hook into vmstat and flush from there? This would drain them
> periodically but it would also introduce an undeterministic interference
> as well.

So I have given this a try (not tested yet) and it doesn't look terribly
complicated. It is hijacking vmstat for a purpose it wasn't intended for
originally but creating a dedicated kenrnel threads/WQ sounds like an
overkill to me. Does this helps or do we have to be more aggressive and
wake up shepherd from the allocator slow path. Could you give it a try
please?
---
diff --git a/mm/internal.h b/mm/internal.h
index b6ead95a0184..876125bd11f4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -488,4 +488,5 @@ extern const struct trace_print_flags pageflag_names[];
 extern const struct trace_print_flags vmaflag_names[];
 extern const struct trace_print_flags gfpflag_names[];
 
+extern bool pcp_lru_add_need_drain(int cpu);
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/swap.c b/mm/swap.c
index 95916142fc46..3937e6caef96 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -667,6 +667,15 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
 
 static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);
 
+bool pcp_lru_add_need_drain(int cpu)
+{
+	return pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
+		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
+		    need_activate_page_drain(cpu);
+}
+
 void lru_add_drain_all(void)
 {
 	static DEFINE_MUTEX(lock);
@@ -680,11 +689,7 @@ void lru_add_drain_all(void)
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
-		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
-		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
-		    need_activate_page_drain(cpu)) {
+		if (pcp_lru_add_need_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			schedule_work_on(cpu, work);
 			cpumask_set_cpu(cpu, &has_work);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7397d9548f21..766f751e3467 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -479,6 +479,13 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 	int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
 	int changes = 0;
 
+	/*
+	 * Do not try to drain LRU pcp caches because that might be
+	 * expensive - we take locks there etc.
+	 */
+	if (do_pagesets && pcp_lru_add_need_drain(smp_processor_id()))
+		lru_add_drain();
+
 	for_each_populated_zone(zone) {
 		struct per_cpu_pageset __percpu *p = zone->pageset;
 
@@ -1477,7 +1484,8 @@ static bool need_update(int cpu)
 			return true;
 
 	}
-	return false;
+
+	return pcp_lru_add_need_drain(cpu);
 }
 
 void quiet_vmstat(void)
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: mm: pages are not freed from lru_add_pvecs after process termination
  2016-04-27 17:11 ` Dave Hansen
  2016-04-28 14:37   ` Michal Hocko
@ 2016-05-02 14:39   ` Vlastimil Babka
  2016-05-02 15:01     ` Kirill A. Shutemov
  1 sibling, 1 reply; 25+ messages in thread
From: Vlastimil Babka @ 2016-05-02 14:39 UTC (permalink / raw)
  To: Dave Hansen, Odzioba, Lukasz, linux-kernel, linux-mm
  Cc: Shutemov, Kirill, Anaczkowski, Lukasz

On 04/27/2016 07:11 PM, Dave Hansen wrote:
> 6. Perhaps don't use the LRU pagevecs for large pages.  It limits the
>     severity of the problem.

I think that makes sense. Being large already amortizes the cost per 
base page much more than pagevecs do (512 vs ~22 pages?).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: mm: pages are not freed from lru_add_pvecs after process termination
  2016-05-02 14:39   ` Vlastimil Babka
@ 2016-05-02 15:01     ` Kirill A. Shutemov
  2016-05-02 15:13       ` Vlastimil Babka
  2016-05-02 15:49       ` Dave Hansen
  0 siblings, 2 replies; 25+ messages in thread
From: Kirill A. Shutemov @ 2016-05-02 15:01 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Dave Hansen, Odzioba, Lukasz, linux-kernel, linux-mm, Shutemov,
	Kirill, Anaczkowski, Lukasz

On Mon, May 02, 2016 at 04:39:35PM +0200, Vlastimil Babka wrote:
> On 04/27/2016 07:11 PM, Dave Hansen wrote:
> >6. Perhaps don't use the LRU pagevecs for large pages.  It limits the
> >    severity of the problem.
> 
> I think that makes sense. Being large already amortizes the cost per base
> page much more than pagevecs do (512 vs ~22 pages?).

We try to do this already, don't we? Any spefic case where we have THPs on
pagevecs?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: mm: pages are not freed from lru_add_pvecs after process termination
  2016-05-02 15:01     ` Kirill A. Shutemov
@ 2016-05-02 15:13       ` Vlastimil Babka
  2016-05-02 15:49       ` Dave Hansen
  1 sibling, 0 replies; 25+ messages in thread
From: Vlastimil Babka @ 2016-05-02 15:13 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Odzioba, Lukasz, linux-kernel, linux-mm, Shutemov,
	Kirill, Anaczkowski, Lukasz

On 05/02/2016 05:01 PM, Kirill A. Shutemov wrote:
> On Mon, May 02, 2016 at 04:39:35PM +0200, Vlastimil Babka wrote:
>> On 04/27/2016 07:11 PM, Dave Hansen wrote:
>>> 6. Perhaps don't use the LRU pagevecs for large pages.  It limits the
>>>     severity of the problem.
>>
>> I think that makes sense. Being large already amortizes the cost per base
>> page much more than pagevecs do (512 vs ~22 pages?).
>
> We try to do this already, don't we? Any spefic case where we have THPs on
> pagevecs?

For example like this?
__do_huge_pmd_anonymous_page
   lru_cache_add_active_or_unevictable
     lru_cache_add

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: mm: pages are not freed from lru_add_pvecs after process termination
  2016-05-02 15:01     ` Kirill A. Shutemov
  2016-05-02 15:13       ` Vlastimil Babka
@ 2016-05-02 15:49       ` Dave Hansen
  2016-05-02 16:02         ` Kirill A. Shutemov
  1 sibling, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2016-05-02 15:49 UTC (permalink / raw)
  To: Kirill A. Shutemov, Vlastimil Babka
  Cc: Odzioba, Lukasz, linux-kernel, linux-mm, Shutemov, Kirill,
	Anaczkowski, Lukasz

On 05/02/2016 08:01 AM, Kirill A. Shutemov wrote:
> On Mon, May 02, 2016 at 04:39:35PM +0200, Vlastimil Babka wrote:
>> On 04/27/2016 07:11 PM, Dave Hansen wrote:
>>> 6. Perhaps don't use the LRU pagevecs for large pages.  It limits the
>>>    severity of the problem.
>>
>> I think that makes sense. Being large already amortizes the cost per base
>> page much more than pagevecs do (512 vs ~22 pages?).
> 
> We try to do this already, don't we? Any spefic case where we have THPs on
> pagevecs?

Lukas was hitting this on a RHEL 7 era kernel.  In his kernel at least,
I'm pretty sure THP's were ending up on pagevecs.  Are you saying you
don't think we're doing that any more?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: mm: pages are not freed from lru_add_pvecs after process termination
  2016-05-02 15:49       ` Dave Hansen
@ 2016-05-02 16:02         ` Kirill A. Shutemov
  2016-05-03  7:37           ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: Kirill A. Shutemov @ 2016-05-02 16:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Vlastimil Babka, Odzioba, Lukasz, linux-kernel, linux-mm,
	Shutemov, Kirill, Anaczkowski, Lukasz

On Mon, May 02, 2016 at 08:49:03AM -0700, Dave Hansen wrote:
> On 05/02/2016 08:01 AM, Kirill A. Shutemov wrote:
> > On Mon, May 02, 2016 at 04:39:35PM +0200, Vlastimil Babka wrote:
> >> On 04/27/2016 07:11 PM, Dave Hansen wrote:
> >>> 6. Perhaps don't use the LRU pagevecs for large pages.  It limits the
> >>>    severity of the problem.
> >>
> >> I think that makes sense. Being large already amortizes the cost per base
> >> page much more than pagevecs do (512 vs ~22 pages?).
> > 
> > We try to do this already, don't we? Any spefic case where we have THPs on
> > pagevecs?
> 
> Lukas was hitting this on a RHEL 7 era kernel.  In his kernel at least,
> I'm pretty sure THP's were ending up on pagevecs.  Are you saying you
> don't think we're doing that any more?

As Vlastimil pointed, we do. It need to be fixed, I think.

Any volunteer? :-P

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: mm: pages are not freed from lru_add_pvecs after process termination
  2016-05-02 16:02         ` Kirill A. Shutemov
@ 2016-05-03  7:37           ` Michal Hocko
  2016-05-03 10:07             ` Kirill A. Shutemov
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2016-05-03  7:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Vlastimil Babka, Odzioba, Lukasz, linux-kernel,
	linux-mm, Shutemov, Kirill, Anaczkowski, Lukasz

On Mon 02-05-16 19:02:50, Kirill A. Shutemov wrote:
> On Mon, May 02, 2016 at 08:49:03AM -0700, Dave Hansen wrote:
> > On 05/02/2016 08:01 AM, Kirill A. Shutemov wrote:
> > > On Mon, May 02, 2016 at 04:39:35PM +0200, Vlastimil Babka wrote:
> > >> On 04/27/2016 07:11 PM, Dave Hansen wrote:
> > >>> 6. Perhaps don't use the LRU pagevecs for large pages.  It limits the
> > >>>    severity of the problem.
> > >>
> > >> I think that makes sense. Being large already amortizes the cost per base
> > >> page much more than pagevecs do (512 vs ~22 pages?).
> > > 
> > > We try to do this already, don't we? Any spefic case where we have THPs on
> > > pagevecs?
> > 
> > Lukas was hitting this on a RHEL 7 era kernel.  In his kernel at least,
> > I'm pretty sure THP's were ending up on pagevecs.  Are you saying you
> > don't think we're doing that any more?
> 
> As Vlastimil pointed, we do. It need to be fixed, I think.

It seems that offloading the draining to the vmstat context doesn't look
terribly bad. Don't we rather want to go that way?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: mm: pages are not freed from lru_add_pvecs after process termination
  2016-05-03  7:37           ` Michal Hocko
@ 2016-05-03 10:07             ` Kirill A. Shutemov
  0 siblings, 0 replies; 25+ messages in thread
From: Kirill A. Shutemov @ 2016-05-03 10:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Hansen, Vlastimil Babka, Odzioba, Lukasz, linux-kernel,
	linux-mm, Shutemov, Kirill, Anaczkowski, Lukasz

On Tue, May 03, 2016 at 09:37:57AM +0200, Michal Hocko wrote:
> On Mon 02-05-16 19:02:50, Kirill A. Shutemov wrote:
> > On Mon, May 02, 2016 at 08:49:03AM -0700, Dave Hansen wrote:
> > > On 05/02/2016 08:01 AM, Kirill A. Shutemov wrote:
> > > > On Mon, May 02, 2016 at 04:39:35PM +0200, Vlastimil Babka wrote:
> > > >> On 04/27/2016 07:11 PM, Dave Hansen wrote:
> > > >>> 6. Perhaps don't use the LRU pagevecs for large pages.  It limits the
> > > >>>    severity of the problem.
> > > >>
> > > >> I think that makes sense. Being large already amortizes the cost per base
> > > >> page much more than pagevecs do (512 vs ~22 pages?).
> > > > 
> > > > We try to do this already, don't we? Any spefic case where we have THPs on
> > > > pagevecs?
> > > 
> > > Lukas was hitting this on a RHEL 7 era kernel.  In his kernel at least,
> > > I'm pretty sure THP's were ending up on pagevecs.  Are you saying you
> > > don't think we're doing that any more?
> > 
> > As Vlastimil pointed, we do. It need to be fixed, I think.
> 
> It seems that offloading the draining to the vmstat context doesn't look
> terribly bad. Don't we rather want to go that way?

Maybe. My knowledge about lru cache is limited.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: mm: pages are not freed from lru_add_pvecs after process termination
  2016-05-02 13:00     ` Michal Hocko
@ 2016-05-04 19:41       ` Odzioba, Lukasz
  2016-05-04 20:16         ` Dave Hansen
  2016-05-04 20:36         ` Michal Hocko
  0 siblings, 2 replies; 25+ messages in thread
From: Odzioba, Lukasz @ 2016-05-04 19:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hansen, Dave, linux-kernel, linux-mm, Shutemov, Kirill,
	Anaczkowski, Lukasz

On Thu 02-05-16 03:00:00, Michal Hocko wrote:
> So I have given this a try (not tested yet) and it doesn't look terribly
> complicated. It is hijacking vmstat for a purpose it wasn't intended for
> originally but creating a dedicated kenrnel threads/WQ sounds like an
> overkill to me. Does this helps or do we have to be more aggressive and
> wake up shepherd from the allocator slow path. Could you give it a try
> please?

It seems to work fine, but it takes quite random time to drain lists, sometimes
a couple of seconds sometimes over two minutes. It is acceptable I believe.

I have an app which allocates almost all of the memory from numa node and
with just second patch and 100 consecutive executions 30-50% got killed.
After applying also your first patch I haven't seen any oom kill activity - great.

I was wondering how many lru_add_drain()'s are called and after boot when
machine was idle it was a bit over 5k calls during first 400s, and with some 
activity it went up to 15k calls during 700s (including 5k from previous 
experiment) which sounds fair to me given big cpu count.

Do you see any advantages of dropping THP from pagevecs over this solution?

Thanks,
Lukas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: mm: pages are not freed from lru_add_pvecs after process termination
  2016-05-04 19:41       ` Odzioba, Lukasz
@ 2016-05-04 20:16         ` Dave Hansen
  2016-05-04 20:36         ` Michal Hocko
  1 sibling, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2016-05-04 20:16 UTC (permalink / raw)
  To: Odzioba, Lukasz, Michal Hocko
  Cc: linux-kernel, linux-mm, Shutemov, Kirill, Anaczkowski, Lukasz

On 05/04/2016 12:41 PM, Odzioba, Lukasz wrote:
> Do you see any advantages of dropping THP from pagevecs over this solution?

It's a more foolproof solution.  Even with this patch, there might still
be some corner cases where the draining doesn't occur.  That "two
minutes" might be come 20 or 200 under some circumstances.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: mm: pages are not freed from lru_add_pvecs after process termination
  2016-05-04 19:41       ` Odzioba, Lukasz
  2016-05-04 20:16         ` Dave Hansen
@ 2016-05-04 20:36         ` Michal Hocko
  2016-05-05  7:21           ` Michal Hocko
  1 sibling, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2016-05-04 20:36 UTC (permalink / raw)
  To: Odzioba, Lukasz
  Cc: Hansen, Dave, linux-kernel, linux-mm, Shutemov, Kirill,
	Anaczkowski, Lukasz

On Wed 04-05-16 19:41:59, Odzioba, Lukasz wrote:
> On Thu 02-05-16 03:00:00, Michal Hocko wrote:
> > So I have given this a try (not tested yet) and it doesn't look terribly
> > complicated. It is hijacking vmstat for a purpose it wasn't intended for
> > originally but creating a dedicated kenrnel threads/WQ sounds like an
> > overkill to me. Does this helps or do we have to be more aggressive and
> > wake up shepherd from the allocator slow path. Could you give it a try
> > please?
> 
> It seems to work fine, but it takes quite random time to drain lists, sometimes
> a couple of seconds sometimes over two minutes. It is acceptable I believe.

I guess you mean that some CPUs are not drained for few minutes, right?
This might be a quite long and I tried to not flush LRU drain to the
idle entry because I felt it would be too expensive. Maybe it would be
better to kick the vmstat_shepherd from the allocator slow path. It
would still take unpredictable amount of time but it would at list be
called when we are getting short on memory.
 
> I have an app which allocates almost all of the memory from numa node and
> with just second patch and 100 consecutive executions 30-50% got killed.

This is still not acceptable. So I guess we need a way to kick
vmstat_shepherd from the reclaim path. I will think about that. Sounds a
bit tricky at first sight.

> After applying also your first patch I haven't seen any oom kill
> activity - great.

As I've said the first patch is quite dangerous as it depends on the WQ
to make a forward progress which might depend on the memory allocation
to create a new worker.
 
> I was wondering how many lru_add_drain()'s are called and after boot when
> machine was idle it was a bit over 5k calls during first 400s, and with some 
> activity it went up to 15k calls during 700s (including 5k from previous 
> experiment) which sounds fair to me given big cpu count.
> 
> Do you see any advantages of dropping THP from pagevecs over this
> solution?

Well the general purpose of pcp pagevecs is to reduce the lru_lock
contention. I have never measured the effect of THP pages. It is true
THP amortizes the contention by the page number handled at once so it
might be the easiest way (and certainly more acceptable for an old
kernel which you seem to be running as mentioned by Dave) but it sounds
too special cased and I would rather see less special casing for THP. So
if the async pcp sync is not too tricky or hard to maintain and worsk I
would rather go that way.

Thanks for testing those patches!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: mm: pages are not freed from lru_add_pvecs after process termination
  2016-05-04 20:36         ` Michal Hocko
@ 2016-05-05  7:21           ` Michal Hocko
  2016-05-05 17:25             ` Odzioba, Lukasz
  2016-05-06 15:10             ` Odzioba, Lukasz
  0 siblings, 2 replies; 25+ messages in thread
From: Michal Hocko @ 2016-05-05  7:21 UTC (permalink / raw)
  To: Odzioba, Lukasz
  Cc: Hansen, Dave, linux-kernel, linux-mm, Shutemov, Kirill,
	Anaczkowski, Lukasz

On Wed 04-05-16 22:36:43, Michal Hocko wrote:
> On Wed 04-05-16 19:41:59, Odzioba, Lukasz wrote:
[...]
> > I have an app which allocates almost all of the memory from numa node and
> > with just second patch and 100 consecutive executions 30-50% got killed.
> 
> This is still not acceptable. So I guess we need a way to kick
> vmstat_shepherd from the reclaim path. I will think about that. Sounds a
> bit tricky at first sight.

OK, it wasn't that tricky afterall. Maybe I have missed something but
the following should work. Or maybe the async nature of flushing turns
out to be just impractical and unreliable and we will end up skipping
THP (or all compound pages) for pcp LRU add cache. Let's see...
---
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 0aa613df463e..7f2c1aef6a09 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -274,4 +274,5 @@ static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages,
 
 extern const char * const vmstat_text[];
 
+extern void kick_vmstat_update(void);
 #endif /* _LINUX_VMSTAT_H */
diff --git a/mm/internal.h b/mm/internal.h
index b6ead95a0184..876125bd11f4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -488,4 +488,5 @@ extern const struct trace_print_flags pageflag_names[];
 extern const struct trace_print_flags vmaflag_names[];
 extern const struct trace_print_flags gfpflag_names[];
 
+extern bool pcp_lru_add_need_drain(int cpu);
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 056baf55a88d..5ca829e707f4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3556,6 +3556,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum compact_result compact_result;
 	int compaction_retries = 0;
 	int no_progress_loops = 0;
+	bool vmstat_updated = false;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3658,6 +3659,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (order && compaction_made_progress(compact_result))
 		compaction_retries++;
 
+	if (!vmstat_updated) {
+		vmstat_updated = true;
+		kick_vmstat_update();
+	}
+
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
 							&did_some_progress);
diff --git a/mm/swap.c b/mm/swap.c
index 95916142fc46..3937e6caef96 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -667,6 +667,15 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
 
 static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);
 
+bool pcp_lru_add_need_drain(int cpu)
+{
+	return pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
+		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
+		    need_activate_page_drain(cpu);
+}
+
 void lru_add_drain_all(void)
 {
 	static DEFINE_MUTEX(lock);
@@ -680,11 +689,7 @@ void lru_add_drain_all(void)
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
-		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
-		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
-		    need_activate_page_drain(cpu)) {
+		if (pcp_lru_add_need_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			schedule_work_on(cpu, work);
 			cpumask_set_cpu(cpu, &has_work);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7397d9548f21..cf4b095ace1c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -479,6 +479,13 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 	int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
 	int changes = 0;
 
+	/*
+	 * Do not try to drain LRU pcp caches because that might be
+	 * expensive - we take locks there etc.
+	 */
+	if (do_pagesets && pcp_lru_add_need_drain(smp_processor_id()))
+		lru_add_drain();
+
 	for_each_populated_zone(zone) {
 		struct per_cpu_pageset __percpu *p = zone->pageset;
 
@@ -1477,7 +1484,8 @@ static bool need_update(int cpu)
 			return true;
 
 	}
-	return false;
+
+	return pcp_lru_add_need_drain(cpu);
 }
 
 void quiet_vmstat(void)
@@ -1542,6 +1550,16 @@ static void vmstat_shepherd(struct work_struct *w)
 		round_jiffies_relative(sysctl_stat_interval));
 }
 
+void kick_vmstat_update(void)
+{
+#ifdef CONFIG_SMP
+	might_sleep();
+
+	if (cancel_delayed_work(&shepherd))
+		vmstat_shepherd(&shepherd.work);
+#endif
+}
+
 static void __init start_shepherd_timer(void)
 {
 	int cpu;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* RE: mm: pages are not freed from lru_add_pvecs after process termination
  2016-05-05  7:21           ` Michal Hocko
@ 2016-05-05 17:25             ` Odzioba, Lukasz
  2016-05-11  7:38               ` Michal Hocko
  2016-05-06 15:10             ` Odzioba, Lukasz
  1 sibling, 1 reply; 25+ messages in thread
From: Odzioba, Lukasz @ 2016-05-05 17:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hansen, Dave, linux-kernel, linux-mm, Shutemov, Kirill,
	Anaczkowski, Lukasz

On Thu 05-05-16 09:21:00, Michal Hocko wrote: 
> OK, it wasn't that tricky afterall. Maybe I have missed something but
> the following should work. Or maybe the async nature of flushing turns
> out to be just impractical and unreliable and we will end up skipping
> THP (or all compound pages) for pcp LRU add cache. Let's see...

Initially this issue was found on RH's 3.10.x kernel, but now I am using 
4.6-rc6.

In overall it does help and under heavy load it is slightly better than the
second patch. Unfortunately I am still able to hit 10-20% oom kills with it -
(went down from 30-50%) partially due to earlier vmstat_update call
 - it went up to 25-25% with this patch below:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b4359f8..7a5ab0d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3264,17 +3264,17 @@ retry:
        if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
                migration_mode = MIGRATE_SYNC_LIGHT;

-       if(!vmstat_updated) {
-               vmstat_updated = true;
-               kick_vmstat_update();
-       }
-
        /* Try direct reclaim and then allocating */
        page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
                                                        &did_some_progress);
        if (page)
                goto got_pg;

+       if(!vmstat_updated) {
+               vmstat_updated = true;
+               kick_vmstat_update();
+       }

I don't quite see an uninvasive way to make sure that we drain all pvecs
before failing allocation and doing it asynchronously will race allocations
anyway - I guess.

Thanks,
Lukas

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* RE: mm: pages are not freed from lru_add_pvecs after process termination
  2016-05-05  7:21           ` Michal Hocko
  2016-05-05 17:25             ` Odzioba, Lukasz
@ 2016-05-06 15:10             ` Odzioba, Lukasz
  2016-05-06 16:04               ` Dave Hansen
  1 sibling, 1 reply; 25+ messages in thread
From: Odzioba, Lukasz @ 2016-05-06 15:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hansen, Dave, linux-kernel, linux-mm, Shutemov, Kirill,
	Anaczkowski, Lukasz

On Thu 05-05-16 09:21:00, Michal Hocko wrote: 
> Or maybe the async nature of flushing turns
> out to be just impractical and unreliable and we will end up skipping
> THP (or all compound pages) for pcp LRU add cache. Let's see...

What if we simply skip lru_add pvecs for compound pages?
That way we still have compound pages on LRU's, but the problem goes
away.  It is not quite what this naïve patch does, but it works nice for me.

diff --git a/mm/swap.c b/mm/swap.c
index 03aacbc..c75d5e1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -392,7 +392,9 @@ static void __lru_cache_add(struct page *page)
        get_page(page);
        if (!pagevec_space(pvec))
                __pagevec_lru_add(pvec);
        pagevec_add(pvec, page);
+       if (PageCompound(page))
+               __pagevec_lru_add(pvec);
        put_cpu_var(lru_add_pvec);
 }

Do we have any tests that I could use to measure performance impact
of such changes before I start to tweak it up? Or maybe it doesn't make
sense at all ?

Thanks,
Lukas

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: mm: pages are not freed from lru_add_pvecs after process termination
  2016-05-06 15:10             ` Odzioba, Lukasz
@ 2016-05-06 16:04               ` Dave Hansen
  2016-05-11  7:53                 ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2016-05-06 16:04 UTC (permalink / raw)
  To: Odzioba, Lukasz, Michal Hocko
  Cc: linux-kernel, linux-mm, Shutemov, Kirill, Anaczkowski, Lukasz,
	Shutemov, Kirill

On 05/06/2016 08:10 AM, Odzioba, Lukasz wrote:
> On Thu 05-05-16 09:21:00, Michal Hocko wrote: 
>> Or maybe the async nature of flushing turns
>> out to be just impractical and unreliable and we will end up skipping
>> THP (or all compound pages) for pcp LRU add cache. Let's see...
> 
> What if we simply skip lru_add pvecs for compound pages?
> That way we still have compound pages on LRU's, but the problem goes
> away.  It is not quite what this naïve patch does, but it works nice for me.
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 03aacbc..c75d5e1 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -392,7 +392,9 @@ static void __lru_cache_add(struct page *page)
>         get_page(page);
>         if (!pagevec_space(pvec))
>                 __pagevec_lru_add(pvec);
>         pagevec_add(pvec, page);
> +       if (PageCompound(page))
> +               __pagevec_lru_add(pvec);
>         put_cpu_var(lru_add_pvec);
>  }

That's not _quite_ what I had in mind since that drains the entire pvec
every time a large page is encountered.  But I'm conflicted about what
the right behavior _is_.

We'd taking the LRU lock for 'page' anyway, so we might as well drain
the pvec.

Or, does the additional work to put the page on to a pvec and then
immediately drain it overwhelm that advantage?

Or does it just not matter?

Kirill, do you have a suggestion for how we should be checking for THP
pages in code like this?  PageCompound() will surely _work_ for anon-THP
and your file-THP, but is it the best way to check?

> Do we have any tests that I could use to measure performance impact
> of such changes before I start to tweak it up? Or maybe it doesn't make
> sense at all ?

You probably want to very carefully calculate the time to fault a page,
then separately to free a page.  If we can't manage to detect a delta on
a little microbenchmark like that then we'll probably never see one in
practice.

You'll want to measure the fault time for a 4k pages, 2M pages, and then
possibly a mix.

You'll want to do this in a highly parallel test to make sure any
additional LRU lock overhead shows up.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: mm: pages are not freed from lru_add_pvecs after process termination
  2016-05-05 17:25             ` Odzioba, Lukasz
@ 2016-05-11  7:38               ` Michal Hocko
  0 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2016-05-11  7:38 UTC (permalink / raw)
  To: Odzioba, Lukasz
  Cc: Hansen, Dave, linux-kernel, linux-mm, Shutemov, Kirill,
	Anaczkowski, Lukasz

On Thu 05-05-16 17:25:07, Odzioba, Lukasz wrote:
> On Thu 05-05-16 09:21:00, Michal Hocko wrote: 
> > OK, it wasn't that tricky afterall. Maybe I have missed something but
> > the following should work. Or maybe the async nature of flushing turns
> > out to be just impractical and unreliable and we will end up skipping
> > THP (or all compound pages) for pcp LRU add cache. Let's see...
> 
> Initially this issue was found on RH's 3.10.x kernel, but now I am using 
> 4.6-rc6.
> 
> In overall it does help and under heavy load it is slightly better than the
> second patch. Unfortunately I am still able to hit 10-20% oom kills with it -
> (went down from 30-50%) partially due to earlier vmstat_update call
>  - it went up to 25-25% with this patch below:

This simply shows that this is not a viable option. So I guess we really
want to rather skip THP (compound pages) from LRU add pcp cache. Thanks
for your effort and testing!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: mm: pages are not freed from lru_add_pvecs after process termination
  2016-05-06 16:04               ` Dave Hansen
@ 2016-05-11  7:53                 ` Michal Hocko
  2016-05-13 11:29                   ` Vlastimil Babka
                                     ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Michal Hocko @ 2016-05-11  7:53 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Odzioba, Lukasz, linux-kernel, linux-mm, Shutemov, Kirill,
	Anaczkowski, Lukasz

On Fri 06-05-16 09:04:34, Dave Hansen wrote:
> On 05/06/2016 08:10 AM, Odzioba, Lukasz wrote:
> > On Thu 05-05-16 09:21:00, Michal Hocko wrote: 
> >> Or maybe the async nature of flushing turns
> >> out to be just impractical and unreliable and we will end up skipping
> >> THP (or all compound pages) for pcp LRU add cache. Let's see...
> > 
> > What if we simply skip lru_add pvecs for compound pages?
> > That way we still have compound pages on LRU's, but the problem goes
> > away.  It is not quite what this naïve patch does, but it works nice for me.
> > 
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 03aacbc..c75d5e1 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -392,7 +392,9 @@ static void __lru_cache_add(struct page *page)
> >         get_page(page);
> >         if (!pagevec_space(pvec))
> >                 __pagevec_lru_add(pvec);
> >         pagevec_add(pvec, page);
> > +       if (PageCompound(page))
> > +               __pagevec_lru_add(pvec);
> >         put_cpu_var(lru_add_pvec);
> >  }
> 
> That's not _quite_ what I had in mind since that drains the entire pvec
> every time a large page is encountered.  But I'm conflicted about what
> the right behavior _is_.
> 
> We'd taking the LRU lock for 'page' anyway, so we might as well drain
> the pvec.

Yes I think this makes sense. The only case where it would be suboptimal
is when the pagevec was already full and then we just created a single
page pvec to drain it. This can be handled better though by:

diff --git a/mm/swap.c b/mm/swap.c
index 95916142fc46..3fe4f180e8bf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -391,9 +391,8 @@ static void __lru_cache_add(struct page *page)
 	struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
 
 	get_page(page);
-	if (!pagevec_space(pvec))
+	if (!pagevec_add(pvec, page) || PageCompound(page))
 		__pagevec_lru_add(pvec);
-	pagevec_add(pvec, page);
 	put_cpu_var(lru_add_pvec);
 }
 

> Or, does the additional work to put the page on to a pvec and then
> immediately drain it overwhelm that advantage?

pagevec_add is quite trivial so I would be really surprised if it
mattered.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: mm: pages are not freed from lru_add_pvecs after process termination
  2016-05-11  7:53                 ` Michal Hocko
@ 2016-05-13 11:29                   ` Vlastimil Babka
  2016-05-13 12:05                   ` Odzioba, Lukasz
  2016-06-07  9:02                   ` Odzioba, Lukasz
  2 siblings, 0 replies; 25+ messages in thread
From: Vlastimil Babka @ 2016-05-13 11:29 UTC (permalink / raw)
  To: Michal Hocko, Dave Hansen
  Cc: Odzioba, Lukasz, linux-kernel, linux-mm, Shutemov, Kirill,
	Anaczkowski, Lukasz

On 05/11/2016 09:53 AM, Michal Hocko wrote:
> On Fri 06-05-16 09:04:34, Dave Hansen wrote:
>> On 05/06/2016 08:10 AM, Odzioba, Lukasz wrote:
>>> On Thu 05-05-16 09:21:00, Michal Hocko wrote:
>>>> Or maybe the async nature of flushing turns
>>>> out to be just impractical and unreliable and we will end up skipping
>>>> THP (or all compound pages) for pcp LRU add cache. Let's see...
>>>
>>> What if we simply skip lru_add pvecs for compound pages?
>>> That way we still have compound pages on LRU's, but the problem goes
>>> away.  It is not quite what this naïve patch does, but it works nice for me.
>>>
>>> diff --git a/mm/swap.c b/mm/swap.c
>>> index 03aacbc..c75d5e1 100644
>>> --- a/mm/swap.c
>>> +++ b/mm/swap.c
>>> @@ -392,7 +392,9 @@ static void __lru_cache_add(struct page *page)
>>>          get_page(page);
>>>          if (!pagevec_space(pvec))
>>>                  __pagevec_lru_add(pvec);
>>>          pagevec_add(pvec, page);
>>> +       if (PageCompound(page))
>>> +               __pagevec_lru_add(pvec);
>>>          put_cpu_var(lru_add_pvec);
>>>   }
>>
>> That's not _quite_ what I had in mind since that drains the entire pvec
>> every time a large page is encountered.  But I'm conflicted about what
>> the right behavior _is_.
>>
>> We'd taking the LRU lock for 'page' anyway, so we might as well drain
>> the pvec.

Note that pages in the pagevec can come from different zones, so this is 
not universally true.

>
> Yes I think this makes sense. The only case where it would be suboptimal
> is when the pagevec was already full and then we just created a single
> page pvec to drain it. This can be handled better though by:
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 95916142fc46..3fe4f180e8bf 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -391,9 +391,8 @@ static void __lru_cache_add(struct page *page)
>   	struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
>
>   	get_page(page);
> -	if (!pagevec_space(pvec))
> +	if (!pagevec_add(pvec, page) || PageCompound(page))
>   		__pagevec_lru_add(pvec);
> -	pagevec_add(pvec, page);
>   	put_cpu_var(lru_add_pvec);
>   }

Yeah that could work. There might be more complex solutions at the level
of lru_cache_add_active_or_unevictable() where we call it either from
base page code (mm/memory.c) or functions in mm/huge_memory.c. We could
redirect it at that point, but likely not worth the trouble unless this
simple solution doesn't show some performance regression...

>> Or, does the additional work to put the page on to a pvec and then
>> immediately drain it overwhelm that advantage?
>
> pagevec_add is quite trivial so I would be really surprised if it
> mattered.
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: mm: pages are not freed from lru_add_pvecs after process termination
  2016-05-11  7:53                 ` Michal Hocko
  2016-05-13 11:29                   ` Vlastimil Babka
@ 2016-05-13 12:05                   ` Odzioba, Lukasz
  2016-06-07  9:02                   ` Odzioba, Lukasz
  2 siblings, 0 replies; 25+ messages in thread
From: Odzioba, Lukasz @ 2016-05-13 12:05 UTC (permalink / raw)
  To: Michal Hocko, Hansen, Dave
  Cc: linux-kernel, linux-mm, Shutemov, Kirill, Anaczkowski, Lukasz

On Wed 05-11-16 09:53:00, Michal Hocko wrote:
> Yes I think this makes sense. The only case where it would be suboptimal
> is when the pagevec was already full and then we just created a single
> page pvec to drain it. This can be handled better though by:
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 95916142fc46..3fe4f180e8bf 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -391,9 +391,8 @@ static void __lru_cache_add(struct page *page)
> 	struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
>  
> 	get_page(page);
>-	if (!pagevec_space(pvec))
>+	if (!pagevec_add(pvec, page) || PageCompound(page))
> 		__pagevec_lru_add(pvec);
>-	pagevec_add(pvec, page);
> 	put_cpu_var(lru_add_pvec);
 >}
 
Oh yeah, that's exactly what I meant, couldn't find such elegant way of
handling this special case and didn't want to obscure the idea.

I'll do the tests proposed by Date and be back here with results next week.

Thank you guys for the involvement,
Lukas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: mm: pages are not freed from lru_add_pvecs after process termination
  2016-05-11  7:53                 ` Michal Hocko
  2016-05-13 11:29                   ` Vlastimil Babka
  2016-05-13 12:05                   ` Odzioba, Lukasz
@ 2016-06-07  9:02                   ` Odzioba, Lukasz
  2016-06-07 11:19                     ` Michal Hocko
  2 siblings, 1 reply; 25+ messages in thread
From: Odzioba, Lukasz @ 2016-06-07  9:02 UTC (permalink / raw)
  To: Michal Hocko, Hansen, Dave
  Cc: linux-kernel, linux-mm, Shutemov, Kirill, Anaczkowski, Lukasz

On Wed 05-11-16 09:53:00, Michal Hocko wrote:
> Yes I think this makes sense. The only case where it would be suboptimal
> is when the pagevec was already full and then we just created a single
> page pvec to drain it. This can be handled better though by:
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 95916142fc46..3fe4f180e8bf 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -391,9 +391,8 @@ static void __lru_cache_add(struct page *page)
> 	struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
> 
> 	get_page(page);
>-	if (!pagevec_space(pvec))
>+	if (!pagevec_add(pvec, page) || PageCompound(page))
> 		__pagevec_lru_add(pvec);
>-	pagevec_add(pvec, page);
> 	put_cpu_var(lru_add_pvec);
>}

It's been a while, but I am back with some results.
For 2M i 4K pages I wrote simple app which mmaps and unmaps a lot of memory (60GB/288CPU) in parallel and does it ten times to get rid of some os/threading overhead.
Then I created an app which mixes pages in sort of pseudo random random way.
I executed those 10 times under "time" (once with THP=on and once with THP=off) command and calculated sum, min, max, avg of sys, real, user time which was necessary due to significant bias in results.

In overall it seems that this change has no negative impact on performance:
4K  THP=on,off -> no significant change
2M  THP=on,off -> it might be a tiny bit slower, but still close to measurement error
MIX THP=on,off -> no significant change

If you have any concerns about test correctness please let me know.
Below I added test applications and test results.

Thanks,
Lukas
	
------------------------------------------------------------------

//compile with: gcc bench.c -o bench_2M -fopenmp
//compile with: gcc -D SMALL_PAGES bench.c -o bench_4K -fopenmp
#include <stdio.h>
#include <sys/mman.h>
#include <omp.h>

#define MAP_HUGE_SHIFT  26
#define MAP_HUGE_2MB    (21 << MAP_HUGE_SHIFT)

#ifndef SMALL_PAGES
#define PAGE_SIZE (1024*1024*2)
#define MAP_PARAM (MAP_HUGE_2MB)
#else
#define PAGE_SIZE (1024*4)
#define MAP_PARAM (0)
#endif

void main() {
        size_t size = ((60 * 1000 * 1000) / 288) * 1000; // 60GBs of memory 288 CPUs
        #pragma omp parallel
        {
        unsigned int k;
        for (k = 0; k < 10; k++) {
                void *p = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON | MAP_PARAM, -1, 0);
                        if (p != MAP_FAILED) {
                                char *cp = (char*)p;
                                size_t i;
                                for (i = 0; i < size / PAGE_SIZE; i++) {
                                        *cp = 0;
                                        cp += PAGE_SIZE;
                                }
                                munmap(p, size);
                        }
        }
        }
}

//compile with: gcc bench_mixed.c -o bench_mixed -fopenmp
#include <stdio.h>
#include <sys/mman.h>
#include <omp.h>
#define SMALL_PAGE (1024*4)
#define HUGE_PAGE (1024*4)
#define MAP_HUGE_SHIFT  26
#define MAP_HUGE_2MB    (21 << MAP_HUGE_SHIFT)
void main() {
        size_t size = ((60 * 1000 * 1000) / 288) * 1000; // 60GBs of memory 288 CPUs
        #pragma omp parallel
        {
        unsigned int k, MAP_PARAM = 0;
        unsigned int PAGE_SIZE = SMALL_PAGE;
        for (k = 0; k < 10; k++) {
                if ((k + omp_get_thread_num()) % 2) {
                        MAP_PARAM = MAP_HUGE_2MB;
                        PAGE_SIZE = HUGE_PAGE;
                }
                void *p = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON | MAP_PARAM, -1, 0);
                        if (p != MAP_FAILED) {
                                char *cp = (char*)p;
                                size_t i;
                                for (i = 0; i < size / PAGE_SIZE; i++) {
                                        *cp = 0;
                                        cp += PAGE_SIZE;
                                }
                                munmap(p, size);
                        }
        }
        }
}



*******************************

######### 4K THP=ON############
###real  unpatched   patched###
sum = 428.737s sum = 421.339s
min = 41.187s min = 41.492s
max = 44.948s max = 42.822s
avg = 42.874s avg = 42.134s

###user  unpatched   patched###
sum = 145.241s sum = 147.283s
min = 13.760s min = 14.418s
max = 15.532s max = 15.201s
avg = 14.524s avg = 14.728s

###sys  unpatched   patched###
sum = 4882.708s sum = 5020.581s
min = 441.922s min = 490.516s
max = 535.294s max = 532.137s
avg = 488.271s avg = 502.058s

######### 4K THP=OFF###########
###real  unpatched   patched###
sum = 2149.288s sum = 2144.336s
min = 214.589s min = 212.642s
max = 215.937s max = 215.579s
avg = 214.929s avg = 214.434s

###user  unpatched   patched###
sum = 858.659s sum = 858.166s
min = 81.655s min = 82.084s
max = 87.790s max = 88.649s
avg = 85.866s avg = 85.817s

###sys  unpatched   patched###
sum = 32357.867s sum = 31126.183s
min = 2952.685s min = 2783.157s
max = 3442.004s max = 3406.730s
avg = 3235.787s avg = 3112.618s

*******************************

######### 2K THP=ON############
###real  unpatched   patched###
sum = 497.032s sum = 500.115s
min = 48.840s min = 49.529s
max = 50.731s max = 50.698s
avg = 49.703s avg = 50.011s

###real  unpatched   patched###
sum = 56.536s sum = 59.286s
min = 5.021s min = 5.014s
max = 7.465s max = 8.865s
avg = 5.654s avg = 5.929s

###real  unpatched   patched###
sum = 4187.996s sum = 4450.088s
min = 391.334s min = 406.223s
max = 453.087s max = 530.787s
avg = 418.800s avg = 445.009s

######### 2K THP=OFF###########
###real  unpatched   patched###
sum = 54.698s sum = 53.383s
min = 5.196s min = 4.802s
max = 5.707s max = 5.639s
avg = 5.470s avg = 5.338s

###real  unpatched   patched###
sum = 55.567s sum = 60.980s
min = 4.625s min = 4.745s
max = 6.860s max = 6.727s
avg = 5.557s avg = 6.098s

###real  unpatched   patched###
sum = 215.267s sum = 215.924s
min = 21.194s min = 20.139s
max = 21.946s max = 22.724s
avg = 21.527s avg = 21.592s

*******************************

#######MIXED THP=OFF###########
###real  unpatched   patched###
sum = 2146.501s sum = 2145.591s
min = 211.727s min = 211.757s
max = 216.011s max = 215.340s
avg = 214.650s avg = 214.559s

###user  unpatched   patched###
sum = 895.243s sum = 909.778s
min = 87.540s min = 87.862s
max = 91.340s max = 94.337s
avg = 89.524s avg = 90.978s

###sys  unpatched   patched###
sum = 31916.377s sum = 30965.023s
min = 2988.592s min = 2878.047s
max = 3581.066s max = 3270.986s
avg = 3191.638s avg = 3096.502s
#######MIXED THP=ON###########
###real  unpatched   patched###
sum = 440.068s sum = 431.539s
min = 41.317s min = 41.860s
max = 58.752s max = 47.080s
avg = 44.007s avg = 43.154s

###user  unpatched   patched###
sum = 153.703s sum = 151.004s
min = 14.395s min = 14.210s
max = 16.778s max = 16.484s
avg = 15.370s avg = 15.100s

###sys  unpatched   patched###
sum = 4945.824s sum = 4957.661s
min = 459.862s min = 469.810s
max = 514.161s max = 526.257s
avg = 494.582s avg = 495.766s

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: mm: pages are not freed from lru_add_pvecs after process termination
  2016-06-07  9:02                   ` Odzioba, Lukasz
@ 2016-06-07 11:19                     ` Michal Hocko
  2016-06-08  8:51                       ` Odzioba, Lukasz
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2016-06-07 11:19 UTC (permalink / raw)
  To: Odzioba, Lukasz
  Cc: Hansen, Dave, linux-kernel, linux-mm, Shutemov, Kirill,
	Anaczkowski, Lukasz

On Tue 07-06-16 09:02:02, Odzioba, Lukasz wrote:
[...]
> //compile with: gcc bench.c -o bench_2M -fopenmp
> //compile with: gcc -D SMALL_PAGES bench.c -o bench_4K -fopenmp
> #include <stdio.h>
> #include <sys/mman.h>
> #include <omp.h>
> 
> #define MAP_HUGE_SHIFT  26
> #define MAP_HUGE_2MB    (21 << MAP_HUGE_SHIFT)
> 
> #ifndef SMALL_PAGES
> #define PAGE_SIZE (1024*1024*2)
> #define MAP_PARAM (MAP_HUGE_2MB)

Isn't MAP_HUGE_2MB ignored for !hugetlb pages?

> #else
> #define PAGE_SIZE (1024*4)
> #define MAP_PARAM (0)
> #endif
> 
> void main() {
>         size_t size = ((60 * 1000 * 1000) / 288) * 1000; // 60GBs of memory 288 CPUs
>         #pragma omp parallel
>         {
>         unsigned int k;
>         for (k = 0; k < 10; k++) {
>                 void *p = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON | MAP_PARAM, -1, 0);

I guess you want something like posix_memalign or start faulting in from
an aligned address to guarantee you will fault 2MB pages. Also note that
the default behavior for THP during the fault has changed recently (see
444eb2a449ef ("mm: thp: set THP defrag by default to madvise and add a
stall-free defrag option") so you might need MADV_HUGEPAGE.

Besides that I am really suspicious that this will be measurable at all.
I would just go and spin a patch assuming you are still able to trigger
OOM with the vanilla kernel. The bug fix is more important...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: mm: pages are not freed from lru_add_pvecs after process termination
  2016-06-07 11:19                     ` Michal Hocko
@ 2016-06-08  8:51                       ` Odzioba, Lukasz
  0 siblings, 0 replies; 25+ messages in thread
From: Odzioba, Lukasz @ 2016-06-08  8:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hansen, Dave, linux-kernel, linux-mm, Shutemov, Kirill,
	Anaczkowski, Lukasz

On Tue 07-06-16 13:20:00, Michal Hocko wrote:
> I guess you want something like posix_memalign or start faulting in from
> an aligned address to guarantee you will fault 2MB pages. 

Good catch.

> Besides that I am really suspicious that this will be measurable at all.
> I would just go and spin a patch assuming you are still able to trigger
> OOM with the vanilla kernel. 

Yes, I am still able to trigger OOM, the tests I did are  more like sanity
checks rather than benchmarks. lru_cache_add takes very little time
so it was rather to look for some unexpected side effects.

Thank,
Lukas

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2016-06-08  8:51 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-27 17:01 mm: pages are not freed from lru_add_pvecs after process termination Odzioba, Lukasz
2016-04-27 17:11 ` Dave Hansen
2016-04-28 14:37   ` Michal Hocko
2016-05-02 13:00     ` Michal Hocko
2016-05-04 19:41       ` Odzioba, Lukasz
2016-05-04 20:16         ` Dave Hansen
2016-05-04 20:36         ` Michal Hocko
2016-05-05  7:21           ` Michal Hocko
2016-05-05 17:25             ` Odzioba, Lukasz
2016-05-11  7:38               ` Michal Hocko
2016-05-06 15:10             ` Odzioba, Lukasz
2016-05-06 16:04               ` Dave Hansen
2016-05-11  7:53                 ` Michal Hocko
2016-05-13 11:29                   ` Vlastimil Babka
2016-05-13 12:05                   ` Odzioba, Lukasz
2016-06-07  9:02                   ` Odzioba, Lukasz
2016-06-07 11:19                     ` Michal Hocko
2016-06-08  8:51                       ` Odzioba, Lukasz
2016-05-02 14:39   ` Vlastimil Babka
2016-05-02 15:01     ` Kirill A. Shutemov
2016-05-02 15:13       ` Vlastimil Babka
2016-05-02 15:49       ` Dave Hansen
2016-05-02 16:02         ` Kirill A. Shutemov
2016-05-03  7:37           ` Michal Hocko
2016-05-03 10:07             ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).