All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@kernel.org>
To: Dave Hansen <dave.hansen@intel.com>
Cc: "Odzioba, Lukasz" <lukasz.odzioba@intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"Shutemov, Kirill" <kirill.shutemov@intel.com>,
	"Anaczkowski, Lukasz" <lukasz.anaczkowski@intel.com>
Subject: Re: mm: pages are not freed from lru_add_pvecs after process termination
Date: Thu, 28 Apr 2016 16:37:10 +0200	[thread overview]
Message-ID: <20160428143710.GC31496@dhcp22.suse.cz> (raw)
In-Reply-To: <5720F2A8.6070406@intel.com>

On Wed 27-04-16 10:11:04, Dave Hansen wrote:
> On 04/27/2016 10:01 AM, Odzioba, Lukasz wrote:
[...]
> > 1. We need some statistics on the number and total *SIZES* of all pages
> >    in the lru pagevecs.  It's too opaque now.
> > 2. We need to make darn sure we drain the lru pagevecs before failing
> >    any kind of allocation.

lru_add_drain_all is unfortunatelly too costly (especially on large
machines). You are right that failing an allocation with a lot of cached
pages is less than suboptimal though. So maybe we can do it from the
slow path after the first round of direct reclaim failed to allocate
anything. Something like the following:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5dd65d9fb76a..0743c58c2e9d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3559,6 +3559,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum compact_result compact_result;
 	int compaction_retries = 0;
 	int no_progress_loops = 0;
+	bool drained_lru = false;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3667,6 +3668,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (page)
 		goto got_pg;
 
+	if (!drained_lru) {
+		drained_lru = true;
+		lru_add_drain_all();
+	}
+
 	/* Do not loop if specifically requested */
 	if (gfp_mask & __GFP_NORETRY)
 		goto noretry;

The downside would be that we really depend on the WQ to make any
progress here. If we are really out of memory then we are screwed so
we would need a flush_work_timeout() or something else that would
guarantee maximum timeout. That something else might be to stop using WQ
and move the flushing into the IRQ context. Not for free too but at
least not dependant on having some memory to make a progress.

> > 3. We need some way to drain the lru pagevecs directly.  Maybe the buddy
> >    pcp lists too.
> > 4. We need to make sure that a zone_reclaim_mode=0 system still drains
> >    too.
> > 5. The VM stats and their updates are now related to how often
> >    drain_zone_pages() gets run.  That might be interacting here too.
> 
> 6. Perhaps don't use the LRU pagevecs for large pages.  It limits the
>    severity of the problem.

7. Hook into vmstat and flush from there? This would drain them
periodically but it would also introduce an undeterministic interference
as well.

-- 
Michal Hocko
SUSE Labs

WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko@kernel.org>
To: Dave Hansen <dave.hansen@intel.com>
Cc: "Odzioba, Lukasz" <lukasz.odzioba@intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"Shutemov, Kirill" <kirill.shutemov@intel.com>,
	"Anaczkowski, Lukasz" <lukasz.anaczkowski@intel.com>
Subject: Re: mm: pages are not freed from lru_add_pvecs after process termination
Date: Thu, 28 Apr 2016 16:37:10 +0200	[thread overview]
Message-ID: <20160428143710.GC31496@dhcp22.suse.cz> (raw)
In-Reply-To: <5720F2A8.6070406@intel.com>

On Wed 27-04-16 10:11:04, Dave Hansen wrote:
> On 04/27/2016 10:01 AM, Odzioba, Lukasz wrote:
[...]
> > 1. We need some statistics on the number and total *SIZES* of all pages
> >    in the lru pagevecs.  It's too opaque now.
> > 2. We need to make darn sure we drain the lru pagevecs before failing
> >    any kind of allocation.

lru_add_drain_all is unfortunatelly too costly (especially on large
machines). You are right that failing an allocation with a lot of cached
pages is less than suboptimal though. So maybe we can do it from the
slow path after the first round of direct reclaim failed to allocate
anything. Something like the following:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5dd65d9fb76a..0743c58c2e9d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3559,6 +3559,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum compact_result compact_result;
 	int compaction_retries = 0;
 	int no_progress_loops = 0;
+	bool drained_lru = false;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3667,6 +3668,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (page)
 		goto got_pg;
 
+	if (!drained_lru) {
+		drained_lru = true;
+		lru_add_drain_all();
+	}
+
 	/* Do not loop if specifically requested */
 	if (gfp_mask & __GFP_NORETRY)
 		goto noretry;

The downside would be that we really depend on the WQ to make any
progress here. If we are really out of memory then we are screwed so
we would need a flush_work_timeout() or something else that would
guarantee maximum timeout. That something else might be to stop using WQ
and move the flushing into the IRQ context. Not for free too but at
least not dependant on having some memory to make a progress.

> > 3. We need some way to drain the lru pagevecs directly.  Maybe the buddy
> >    pcp lists too.
> > 4. We need to make sure that a zone_reclaim_mode=0 system still drains
> >    too.
> > 5. The VM stats and their updates are now related to how often
> >    drain_zone_pages() gets run.  That might be interacting here too.
> 
> 6. Perhaps don't use the LRU pagevecs for large pages.  It limits the
>    severity of the problem.

7. Hook into vmstat and flush from there? This would drain them
periodically but it would also introduce an undeterministic interference
as well.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2016-04-28 14:37 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-27 17:01 mm: pages are not freed from lru_add_pvecs after process termination Odzioba, Lukasz
2016-04-27 17:01 ` Odzioba, Lukasz
2016-04-27 17:11 ` Dave Hansen
2016-04-27 17:11   ` Dave Hansen
2016-04-28 14:37   ` Michal Hocko [this message]
2016-04-28 14:37     ` Michal Hocko
2016-05-02 13:00     ` Michal Hocko
2016-05-02 13:00       ` Michal Hocko
2016-05-04 19:41       ` Odzioba, Lukasz
2016-05-04 19:41         ` Odzioba, Lukasz
2016-05-04 20:16         ` Dave Hansen
2016-05-04 20:16           ` Dave Hansen
2016-05-04 20:36         ` Michal Hocko
2016-05-04 20:36           ` Michal Hocko
2016-05-05  7:21           ` Michal Hocko
2016-05-05  7:21             ` Michal Hocko
2016-05-05 17:25             ` Odzioba, Lukasz
2016-05-05 17:25               ` Odzioba, Lukasz
2016-05-11  7:38               ` Michal Hocko
2016-05-11  7:38                 ` Michal Hocko
2016-05-06 15:10             ` Odzioba, Lukasz
2016-05-06 15:10               ` Odzioba, Lukasz
2016-05-06 16:04               ` Dave Hansen
2016-05-06 16:04                 ` Dave Hansen
2016-05-11  7:53                 ` Michal Hocko
2016-05-11  7:53                   ` Michal Hocko
2016-05-13 11:29                   ` Vlastimil Babka
2016-05-13 11:29                     ` Vlastimil Babka
2016-05-13 12:05                   ` Odzioba, Lukasz
2016-05-13 12:05                     ` Odzioba, Lukasz
2016-06-07  9:02                   ` Odzioba, Lukasz
2016-06-07  9:02                     ` Odzioba, Lukasz
2016-06-07 11:19                     ` Michal Hocko
2016-06-07 11:19                       ` Michal Hocko
2016-06-08  8:51                       ` Odzioba, Lukasz
2016-06-08  8:51                         ` Odzioba, Lukasz
2016-05-02 14:39   ` Vlastimil Babka
2016-05-02 14:39     ` Vlastimil Babka
2016-05-02 15:01     ` Kirill A. Shutemov
2016-05-02 15:01       ` Kirill A. Shutemov
2016-05-02 15:13       ` Vlastimil Babka
2016-05-02 15:13         ` Vlastimil Babka
2016-05-02 15:49       ` Dave Hansen
2016-05-02 15:49         ` Dave Hansen
2016-05-02 16:02         ` Kirill A. Shutemov
2016-05-02 16:02           ` Kirill A. Shutemov
2016-05-03  7:37           ` Michal Hocko
2016-05-03  7:37             ` Michal Hocko
2016-05-03 10:07             ` Kirill A. Shutemov
2016-05-03 10:07               ` Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160428143710.GC31496@dhcp22.suse.cz \
    --to=mhocko@kernel.org \
    --cc=dave.hansen@intel.com \
    --cc=kirill.shutemov@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lukasz.anaczkowski@intel.com \
    --cc=lukasz.odzioba@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.