All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@kernel.org>
To: "Odzioba, Lukasz" <lukasz.odzioba@intel.com>
Cc: "Hansen, Dave" <dave.hansen@intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"Shutemov, Kirill" <kirill.shutemov@intel.com>,
	"Anaczkowski, Lukasz" <lukasz.anaczkowski@intel.com>
Subject: Re: mm: pages are not freed from lru_add_pvecs after process termination
Date: Wed, 4 May 2016 22:36:43 +0200	[thread overview]
Message-ID: <20160504203643.GI21490@dhcp22.suse.cz> (raw)
In-Reply-To: <D6EDEBF1F91015459DB866AC4EE162CC023C182F@IRSMSX103.ger.corp.intel.com>

On Wed 04-05-16 19:41:59, Odzioba, Lukasz wrote:
> On Thu 02-05-16 03:00:00, Michal Hocko wrote:
> > So I have given this a try (not tested yet) and it doesn't look terribly
> > complicated. It is hijacking vmstat for a purpose it wasn't intended for
> > originally but creating a dedicated kenrnel threads/WQ sounds like an
> > overkill to me. Does this helps or do we have to be more aggressive and
> > wake up shepherd from the allocator slow path. Could you give it a try
> > please?
> 
> It seems to work fine, but it takes quite random time to drain lists, sometimes
> a couple of seconds sometimes over two minutes. It is acceptable I believe.

I guess you mean that some CPUs are not drained for few minutes, right?
This might be a quite long and I tried to not flush LRU drain to the
idle entry because I felt it would be too expensive. Maybe it would be
better to kick the vmstat_shepherd from the allocator slow path. It
would still take unpredictable amount of time but it would at list be
called when we are getting short on memory.
 
> I have an app which allocates almost all of the memory from numa node and
> with just second patch and 100 consecutive executions 30-50% got killed.

This is still not acceptable. So I guess we need a way to kick
vmstat_shepherd from the reclaim path. I will think about that. Sounds a
bit tricky at first sight.

> After applying also your first patch I haven't seen any oom kill
> activity - great.

As I've said the first patch is quite dangerous as it depends on the WQ
to make a forward progress which might depend on the memory allocation
to create a new worker.
 
> I was wondering how many lru_add_drain()'s are called and after boot when
> machine was idle it was a bit over 5k calls during first 400s, and with some 
> activity it went up to 15k calls during 700s (including 5k from previous 
> experiment) which sounds fair to me given big cpu count.
> 
> Do you see any advantages of dropping THP from pagevecs over this
> solution?

Well the general purpose of pcp pagevecs is to reduce the lru_lock
contention. I have never measured the effect of THP pages. It is true
THP amortizes the contention by the page number handled at once so it
might be the easiest way (and certainly more acceptable for an old
kernel which you seem to be running as mentioned by Dave) but it sounds
too special cased and I would rather see less special casing for THP. So
if the async pcp sync is not too tricky or hard to maintain and worsk I
would rather go that way.

Thanks for testing those patches!
-- 
Michal Hocko
SUSE Labs

WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko@kernel.org>
To: "Odzioba, Lukasz" <lukasz.odzioba@intel.com>
Cc: "Hansen, Dave" <dave.hansen@intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"Shutemov, Kirill" <kirill.shutemov@intel.com>,
	"Anaczkowski, Lukasz" <lukasz.anaczkowski@intel.com>
Subject: Re: mm: pages are not freed from lru_add_pvecs after process termination
Date: Wed, 4 May 2016 22:36:43 +0200	[thread overview]
Message-ID: <20160504203643.GI21490@dhcp22.suse.cz> (raw)
In-Reply-To: <D6EDEBF1F91015459DB866AC4EE162CC023C182F@IRSMSX103.ger.corp.intel.com>

On Wed 04-05-16 19:41:59, Odzioba, Lukasz wrote:
> On Thu 02-05-16 03:00:00, Michal Hocko wrote:
> > So I have given this a try (not tested yet) and it doesn't look terribly
> > complicated. It is hijacking vmstat for a purpose it wasn't intended for
> > originally but creating a dedicated kenrnel threads/WQ sounds like an
> > overkill to me. Does this helps or do we have to be more aggressive and
> > wake up shepherd from the allocator slow path. Could you give it a try
> > please?
> 
> It seems to work fine, but it takes quite random time to drain lists, sometimes
> a couple of seconds sometimes over two minutes. It is acceptable I believe.

I guess you mean that some CPUs are not drained for few minutes, right?
This might be a quite long and I tried to not flush LRU drain to the
idle entry because I felt it would be too expensive. Maybe it would be
better to kick the vmstat_shepherd from the allocator slow path. It
would still take unpredictable amount of time but it would at list be
called when we are getting short on memory.
 
> I have an app which allocates almost all of the memory from numa node and
> with just second patch and 100 consecutive executions 30-50% got killed.

This is still not acceptable. So I guess we need a way to kick
vmstat_shepherd from the reclaim path. I will think about that. Sounds a
bit tricky at first sight.

> After applying also your first patch I haven't seen any oom kill
> activity - great.

As I've said the first patch is quite dangerous as it depends on the WQ
to make a forward progress which might depend on the memory allocation
to create a new worker.
 
> I was wondering how many lru_add_drain()'s are called and after boot when
> machine was idle it was a bit over 5k calls during first 400s, and with some 
> activity it went up to 15k calls during 700s (including 5k from previous 
> experiment) which sounds fair to me given big cpu count.
> 
> Do you see any advantages of dropping THP from pagevecs over this
> solution?

Well the general purpose of pcp pagevecs is to reduce the lru_lock
contention. I have never measured the effect of THP pages. It is true
THP amortizes the contention by the page number handled at once so it
might be the easiest way (and certainly more acceptable for an old
kernel which you seem to be running as mentioned by Dave) but it sounds
too special cased and I would rather see less special casing for THP. So
if the async pcp sync is not too tricky or hard to maintain and worsk I
would rather go that way.

Thanks for testing those patches!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2016-05-04 20:36 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-27 17:01 mm: pages are not freed from lru_add_pvecs after process termination Odzioba, Lukasz
2016-04-27 17:01 ` Odzioba, Lukasz
2016-04-27 17:11 ` Dave Hansen
2016-04-27 17:11   ` Dave Hansen
2016-04-28 14:37   ` Michal Hocko
2016-04-28 14:37     ` Michal Hocko
2016-05-02 13:00     ` Michal Hocko
2016-05-02 13:00       ` Michal Hocko
2016-05-04 19:41       ` Odzioba, Lukasz
2016-05-04 19:41         ` Odzioba, Lukasz
2016-05-04 20:16         ` Dave Hansen
2016-05-04 20:16           ` Dave Hansen
2016-05-04 20:36         ` Michal Hocko [this message]
2016-05-04 20:36           ` Michal Hocko
2016-05-05  7:21           ` Michal Hocko
2016-05-05  7:21             ` Michal Hocko
2016-05-05 17:25             ` Odzioba, Lukasz
2016-05-05 17:25               ` Odzioba, Lukasz
2016-05-11  7:38               ` Michal Hocko
2016-05-11  7:38                 ` Michal Hocko
2016-05-06 15:10             ` Odzioba, Lukasz
2016-05-06 15:10               ` Odzioba, Lukasz
2016-05-06 16:04               ` Dave Hansen
2016-05-06 16:04                 ` Dave Hansen
2016-05-11  7:53                 ` Michal Hocko
2016-05-11  7:53                   ` Michal Hocko
2016-05-13 11:29                   ` Vlastimil Babka
2016-05-13 11:29                     ` Vlastimil Babka
2016-05-13 12:05                   ` Odzioba, Lukasz
2016-05-13 12:05                     ` Odzioba, Lukasz
2016-06-07  9:02                   ` Odzioba, Lukasz
2016-06-07  9:02                     ` Odzioba, Lukasz
2016-06-07 11:19                     ` Michal Hocko
2016-06-07 11:19                       ` Michal Hocko
2016-06-08  8:51                       ` Odzioba, Lukasz
2016-06-08  8:51                         ` Odzioba, Lukasz
2016-05-02 14:39   ` Vlastimil Babka
2016-05-02 14:39     ` Vlastimil Babka
2016-05-02 15:01     ` Kirill A. Shutemov
2016-05-02 15:01       ` Kirill A. Shutemov
2016-05-02 15:13       ` Vlastimil Babka
2016-05-02 15:13         ` Vlastimil Babka
2016-05-02 15:49       ` Dave Hansen
2016-05-02 15:49         ` Dave Hansen
2016-05-02 16:02         ` Kirill A. Shutemov
2016-05-02 16:02           ` Kirill A. Shutemov
2016-05-03  7:37           ` Michal Hocko
2016-05-03  7:37             ` Michal Hocko
2016-05-03 10:07             ` Kirill A. Shutemov
2016-05-03 10:07               ` Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160504203643.GI21490@dhcp22.suse.cz \
    --to=mhocko@kernel.org \
    --cc=dave.hansen@intel.com \
    --cc=kirill.shutemov@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lukasz.anaczkowski@intel.com \
    --cc=lukasz.odzioba@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.