All of lore.kernel.org
 help / color / mirror / Atom feed
* Large Pages - Linux Foundation HPC
@ 2009-04-21 16:32 Badari Pulavarty
  2009-04-21 16:57 ` Dave Hansen
  0 siblings, 1 reply; 7+ messages in thread
From: Badari Pulavarty @ 2009-04-21 16:32 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel

Hi Dave,

On the Linux foundation HPC track summary, I saw:

-- Memory and interface to it - mapping memory into apps, nodes going
down due to memory exhaustion
               - large pages important - current state not good enough


What does this mean ? Whats not good enough ? What do they want ?
Did they spell out the actual requirements and why ? What are
your findings ?

Thanks,
Badari


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Large Pages - Linux Foundation HPC
  2009-04-21 16:32 Large Pages - Linux Foundation HPC Badari Pulavarty
@ 2009-04-21 16:57 ` Dave Hansen
  2009-04-21 18:25   ` Balbir Singh
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Hansen @ 2009-04-21 16:57 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: linux-kernel, Christoph Lameter, Vivek Kashyap, Mel Gorman,
	Balbir Singh1, Robert MacFarlan

On Tue, 2009-04-21 at 09:32 -0700, Badari Pulavarty wrote:
> Hi Dave,
> 
> On the Linux foundation HPC track summary, I saw:
> 
> -- Memory and interface to it - mapping memory into apps
>      - large pages important - current state not good enough

I'm not sure exactly what this means.  But, there was continuing concern
about large page interfaces.  hugetlbfs is fine, but it still requires
special tools, planning, and requires some modification of the app.  We
can modify it with linker tricks or with LD_PRELOAD, but those certainly
don't work everywhere.  I was told over and over again that hugetlbfs
isn't a sufficient interface for large pages, no matter how much
userspace we try to stick in front of it.

Some of their apps get a 6-7x speedup from large pages!

Fragmentation also isn't an issue for a big chunk of the users since
they reboot between each job.

> nodes going down due to memory exhaustion

Virtually all the apps in an HPC environment start up try to use all the
memory they can get their hands on.  With strict overcommit on, that
probably means brk() or mmap() until they fail.  They also usually
mlock() anything they're able to allocate.  Swapping is the devil to
them. :)

Basically, what all the apps do is a recipe for stressing the VM and
triggering the OOM killer.  Most of the users simply hack the kernel and
replace the OOM killer with one that fits their needs.  Some have an
attitude that "the user's app should never die" and others "the user
caused this, so kill their app".  Basically, there's no way to make
everyone happy since they have conflicting requirements.  But, this is
true of the kernel in general... nothing special here.

The split LRU should help things.  It will at least make our memory
scanning more efficient and ensure we're making more efficient reclaim
progress.  I'm not sure that anyone there knew about the oom_adjust and
oom_score knobs in /proc.  They do now. :)

One of my suggestions was to use the memory resource controller.  They
could give each app 95% (or whatever) of the system.  This should let
them keep their current "consume all memory" behavior, but stop them at
sane limits.

That leads into another issue, which is the "wedding cake" software
stack.  There are a lot of software dependencies both in and out of the
kernel.  It is hard to change individual components, especially in the
lower levels.  This leads many of the users to use old (think 2.6.9)
kernels.  Nobody runs mainline, of course.

Then, there's Lustre.  Everybody uses it, it's definitely a big hunk of
the "wedding cake".  I haven't seen any LKML postings on it in years and
I really wonder how it interacts with the VM.  No idea.

There's a "Hyperion cluster" which is for testing new HPC software on a
decently sized cluster.  One suggestion of ours was to try and get
mainline tested on this every so often to look for regressions since
we're not able to glean feedback from 2.6.9 kernel users.  We'll see
where that goes. 

> checkpoint/restart

Many of the MPI implementations have mechanisms in userspace for
checkpointing of user jobs.  Most cluster administrators instruct their
users to use these mechanisms.  Some do.  Most don't.

-- Dave


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Large Pages - Linux Foundation HPC
  2009-04-21 16:57 ` Dave Hansen
@ 2009-04-21 18:25   ` Balbir Singh
  2009-04-25  8:48     ` Wu Fengguang
  0 siblings, 1 reply; 7+ messages in thread
From: Balbir Singh @ 2009-04-21 18:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Badari Pulavarty, linux-kernel, Christoph Lameter, Vivek Kashyap,
	Mel Gorman, Balbir Singh, Robert MacFarlan

[Fix my email address to balbir@linux.vnet.ibm.com]

* Dave Hansen <dave@linux.vnet.ibm.com> [2009-04-21 09:57:05]:
> On Tue, 2009-04-21 at 09:32 -0700, Badari Pulavarty wrote:
> > Hi Dave,
> > 
> > On the Linux foundation HPC track summary, I saw:
> > 
> > -- Memory and interface to it - mapping memory into apps
> >      - large pages important - current state not good enough
> 
> I'm not sure exactly what this means.  But, there was continuing concern
> about large page interfaces.  hugetlbfs is fine, but it still requires
> special tools, planning, and requires some modification of the app.  We
> can modify it with linker tricks or with LD_PRELOAD, but those certainly
> don't work everywhere.  I was told over and over again that hugetlbfs
> isn't a sufficient interface for large pages, no matter how much
> userspace we try to stick in front of it.
> 
> Some of their apps get a 6-7x speedup from large pages!
> 
> Fragmentation also isn't an issue for a big chunk of the users since
> they reboot between each job.
> 
> > nodes going down due to memory exhaustion
> 
> Virtually all the apps in an HPC environment start up try to use all the
> memory they can get their hands on.  With strict overcommit on, that
> probably means brk() or mmap() until they fail.  They also usually
> mlock() anything they're able to allocate.  Swapping is the devil to
> them. :)
> 
> Basically, what all the apps do is a recipe for stressing the VM and
> triggering the OOM killer.  Most of the users simply hack the kernel and
> replace the OOM killer with one that fits their needs.  Some have an
> attitude that "the user's app should never die" and others "the user
> caused this, so kill their app".  Basically, there's no way to make
> everyone happy since they have conflicting requirements.  But, this is
> true of the kernel in general... nothing special here.

OOM killer has been a hot topic. Have you seen Dan Malek's patches at
http://lkml.org/lkml/2009/4/13/276.

> 
> The split LRU should help things.  It will at least make our memory
> scanning more efficient and ensure we're making more efficient reclaim
> progress.  I'm not sure that anyone there knew about the oom_adjust and
> oom_score knobs in /proc.  They do now. :)

:-)

> 
> One of my suggestions was to use the memory resource controller.  They
> could give each app 95% (or whatever) of the system.  This should let
> them keep their current "consume all memory" behavior, but stop them at
> sane limits.
> 

Soft limits should help as well, basically we are trying to allow
unrestricted memory access until there is contention. The patches are
still under development.

> That leads into another issue, which is the "wedding cake" software
> stack.  There are a lot of software dependencies both in and out of the
> kernel.  It is hard to change individual components, especially in the
> lower levels.  This leads many of the users to use old (think 2.6.9)
> kernels.  Nobody runs mainline, of course.
> 
> Then, there's Lustre.  Everybody uses it, it's definitely a big hunk of
> the "wedding cake".  I haven't seen any LKML postings on it in years and
> I really wonder how it interacts with the VM.  No idea.
> 
> There's a "Hyperion cluster" which is for testing new HPC software on a
> decently sized cluster.  One suggestion of ours was to try and get
> mainline tested on this every so often to look for regressions since
> we're not able to glean feedback from 2.6.9 kernel users.  We'll see
> where that goes. 
> 
> > checkpoint/restart
> 
> Many of the MPI implementations have mechanisms in userspace for
> checkpointing of user jobs.  Most cluster administrators instruct their
> users to use these mechanisms.  Some do.  Most don't.
>

Good inputs and summary. Thanks! 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Large Pages - Linux Foundation HPC
  2009-04-21 18:25   ` Balbir Singh
@ 2009-04-25  8:48     ` Wu Fengguang
  2009-04-26  6:54       ` Dave Hansen
  0 siblings, 1 reply; 7+ messages in thread
From: Wu Fengguang @ 2009-04-25  8:48 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Dave Hansen, Badari Pulavarty, linux-kernel, Christoph Lameter,
	Vivek Kashyap, Mel Gorman, Robert MacFarlan, Fu, Michael

On Tue, Apr 21, 2009 at 11:55:55PM +0530, Balbir Singh wrote:
> [Fix my email address to balbir@linux.vnet.ibm.com]
> 
> * Dave Hansen <dave@linux.vnet.ibm.com> [2009-04-21 09:57:05]:
> > On Tue, 2009-04-21 at 09:32 -0700, Badari Pulavarty wrote:
> > > Hi Dave,
> > > 
> > > On the Linux foundation HPC track summary, I saw:
> > > 
> > > -- Memory and interface to it - mapping memory into apps
> > >      - large pages important - current state not good enough
> > 
> > I'm not sure exactly what this means.  But, there was continuing concern
> > about large page interfaces.  hugetlbfs is fine, but it still requires
> > special tools, planning, and requires some modification of the app.  We
> > can modify it with linker tricks or with LD_PRELOAD, but those certainly
> > don't work everywhere.  I was told over and over again that hugetlbfs
> > isn't a sufficient interface for large pages, no matter how much
> > userspace we try to stick in front of it.
> > 
> > Some of their apps get a 6-7x speedup from large pages!
> > 
> > Fragmentation also isn't an issue for a big chunk of the users since
> > they reboot between each job.

Perhaps this policy?

In mlock(), populate huge pages if (1) the mlock range is large enough
to hold some huge pages; (2) there are more than enough free high
order pages.

Based on Dave's descriptions that HPC apps typically
- do mlock(), to pre-populate memory and pin them in memory
- run at fresh boot, with loads of high order pages available

Thanks,
Fengguang

> > > nodes going down due to memory exhaustion
> > 
> > Virtually all the apps in an HPC environment start up try to use all the
> > memory they can get their hands on.  With strict overcommit on, that
> > probably means brk() or mmap() until they fail.  They also usually
> > mlock() anything they're able to allocate.  Swapping is the devil to
> > them. :)
> > 
> > Basically, what all the apps do is a recipe for stressing the VM and
> > triggering the OOM killer.  Most of the users simply hack the kernel and
> > replace the OOM killer with one that fits their needs.  Some have an
> > attitude that "the user's app should never die" and others "the user
> > caused this, so kill their app".  Basically, there's no way to make
> > everyone happy since they have conflicting requirements.  But, this is
> > true of the kernel in general... nothing special here.
> 
> OOM killer has been a hot topic. Have you seen Dan Malek's patches at
> http://lkml.org/lkml/2009/4/13/276.
> 
> > 
> > The split LRU should help things.  It will at least make our memory
> > scanning more efficient and ensure we're making more efficient reclaim
> > progress.  I'm not sure that anyone there knew about the oom_adjust and
> > oom_score knobs in /proc.  They do now. :)
> 
> :-)
> 
> > 
> > One of my suggestions was to use the memory resource controller.  They
> > could give each app 95% (or whatever) of the system.  This should let
> > them keep their current "consume all memory" behavior, but stop them at
> > sane limits.
> > 
> 
> Soft limits should help as well, basically we are trying to allow
> unrestricted memory access until there is contention. The patches are
> still under development.
> 
> > That leads into another issue, which is the "wedding cake" software
> > stack.  There are a lot of software dependencies both in and out of the
> > kernel.  It is hard to change individual components, especially in the
> > lower levels.  This leads many of the users to use old (think 2.6.9)
> > kernels.  Nobody runs mainline, of course.
> > 
> > Then, there's Lustre.  Everybody uses it, it's definitely a big hunk of
> > the "wedding cake".  I haven't seen any LKML postings on it in years and
> > I really wonder how it interacts with the VM.  No idea.
> > 
> > There's a "Hyperion cluster" which is for testing new HPC software on a
> > decently sized cluster.  One suggestion of ours was to try and get
> > mainline tested on this every so often to look for regressions since
> > we're not able to glean feedback from 2.6.9 kernel users.  We'll see
> > where that goes. 
> > 
> > > checkpoint/restart
> > 
> > Many of the MPI implementations have mechanisms in userspace for
> > checkpointing of user jobs.  Most cluster administrators instruct their
> > users to use these mechanisms.  Some do.  Most don't.
> >
> 
> Good inputs and summary. Thanks! 
> 
> -- 
> 	Balbir
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Large Pages - Linux Foundation HPC
  2009-04-25  8:48     ` Wu Fengguang
@ 2009-04-26  6:54       ` Dave Hansen
  2009-04-27 14:12         ` Christoph Lameter
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Hansen @ 2009-04-26  6:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Balbir Singh, Badari Pulavarty, linux-kernel, Christoph Lameter,
	Vivek Kashyap, Mel Gorman, Robert MacFarlan, Fu, Michael

On Sat, 2009-04-25 at 16:48 +0800, Wu Fengguang wrote:
> Based on Dave's descriptions that HPC apps typically
> - do mlock(), to pre-populate memory and pin them in memory
> - run at fresh boot, with loads of high order pages available

There are definitely some of them that do this, but it certainly isn't
all.  It may not even be the norm.  

-- Dave


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Large Pages - Linux Foundation HPC
  2009-04-26  6:54       ` Dave Hansen
@ 2009-04-27 14:12         ` Christoph Lameter
  2009-04-28  3:15           ` Wu Fengguang
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Lameter @ 2009-04-27 14:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Wu Fengguang, Balbir Singh, Badari Pulavarty, linux-kernel,
	Vivek Kashyap, Mel Gorman, Robert MacFarlan, Fu, Michael

On Sat, 25 Apr 2009, Dave Hansen wrote:

> On Sat, 2009-04-25 at 16:48 +0800, Wu Fengguang wrote:
> > Based on Dave's descriptions that HPC apps typically
> > - do mlock(), to pre-populate memory and pin them in memory
> > - run at fresh boot, with loads of high order pages available
>
> There are definitely some of them that do this, but it certainly isn't
> all.  It may not even be the norm.

Some of the machine have so much memory available that 2M allocations are
likely to succeed. If a machine has a couple of terabytes of memory
available then its highly unlikely that a 2M allocation will not succeed.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Large Pages - Linux Foundation HPC
  2009-04-27 14:12         ` Christoph Lameter
@ 2009-04-28  3:15           ` Wu Fengguang
  0 siblings, 0 replies; 7+ messages in thread
From: Wu Fengguang @ 2009-04-28  3:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dave Hansen, Balbir Singh, Badari Pulavarty, linux-kernel,
	Vivek Kashyap, Mel Gorman, Robert MacFarlan, Fu, Michael

On Mon, Apr 27, 2009 at 10:12:26PM +0800, Christoph Lameter wrote:
> On Sat, 25 Apr 2009, Dave Hansen wrote:
> 
> > On Sat, 2009-04-25 at 16:48 +0800, Wu Fengguang wrote:
> > > Based on Dave's descriptions that HPC apps typically
> > > - do mlock(), to pre-populate memory and pin them in memory
> > > - run at fresh boot, with loads of high order pages available
> >
> > There are definitely some of them that do this, but it certainly isn't
> > all.  It may not even be the norm.
> 
> Some of the machine have so much memory available that 2M allocations are
> likely to succeed. If a machine has a couple of terabytes of memory
> available then its highly unlikely that a 2M allocation will not succeed.

Huge pages used to be explicitly managed as scarce resources. As time
goes by I'd imagine the move to "huge page as an optimization" POV.
Optimizations that can benefit unmodified applications and can fail
gracefully.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-04-28  3:15 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-21 16:32 Large Pages - Linux Foundation HPC Badari Pulavarty
2009-04-21 16:57 ` Dave Hansen
2009-04-21 18:25   ` Balbir Singh
2009-04-25  8:48     ` Wu Fengguang
2009-04-26  6:54       ` Dave Hansen
2009-04-27 14:12         ` Christoph Lameter
2009-04-28  3:15           ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.