All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Martin J. Bligh" <mbligh@aracnet.com>
To: Brent Casavant <bcasavan@sgi.com>
Cc: Andi Kleen <ak@suse.de>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hugh@veritas.com
Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
Date: Tue, 02 Nov 2004 14:51:59 -0800	[thread overview]
Message-ID: <239530000.1099435919@flay> (raw)
In-Reply-To: <Pine.SGI.4.58.0411021613300.79056@kzerza.americas.sgi.com>

> I fully agree with Martin's statement "You WANT your data to be local."
> However, it can become non-local in two different manners, each of
> which wants tmpfs to behave differently.  (The next two paragraphs are
> simply to summarize, no position is being taken.)
> 
> The manner I'm concerned about is when a long-lived file (long-lived
> meaning at least for the duration of the run of a large multithreaded app)
> is placed in memory as an accidental artifact of the CPU which happened
> to create the file.

Agreed, I see that point - if it's a globally accessed file that's
created by one CPU, you want it spread around. However ... how the hell
is the VM meant to distinguish that? The correct way is for the application
to tell us that - ie use the NUMA API.

> Imagine, for instance, an application startup
> script copying a large file into a tmpfs filesystem before spawning
> the actual computation application itself.  There is a large class of
> HPC applications which are tightly synchronized, and which require nearly
> all of the system memory (i.e. almost all the memory on each node).
> In this case the "victim" node will be forced to go off-node for the
> bulk of its memory accesses, destroying locality, and slowing down
> the entire application.  This problem is alleviated if the tmpfs file
> is fairly well distributed across the nodes.

There are definitely situations where you'd want both, I'd agree.

> But I do understand the opposite situation.  If a lightly-threaded
> application wants to access the file data, or the application does
> not require large amounts of additional memory (i.e. nearly all the
> memory on each node) getting the file itself allocated close to the
> processor is more beneficial.  In this case the distribution of tmpfs
> pages is non-ideal (though I'm not sure it's worst-case).

Well, it's not quite worst case  ... you'll only get (n-1)/n of your pages
off node (where n is number of nodes). I guess we could deliberately
force ALL of them off-node just to make sure ;-)

>> > But that's a big ugly to distingush, that is why I suggested the sysctl.
>> 
>> As long as it defaults to off, I guess I don't really care. Though I'm still
>> wholly unconvinced it makes much sense. I think we're just papering over the
>> underlying problem - that we don't do good balancing between nodes under
>> mem pressure.
> 
> It's a tough situation, as shown above.  The HPC workload I mentioned
> would much prefer the tmpfs file to be distributed.  A non-HPC workload
> would prefer the tmpfs files be local.  Short of a sysctl I'm not sure
> how the system could make an intelligent decision about what to do under
> memory pressure -- it simply isn't knowledge the kernel can have.

It is if you tell it from the app ;-) But otherwise yes, I'd agree.
 
> I've got a new patch including Andi's suggested sysctl ready to go.
> But I've seen one vote for defaulting to on, and one for defaulting
> to off.  I'd vote for on, but then again I'm biased. :)  Who arbitrates
> this one, Hugh Dickins (it's a tmpfs change, after all)?  From SGI's
> perspective we can live with it either way; we'll simply need to
> document our recommendations for our customers.

I guess Hugh or Andrew. But I'm very relucant to see the current default
changed to benefit one particular set of workloads ... unless there's
overwhelming evidence that those workloads are massively predominant.
A per-arch thing might make sense I suppose if you're convinced that
all ia64 NUMA machines will ever run is HPC apps.

> As soon as I know whether this should default on or off, I'll post
> the new patch, including the sysctl.

Another way might be a tmpfs mount option ... I'd prefer that to a sysctl
personally, but maybe others wouldn't. Hugh, is that nuts?

M.


WARNING: multiple messages have this Message-ID (diff)
From: "Martin J. Bligh" <mbligh@aracnet.com>
To: Brent Casavant <bcasavan@sgi.com>
Cc: Andi Kleen <ak@suse.de>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hugh@veritas.com
Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
Date: Tue, 02 Nov 2004 14:51:59 -0800	[thread overview]
Message-ID: <239530000.1099435919@flay> (raw)
In-Reply-To: <Pine.SGI.4.58.0411021613300.79056@kzerza.americas.sgi.com>

> I fully agree with Martin's statement "You WANT your data to be local."
> However, it can become non-local in two different manners, each of
> which wants tmpfs to behave differently.  (The next two paragraphs are
> simply to summarize, no position is being taken.)
> 
> The manner I'm concerned about is when a long-lived file (long-lived
> meaning at least for the duration of the run of a large multithreaded app)
> is placed in memory as an accidental artifact of the CPU which happened
> to create the file.

Agreed, I see that point - if it's a globally accessed file that's
created by one CPU, you want it spread around. However ... how the hell
is the VM meant to distinguish that? The correct way is for the application
to tell us that - ie use the NUMA API.

> Imagine, for instance, an application startup
> script copying a large file into a tmpfs filesystem before spawning
> the actual computation application itself.  There is a large class of
> HPC applications which are tightly synchronized, and which require nearly
> all of the system memory (i.e. almost all the memory on each node).
> In this case the "victim" node will be forced to go off-node for the
> bulk of its memory accesses, destroying locality, and slowing down
> the entire application.  This problem is alleviated if the tmpfs file
> is fairly well distributed across the nodes.

There are definitely situations where you'd want both, I'd agree.

> But I do understand the opposite situation.  If a lightly-threaded
> application wants to access the file data, or the application does
> not require large amounts of additional memory (i.e. nearly all the
> memory on each node) getting the file itself allocated close to the
> processor is more beneficial.  In this case the distribution of tmpfs
> pages is non-ideal (though I'm not sure it's worst-case).

Well, it's not quite worst case  ... you'll only get (n-1)/n of your pages
off node (where n is number of nodes). I guess we could deliberately
force ALL of them off-node just to make sure ;-)

>> > But that's a big ugly to distingush, that is why I suggested the sysctl.
>> 
>> As long as it defaults to off, I guess I don't really care. Though I'm still
>> wholly unconvinced it makes much sense. I think we're just papering over the
>> underlying problem - that we don't do good balancing between nodes under
>> mem pressure.
> 
> It's a tough situation, as shown above.  The HPC workload I mentioned
> would much prefer the tmpfs file to be distributed.  A non-HPC workload
> would prefer the tmpfs files be local.  Short of a sysctl I'm not sure
> how the system could make an intelligent decision about what to do under
> memory pressure -- it simply isn't knowledge the kernel can have.

It is if you tell it from the app ;-) But otherwise yes, I'd agree.
 
> I've got a new patch including Andi's suggested sysctl ready to go.
> But I've seen one vote for defaulting to on, and one for defaulting
> to off.  I'd vote for on, but then again I'm biased. :)  Who arbitrates
> this one, Hugh Dickins (it's a tmpfs change, after all)?  From SGI's
> perspective we can live with it either way; we'll simply need to
> document our recommendations for our customers.

I guess Hugh or Andrew. But I'm very relucant to see the current default
changed to benefit one particular set of workloads ... unless there's
overwhelming evidence that those workloads are massively predominant.
A per-arch thing might make sense I suppose if you're convinced that
all ia64 NUMA machines will ever run is HPC apps.

> As soon as I know whether this should default on or off, I'll post
> the new patch, including the sysctl.

Another way might be a tmpfs mount option ... I'd prefer that to a sysctl
personally, but maybe others wouldn't. Hugh, is that nuts?

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

  reply	other threads:[~2004-11-02 23:01 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-11-02  1:07 [PATCH] Use MPOL_INTERLEAVE for tmpfs files Brent Casavant
2004-11-02  1:07 ` Brent Casavant
2004-11-02  1:43 ` Dave Hansen
2004-11-02  1:43   ` Dave Hansen
2004-11-02  9:13 ` Andi Kleen
2004-11-02  9:13   ` Andi Kleen
2004-11-02 15:46 ` Martin J. Bligh
2004-11-02 15:46   ` Martin J. Bligh
2004-11-02 15:55   ` Andi Kleen
2004-11-02 15:55     ` Andi Kleen
2004-11-02 16:55     ` Martin J. Bligh
2004-11-02 16:55       ` Martin J. Bligh
2004-11-02 22:17       ` Brent Casavant
2004-11-02 22:17         ` Brent Casavant
2004-11-02 22:51         ` Martin J. Bligh [this message]
2004-11-02 22:51           ` Martin J. Bligh
2004-11-03  1:12           ` Brent Casavant
2004-11-03  1:12             ` Brent Casavant
2004-11-03  1:30             ` Martin J. Bligh
2004-11-03  8:44           ` Hugh Dickins
2004-11-03  8:44             ` Hugh Dickins
2004-11-03  9:01             ` Andi Kleen
2004-11-03  9:01               ` Andi Kleen
2004-11-03 16:32               ` Brent Casavant
2004-11-03 16:32                 ` Brent Casavant
2004-11-03 21:00                 ` Martin J. Bligh
2004-11-03 21:00                   ` Martin J. Bligh
2004-11-08 19:58                   ` Brent Casavant
2004-11-08 19:58                     ` Brent Casavant
2004-11-08 20:57                     ` Martin J. Bligh
2004-11-08 20:57                       ` Martin J. Bligh
2004-11-09 19:04                     ` Hugh Dickins
2004-11-09 19:04                       ` Hugh Dickins
2004-11-09 20:09                       ` Martin J. Bligh
2004-11-09 20:09                         ` Martin J. Bligh
2004-11-09 21:08                         ` Hugh Dickins
2004-11-09 21:08                           ` Hugh Dickins
2004-11-09 22:07                           ` Martin J. Bligh
2004-11-09 22:07                             ` Martin J. Bligh
2004-11-10  2:41                       ` Brent Casavant
2004-11-10  2:41                         ` Brent Casavant
2004-11-10 14:20                         ` Hugh Dickins
2004-11-10 14:20                           ` Hugh Dickins
2004-11-11 19:48                       ` Hugh Dickins
2004-11-11 19:48                         ` Hugh Dickins
2004-11-11 23:10                         ` Brent Casavant
2004-11-11 23:10                           ` Brent Casavant
2004-11-15 22:07                         ` Brent Casavant
2004-11-15 22:07                           ` Brent Casavant

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=239530000.1099435919@flay \
    --to=mbligh@aracnet.com \
    --cc=ak@suse.de \
    --cc=bcasavan@sgi.com \
    --cc=hugh@veritas.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.