linux-numa.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* How does numa_parse_nodestring +0-1 works/should work
@ 2013-05-16 12:00 Andreas Mueller
  2013-05-16 12:11 ` Andi Kleen
  0 siblings, 1 reply; 3+ messages in thread
From: Andreas Mueller @ 2013-05-16 12:00 UTC (permalink / raw)
  To: 'linux-numa@vger.kernel.org'

Hi,

could someone please explain the behaviour of a nodestring "+0-1" or similar for numa_parse_nodestring?

According to the manual, "a leading + can be used to indicate that the node numbers in the list are relative to the task's cpuset".

What I expected and want to achieve is, that if a process was started on node 3, I will get a node mask of 3,4.
What I get (on a kernel 2.6.32) is always 0,1 independently of the node the process was started (either by chance or by assigning a specific node with numactl).

Background: I want to limit a multithreaded process to a subset of the available nodes, but want to include the node it currently runs on, because I assume that the memory (stack/heap) already in use by the process is mainly on this node, so it makes sense to include this node. I know I could simply start the process with numactl, but I want to implement a useful default behavior and for the kind or process this is neither achieved by limiting to one node nor to all nodes.

Thanks in advance for any help.

Best regards,
Andreas Mueller


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: How does numa_parse_nodestring +0-1 works/should work
  2013-05-16 12:00 How does numa_parse_nodestring +0-1 works/should work Andreas Mueller
@ 2013-05-16 12:11 ` Andi Kleen
  2013-05-22  9:22   ` NUMA performance, optimal settings for multithreaded, memory mapped file access Andreas Mueller
  0 siblings, 1 reply; 3+ messages in thread
From: Andi Kleen @ 2013-05-16 12:11 UTC (permalink / raw)
  To: Andreas Mueller; +Cc: 'linux-numa@vger.kernel.org'

On Thu, May 16, 2013 at 02:00:40PM +0200, Andreas Mueller wrote:
> could someone please explain the behaviour of a nodestring "+0-1" or similar for numa_parse_nodestring?
> 
> According to the manual, "a leading + can be used to indicate that the node numbers in the list are relative to the task's cpuset".
> 
> What I expected and want to achieve is, that if a process was started on node 3, I will get a node mask of 3,4.

cpuset refers to the cpusets in 
http://man7.org/linux/man-pages/man7/cpuset.7.html
You seem to think it refers to the plain task affinity, which is not the
case.

In general generic relative NUMA policy descriptions would be nice
(like in OpenMP4) but we only have them with cpusets currently.

-andi

^ permalink raw reply	[flat|nested] 3+ messages in thread

* NUMA performance, optimal settings for multithreaded, memory mapped file access
  2013-05-16 12:11 ` Andi Kleen
@ 2013-05-22  9:22   ` Andreas Mueller
  0 siblings, 0 replies; 3+ messages in thread
From: Andreas Mueller @ 2013-05-22  9:22 UTC (permalink / raw)
  To: 'linux-numa@vger.kernel.org'

Hello,

I have rather strange test results with different numa settings. My application uses multiple threads and (simplified) accesses data in a memory mapped file. I tried to find optimal settings and see the following effects with 40 threads (on a NUMA hardware with 8 nodes with 6 CPUs per node):

Independently of the number of assigned nodes, the runtime is nearly the same (25 +/- 3 seconds). The used CPU (showed in top) is nearly 100% * number of processors (e.g. 600% for one node up to 4000% for 8 nodes, here not 4800, because I use only 40 threads).

The user CPU (measured with time) is around 1m20 (+/-5s). The system  CPU is 1m20 with 1 node and increases (3min for 2 nodes, 6min for 3 nodes etc. up to 15min for 8 nodes).
So every gain with additional nodes/cpus is completely wasted in system CPU.

I made some other tests with only CPU assignments. If I assign only 1 CPU, the test needs 47s. Optimum is reached at 4 CPUs (on 2 different nodes) with 16s. This is even much better than any of the results where I limited to 1, 2, ... or 8 nodes.

The system is a SUSE Linux Enterprise Server 11 SP1 with kernel 2.6.32.12.

I will not exclude any problems in our application or my test constellation, but we have some background in optimizing it for multithreading and had good scaling results on non-NUMA hardware.

It seems to me that either scheduling or memory access (esp. to memory mapped files) does not really scale in the (this?) kernel.
Could pthread mutexes be the problem (scaling bad on NUMA) or rather some internal kernel mutexes when accessing memory that is located on another node?

Has anyone similar effects and/or some hints that could improve the situation? Are there better NUMA optimized kernels available (for Suse or Red hat Enterprise Linux)?

Thank you in advance!

Regards,
Andreas


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2013-05-22  9:22 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-16 12:00 How does numa_parse_nodestring +0-1 works/should work Andreas Mueller
2013-05-16 12:11 ` Andi Kleen
2013-05-22  9:22   ` NUMA performance, optimal settings for multithreaded, memory mapped file access Andreas Mueller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).